EDA aka Exploratory Data Analysis is a statistical approach to analyzing data sets. It has gained significance due to its role in the detection of mistakes while analyzing experimental data, as well as in checking the validity of any assumptions that were made while gathering data, and in the selection of the suitable models. Statistical tools are employed in order to understand the characteristics of data sets.
'It is a capital mistake to theorize before one has data.' - Sherlock Holmes (Arthur Conan Doyle)
How it Works
First of all, the data is collected into rectangular arrays. Every row in the array represents a particular experimental subject, and every column signifies an outcome variable or identifier. With exploratory data analysis, certain data is given more significance over others, and so that sort of data is used and others are hidden in order to aid effective decision-making.
There are two major classifications of Exploratory Data Analysis
- Non-graphical and graphical
Graphical methods involve representing and summarizing data in chart, graph or diagram form. Non-graphical methods involve the data being calculated using statistical methods in non-pictorial form.
- Univariate and multivariate
As the name suggests, the univariate method only makes use of a single variable at a time. Multivariate methods, on the other hand, consider more than one variable concurrently.
This means there are four types of exploratory data analysis:
- Univariate non-graphical method
- Multivariate non-graphical method
- Univariate graphical method
- Multivariate graphical method
Students of exploratory data analysis will become used to the use of techniques like the histogram, box plot, run chart, multivariate chart, scatter plot and pareto chart when working on the graphical methods, as well as with techniques like the tri-mean, ordination and median polish when studying the non-graphical methods.
The Importance of EDA
EDA provides a great many exploratory techniques, which the data scientist can use to explore their data with. Even once you’ve completely understood the data set, it is to your advantage to use alternative techniques in order to make the data even more refined. Exploratory data analysis, therefore, allows you to reach a point of confidence in your data that allows for the formation of a machine learning algorithm.
Additionally, it can refine your choice of feature variables that will later be utilized for machine learning. This is important because after you have completely understood the data set and familiarized yourself with its characteristics, you may see that the features you originally selected are not totally suitable for your purposes. Thus you may decide to change these, and add other features in order to create a more comprehensive picture of the data. EDA should thus provide you with a firm set of features to use with statistical learning.
EDA thus has a profound importance in the realm of data science and especially machine learning. It is important to use EDA to its fullest extent in order to generate accurate models on the right data, and to create the right kinds of variables in data preparation. It allows you to utilize your resources efficiently by keeping your data free of outliers and unbiased.