The concept of Exploratory Data Analysis or EDA was created by James W. Tukey in 1977. Although it has many similarities with classical analysis, the approach, or more like the philosophy of data analysis is very different. Usually, the process of analysis starts with a scientific problem, or with a business goal, and the aim is to find the right conclusions based on the collected data. The difference lies in the intermediate steps, however.
In the classical analysis, data collection is followed by building a model (e. g. linearity, normality, etc.), and the steps of analysis, estimation, and testing are focusing on the parameters of the chosen model. However, in the case of EDA, data collection is followed immediately by analysis to find the right model for the right conclusions. Technically, there are six main differences between the two approaches.
Whether you are an analyst, a scientist, an engineer, or a marketer, one thing is for sure: you have to understand your data like the back of your hand. Living in a world with infinite amounts of data, it’s not enough to gather the pieces- if you do not understand the correlations between the elements, you will not be able to make the right decisions when the time comes. But how can you extract the necessary information? Of course, you can sit in front of your computer trying to find the appropriate rows and columns, but having the wanted results takes too much time and energy.
Work smarter not harder, take a step back, and start to visualize the data, find the patterns and “feel” the information without having any hypotheses. In other words: by making an Exploratory Data Analysis.
Classical Analysis vs EDA
The classical approach uses models (deterministic and probabilistic) on the data. Deterministic models could be, for example, regression models or ANOVA (analysis of variance) models. The most common probabilistic model, the ANOVA F test, assumes that the errors of the deterministic model are normally distributed— and this assumption affects the validity of the tests too.
Exploratory Data Analysis approach, on the contrary, does not apply deterministic or probabilistic models on the data, but it allows the data to suggest appropriate models that suit best to the data.
The two approaches differ significantly in their focus too. The classical analysis focuses mainly on the model and its estimated parameters while generating predicted values from the model. As for EDA, the focus is on the structure and the outliers of data and the models are suggested by that.
Classical techniques are generally quantitative in their nature, so they lead mostly to numeric or tabular output. Hypothesis testing, confidence intervals, or ANOVA are valuable and mainstream in terms of classical analysis.
The EDA approach uses mostly graphic techniques like histograms, scatter plots, or box plots to get deeper insights from the data by using the natural pattern-recognizer ability of the human brain. With the usage of statistical graphics, one can better find and validate the right model, detect the outliers, identify relationships, or determine any possible factor effects.
Classical techniques stand like the cornerstones of science and engineering: they are rigorous, formal, and objective in each aspect.
EDA techniques lack this kind of formality, but they are very suggestive, indicative, and insightful about the proper model. The techniques are subjective and the interpretation may differ, but thanks to their experiences, analysts usually receive the same conclusions.
The aim of classical estimation techniques is to find the most important characteristics of the data by mapping and filtering. This process causes information loss, leaving only the “appropriate” values and characteristics for further processing. EDA is, in this sense, more like a holistic approach, there is no information loss, and the analysts can gain more insights by seeing all the available variables.
Tests based on classical techniques are usually very sensitive: e.g. a shift in location can easily be detected and declared as statistically significant. On the other hand, the validity of classical test conclusions depends on the validity of the underlying assumptions. In the worst case, the assumptions stay unknown or untested to the analyst. Thus making the whole scientific conclusion suspicious.
By using EDA techniques, little-to-no anticipated assumptions are needed, because all the necessary data are presented on the screen.
Now that you know the main differences between the two statistical approaches, let us take a closer look at the graphic techniques of EDA - with the help of AnswerMiner.
How to Do Exploratory Data Analysis?
First of all, you need a dataset to do that. You don’t have one? No worries, you can use one from the collection in AnswerMiner. You can upload your own file too, or connect one of your outer databases like Facebook, Google Analytics, Mailchimp or surveys, SQL files, etc.
Now comes the fun part. After uploading or connecting your dataset, you can start investigating it. Pick one or multiple variations and check the Suggested Charts to recognize the above-mentioned patterns in your data. There are a couple of data visualizer tools and techniques that can help you in this process.
Histograms are, for example, commonly used to summarize the distribution of a univariate data set, and can give you the sensation to understand the spread and skewness of your data, easily finding the center or the outliers. This graphic method strongly suggests the right distribution model for further exploration.
But what if you want to know the relationship between two or more variables?
Box Plots are the perfect tools for showing location and variation changes between different data groups. This way you can easily understand if a factor is significant, relying on the data, or not while you can effectively summarize large amounts of information.
Creating a Scatter Plot, on the other hand, is a useful way to find relationships between two variables, also to visualize the positive or negative correlation.
Other graphic tools like Word Cloud or Mosaic Plot are available to understand the ratio of variables.
So, you’re done with the exploration? Then use the Prediction Tree to make decisions or to find the factors that influence your variables the most. Eventually, make the right assumptions and take the right moves based on your data.
Why Is Exploratory Data Analysis Important?
Data visualization is a key element to have deeper insights during the analytical process. Even once you have completely understood the data set, it is to your advantage to use alternative techniques in order to make the data even more refined.
Graphic techniques and statistical graphics can trigger the pattern-recognizer ability of the human brain and can help you to investigate the dataset without having preliminary assumptions. This is important because after you have completely understood the data set and familiarized yourself with its characteristics, you may see that the features you originally selected are not totally suitable for your purposes.
Thus you may decide to change these, and add other features in order to create a more comprehensive picture of the data. EDA should thus provide you with a firm set of features to use with statistical learning. Additionally, it can refine your choice of feature variables that will later be utilized for machine learning.
EDA thus has a profound importance in the realm of data science and especially machine learning. It is important to use EDA to its fullest extent in order to generate accurate models on the right data, and to create the right kinds of variables in data preparation. It allows you to utilize your resources efficiently by keeping your data free of outliers and unbiased.
The philosophy of Exploratory Data Analysis paired with the quantitative approach of Classical Analysis is a powerful combination, and data visualizer applications like AnswerMiner can help you to understand your customers’ behavior, find the right variables for your model or predict important business conclusions.