Applied Supervised Learning with R
上QQ阅读APP看书,第一时间看更新

Understanding the Science Behind EDA

In layman's terms, we can define EDA as the science of understanding data. A more formal definition is the process of analyzing and exploring datasets to summarize its characteristics, properties, and latent relationships using statistical, visual, analytical, or a combination of techniques.

To cement our understanding, let's break down the definition further. The dataset is a combination of numeric and categorical features. To study the data, we might need to explore features individually, and to study relationships, we might need to explore features together. Depending on the number of features and the type of features, we may cross paths with different types of EDA.

To simplify, we can broadly classify the process of EDA as follows:

  • Univariate analysis: Studying a single feature
  • Bivariate analysis: Studying the relationship between two features
  • Multivariate analysis: Studying the relationship between more than two features

For now, we will restrict the scope of the chapter to univariate and bivariate analysis. A few forms of multivariate analysis, such as regression, will be covered in the upcoming chapters.

To accomplish each of the previously mentioned analyses, we can use visualization techniques such as boxplots, scatter plots, and bar charts; statistical techniques such as hypothesis testing; or simple analytical techniques such as averages, frequency counts, and so on.

Breaking this further down, we have another dimension to cater to, that is, the types of features—numeric or categorical. In each of the type of analysis mentioned—univariate and bivariate—based on the type of the feature, we might have a different visual technique to accomplish the study. So, for univariate analysis of a numeric variable, we could use a histogram or a boxplot, whereas we might use a frequency bar chart for a categorical variable. We will get into the details of the overall exercise of EDA using a lazy programming approach, that is, we will explore the context and details of the analysis as and when it occurs in the book.

With the basic background context set for the exercise, let's get ready for a specific EDA exercise.