Data cleaning

9 minute read

Introduction

The majority of time spent by data scientist is for data preparation and cleaning. That is why I wanted to dedicate a section of my github repository to this task. The goal is to discuss and summarize the most important aspects of data cleaning and what impact different approaches have for model training.

Process

The data cleaning process starts with the raw data and results in clean and tidy data.

Choices in the process need to be the results of conscious decisions and have to be properly documented. Flag the observation with an indicator variable of changes you made. When it affects many observations, it may have effects on the results of the analysis. In such a case it is good practice to try different solutions to the problem and see whether and how that affects the results of the analysis. This process is sometimes called robustness checking.

Variable format

All important variables should be stored in an appropriate format.

  • Binary variables are best stored as 0 or 1.
  • Qualitative variables with several values may be stored as text or number.
  • Convert to dummy variables if appropriate.

Variable cleaning

Often numbers are stored in weird formats such as “12 min” or are wrongly interpreted upon import because of the separators used (point versus comma).

  • Extracting numerical information where necessary or
  • Transform text into numbers

Missing values

It is crucial to identify missing values and to make appropriate decisions on them.

  • If the fraction of missing values is not too high, we can remove all and work with observations that have non-missing values for all variables used in the analysis. (selection bias!)
  • As an alternative keep these observations, but apply imputation: Fill in some value for the missing values, such as the average value, or a value randomly selected from the other observations
  • You may opt to leave them as missing, then missing values need to be stored in a way that the statistical software recognizes them. For a qualitative variable, you may want to add missing as a new value.

Missing values cause problems when ….

  • we have too few observations in the data with valid information so that we cannot confidently generalize our results from the particular dataset to the situation we truly care about. The variable that best captures the information we need may have many missing values and removing them might reduce the sample size too much. Solution: focus on more fully filled variables. A less perfect variable is available with few missing variables. Removing those observations causes selection bias. One version of this approach is to work with observations that make up a well-defined subsample, in which the missing data problem is a lot less severe. For example, if missing values are a lot more prevalent among small firms in administrative data on firms, we may exclude all small firms from the data. The advantage of such a choice is transparency: the results of the analysis will refer to medium and large firms.
  • they are stored in a way that the statistical software does no recognize it.

Sometimes, missing values are recorded with number values, outside the range (e.g., 0 for no, 1 for yes, 9 for missing). Such values need to be replaced.

Checking range

Values of each variable should be within their admissible range.

  • Values outside the range are best replaced as missing unless it is obvious what the value was meant to be (e.g., an obvious misplacement of digits)
  • Change units of measurement (currency, large numbers with measures in thousands, Check duplicates

Check extreme values / outliers

Even if you are no actively looking for extreme values (for example in a fraud detection system), it is still recommended to find extreme values in your data and remove or modify them:

Extreme values can cause problems with certain types of models. For example, linear regression models are less robust to extreme values than decision tree models. In general, if you have a legitimate reason to remove an extreme value, it will help your model’s performance. A legitimiate reason for removing an extreme value could be suspicious measurements that are unlikely to be real data.

1. Definition and types of extreme values

An extreme value is a data object that deviates significantly from the rest of the objects, as if it were generated by a different mechanims.

They might be unimportant if they capture inaccurate information or carry little weight in the analysis, but they might be important if they carry a lot of weight or if they provide important information that normal data won’t.

When working with small data sets the model might be greatly affected by extreme values. Deleting them would bias the results significantly. An alternative might be transformation. Therefor it is recommended to investigate further:

  • Size of sample: If the size of the sample is small it is crucial not to ignore extreme values!
  • Methodology: Check if the outlier is likely to be the result of how the data was collected
  • Look at the character of the data: If the data is a collection of real world phenomena, do not drop extreme values, as they might contain critical information
  • Occurance: If an extreme value is so far off the normal distribution that is highly unlikely to reoccur, it is probably safe to drop/ignore these datapoints
Applications
  • Intrusions in communication networks
  • Fraud in financial data
  • Fake news and misinformation
  • Healthcare analyis
  • Industry damage detection
  • Security and surveillance
Causes of extreme values:
  • Data entry errors = human error
  • Measurement erros = instrument error
  • Experimental errors = data extraction or experimental planning or execution errors
  • Intentional = dummy outlier made to test detection methods
  • Data processing erros = data manipulation or unintended mutations
  • Sampling errors = extracting and mixing data from raw sources
  • Natural = novelty in data, not actually an error
Impact
  • Change the results: Can drastically change the result of statistical modeling
  • Error variance: increases error variance and reduces the power of statistical test
  • Normality: can decrease normality
  • Bias: can influence estimates that may be of interest
  • Assumption: impact basic assumption of the regression, anova and other statistical model assumptions
Outlier detection vs Anomaly detection

Outlier Detection is used to find the odd one within the data sample. Example: customer behaviour that different from the rest of the customer. Anomaly Detection can detect the abnormal behaviour in the data. Example: on a particular day the behiour of the same customer changes we detect it as anomaly Novalty Detection is in which the training data is not polluted with outliers and we are interested in detected in anomalies in new observations. In Outlier Detection the training data aready contains outliers that we need to find and then make decisions on how to treat them for model training.

Types
  • Global outlier: object that significantly deviates from the rest of the datasets.
  • Contextual outlier: object deviates significantly based on a selected context. (Example: 24 degrees Celcius is an outlier for a Moscow winter, not for a Moscow summer day)
  • Collective outlier: subset of data collectively deviates significantly from the whole datasets even if the individual datapoints may not outliers. (Example: large set of transaction of the same stock among a small party in a short period of time can be considered as evidence of market manipulation)

  • Univariate outliers: Can be found when looking at the distribution in a single feature space
  • Multivariate outliers: Can be found when looking at the distribution in an n-dimensional space

2. Detection methods

Extreme Value Analysis: This is the most basic form of outlier detection and good for one dimensional data, but not valuable in multi-variate setting. It is assumed that values that are too small or too large are outliers.

Methods: Z-test and Student’s t-test, Interquartile Range Method (IQR), Standard Deviation Method

Examples:

  • Set thresholds of 2.5, 3, 3.5 or more standard deviations.
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std

seed(1)
data = 5 * randn(10000) + 50
data_mean, data_std = mean(data, std(data)

# set boundaries using 3 standard deviations below and above mean
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off

# outliers
outliers = [x for x in data if x < lower or x > upper]
print('Identified outliers: %d' % len(outliers))
  • Use interquartile range: Q1- 1.5 (IQR) | Q3 + 1.5(IQR) ```python q25, q75 = percentile(data, 25), percentile(data, 75) iqr = q75 - q25

set boundaries using IQR

cut_off = iqr * 1.5 lower, upper = q25 - cut_off, q75 + cut_off

outliers

outliers = [x for x in data if x < lower or x > upper] print(‘Identified outliers: %d’ % len(outliers)) ```

—–> More examples can be found here

Probabilistic and Statistical Modelling: Assumes specific distribution for data and uses expectation-maximization (EM) methods to esitmate the parameters of the model and calculate probability of membership of each datapoint in the distribution. Those with the least probability of membership are marked as outliers.

Linear regression models: Data is modelled in lower dimensional sub spaces with the help of liniear correlations, and distance of each data point to a plane to fit the sub space is calculated. This distance is used to find extreme values.

Methods: Principal Componant Analysis (PCA)

Proximity based models: Here we identify outliers by isolating them from the rest of the observations

Methods: Clustering analysis, Density based analysis, Nearest Neighbours

Examples:

  • KNN
  • K-means
  • Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Information theory models: In this method, the outliers increase the minimum code length to describe a data set.

High dimensional outlier detection models: Specific methods to handle high dimension parsed data

Methods: Tree based anomaly detection algorithm, Feature bagging

Examples:

  • Isolation forest (iForest)
  • Local correlation integral
  • Angular based outlier detection
  • Autoencoders in deep learning, such as GAN, LSTM

3. Approach / best practices

  1. Simple approach like calculating z-score to have a baseline model in place

  2. Inspect data points where the baseline model fails. When data points are noisy smoothing might help, or adapt the size of rolling window for highly seasonal data.

  3. Plot data points. The shape of decision boundaries might guide your decision on which machine learning algorithm to choose for identifying outliers.

  4. Understand the baseline model and the data. If the decision boundaries are smooth and gaussian the robust covariant model might work well. If there is complex multi dimensional interaction Isolation forest might work well.

  5. Experiment Explore more appraoches likes autoencoders but be prepared for a lot of fine tuning and close inspection of those models.

4. Removal of outliers

Depending on the cause of the extreme value and the impact of your results be careful when choosing on how to deal with extreme values. In general extreme values should not be the reason for your results.

  • Trimming: Define range and delete or replace values outside of this range
  • Deleting
  • Transforming / binning: Use transformation techniques such as log value instead of using the original value
  • Imputing: replace outlier value with mean or median value of the features
  • Treating seperately