9/17/2017

A Tolerant Guide To Cleaning Data in HR Analytics


How can we solve this?
How confident you are in your work in (people) analytics and data largely depends on the quality of the data you are working with, and unfortunately, data always arrives loaded with errors.
One of our favorite sayings in analytics is "garbage in, garbage out." It applies to people analytics entirely.
However, you might be able to recycle all that garbage.
Cleaning, sorting, and adding data from different sources, coupled with compliance processes can take months. Even worse, if you work with multinational companies, you will probably come across different coding systems for the same types of data.
They’re lying to you when they tell you that in analytics you spend three quarters of your time cleaning or preparing data. It’s actually much more.

Perfect is the enemy of good

There are always problems with data: missing information, errors, duplicate values, and much more. So…
How do you know if the quality of the data is good enough to obtain reliable results in an analytics project?
Let's turn to Voltaire, our go-to philosopher on tolerance (Traité sur la tolérance) and simple-minded optimism (Candide, ou l'Optimisme). These quotes are taken from his Dictionnaire philosophique:
"Le mieux est l'ennemi du bien."
-----------
Perfect is the enemy of good.

The synthesis that could solve this dialectical confrontation between the perfect and the good could be the next quote:
"La discorde est le plus grand mal du genre humain, et la tolérance en est le seul remède."
========
Discord is the great ill of mankind; and tolerance is the only remedy for it.

Asking the Owner of the Data

Now let’s return to the question above:
How do you know if the quality of the data is good enough to obtain reliable results in an analytics project?
It's probably not up to us to respond. Data owners may be in a better position to know when the quality of the data is sufficient to produce useful results.
If we find, as we will see below, records on wages with negative values (someone who makes -$2523.35 gross) or 0.00 €, the person in charge of payroll can probably clarify if they are errors or have a reasonable explanation.
In a necessarily imperfect world, we are going to look for partial solutions to solve the challenges that we face in the area of data quality.

Challenge 1. Outliers


Outliers are values that are abnormally higher or lower than most other values in a sample of data. Identifying atypical values in the data is important because some extreme values can alter the results considerably.
Sometimes atypical values are legitimate. Other times, they are the result of an error. Either way, they can lead to misleading conclusions.
One of the most frequently used methods to detect outliers is the Tukey test.
Solutions when there are outliers
• The first thing to figure out is whether the outliers are errors or just extreme values. Is that negative salary record an error? And a "zero" salary? It is very likely that the owners of the data will have to clarify.
• Keep them. In large samples, avoid eliminating these atypical values when they are not errors. Use an algorithm that is robust for outliers (one that is based on medians rather than means) and models that work well with these values.
• One option is simply to eliminate cases with outliers. You have to understand the consequences of what you do. Two common approaches to excluding outliers are trimming, which discards them, and the Winsorising technique, which replaces outliers with the closest non-suspect data.

Challenge 2: No data


Sometimes, no system or database has the data we need for the analysis.
Imagine we are analyzing the causes of turnover. We have different variables to analyze, but we believe that promotions (that is, when and how often people have been promoted) could be an important variable.
Unfortunately, we discovered that there is no data related to promotions. What do we do?
Solutions to the lack of data
a. Start collecting that data from that moment on even though, of course, it will not solve the immediate problem.
b. Look for proxy variables that already exist in the internal data. For example, a strong salary increase could be a good indicator of a promotion.
c. Go to LinkedIn to check changes in the profiles

Challenge 3: Some data missing


This problem is pervasive. Suppose you are analyzing the results of an eNPS survey to evaluate employee experience in your organization. Inevitably, some people will not answer the survey.
a. First of all, check if the lack of data is random or if there are patterns in the records that are missing. If there are signs that patterns exist, a deeper analysis will have to be carried out.
We talked about an eNPS survey above. Typically, people who are most concerned about data privacy are less likely to answer certain questions in a survey because they do not trust what will happen to their data. We should keep that in mind. It could be relevant to the analysis.
b. One option is to simply eliminate cases with missing values from the analysis, although this may reduce the representability of the sample (causing biased results) and shrink the sample size (which is undesirable - in general, the more data, the better).
c. Alternatively, the missing values can be "filled in" by inferring them, using regression techniques to estimate the missing values. It is a delicate job that requires knowledge and experience.

Challenge 4. Outdated data


A classic problem in People Analytics is when salaries are not up to date.
a. A sensitivity analysis can be useful in determining if the updated values will significantly change the findings.
b. If data refreshing cycles are frequent (or imminent), the best option may be to wait for the next update.
c. Another option is to manually update the values, if necessary. It is a dangerous intervention.

Challenge 5. Does the data have a normal distribution?


For many inferential statistics (variations, correlations, regressions, etc.), the independent variable must have a normal distribution (mean = median = mode) to be able to make inferences. If the distribution is not normal, statistical tests could produce misleading results. There are several examples in later chapters where I do this type of previous validation.
a . Transform the variable. Normalize it.
b. There are alternative statistical techniques for analyzing data that do not have a normal distribution. Nonparametric tests are also called “free distribution tests” because they do not assume that their data follows a specific distribution. They are like a universe parallel to that of parametric tests. For example, a Student's t-test of a sample is a test we do when the distribution is normal. When it is not, we use the Wilkinson test.

Challenge 6. Duplication of data


Duplicate data is an inseparable part of our lives.
Often in HR, some people have more than one position in the company. Systems often create separate records for each position. These people, therefore, end up having multiple records in a database.
a. Delete them. It might not be advisable.
b. Aggregating them seems more advisable. It ensures that the quality of the analysis is not compromised.

Takeaways

For each data quality problem, there is one or more imperfect solution that you can use to solve the problem.