People Analytics: Don't be Fooled by Data!

Every time people use statistics to support their assertions, it is really difficult for some sort of deception to not slip what appears to be the definitive validation of what they claim to (intentionally or otherwise).
In this post:
1.    Correlation doesn’t imply causation.
2.    Hidden variables
3.    Reverse causality
4.    Simpson’s paradox
5.    The law of small numbers
“It is my job to arm you against the foulest of creatures known to wizardkind.”
Harry Potter and the Chamber of Secrets. J.K. Rowling
A critical spirit is an essential tool to separate the truth from lies and to cope with the constant threats that you face.
To overcome evil and lies, you must be well trained, like the students of Hogwarts, in defending against these statistical dark arts.

Professor Lockhart:
 You may find yourself facing your worst fears in this room. Know only that no harm will befall you whilst I am here. I must ask you not to scream. It might provoke them!"
Sometimes data that contains lies is used with good intentions, without malice. The data tricked someone in the beginning, and with an equal mix of goodwill and ignorance, you accept it as true and do not hesitate to share their false truth with the world.
Other times, falsehood is intentional. These are deliberate attempts to deceive you and play you for a fool.
Ordinary citizens (“muggles”), just like analysts (“wizards”), have to learn to defend themselves from both the malicious and the naive. You have to learn to check a few things before accepting any data that comes to you as statistics.
The attacks come from two sources. We're going to study them one by one.
        1.     Correlation does not imply causality
        2.     The Law of Small Numbers
In fact, I should make a complete list that includes deliberate deceptions (with and without graphics) and those related to the ignorance of other data that completely alters its meaning, deliberately or not. Let's stick with these two points, which lend themselves to a little reading.

1. Correlation does not imply causality.

Not understanding this principle is definitely the main attack of the dark arts.
If we have two variables (A and B), we say that there is a correlation when, if the value of A decreases, so does the value of B and vice versa. When the correlation is positive, if you increase the frequency of one variable, then a proportionate change is reflected in the other. On the other hand, in a negative correlation, the frequencies display inverse characteristics (one variable increases and another decreases).

Chocolate Consumption and Weight Gain: A Hypothetical Example

This is the first example I give in people analytics courses to illustrate what a correlation is.

Imagine that you weigh (vertical axis, Y) a group of people who eat different amounts of chocolate (horizontal axis, X) per week. You can see the correlation in the chart above. The higher the consumption of chocolate, the greater the weight of the individuals. They're correlated.
Many statistical tests calculate correlations between variables. When two variables correlate, it's tempting to assume that this proves that one variable causes the other.
In fact, the most common practice is to place the predictor variable on the X-axis and the target variable on the Y. As in the example above, the consumption of chocolate (X-axis) is the cause of the excess weight (Y-axis). But you shouldn't jump to hasty conclusions. The fact that they appear together doesn't mean we can logically infer that there is a cause-and-effect relationship between them. There are other possibilities that we have to consider first.

A) Hidden Variable

It could well be that a third unknown factor is actually the cause of the relationship between A and B. We call that a hidden variable.
For example, if you keep a record showing that those who your company hires through advertisements have worse results than those who get jobs through people who already work there, do you have to stop posting ads because that channel of recruitment attracts lower quality employees? Is the hiring channel really the cause? It could be. But there may be other causes.
In the previous case, it could be true that the channel is the cause. But it could also be that you only post ads when the job is harder to cover. If so, the cause of not regularly finding good employees could be a third factor, a hidden variable: hard-to-cover jobs, such as an HR analyst, that make you have to resort to ads. But as there are so few people trained in the field, very often the performance of those hired (through ads) is not good.
In addition to the hidden cause, there are two other possible relationships between two correlated variables. Strictly speaking, there are more, but they are not as frequent.

B) Reverse causality

We notice that those who have spent more time in the same job have lower performance. Should we understand from these two pieces of information that performance decreases over time and that, therefore, we must move people around more frequently? What if it were the other way around, that lower performance is the real cause that these people do not change positions?

"Smoking is good for you."

Australian physician Dr. William Whitby published Smoking is Good For You in 1978. Much of his book is dedicated not only to discrediting the belief that smoking causes lung cancer (or anything else) and the fear of second-hand smoking, but also to testifying that smoking is an effective treatment for many chest ailments, including bronchitis and asthma.
Mr. Whitby doesn't hesitate to discredit the statistics:
In this paragraph, Mr. Whitby asks,
"Is there anything wrong with tobacco?" The answer, it seems, is NOTHING other than an alleged statistical relationship. But are we going to give some credit to statistics? We've already seen how unreliable they are. The relationship [between smoking and lung cancer] in most cases is usually only apparent because many cancer patients, as well as most people with chest conditions, smoke to relieve their cough. Blaming smoking for cancer is putting the cart before the horse.”
From the field of statistics, there were also voices pointing to reverse causality.
In the 1950s, Sir Ronald Fisher was already considered the best statistician of the 20th century. While he was on the payroll of the tobacco industry, he wrote:
"The alleged consequence (lung cancer) is actually the cause, that is, what leads the subject to smoke. An incipient cancer or a precancerous condition with chronic inflammation is the factor that induces smoking cigarettes."
So be careful! Something is not necessarily true just because a doctor or the best statistician of his time says so.

C) Coincidences

The best collection of these coincidences is found on the website SPURIOUS CORRELATIONS. There, you can find some funny correlations that would be hard for us to believe that were not pure coincidences.
There are many more, but I've selected these three:
1. For ten years suicides by hanging are correlated with US spending on science.

2. The number of people who died from falling into swimming pools is correlated with the number of films in which Nicolas Cage appeared between 1999 and 2009.

3. Oil imports from Norway are correlated with the number of train crash fatalities.

Simpson's Paradox and Causality

The Simpson Paradox (first described by Edward H. Simpson in 1951), happens when a trend that appears in several groups of data disappears when these groups are combined and the opposite trend appears for the aggregate data. In other words, the combined data suggests one thing, but it turns out to be the opposite when we analyze the data more carefully.
Simpson's Paradox disappears when you analyze causal relationships.

Gender Discrimination in Hiring at Berkeley

One of the best known examples of Simpson's Paradox occurred when the University of Berkeley was sued for discrimination against women who had applied for admission to graduate school. The admissions data for the summer of 1973 showed that male applicants were more likely to be chosen than women, and that the difference was such that it was not possible for it to be random.

However, by examining individual departments, it was found that there was no bias against women in any department. In fact, most departments had presented a "small but statistically significant bias in favor of women." Data from the six largest departments is listed below.

The conclusion was that women tended to submit applications in competitive fields with a low admission rate (such as the English Department) while men tended to apply in departments with less competition and a higher admission rate. The admissions data of the specific departments constituted a defense against the discrimination charges.

So, how is causation demonstrated?

To consider that there is a causal relationship between two variables, you have to prove that one is responsible for the occurrence of the other.
The standard method for demonstrating a causal relationship between two correlated variables is performing a controlled experiment.
This article by Khan Academy explains it very well. Test participants are randomly assigned to the group receiving the treatment in question (the possible cause) or to a group receiving standard treatment (or treatment with placebo) as a control group.
Although the demonstration of causality is far from universally accepted and supports an intense academic debate, I can safely state that three conditions are needed to prove causality:
1. The cause occurs before the effect.
2. Experiments show a relationship between cause and effect (when a change occurs).
3. Other possible causes are ruled out due to randomization.

Design of quasi-experimental studies

Performing randomized controlled experiments in the context of people analytics is often impossible. When the subjects of the experiment are people, randomization or the existence of a control group can be problematic or impossible.
There are, however, quasi-experimental studies, which allow some confidence in the identification of a causal relationship, although they are not as robust as a randomized experiment.
In a quasi-experimental study the employees are not randomly assigned to the experimental and control groups. This limits the inferences that can be drawn confidently from the analyses when compared with the experimental designs, because the lack of randomization usually makes it impossible to be sure that the two groups were the same before the experiment.
Despite these limitations, strong conclusions can still be made about effective interventions from quasi-experimental designs in HR analytics, particularly if the results are replicated in multiple studies.
This type of non-experimental study is associated with a series of effects that are important to consider, such as:
1.  The Hawthorne effect: The subjects of an experiment show a modification in some aspect of their behavior as a result of knowing that they are being studied, and not in response to any type of manipulation foreseen in the experimental study.
2. The Placebo effect: A positive response occurs in a person as a consequence of the administration of a treatment, but it cannot be attributed to the treatment itself.
3. Regression to the mean: In statistics, regression to the mean is the phenomenon in which if a variable is extreme on its first measurement, it will tend to be closer to the mean in its second measurement and, paradoxically, if it is extreme in its second measurement, will tend to have been closer to the mean in its first.

Correlational Studies

Designs that do not involve randomization and/or a control group are called correlational or observational. In these studies, we observe how the variables are related to each other without inferring a causal connection between the variables.
A correlational study involves analyzing an increase or decrease in one variable that coincides with an increase or decrease in the other variable.
Correlation designs typically have no control group. Inferring causal associations of these designs is difficult, but they may be useful in identifying variables for further study using experimental or quasi-experimental designs.
In any case, these designs lead to stronger conclusions than intuition alone.

Longitudinal Studies

All experimental, quasi-experimental, or correlation designs are strengthened when the variables being studied are measured repeatedly at multiple points in time. This is known as a longitudinal study.

2. Too Small of a Sample: the Law of Small Numbers

The Law of Small Numbers describes the strong tendency that people have to believe that the information obtained in a small sample will be representative of the total population. Daniel Kahneman (Nobel Prize winner in economics) and Tversky came up with this name, parodying the well-known Law of Large Numbers.
According to this law, in a six-sided die, there's a one to six chance of a specific number being rolled in each roll. If you throw the die only six times (here is our small number), there are quite a few chances that one of the numbers will not come up at all, or that it comes up twice or even three times.
But if you continue to throw the die and make hundreds or thousands of throws, each number will progressively adjust to the 1/6 probability. That's the Law of Large Numbers.
The Law of Small Numbers is also known by other names, like "hasty generalization," "insufficient sample fallacy," or simply "jumping to conclusions."
        1.     Aureliano is tall, and he brings in more sales than the company average.
        2.     Remedios is tall, and she brings in more sales than the company average.
Therefore, tall people are better salespeople. At least in this company.

From this demonstration alone, it will be hard for you to accept that small numbers ALWAYS mislead.

Suppose you have a jar with 100 marbles. Exactly half are red and half are green (50-50).
You draw four marbles from the jar at random. You count the results and return them to the jar. You repeat this process a thousand times.
I do the marble experiment fairly regularly (in class), and it only takes 20 seconds to finish the experiment.
So, if you take a random four-marble sample, how likely is it to draw two reds and two greens?
I have asked this question dozens of times in my courses. Other than the occasional know-it-all, people almost always answer "50%".
But that's not the case. The correct answer is 37.5%.
Warning: It will be hard for you to accept. Even Kahneman confesses that it was difficult for him to overcome this mistaken intuition.
Let's look at what's called all the "sample space." In probability theory, sample space consists of all the possible outcomes of a random experiment.
Only 37.5% of the time do we actually get two red marbles and two green marbles, not 50% as we instinctively respond. 50% of the time there are three of each color, and 12.5% of the time the four balls are either all red or all green.

Try it at home, there's no danger involved.

Once I took an actual jar to class, like the one from the experiment. Only this time it was filled with red and green mints instead of marbles. That was the first and last time. Clara Cabañas, doing an internship at the time, would take the mints four at a time. The experiment was a little chaotic, actually, and not very practical. How do you repeat a test like this thousands of times in less than five minutes, which was the time that this failed experiment lasted?
To make matters worse, at the end of the class, the participants sneakily ate the experiment! Of the 100 original mints, there were soon only 90, then 80, and so on. One last green mint survived for almost a year in the nearly empty jar that peered over at me half sadly, half accusingly, from my desk. I keep the jar in my house, and I'm thinking of donating it to science when I die.
So, for the next course I prepared a script that simulated the same process with R. You can run it as many times as you want to simulate 1000, 10,000 or 100,000 "rolls" on line 6 (while i < 1000).
What you'll find is that the results are the same as those of the sample space. You can do the experiment as many times as you want.

How ignoring the Law of Small Numbers cost Bill and Melissa Gates $1.7 billion

This is the story of how the Bill and Melinda Gates Foundation squandered $1.7 billion because they did not take the Law of Small Numbers into account. You can read the case in the first chapter of the book Picturing the Uncertain World by Howard Wainer (free).
The characteristic urbanization of the 20th century in the United States gave rise to the abandonment of rural life and thus, to the increase in school sizes. In the urban landscape, the small rural schools were replaced by large schools, often with more than 1,000 students.
However, during the last quarter of the 20th century, dissatisfaction with large schools grew, and more and more people began to wonder if smaller schools could provide better quality education.
In the late 1990s, the Bill and Melinda Gates Foundation began supporting small schools. In 2001, the Foundation awarded grants to education projects for totaling approximately $1.7 billion. The availability of large sums of money to implement a smaller school policy produced a proportional increase in pressure to transform large schools into small schools.
It was quite intuitive to think that when schools are smaller, students' results improve. You know, personal attention and so forth.
But there were also studies that "showed" that the small schools obtained better results than the large ones.
Let's take one example of these studies. If we examine the average 5th grade reading scores of the 1,662 elementary schools in Pennsylvania, we find that out of the top fifty schools (3%), six of them were from the small school group. This means that small schools were over-represented four to one! If school size were not related to the results, you would expect only 3% of the small schools to be found in this select group, but we found 12%.
Can you draw the conclusion from here that small schools are four times better than big ones?
Apparently, the Gates believed it.
But wait a minute! If we take a look at the fifty worst schools, which have the lowest scores, nine of them (18%) were from the smallest schools: six times more than what should correspond to them.
Look at the graph below where the squares represent the small schools with the best results, and the circles represent the worst.

What is happening here is that smaller schools have a much higher variability and are therefore overrepresented at both ends. But the regression line also shows a significant positive slope, which tells us that the higher the number of students in a school (horizontal axis), the better the results (vertical axis).
In October 2005, the Gates Foundation announced that it was moving away from its emphasis on converting large secondary schools into smaller schools. The leaders of the Foundation said they had come to the conclusion that school size was not important. The spokesman concluded:
"I'm afraid we've done the children a big disservice."
They paid a big price for ignoring the Law of Small Numbers.