← Back to all articles

Creating correlation matrixes to generate hypotheses

Often a startup will offer multiple features, like task management, video calls, messaging, etc.

As a data scientist or growth engineer, you're always on the hunt for causal relationships as they can be optimized to drive results. In order to find casual relationships, you often have to first find correlated metrics and then run randomized controlled experiments to determine if the two are indeed casual or not.

Take this correlation matrix below:

Correlation Matrix Example

Where 1 is perfect correlation (i.e: impossible or it's the same metric), and where -1 is absolutely no correlation.

We can see that the final exam score and hours spent studying have a high correlation of 82%. On the flip side, it might surprise you that hours spent sleeping and exam score were not highly correlated. It's only an example but it shows the learnings you can derive from the matrix.

How do you calculate correlation?

In statistics, the Pearson correlation coefficient is a measure of linear correlation between two sets of data. In even simpler terms it measures the closeness of movement between two linear datasets.

You can calculate it using the formula below:

Coefficient

Example: Tasks created versus Messages Sent

Take this dataset, where each week we calculate the total number of tasks and messages created:

WeekTasks created (x)Messages sent (y)
1207
2248
3229
42712
5217
6238

1) First step is to make a new column called xy which is the multiplication of x and y:

WeekTasks created (x)Messages sent (y)xy
1207140
2248192
3229198
42712324
5217147
6238184

2) Next, we created an x2 column, where we square the result of tasks created (x):

WeekTasks created (x)Messages sent (y)xyx^2
1207140400
2248192576
3229198484
42712324729
5217147441
6238184529

3) The same for y2:

WeekTasks created (x)Messages sent (y)xyx^2y^2
120714040049
224819257664
322919848481
42712324729144
521714744149
623818452964

4) Then we simply add up all of the numbers in the columns and put the result at the bottom of the column. The Greek letter sigma (Σ) is a short way of saying “sum of” or summation.

WeekTasks created (x)Messages sent (y)xyx^2y^2
120714040049
224819257664
322919848481
42712324729144
521714744149
623818452964
Σ137511,1853,159451

5) Then we can now calculate the correlation coefficient between x and y:

Coefficient

n = sample size (which in our case is 6 rows)
Σx = 137
Σy = 51
Σxy = 1185
Σx^2 = 3159
Σy^2 = 451

R = (6 * 1185 - 137 * 51) / sqrt(
  (6 * 3159 - 137^2) * (6 * 451 - 51^2 )
) = 0.88 (88%)

88% signifies quite a high correlation between tasks created and messages sent. From here we can develop our own hypotheses for why this might be:

  • Do users who create new tasks then message other users to talk about the task?
  • Is there an issue with our data collection where a task will trigger a message?
  • If we reduce the number of tasks being created will that also reduce messages sent?
  • Etc etc.

Remember kids: Correlation does not mean causation. Use randomized controlled AB experiments to determine the casual relationship of the two metrics.