In this set of notes we’re going to

  1. Continue our discussion of how Expectations are informative about the probability measure of interest: $\mathbb{P} \circ X^{-1}$, but now focusing on vector-valued random variables
  2. Explore the difference between Covariance and Correlation

1.

In our previous class, we discussed how we could take the expected value of a transformation of a random variable to understand the underlying probability measure / distribution.

<aside>

We’ll now consider the situation where we’re interested in understanding distributions defined over over multiple random variables. For example, given our interest in the WNBA, we may be interested in the distribution over Assists and Points.

From one viewpoint, we can regard Assists and Points as two random variable — functions mapping from the sample space into their respective domains. From another viewpoint, though, we can view them a vector-valued random variable — that is, a function that returns a pair of numbers $(\text{Assists}, \text{Points})$ as it’s output.

Screenshot 2024-10-07 at 8.35.17 AM.png

As before, we can continue to use expectations to better understand the measure / distribution over the space of assists and point ($\mathbb{P} \circ X^{-1}$).

Continuous Variables

Since assists and points are continuous, we can ask what the covariance is between them. The covariance tells us whether on average, players with above average assists also have above average points. Note, there is nothing causal about this relationship. And while we can compute the covariance in Pandas via df['AST'].cov(df['PTS'])I think it’s helpful to walk through the underlying math and python to understand exactly what this term captures.

Letting $X$ denote assists and $Y$ points, and $\tilde{X}$ the demeaned variable $X - \mathbb{E}[X]$, we can express the covariance as the expected value of the product of the demeaned variables. If this value is positive, then on average, players who are above average in assists are also above average in points.

$$ W = \tilde{X}\tilde{Y}, \quad \mathbb{E}[W] = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \text{Cov}(X,Y) $$

We can get a better sense of this term by constructing these demeaned variables and overlaying one of the demeaned variables on the underlying scatter plot. What do you notice? The points with longer lines (higher demeaned values) tend be to the right and up. Player with above average assists tend to be above average in points.