In this set of notes we’re going to

  1. Provide a high-level overview of what we’ve discussed thus far
  2. Introduce the idea of a Conditional Expectation Function as an alternative to Correlation Coefficient when working with Continuous and Categorical Data
  3. Introduce the idea of a Conditional Distribution

YouTube Recording of Class Notes

YouTube Recording of Class Notes

1.

Roughly two weeks ago, we learned how we can use a probability space to reason about uncertainty. We said look — we have a sample space (the set of all things that could happen), we have a probability measure (a function which tells us how likely subsets of the sample space are to occur) and we have a random variable which maps from the sample space onto another outcome space, and through composition, carries the probability measure onto the new space.

We cautioned that this level of abstraction is difficult to grasp the first time you see it. But we’re invested in this framework because it is general enough that we can use it to seamlessly reason about uncertainty in a lot of different contexts. For instance, given our understanding of probability theory, we can understand what it means to take the expected positive rating of a movie review as assessed by a Large Language Model.

This past week, we took the sample space, the probability measure, and the random variable as given, and said, let’s try and understand the distribution generated by the random variable and the probability measure: $\mathbb{P} \circ X^{-1}$. We discussed how for non-categorical variables, expected value and the standard deviation are useful summary statistics of the distribution. And how when the random variable is vector-valued (that is it outputs multiple values like Points and Assists), we can use the covariance and correlation coefficient to capture the joint relationship.

Screenshot 2024-10-13 at 9.58.08 AM.png

<aside>

💡Note, up to this point, I’ve been a bit loose with the distinction between a categorical and a non-categorical variable. When thinking about random variables its worthwhile to have a clear understanding of the distinction. A categorical variable is one where there’s no notion of distance between the values. For example, variables like Street Number, Position, Zip Code, Area Code, Flight Number, Course Code, etc. are all categorical variables. Python may represent them as numerical, so you would be able to compute the mean of Flight Number if your had a datasets in Python but it would not be meaningful. Note — even binary random variables, like whether someone goes to Boston University should be considered categorical.

</aside>

2.

Now, the correlation coefficient is not informative about the strength of the existing relationship between two variables when one of the them is categorical. To see this, let’s consider the following two random variables: Points and Position. The former is a continuous variable (lots of values, and a meaningful notion of distance between the values), while the latter is a categorical variable (no meaningful notion of distance between the values).

We can visualize the relationship between Points and Position in Python by

  1. Relabeling the Positions with integer labels

  2. Creating a scatter plot of these integer labels and Points.

    Screenshot 2024-10-13 at 10.18.01 AM.png

We can then compute the correlation coefficient between these two variables via df['Pos_num'].corr(df['PTS']). Now, is this number meaningful?

$$ \rho(\text{Position Number}, \text{Points}) = -0.006

$$