In this set of notes we’re going to

  1. Distinguish between the Population and the Sample in data analysis
  2. Understand how to view a Population level data set within the framework of probability theory
  3. Discuss how we can take the expected value of a transformation of a random variable to understand the distribution of a random variable.
  4. Consider an applications of Large Language Models

Let’s begin then by revisiting some of the data transformations we’ve seen in both the lectures and the homework, but now think about them from the perspective of our probability theory framework.

1.

When working with data, the first question we want to ask ourselves is — what are we trying to understand.

For the next few classes, we’re only going to consider questions that depend entirely on the data that we have. For instance, working with the dataset of all WNBA players (and their statistics) we can ask what’s the average points scored per game by players last season. This question depends only on the players we observe in the dataset. When we observe all of the data needed to answer a question, like this one, we say that we’re working at the population level.

Conversely, you could imagine a situation where our data consists of the current employment records of 60,000 people in the United States, but we’re interested in the unemployment rate across the entire United States. In this case, the question depends on both the individuals that we observe in our data (the 60,000) as well as a whole bunch of individuals which we don’t observe in our data. When we observe only some of the data related to answering a question, we say that we’re working with a sample.

For the next few classes, we’ll focus on questions that can be answered by the data at hand. After we’re comfortable with a population level analysis, we’ll then consider how to work with samples.

2.

When working at the population level, it’s helpful to understand how we can view our dataset from the perspective of probability theory. To make things concrete, let’s consider the 2024 WNBA dataset. The dataset consists of rows representing different players, and columns containing different statistical measures of the players performance. We can regard this dataset from the perspective of probability theory, by saying that the sample space is the set of all WNBA players (as represented by the rows) and the random variables are the columns. Random variables, as we’ve discussed take elements from the sample space (here the players) and output a corresponding value. That’s exactly what the columns do.

Screenshot 2024-10-11 at 7.30.18 AM.png

One can view the column points, df['PTS'] as a function that maps each player in the WNBA to their average number of points-per-game this season. We’re not saying it’s literally a function. It is after all a pandas series (see type(df['PTS'] ), but in terms of the underlying information, it’s equivalent to one. It can tell us how many points per game a player like Cailtin Clark averaged this season.