In this set of notes we’re going to

  1. Introduce the notion of a Discrete Random Variable
  2. Introduce the idea of sampling from a distribution
  3. Create Sentences from sampling from a Joint Distribution over Text

<aside>

I recognize that people are not yet comfortable with the idea of a probability space. We are going to continue to “rep it” though until it becomes second nature to you. You will understand this framework by the end of the semester. And once you do, it will be difficult to think of probabilistic events and contexts without naturally thinking via this framework.

</aside>

Motivation

As mentioned, a central aim of this class is to help you better think about data. Much of the raw data that we’ve seen so far has consisted of lists of numbers - Points Per Game, Assists Per Game, etc. But more and more of the data that businesses are working with, whether it’s client facing or internally among teams is textual. And so it’s worthwhile in an introductory class like BA222 to get some exposure to working with textual data.

Now, modern textual analysis is really all about how to apply large language models to some task. And so we’re going to begin by (1) understanding how to think about Large Language Models. and (2) get some experience of working with them in a variety of tasks in Python.

Large Language Models

High Level Introduction

One way to think of Large Language models is that they are giant probability distributions over sequences of words. For the moment, don’t worry about the mathematically precise definition of what it means to assign probabilities to sequences of words. Instead, focus on the idea that a Large Language Model takes in a sentence of words and outputs the probability of that sentence occurring.

For example, we can pass in the sentence “I walked my dog” to a Large Language Model like gpt2. And the model would tell us how likely that sentence is to occur. The function sentence_probability defined in this colab notebook does this for us! Play around with it by varying the initial sentence.

Screenshot 2024-10-18 at 4.39.06 PM.png

Usefulness