<aside>
I recognize that people are not yet comfortable with the idea of a probability space. We are going to continue to “rep it” though until it becomes second nature to you. You will understand this framework by the end of the semester. And once you do, it will be difficult to think of probabilistic events and contexts without naturally thinking via this framework.
</aside>
As mentioned, a central aim of this class is to help you better think about data. Much of the raw data that we’ve seen so far has consisted of lists of numbers - Points Per Game, Assists Per Game, etc. But more and more of the data that businesses are working with, whether it’s client facing or internally among teams is textual. And so it’s worthwhile in an introductory class like BA222 to get some exposure to working with textual data.
Now, modern textual analysis is really all about how to apply large language models to some task. And so we’re going to begin by (1) understanding how to think about Large Language Models. and (2) get some experience of working with them in a variety of tasks in Python.
High Level Introduction
One way to think of Large Language models is that they are giant probability distributions over sequences of words. For the moment, don’t worry about the mathematically precise definition of what it means to assign probabilities to sequences of words. Instead, focus on the idea that a Large Language Model takes in a sentence of words and outputs the probability of that sentence occurring.
For example, we can pass in the sentence “I walked my dog” to a Large Language Model like gpt2. And the model would tell us how likely that sentence is to occur. The function sentence_probability
defined in this colab notebook does this for us! Play around with it by varying the initial sentence.
Usefulness