Everything we’ve done this week with datasets has been geared toward analyzing the entire data set. We’ve plotted entire columns against each other using plt.scatter. We’ve computed the correlation coefficient across entire column to understand how one variable is linearly correlated with another.

We’ll find that in practice, though, that we often want to be able to work with subsets of a datasets. And in particular, subsets which satisfy a meaningful condition. To give a concrete example of this, let’s open up the following colab notebook that contains data on mortgages in Massachusettes.


Before we jump right into filtering, though, I want to introduce a an extension to the .value_counts() method which we’ve previously introduced. In a prior class we showed that we can use it to see the frequency (or relative frequency) of values for a discrete variable. We discussed briefly how it wasn’t really suitable though for continuous variable. The work around to this is to cal .value_counts(bins=5) or .value_counts(bins=[0, 2000, 10000, 20000])


Filtering

We can create a subset of a data set by indexing a DataFrame with a boolean series. You might need to re-read that sentence a few times before it becomes clear. To understand it better, we’ll first learn clarify what a boolean series is and how to create one before seeing how it can be used to index a DataFrame.

Boolean Series

A boolean series is essentially a pandas’ list-like object where each value is either True or False. As an example, let’s make one which has a value True if the mortgage is a conventional mortgage and False otherwise: df['loan_type_name'] == 'Conventional’ . We can also create a boolean series by combining multiple conditions using logical operations such as & | and ~. For instance, we could specify the following:

(df['loan_type_name'] == 'Conventional') & (df['loan_purpose_name'] == 'Home purchase')

Screenshot 2024-09-19 at 8.19.44 PM.png

<aside>

Check your Understanding

Create a boolean series that takes the value True if the agency_abbr is not HUD

</aside>

We can then simply index a DataFrame via the boolean series as follows. Note, this line simply creates a subset of a DataFrame. To keep the subset or use is later in our coder, we’ll want to assign it to a variable.

df['loan_type_name'] == 'Conventional'] 

When filtering a data, we can also do so in a way that allows us to specify a subset of the columns using df.loc as in