Everything we’ve done this week with datasets has been geared toward analyzing the entire data set. We’ve plotted entire columns against each other using plt.scatter. We’ve computed the correlation coefficient across entire columns to understand how one variable is linearly correlated with another over the entire time period.

We’ll find in practice, though, that we often want to be able to work with subsets of a datasets. And in particular, subsets which satisfy a meaningful condition. To give a concrete example of this, let’s open up the following colab notebook that contains data on mortgages in Massachusettes.


Before we jump right into filtering, I want to introduce a an extension to the .value_counts() method which we’ve previously introduced. In a prior class, we showed that we can use it to see the frequency (or relative frequency) of values for a discrete variable. We discussed briefly how it wasn’t really suitable though for continuous variable.

To apply this method to continuous variables, we want to call it by either specifying the number of evenly spaced bins to create .value_counts(bins=5) or pass in a list of the thresholds to split the variable on .value_counts(bins=[0, 2000, 10000, 20000]) . Doing so, pandas create a pandas series which we can then use to create a bar graph! plt.bar(new_df.index, df.values). See the notebook for more details.

m1.png


Filtering

We can create a subset of a data set by indexing a DataFrame with a boolean series. You might need to re-read that sentence a few times before it becomes clear. To understand it better, we’ll first learn clarify what a boolean series is and how to create one before seeing how it can be used to index a DataFrame.

Boolean Series

A boolean series is essentially a pandas’ list-like object where each value is either True or False. As an example, let’s make one which has a value True if the mortgage is a conventional mortgage and False otherwise: df['loan_type_name'] == 'Conventional’ . We can also create a boolean series by combining multiple conditions using logical operations such as & | and ~. For instance, we could specify the following:

(df['loan_type_name'] == 'Conventional') & (df['loan_purpose_name'] == 'Home purchase')

Screenshot 2024-09-19 at 8.19.44 PM.png

<aside>

Check your Understanding

Create a boolean series that takes the value True if the agency_abbr is not HUD

</aside>