We wrapped up last class by creating our very first DataFrame in Pandas. We showed that we could create a DataFrame by (1) importing the pandas library (import pandas as pd ), (2) creating a dictionary and (3) passing the dictionary as an input to pd.DataFrame which is a function defined in pandas. We talked about how pd.DataFrame maps dictionaries to DataFrames.

We ended the class with the realization that slicing a DataFrame via df.loc method is inclusive. That df.loc[1:3] in our example selects the second, third, and forth rows of the DataFrame. This is because when we create the DataFrame, a “default” index of integers was created. We can view the index byprint(df.index). df.loc performs an inclusive slice in reference to the index. df.loc[1:3]slices starting from the element associated with the index 1 upto and including the element associated with label 3.

While this behavior may initially appear annoying (and you might prefer using .iloc which slices like a list, .loc is attractive because we can specify both the rows and names of columns that we would like to extract. For instance, to select the first three rows together with the vehicle and sports columns we can write df.loc[:3, ['vehicle', 'sports']].

<aside>

Check Your Understanding

What happens when you write

new_df = df.loc[1:3, ['vehicle', 'sports']]
new_df.loc[0]
# Try `new_df.iloc[0]` after

</aside>

Plotting

Our first visualization technique (if you can call it that) with DataFrames was to call the head/ tail methods to display the first / last 5 rows of the dataset. We’ve also talked about how to use .value_counts() to display the total counts of each value. These are helpful methods. But to get a better sense of our data, we’ll typically want to our data. To get started, let’s open up the following colab notebook which downloads a dataset on US consumer debt and credit from the New York Federal Reserve.

Pandas makes it relatively easy to read in real world data using pandas.read_csv or pandas.read_excel depending on the type of file. To ensure that everything goes well, though, it can be helpful to actually open the dataset as an in excel. After doing so, we notice that (1) There are multiple tabs each corresponding to a different figure/dataset (2) The column names are specified in the fourth row (3) The first column doesn't have a name. To ensure that pandas reads in the data as we want, we just have to specify a couple of arguments in the pandas.read_excel method.

mf.png

Plotting how the variables in a dataset evolve over time is a great way to get a sense of the data. To see the list of variables in the DataFrame, we can write df.columns . Doing so in our specific colab notebook reveals that we have a number of interesting variables to consider. Period corresponds to a time variable and the rest of the variables correspond to different kinds of US debt. We can visualize how the debt evolves by plotting them as follows.

# Importing a plotting library (only neeed to run this once)
import matplotlib.pyplot as plt 
plt.plot(df['Period'], df['Mortgage'])
plt.show()

Time Series of US Mortgage Debt (Source: New York Federal Reserve). Note, I’ve added a few additional statements to the notebook that improve the readability of the figure, so the code below won’t exactly produce the corresponding figure, but it contains the essential commands.

Time Series of US Mortgage Debt (Source: New York Federal Reserve). Note, I’ve added a few additional statements to the notebook that improve the readability of the figure, so the code below won’t exactly produce the corresponding figure, but it contains the essential commands.

<aside>

Check Your Understanding:

Using a for loop, create a figure that plots the the evolution of each type of debt over time (don’t include the total!).

Time Series of Types of Debt (Source: New York Federal Reserve)

Time Series of Types of Debt (Source: New York Federal Reserve)

</aside>

So graphs are clearly great. The differences in the data just jump out. The majority of U.S. household debt is in mortgages which is currently greater than 12 trillion dollars. Furthermore, we can use graphs to help us see the relationships between variables. Writing plt.scatter(df['Auto Loans'], df['Morgage']) allows us to see how the amount of Mortgage debt relates to the amount of Auto Loan Debt.

Scatter Plot of Mortgage Debt with Auto Loans (Source: New York Federal Reserve)

Scatter Plot of Mortgage Debt with Auto Loans (Source: New York Federal Reserve)