In this notebook we dive into some plotting methods commonly used for Exploratory Data Analysis (EDA).
Our goals for EDA are to open-mindedly explore the data, and see what insights we may find.
The purpose of the EDA approach is to:
In this notebook we'll investigate these plotting techniques:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
%matplotlib inline
With each notebook, we need to read in our dataset
df = pd.read_csv('loans.csv')
df.head()
Before diving into our exploratory data analysis, it is worth reiterating that this whole process is about understanding the distribution of data and relationships between different features.
When we move on to use machine learning algorithms, we will be asking a question and trying to answer it using the statistical relationships between different features in the data. The EDA analysis will help us shape this question and have a clear idea about how to approach building the algorithm!
With that in mind, let's look at several visualization methods to examine the data and any relationships between features…
To start, the scatter plot! This is a very popular and powerful way to visualize the relationship between two continuous features. Essentially this plot shows us how feature Y changes when feature X is changed. If there is a clear pattern formed in the scatter plot, we say that x and y are correlated.
There are several outcomes we see on a scatter plot:
Let's try this out on our data and choose two continuous variables to plot. First lets extract all the continuous variables from our dataset.
numeric_vars = df.select_dtypes(include=[np.number]).columns.tolist()
for variable in numeric_vars:
print(variable)
To start, let's look if there is a relationship between lender_count and loan_amount... intuition suggests that bigger loans much have more lenders. If this is true, we'll see this in the scatter plot!
ax = sns.regplot(x='lender_count', y='loan_amount', data=df)
Where does the data follow the line?
Where does the data not follow the line?
What are possible reasons that data does not follow the line?
How about the repayment term and the loan amount?
What kind of relationship would you expect between the repayment term and the loan amount?
ax = sns.regplot(x='repayment_term',
y='loan_amount',
data=df)
Where does the data follow the line?
Where does the data not follow the line?
What are possible reasons that data does not follow the line?
When we have lots of continuous variables, we could go through them one by one to see the relationship or we could use a scatterplot matrix! This creates a scatter plot between every combination of variables in a list.
Another interesting quality of the scatter matrix is that the diagonals give a histogram of the variable in question.
# Let's choose only a couple of columns to examine:
columns = ['loan_amount', 'funded_amount', 'status']
num_df = df[columns]
num_df
# Remove the NaN rows so Seaborn can plot
num_df = num_df.dropna(axis=0, how='any')
# Create the scatter plot and let's color the data point by their status.
sns.pairplot(num_df, hue='status');
What can say about the data?
A histogram is useful for looking at the distribution of values for a single variable and also identifying outliers. It shows us the count of data.
The plot below shows the data distribution of loan_amount using both bars and a continuous line. Without going into too much detail about the value on the y-axis, what we can take away from this is there is a much higher occurrence of small loans (high bar/peak in the line) and that large loans are much rarer (low bars/drop in the line).
sns.displot(df['loan_amount'].dropna());
sns.histplot(df['loan_amount'].dropna(axis = 0));
# Let's just look at those under 5K
small_loans_df = df[(df['loan_amount'] < 5000)]
sns.displot(small_loans_df['loan_amount']);
Looking at the loans less than 5000 we see a much clearer distribution, although it is still left-hand skewed.
Bar plots are useful for understanding how categorical groups are different with respect to a continuous variable.
p = sns.barplot(x='sector', y = 'loan_amount', data=df, estimator=np.mean);
p.set(title='Average loan amount by sector')
p.set_xticklabels(p.get_xticklabels(), rotation=-45);
Which sector is the largest? Why?
p = sns.barplot(x='sector', y = 'loan_amount', data=df, estimator=np.sum);
p.set(title='Total loan amount by sector')
p.set_xticklabels(p.get_xticklabels(), rotation=-45);
p = sns.barplot(x='sector', y = 'loan_amount', data=df, estimator=np.sum);
p.set(title='Total loan amount by sector')
p.set_xticklabels(p.get_xticklabels(), rotation=-45);
p = sns.barplot(x='sector', y = 'loan_amount', data=df, estimator=np.sum, hue='status');
p.set(title='Total loan amount by sector')
p.set_xticklabels(p.get_xticklabels(), rotation=-45);
sns.set(rc={"figure.figsize":(10, 5)})
Which sector is the largest? Why?
p = sns.countplot(x = 'sector', data=df);
p.set_xticklabels(p.get_xticklabels(), rotation=-45);
A box plot describes the distribution of data based on five important summary numbers: the minimum, first quartile, median, third quartile, and maximum. In the simplest box plot the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR). A segment inside the rectangle shows the median and "whiskers" above and below the box show the locations of the minimum and maximum.
Lets use this to look at the distribution of borrowers counts by each sector for different loan status for different partners. First lets look at how many loans come from different partners.
df_retail = df[df.sector=='Retail']
df_retail.head()
df_retail = df[df.sector=='Retail']
p = sns.boxplot(x='sector',
y='loan_amount',
data=df_retail);
p.set(title = f'Loan amounts for {sector}');
p.set_xticklabels(p.get_xticklabels(), rotation=-45);
df_retail = df[df.sector=='Food']
p = sns.boxplot(x='sector',
y='loan_amount',
data=df_retail);
p.set(title = f'Loan amounts for {sector}');
p.set_xticklabels(p.get_xticklabels(), rotation=-45);
Try this - Select other sectors and see how they look
Aha! It looks like we are onto something here... we can see different trends for different partners! We'll look into this further in feature_engineering to see how we can use to create powerful features.
Quite often it's useful to see how a variable changes over time. This means creating a plot with time on the x-axis and the variable on the y-axis.
Lets have a look at how the average loan amount changes over time on a monthly basis.
# Convert posted date to a datetime object
time_column = 'funded_date'
df[time_column] = pd.to_datetime(df[time_column])
# Resample the date to monthly intervals , taking the mean of loan_amount
# This creates an array where the index is the timestamp and the value is the mean of loan amount
time_data = df.resample('M', on=time_column)['loan_amount'].mean().fillna(0)
fig, ax = plt.subplots(figsize=(15,8))
ax.plot(time_data)
plt.title('Mean loan_amount over time');
df.dtypes
We can look at different timefrance by changing the parameter in resample. Lets look on a weekly basis!
# Resample the date to monthly intervals , taking the mean of loan_amount
# This creates an array where the index is the timestamp and the value is the mean of loan amount
time_data = df.resample('7D', on=time_column)['loan_amount'].mean().fillna(0)
fig, ax = plt.subplots(figsize=(15,8))
ax.plot(time_data)
plt.title('Mean loan_amount over time');