Pandas is a Python package for data analysis and exposes two new
data structures: Dataframes and Series.
In this notebook, we will explore the data structures that Pandas
provides, and learn how to interact with them.
To import an external Python library such as Pandas, use Python's
import function. To save yourself some typing later on, you can
give the library you import an alias. Here, we are importing Pandas
and giving it an alias of pd
.
import pandas as pd
#creating series from lists
names = ['Rexy', 'Raphael','Rita','Razel','Randiel']
weight= [70, 75, 78, 73, 72]
names
SeriesNames = pd.Series(names, index = weight )
SeriesNames
SeriesNames[70]
SeriesNames2= pd.Series(['Damalie','Damon','Durim','Donald'], index = ['A','B','C','D'])
SeriesNames2
SeriesNames2['B']
#Creating Series from dictionaries
cities = {0:'Kampala',
1:'Arusha',
2:'Nyeri',
3:'Addis',
4: 'Accra'}
cities
DSA_CITIES = pd.Series(cities)
DSA_CITIES
population = pd.Series({0: 250000,
1: 150000,
2: 230000,
3: 500000,
4: 100000})
population
AreaPerKm = pd.Series({0: 1000,
1: 5000,
2: 24000,
3: 4300,
4: 6000})
AreaPerKm
#creating a dataframe
data = pd.DataFrame({'City': DSA_CITIES,
'Size': AreaPerKm,
'Population': population})
data
#Querying the data
#data.City
#print (data.columns)
data[['Size', 'Population']]
#Adding a new column
data['Population Density'] = data['Population'] / data['Size']
data
We will load a CSV file as a dataframe using Panda's read_csv
method. This will allow us to use Pandas' dataframe functions to
explore the data in the CSV.
df = pd.read_csv('loans_full_africa.csv')
Once we have loaded the CSV as a dataframe, we can start to explore the data. Here are a few useful methods:
df.head()
df.shape
df.head(8)
df.tail()
df.tail(3)
This function allow you to see the shape of your dataset
df.dtypes
To get some basic stats of the columns you can either use .describe() for discrete data or .value_counts for categroical data
df.describe()
#Checking for missing values
df.isnull()
#Checking for missing values
df.isnull().sum()
This function allow you to see the sum of missing value in your dataset.
#Dealing with missing values
#It is situation dependent
import numpy as np
#dropping the missing values
df1 = df.dropna()
#replacing them with a value
df2 = df.fillna(0)
#Replace NULL values with the value from the previous row
df3 = df.fillna(method='ffill')
#replaces the NULL values with the values from the next row
df4 = df.fillna(method='bfill')
#replacing them using replace
df5 = df.replace({np.nan:5})
#replacing them with mean
df['loan_amount'].fillna(value=df['loan_amount'].mean())
#replacing them with mode
df['borrower_count'].fillna(value=df['borrower_count'].mode())
df_test = df
df_test['loan_amount'] = df_test['loan_amount'].dropna()
df_test.isnull().sum()
df1.shape
Alternatively, if you want just the count or min / max of one column, you can use Pandas built in functions:
print(len(df['borrower_count']))
print(max(df['funded_amount']))
print(df['loan_amount'].mean())
df['activity'].value_counts()
df['activity'].unique()
df['activity'].nunique()
df.sort_values(by='location_town')
To examine a specfic column of the DataFrame:
df['activity'].head()
df[['activity','basket_amount']].tail()
df['description_texts_en'].tail(10)
To examine specific rows and columns of a Dataframe, Pandas provides
the iloc
and loc
methods to do so. iloc
is used when you want to specify a list or range of indices, and .loc
is used when you want to specify a list or range of labels.
For both of these methods you need to specify two elements, with the first element indicating the rows that you want to select and the second element indicating the columns that you want to select.
# Get rows 1 through 3 and columns 0 through 5.
df.iloc[1:3,:5]
# Get rows with index values of 2-4 and the columns basket_amount and activity
df.loc[2:4, ["basket_amount", "activity"]]
df.iloc[3500:3501,0]
df.loc[3500:3500,["activity"]]
What do you notice about the way the indices work for iloc
versus loc
?
# To see all the rows and columns:
# Note: [remove the .head() to see it all]
df.iloc[:,:].head()
# You can also store a slice of the dataframe as a new dataframe!
titles_df = df.iloc[:,2]
titles_df.head()
A powerful feature of DataFrames is that you can view a subset of the DataFrame based on the values of the columns or rows. For example, lets say you only wanted to view loans with a status of "expired"
df['status'].value_counts()
df[df['status']=='funded'].head()
To view all loans with a status of "expired" or
"fundraising":
df[(df['status']=='expired')|(df['status']=='fundraising')]
Select loans that have expired and with loan amounts greater than 1000
df[(df['status']=='expired')&(df['loan_amount']>1000)]
You can group data by a column that has duplicates, like activity for the sector group.
df.groupby('activity').sum()['loan_amount']
df.groupby('activity').mean()['loan_amount'].reset_index()
df.groupby('activity').std()[['loan_amount','borrower_count']].reset_index()
You can also use SQL functions like inner join, outer join, left / right join using pd.merge(). Find documentation on this concept here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
df[['loan_amount','borrower_count']].corr()
df[['loan_amount','borrower_count']].corr(method='kendall')
df[['loan_amount','borrower_count']].corr(method='spearman')
!pip install "name of the library missing"