1. Our dataset: Auralin and Novodra Trials
We will be looking at the phase two clinical trial data of 350 patients for a new innovative oral insulin called Auralin - a proprietary capsule that can solve this stomach lining problem.
Phase two trials are intended to:
In this trial, half of the patients are being treated with Auralin, and the other 175 being treated with a popular injectable insulin called Novodra. By comparing key metrics between these two drugs, we can determine if Auralin is effective.
import pandas as pd
import numpy as np
patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')
adverse_reactions = pd.read_csv('adverse_reactions.csv')
This Auralin Phase II clinical trial dataset comes in three tables: patients
, treatments
, and adverse_reactions
. Acquaint yourself with them through visual assessment below.
In the cells below, each column of each table in this clinical trial dataset is described. To see the table that goes hand in hand with these descriptions, display each table in its entirety by displaying the pandas DataFrame that it was gathered into. This task is the mechanical part of visual assessment in pandas.
# Display the patients table
patients
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | contact | birthdate | weight | height | bmi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | female | Zoe | Wellish | 576 Brown Bear Drive | Rancho California | California | 92390.0 | United States | 951-719-9170ZoeWellish@superrito.com | 7/10/1976 | 121.7 | 66 | 19.6 |
1 | 2 | female | Pamela | Hill | 2370 University Hill Road | Armstrong | Illinois | 61812.0 | United States | PamelaSHill@cuvox.de+1 (217) 569-3204 | 4/3/1967 | 118.8 | 66 | 19.2 |
2 | 3 | male | Jae | Debord | 1493 Poling Farm Road | York | Nebraska | 68467.0 | United States | 402-363-6804JaeMDebord@gustr.com | 2/19/1980 | 177.8 | 71 | 24.8 |
3 | 4 | male | Liêm | Phan | 2335 Webster Street | Woodbridge | NJ | 7095.0 | United States | PhanBaLiem@jourrapide.com+1 (732) 636-8246 | 7/26/1951 | 220.9 | 70 | 31.7 |
4 | 5 | male | Tim | Neudorf | 1428 Turkey Pen Lane | Dothan | AL | 36303.0 | United States | 334-515-7487TimNeudorf@cuvox.de | 2/18/1928 | 192.3 | 27 | 26.1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
498 | 499 | male | Mustafa | Lindström | 2530 Victoria Court | Milton Mills | ME | 3852.0 | United States | 207-477-0579MustafaLindstrom@jourrapide.com | 4/10/1959 | 181.1 | 72 | 24.6 |
499 | 500 | male | Ruman | Bisliev | 494 Clarksburg Park Road | Sedona | AZ | 86341.0 | United States | 928-284-4492RumanBisliev@gustr.com | 3/26/1948 | 239.6 | 70 | 34.4 |
500 | 501 | female | Jinke | de Keizer | 649 Nutter Street | Overland Park | MO | 64110.0 | United States | 816-223-6007JinkedeKeizer@teleworm.us | 1/13/1971 | 171.2 | 67 | 26.8 |
501 | 502 | female | Chidalu | Onyekaozulu | 3652 Boone Crockett Lane | Seattle | WA | 98109.0 | United States | ChidaluOnyekaozulu@jourrapide.com1 360 443 2060 | 2/13/1952 | 176.9 | 67 | 27.7 |
502 | 503 | male | Pat | Gersten | 2778 North Avenue | Burr | Nebraska | 68324.0 | United States | PatrickGersten@rhyta.com402-848-4923 | 5/3/1954 | 138.2 | 71 | 19.3 |
503 rows × 14 columns
patients
columns:
# Display the treatments table
treatments
given_name | surname | auralin | novodra | hba1c_start | hba1c_end | hba1c_change | |
---|---|---|---|---|---|---|---|
0 | veronika | jindrová | 41u - 48u | - | 7.63 | 7.20 | NaN |
1 | elliot | richardson | - | 40u - 45u | 7.56 | 7.09 | 0.97 |
2 | yukitaka | takenaka | - | 39u - 36u | 7.68 | 7.25 | NaN |
3 | skye | gormanston | 33u - 36u | - | 7.97 | 7.62 | 0.35 |
4 | alissa | montez | - | 33u - 29u | 7.78 | 7.46 | 0.32 |
... | ... | ... | ... | ... | ... | ... | ... |
275 | albina | zetticci | 45u - 51u | - | 7.93 | 7.73 | 0.20 |
276 | john | teichelmann | - | 49u - 49u | 7.90 | 7.58 | NaN |
277 | mathea | lillebø | 23u - 36u | - | 9.04 | 8.67 | 0.37 |
278 | vallie | prince | 31u - 38u | - | 7.64 | 7.28 | 0.36 |
279 | samúel | guðbrandsson | 53u - 56u | - | 8.00 | 7.64 | 0.36 |
280 rows × 7 columns
350 patients participated in this clinical trial. None of the patients were using Novodra (a popular injectable insulin) or Auralin (the oral insulin being researched) as their primary source of insulin before. All were experiencing elevated HbA1c levels.
All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose. After four weeks, which isn’t enough time to capture all the change in HbA1c that can be attributed by the switch to Auralin or Novodra:
treatments
columns:
hba1c_start
- hba1c_end
. For Auralin to be deemed effective, it must be "noninferior" to Novodra, the current standard for insulin. This "noninferiority" is statistically defined as the upper bound of the 95% confidence interval being less than 0.4% for the difference between the mean HbA1c changes for Novodra and Auralin (i.e. Novodra minus Auralin).# Display the adverse_reactions table
adverse_reactions
given_name | surname | adverse_reaction | |
---|---|---|---|
0 | berta | napolitani | injection site discomfort |
1 | lena | baer | hypoglycemia |
2 | joseph | day | hypoglycemia |
3 | flavia | fiorentino | cough |
4 | manouck | wubbels | throat irritation |
5 | jasmine | sykes | hypoglycemia |
6 | louise | johnson | hypoglycemia |
7 | albinca | komavec | hypoglycemia |
8 | noe | aranda | hypoglycemia |
9 | sofia | hermansen | injection site discomfort |
10 | tegan | johnson | headache |
11 | abel | yonatan | cough |
12 | abdul-nur | isa | hypoglycemia |
13 | leon | scholz | injection site discomfort |
14 | gabriele | saenger | hypoglycemia |
15 | jia li | teng | nausea |
16 | jakob | jakobsen | hypoglycemia |
17 | christopher | woodward | nausea |
18 | ole | petersen | hypoglycemia |
19 | finley | chandler | headache |
20 | anenechi | chidi | hypoglycemia |
21 | miłosław | wiśniewski | injection site discomfort |
22 | lixue | hsueh | injection site discomfort |
23 | merci | leroux | hypoglycemia |
24 | kang | mai | injection site discomfort |
25 | elliot | richardson | hypoglycemia |
26 | clinton | miller | throat irritation |
27 | idalia | moore | hypoglycemia |
28 | xiuxiu | chang | hypoglycemia |
29 | alex | crawford | hypoglycemia |
30 | monika | lončar | hypoglycemia |
31 | steven | roy | headache |
32 | cecilie | nilsen | hypoglycemia |
33 | krisztina | magyar | hypoglycemia |
adverse_reactions
columns:
Additional useful information:
patients
table.patients
tabletreatments
tableadverse_reactions
tableThese are the programmatic assessment methods in pandas that you will probably use most often:
.head()
(DataFrame and Series).tail()
(DataFrame and Series).sample()
(DataFrame and Series).info()
(DataFrame only).describe()
(DataFrame and Series).value_counts()
(Series only)Try them out below and keep their results in mind. Some will come in handy later in the lesson.
Check out the pandas API reference for detailed usage information.
.head
and .tail
on the patients
table.patients.head()
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | contact | birthdate | weight | height | bmi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | female | Zoe | Wellish | 576 Brown Bear Drive | Rancho California | California | 92390.0 | United States | 951-719-9170ZoeWellish@superrito.com | 7/10/1976 | 121.7 | 66 | 19.6 |
1 | 2 | female | Pamela | Hill | 2370 University Hill Road | Armstrong | Illinois | 61812.0 | United States | PamelaSHill@cuvox.de+1 (217) 569-3204 | 4/3/1967 | 118.8 | 66 | 19.2 |
2 | 3 | male | Jae | Debord | 1493 Poling Farm Road | York | Nebraska | 68467.0 | United States | 402-363-6804JaeMDebord@gustr.com | 2/19/1980 | 177.8 | 71 | 24.8 |
3 | 4 | male | Liêm | Phan | 2335 Webster Street | Woodbridge | NJ | 7095.0 | United States | PhanBaLiem@jourrapide.com+1 (732) 636-8246 | 7/26/1951 | 220.9 | 70 | 31.7 |
4 | 5 | male | Tim | Neudorf | 1428 Turkey Pen Lane | Dothan | AL | 36303.0 | United States | 334-515-7487TimNeudorf@cuvox.de | 2/18/1928 | 192.3 | 27 | 26.1 |
patients.tail()
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | contact | birthdate | weight | height | bmi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
498 | 499 | male | Mustafa | Lindström | 2530 Victoria Court | Milton Mills | ME | 3852.0 | United States | 207-477-0579MustafaLindstrom@jourrapide.com | 4/10/1959 | 181.1 | 72 | 24.6 |
499 | 500 | male | Ruman | Bisliev | 494 Clarksburg Park Road | Sedona | AZ | 86341.0 | United States | 928-284-4492RumanBisliev@gustr.com | 3/26/1948 | 239.6 | 70 | 34.4 |
500 | 501 | female | Jinke | de Keizer | 649 Nutter Street | Overland Park | MO | 64110.0 | United States | 816-223-6007JinkedeKeizer@teleworm.us | 1/13/1971 | 171.2 | 67 | 26.8 |
501 | 502 | female | Chidalu | Onyekaozulu | 3652 Boone Crockett Lane | Seattle | WA | 98109.0 | United States | ChidaluOnyekaozulu@jourrapide.com1 360 443 2060 | 2/13/1952 | 176.9 | 67 | 27.7 |
502 | 503 | male | Pat | Gersten | 2778 North Avenue | Burr | Nebraska | 68324.0 | United States | PatrickGersten@rhyta.com402-848-4923 | 5/3/1954 | 138.2 | 71 | 19.3 |
.sample
on the treatments
table.treatments.sample(5)
given_name | surname | auralin | novodra | hba1c_start | hba1c_end | hba1c_change | |
---|---|---|---|---|---|---|---|
60 | onyekachukwu | obinna | 37u - 46u | - | 7.58 | 7.12 | NaN |
54 | oles | zhdanov | 54u - 67u | - | 7.52 | 7.11 | NaN |
255 | jia li | teng | 48u - 54u | - | 7.66 | 7.32 | 0.34 |
143 | nora | nyborg | 55u - 59u | - | 7.83 | 7.48 | 0.35 |
271 | leo | vieira | - | 30u - 33u | 7.74 | 7.36 | NaN |
.info
on the all tables.patients.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 patient_id 503 non-null int64
1 assigned_sex 503 non-null object
2 given_name 503 non-null object
3 surname 503 non-null object
4 address 491 non-null object
5 city 491 non-null object
6 state 491 non-null object
7 zip_code 491 non-null float64
8 country 491 non-null object
9 contact 491 non-null object
10 birthdate 503 non-null object
11 weight 503 non-null float64
12 height 503 non-null int64
13 bmi 503 non-null float64
dtypes: float64(3), int64(2), object(9)
memory usage: 55.1+ KB
treatments.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 given_name 280 non-null object
1 surname 280 non-null object
2 auralin 280 non-null object
3 novodra 280 non-null object
4 hba1c_start 280 non-null float64
5 hba1c_end 280 non-null float64
6 hba1c_change 171 non-null float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB
adverse_reactions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 given_name 34 non-null object
1 surname 34 non-null object
2 adverse_reaction 34 non-null object
dtypes: object(3)
memory usage: 944.0+ bytes
.describe
on the patients
& treatment
tables.patients.describe()
patient_id | zip_code | weight | height | bmi | |
---|---|---|---|---|---|
count | 503.000000 | 491.000000 | 503.000000 | 503.000000 | 503.000000 |
mean | 252.000000 | 49084.118126 | 173.434990 | 66.634195 | 27.483897 |
std | 145.347859 | 30265.807442 | 33.916741 | 4.411297 | 5.276438 |
min | 1.000000 | 1002.000000 | 48.800000 | 27.000000 | 17.100000 |
25% | 126.500000 | 21920.500000 | 149.300000 | 63.000000 | 23.300000 |
50% | 252.000000 | 48057.000000 | 175.300000 | 67.000000 | 27.200000 |
75% | 377.500000 | 75679.000000 | 199.500000 | 70.000000 | 31.750000 |
max | 503.000000 | 99701.000000 | 255.900000 | 79.000000 | 37.700000 |
treatments.describe()
hba1c_start | hba1c_end | hba1c_change | |
---|---|---|---|
count | 280.000000 | 280.000000 | 171.000000 |
mean | 7.985929 | 7.589286 | 0.546023 |
std | 0.568638 | 0.569672 | 0.279555 |
min | 7.500000 | 7.010000 | 0.200000 |
25% | 7.660000 | 7.270000 | 0.340000 |
50% | 7.800000 | 7.420000 | 0.380000 |
75% | 7.970000 | 7.570000 | 0.920000 |
max | 9.950000 | 9.580000 | 0.990000 |
.value_counts
on the adverse_reaction column of the adverse_reactions
table.adverse_reactions.adverse_reaction.value_counts()
hypoglycemia 19
injection site discomfort 6
headache 3
cough 2
throat irritation 2
nausea 2
Name: adverse_reaction, dtype: int64
patients
table for patients that are from the city New York.patients[patients.city == 'New York']
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | contact | birthdate | weight | height | bmi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | 10 | female | Sophie | Cabrera | 3303 Anmoore Road | New York | New York | 10011.0 | United States | SophieCabreraIbarra@teleworm.us1 718 795 9124 | 12/3/1930 | 194.7 | 64 | 33.4 |
35 | 36 | female | Kamila | Pecinová | 3558 Longview Avenue | New York | New York | 10004.0 | United States | 718-501-0503KamilaPecinova@dayrep.com | 12/23/1985 | 198.9 | 62 | 36.4 |
84 | 85 | female | Nương | Vũ | 465 Southern Street | New York | NY | 10001.0 | United States | VuCamNuong@fleckens.hu516-720-5094 | 2/1/1981 | 138.2 | 63 | 24.5 |
129 | 130 | female | Rebecca | Jephcott | 989 Wayback Lane | New York | NY | 10004.0 | United States | 631-370-7406RebeccaJephcott@armyspy.com | 8/1/1966 | 203.3 | 65 | 33.8 |
142 | 143 | male | Finley | Chandler | 2754 Westwood Avenue | New York | New York | 10001.0 | United States | 516-740-5280FinleyChandler@dayrep.com | 10/25/1936 | 150.9 | 70 | 21.6 |
152 | 153 | male | Christopher | Woodward | 3450 Southern Street | New York | NY | 10004.0 | United States | ChristopherWoodward@jourrapide.com+1 (516) 630... | 9/4/1984 | 212.2 | 66 | 34.2 |
188 | 189 | male | Søren | Sørensen | 2397 Bell Street | New York | NY | 10011.0 | United States | SrenSrensen@superrito.com1 212 201 3108 | 12/31/1942 | 157.1 | 67 | 24.6 |
213 | 214 | female | Onyemaechi | Onwughara | 685 Duncan Avenue | New York | NY | 10013.0 | United States | 917-622-9142OnyemaechiOnwughara@einrot.com | 3/8/1989 | 131.1 | 69 | 19.4 |
215 | 216 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
229 | 230 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
237 | 238 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
244 | 245 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
247 | 248 | male | Tuukka | Leppäluoto | 1886 Bicetown Road | New York | NY | 10011.0 | United States | 917-408-8855TuukkaLeppaluoto@teleworm.us | 3/7/1978 | 211.0 | 73 | 27.8 |
251 | 252 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
263 | 264 | female | Julia | Carvalho | 3662 Shinn Street | New York | NY | 10036.0 | United States | JuliaAzevedoCarvalho@superrito.com+1 (212) 782... | 4/11/1931 | 171.8 | 61 | 32.5 |
277 | 278 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
301 | 302 | female | Onyekachukwu | Obinna | 2970 Forest Avenue | New York | NY | 10004.0 | United States | OnyekachukwuObinna@teleworm.us646-982-6609 | 1/24/1997 | 154.7 | 65 | 25.7 |
461 | 462 | male | Cannan | Cabrera | 2102 Geraldine Lane | New York | NY | 10014.0 | United States | 646-289-4177CannanCabreraOrdonez@superrito.com | 10/12/1980 | 209.7 | 71 | 29.2 |
patients.loc[patients['city'] == 'New York']
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | contact | birthdate | weight | height | bmi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | 10 | female | Sophie | Cabrera | 3303 Anmoore Road | New York | New York | 10011.0 | United States | SophieCabreraIbarra@teleworm.us1 718 795 9124 | 12/3/1930 | 194.7 | 64 | 33.4 |
35 | 36 | female | Kamila | Pecinová | 3558 Longview Avenue | New York | New York | 10004.0 | United States | 718-501-0503KamilaPecinova@dayrep.com | 12/23/1985 | 198.9 | 62 | 36.4 |
84 | 85 | female | Nương | Vũ | 465 Southern Street | New York | NY | 10001.0 | United States | VuCamNuong@fleckens.hu516-720-5094 | 2/1/1981 | 138.2 | 63 | 24.5 |
129 | 130 | female | Rebecca | Jephcott | 989 Wayback Lane | New York | NY | 10004.0 | United States | 631-370-7406RebeccaJephcott@armyspy.com | 8/1/1966 | 203.3 | 65 | 33.8 |
142 | 143 | male | Finley | Chandler | 2754 Westwood Avenue | New York | New York | 10001.0 | United States | 516-740-5280FinleyChandler@dayrep.com | 10/25/1936 | 150.9 | 70 | 21.6 |
152 | 153 | male | Christopher | Woodward | 3450 Southern Street | New York | NY | 10004.0 | United States | ChristopherWoodward@jourrapide.com+1 (516) 630... | 9/4/1984 | 212.2 | 66 | 34.2 |
188 | 189 | male | Søren | Sørensen | 2397 Bell Street | New York | NY | 10011.0 | United States | SrenSrensen@superrito.com1 212 201 3108 | 12/31/1942 | 157.1 | 67 | 24.6 |
213 | 214 | female | Onyemaechi | Onwughara | 685 Duncan Avenue | New York | NY | 10013.0 | United States | 917-622-9142OnyemaechiOnwughara@einrot.com | 3/8/1989 | 131.1 | 69 | 19.4 |
215 | 216 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
229 | 230 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
237 | 238 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
244 | 245 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
247 | 248 | male | Tuukka | Leppäluoto | 1886 Bicetown Road | New York | NY | 10011.0 | United States | 917-408-8855TuukkaLeppaluoto@teleworm.us | 3/7/1978 | 211.0 | 73 | 27.8 |
251 | 252 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
263 | 264 | female | Julia | Carvalho | 3662 Shinn Street | New York | NY | 10036.0 | United States | JuliaAzevedoCarvalho@superrito.com+1 (212) 782... | 4/11/1931 | 171.8 | 61 | 32.5 |
277 | 278 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
301 | 302 | female | Onyekachukwu | Obinna | 2970 Forest Avenue | New York | NY | 10004.0 | United States | OnyekachukwuObinna@teleworm.us646-982-6609 | 1/24/1997 | 154.7 | 65 | 25.7 |
461 | 462 | male | Cannan | Cabrera | 2102 Geraldine Lane | New York | NY | 10014.0 | United States | 646-289-4177CannanCabreraOrdonez@superrito.com | 10/12/1980 | 209.7 | 71 | 29.2 |
len(patients[patients.city == 'New York'])
18
patients[patients['address'].isnull()]
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | contact | birthdate | weight | height | bmi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
209 | 210 | female | Lalita | Eldarkhanov | NaN | NaN | NaN | NaN | NaN | NaN | 8/14/1950 | 143.4 | 62 | 26.2 |
219 | 220 | male | Mỹ | Quynh | NaN | NaN | NaN | NaN | NaN | NaN | 4/9/1978 | 237.8 | 69 | 35.1 |
230 | 231 | female | Elisabeth | Knudsen | NaN | NaN | NaN | NaN | NaN | NaN | 9/23/1976 | 165.9 | 63 | 29.4 |
234 | 235 | female | Martina | Tománková | NaN | NaN | NaN | NaN | NaN | NaN | 4/7/1936 | 199.5 | 65 | 33.2 |
242 | 243 | male | John | O'Brian | NaN | NaN | NaN | NaN | NaN | NaN | 2/25/1957 | 205.3 | 74 | 26.4 |
249 | 250 | male | Benjamin | Mehler | NaN | NaN | NaN | NaN | NaN | NaN | 10/30/1951 | 146.5 | 69 | 21.6 |
257 | 258 | male | Jin | Kung | NaN | NaN | NaN | NaN | NaN | NaN | 5/17/1995 | 231.7 | 69 | 34.2 |
264 | 265 | female | Wafiyyah | Asfour | NaN | NaN | NaN | NaN | NaN | NaN | 11/3/1989 | 158.6 | 63 | 28.1 |
269 | 270 | female | Flavia | Fiorentino | NaN | NaN | NaN | NaN | NaN | NaN | 10/9/1937 | 175.2 | 61 | 33.1 |
278 | 279 | female | Generosa | Cabán | NaN | NaN | NaN | NaN | NaN | NaN | 12/16/1962 | 124.3 | 69 | 18.4 |
286 | 287 | male | Lewis | Webb | NaN | NaN | NaN | NaN | NaN | NaN | 4/1/1979 | 155.3 | 68 | 23.6 |
296 | 297 | female | Chỉ | Lâm | NaN | NaN | NaN | NaN | NaN | NaN | 5/14/1990 | 181.1 | 63 | 32.1 |
patients
tabletreatments
tableadverse_reactions
tableIn the video, it is stated that "having the country column as the data type object (string) is fine, while I argue that state should be the category data type". This topic deserves a little bit more discussion.
state is categorical because its values are a finite set of options without order. country, for all intents and purposes, also has a finite set of values and therefore could be argued to be of categorical type as well. It seems there isn't much freedom of values in country to deserve classifying it as a string.
So why use object here for the data type for country? Well, country has a lot of values. Categorical data with tons of categories isn't that useful. Another reason for using object here is situational, i.e., it depends on the context in which you'd like to use the country column. In this dataset, all of the clinical trial patients are from the United States, so there are no advantages gained from switching the data type from object to category. The country column won't be used for analysis.
A more general scenario outside of this dataset is as follows. Say you had one to a few observations from each country, it would probably be best to treat country like a string and group observations on a larger unit, like world_region (Africa, Asia, Central America, etc.). If you had a lot of observations from a few countries, like test scores from students sampled in a handful of countries, making country categorical would be more appropriate.
.value_counts
on the surname and address columns of the patients
table.patients.surname.value_counts()
Doe 6
Jakobsen 3
Taylor 3
Ogochukwu 2
Tucker 2
..
Casárez 1
Mata 1
Pospíšil 1
Rukavina 1
Onyekaozulu 1
Name: surname, Length: 466, dtype: int64
patients.address.value_counts()
123 Main Street 6
2778 North Avenue 2
2476 Fulton Street 2
648 Old Dear Lane 2
3094 Oral Lake Road 1
..
1066 Goosetown Drive 1
4291 Patton Lane 1
4643 Reeves Street 1
174 Lost Creek Road 1
3652 Boone Crockett Lane 1
Name: address, Length: 483, dtype: int64
.duplicated
on the address column of the patients
table.patients[patients.address.duplicated()]
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | contact | birthdate | weight | height | bmi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29 | 30 | male | Jake | Jakobsen | 648 Old Dear Lane | Port Jervis | New York | 12771.0 | United States | JakobCJakobsen@einrot.com+1 (845) 858-7707 | 8/1/1985 | 155.8 | 67 | 24.4 |
219 | 220 | male | Mỹ | Quynh | NaN | NaN | NaN | NaN | NaN | NaN | 4/9/1978 | 237.8 | 69 | 35.1 |
229 | 230 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
230 | 231 | female | Elisabeth | Knudsen | NaN | NaN | NaN | NaN | NaN | NaN | 9/23/1976 | 165.9 | 63 | 29.4 |
234 | 235 | female | Martina | Tománková | NaN | NaN | NaN | NaN | NaN | NaN | 4/7/1936 | 199.5 | 65 | 33.2 |
237 | 238 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
242 | 243 | male | John | O'Brian | NaN | NaN | NaN | NaN | NaN | NaN | 2/25/1957 | 205.3 | 74 | 26.4 |
244 | 245 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
249 | 250 | male | Benjamin | Mehler | NaN | NaN | NaN | NaN | NaN | NaN | 10/30/1951 | 146.5 | 69 | 21.6 |
251 | 252 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
257 | 258 | male | Jin | Kung | NaN | NaN | NaN | NaN | NaN | NaN | 5/17/1995 | 231.7 | 69 | 34.2 |
264 | 265 | female | Wafiyyah | Asfour | NaN | NaN | NaN | NaN | NaN | NaN | 11/3/1989 | 158.6 | 63 | 28.1 |
269 | 270 | female | Flavia | Fiorentino | NaN | NaN | NaN | NaN | NaN | NaN | 10/9/1937 | 175.2 | 61 | 33.1 |
277 | 278 | male | John | Doe | 123 Main Street | New York | NY | 12345.0 | United States | johndoe@email.com1234567890 | 1/1/1975 | 180.0 | 72 | 24.4 |
278 | 279 | female | Generosa | Cabán | NaN | NaN | NaN | NaN | NaN | NaN | 12/16/1962 | 124.3 | 69 | 18.4 |
282 | 283 | female | Sandy | Taylor | 2476 Fulton Street | Rainelle | WV | 25962.0 | United States | 304-438-2648SandraCTaylor@dayrep.com | 10/23/1960 | 206.1 | 64 | 35.4 |
286 | 287 | male | Lewis | Webb | NaN | NaN | NaN | NaN | NaN | NaN | 4/1/1979 | 155.3 | 68 | 23.6 |
296 | 297 | female | Chỉ | Lâm | NaN | NaN | NaN | NaN | NaN | NaN | 5/14/1990 | 181.1 | 63 | 32.1 |
502 | 503 | male | Pat | Gersten | 2778 North Avenue | Burr | Nebraska | 68324.0 | United States | PatrickGersten@rhyta.com402-848-4923 | 5/3/1954 | 138.2 | 71 | 19.3 |
.sort_values
on the weight column of the patients
table.patients.weight.sort_values()
210 48.8
459 102.1
335 102.7
74 103.2
317 106.0
...
144 244.9
61 244.9
283 245.5
118 254.5
485 255.9
Name: weight, Length: 503, dtype: float64
The minimum value of weight in the patients table is 48.8 pounds which looks very low. We can check this by looking at the height and BMI entries for this patient we can see that 48.8 is actually kilograms instead of pounds.
# Convert weight to punds and calculate BMI to see if it matches the recorded bmi
weight_lbs = patients[patients.surname == 'Zaitseva'].weight * 2.20462 # to convert to pounds
height_in = patients[patients.surname == 'Zaitseva'].height
bmi_check = 703 * weight_lbs / ( height_in * height_in )
bmi_check
210 19.055827
dtype: float64
# Check recorded bmi to see if it matches bmi_check
patients[patients.surname == 'Zaitseva'].bmi
210 19.1
Name: bmi, dtype: float64
.isnull
on the auraline and novodra columns of the treatments
table.sum(treatments.auralin.isnull())
0
sum(treatments.novodra.isnull())
0
patients
tabletreatments
tableadverse_reactions
tablepatients
table should be split into phone number and email.treatments
table (treatment, start dose and end dose).treatments
table.patients
table duplicated in treatments
and adverse_reactions
tables.The auralin and novodra columns violate the first rule of tidiness: that each variable forms a column. There are three variables: treatment (auralin or novodra), start dose (for example 41 units), and end dose (for example 48 units).
Because there are three variables, there should be three columns and there are currently only two, and both contain two variables. The auralin column contains the start and end dose for auralin patients and the novodra column contains the start and end dose for patients that were treated with novodra.
If these two columns both contain two variables, start and end dose each, where is that missing third variable?
The third variable, treatment, is hidden in the column headers. Column headers in this case are values, not variable names. Instead of the auralin and novodra columns, there should be three columns:
treatment
, which contains values Auralin or Novodrastart_dose
end_dose
Two tables are needed: patients
and treatments
.
patients
should have the same columns as the current patients table.adverse_reactions
table should be included in the treatments
table. When looking at treatment data, the adverse reaction for that treatment is the same type of observational unit.adverse_reactions
table are already present in both the treatments and patients table, so they can be eliminated.# Create a series of all table columns to use .duplicated()
all_columns = pd.Series(list(patients) + list(treatments) + list(adverse_reactions))
all_columns
0 patient_id
1 assigned_sex
2 given_name
3 surname
4 address
5 city
6 state
7 zip_code
8 country
9 contact
10 birthdate
11 weight
12 height
13 bmi
14 given_name
15 surname
16 auralin
17 novodra
18 hba1c_start
19 hba1c_end
20 hba1c_change
21 given_name
22 surname
23 adverse_reaction
dtype: object
len(all_columns)
24
all_columns[all_columns.duplicated()]
14 given_name
15 surname
21 given_name
22 surname
dtype: object
The very first thing to do before any cleaning occurs is to make a copy of each piece of data. All of the cleaning operations will be conducted on this copy so you can still view the original dirty and/or messy dataset later.
patients_clean = patients.copy()
treatments_clean = treatments.copy()
adverse_reactions_clean = adverse_reactions.copy()
It's important to do this upfront so that subsequent data cleaning will not have to be repeated.
treatments
: Missing records (280 instead of 350)Note: the missing treatments
records are stored in a file named treatments_cut.csv
Hint: documentation page for the concat
function used in the solution.
Import the cut treatments into a DataFrame and concatenate it with the original treatments DataFrame.
treatments_cut = pd.read_csv('treatments_cut.csv')
treatments_clean = pd.concat([treatments_clean, treatments_cut],
ignore_index=True)
treatments_clean.shape
(350, 7)
treatments_clean.head()
given_name | surname | auralin | novodra | hba1c_start | hba1c_end | hba1c_change | |
---|---|---|---|---|---|---|---|
0 | veronika | jindrová | 41u - 48u | - | 7.63 | 7.20 | NaN |
1 | elliot | richardson | - | 40u - 45u | 7.56 | 7.09 | 0.97 |
2 | yukitaka | takenaka | - | 39u - 36u | 7.68 | 7.25 | NaN |
3 | skye | gormanston | 33u - 36u | - | 7.97 | 7.62 | 0.35 |
4 | alissa | montez | - | 33u - 29u | 7.78 | 7.46 | 0.32 |
treatments_clean.tail()
given_name | surname | auralin | novodra | hba1c_start | hba1c_end | hba1c_change | |
---|---|---|---|---|---|---|---|
345 | rovzan | kishiev | 32u - 37u | - | 7.75 | 7.41 | 0.34 |
346 | jakob | jakobsen | - | 28u - 26u | 7.96 | 7.51 | 0.95 |
347 | bernd | schneider | 48u - 56u | - | 7.74 | 7.44 | 0.30 |
348 | berta | napolitani | - | 42u - 44u | 7.68 | 7.21 | NaN |
349 | armina | sauvé | 36u - 46u | - | 7.86 | 7.40 | NaN |
treatments
: Missing HbA1c changes and Inaccurate HbA1c changes (leading 4s mistaken as 9s)Note: the "Inaccurate HbA1c changes (leading 4s mistaken as 9s)" observation, which is an accuracy issue and not a completeness issue, is included in this header because it is also fixed by the cleaning operation that fixes the missing "Missing HbA1c changes" observation. Multiple observations in one Define, Code, and Test header occurs multiple times in this notebook.
Recalculate the hba1c_change
column: hba1c_start
minus hba1c_end
.
treatments_clean.hba1c_change = treatments_clean.hba1c_start - treatments_clean.hba1c_end
treatments_clean.hba1c_change.head()
0 0.43
1 0.47
2 0.43
3 0.35
4 0.32
Name: hba1c_change, dtype: float64
treatments_clean.hba1c_change.isnull().sum()
0
After addressing missing data the next logical step is cleaning for tidiness.
This means it's generally best to clean structural issues first, like tidiness, and then clean content issues, like quality.
patients
table contains two variables: phone number and emailHint 1: use regular expressions with pandas' str.extract
method. Here is an amazing regex tutorial. Hint 2: various phone number regex patterns. Hint 3: email address regex pattern, which you might need to modify to distinguish the email from the phone number.
Extract the phone number and email variables from the contact column using regular expressions and pandas' str.extract
method. Drop the contact column when done.
patients_clean['phone_number'] = patients_clean.contact.str.extract('((?:\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})', expand=True)
# [a-zA-Z] to signify emails in this dataset all start and end with letters
patients_clean['email'] = patients_clean.contact.str.extract('([a-zA-Z][a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+[a-zA-Z])', expand=True)
# Note: axis=1 denotes that we are referring to a column, not a row
patients_clean = patients_clean.drop('contact', axis=1)
# Confirm contact column is gone
list(patients_clean)
['patient_id',
'assigned_sex',
'given_name',
'surname',
'address',
'city',
'state',
'zip_code',
'country',
'birthdate',
'weight',
'height',
'bmi',
'phone_number',
'email']
patients_clean.phone_number.sample(25)
251 1234567890
198 +1 (859) 977-9462
136 714-507-4204
236 303-360-3627
342 860 618 4560
308 631-875-3023
436 703-547-0551
163 803-712-1180
478 +1 (415) 755-6435
229 1234567890
153 775 770 7827
29 +1 (845) 858-7707
298 361-533-5161
17 678-829-8578
146 219 391 7729
470 386-234-5932
256 570-797-6666
141 903-644-2098
371 +1 (908) 287-7099
30 303-910-2058
91 713-230-1739
37 +1 (605) 204-6572
327 908-871-7009
303 313 589 1746
161 406-759-6160
Name: phone_number, dtype: object
# Confirm that no emails start with an integer (regex didn't match for this)
patients_clean.email.sort_values().head()
404 AaliyahRice@dayrep.com
11 Abdul-NurMummarIsa@rhyta.com
332 AbelEfrem@fleckens.hu
258 AbelYonatan@teleworm.us
305 AddolorataLombardi@jourrapide.com
Name: email, dtype: object
treatments
table (treatment, start dose and end dose)Hint: use pandas' melt function and str.split()
method. Here is an excellent melt
tutorial.
Melt the auralin and novodra columns to a treatment and a dose column (dose will still contain both start and end dose at this point). Then split the dose column on ' - ' to obtain start_dose and end_dose columns. Drop the intermediate dose column.
treatments_clean = pd.melt(treatments_clean, id_vars=['given_name', 'surname', 'hba1c_start', 'hba1c_end', 'hba1c_change'],
var_name='treatment', value_name='dose')
treatments_clean.head()
given_name | surname | hba1c_start | hba1c_end | hba1c_change | treatment | dose | |
---|---|---|---|---|---|---|---|
0 | veronika | jindrová | 7.63 | 7.20 | 0.43 | auralin | 41u - 48u |
1 | elliot | richardson | 7.56 | 7.09 | 0.47 | auralin | - |
2 | yukitaka | takenaka | 7.68 | 7.25 | 0.43 | auralin | - |
3 | skye | gormanston | 7.97 | 7.62 | 0.35 | auralin | 33u - 36u |
4 | alissa | montez | 7.78 | 7.46 | 0.32 | auralin | - |
treatments_clean = treatments_clean[treatments_clean.dose != "-"]
treatments_clean.head()
given_name | surname | hba1c_start | hba1c_end | hba1c_change | treatment | dose | |
---|---|---|---|---|---|---|---|
0 | veronika | jindrová | 7.63 | 7.20 | 0.43 | auralin | 41u - 48u |
3 | skye | gormanston | 7.97 | 7.62 | 0.35 | auralin | 33u - 36u |
6 | sophia | haugen | 7.65 | 7.27 | 0.38 | auralin | 37u - 42u |
7 | eddie | archer | 7.89 | 7.55 | 0.34 | auralin | 31u - 38u |
9 | asia | woźniak | 7.76 | 7.37 | 0.39 | auralin | 30u - 36u |
treatments_clean[['dose_start','dose_end']] = treatments_clean['dose'].str.split(' - ', 1, expand = True)
treatments_clean
given_name | surname | hba1c_start | hba1c_end | hba1c_change | treatment | dose | dose_start | dose_end | |
---|---|---|---|---|---|---|---|---|---|
0 | veronika | jindrová | 7.63 | 7.20 | 0.43 | auralin | 41u - 48u | 41u | 48u |
3 | skye | gormanston | 7.97 | 7.62 | 0.35 | auralin | 33u - 36u | 33u | 36u |
6 | sophia | haugen | 7.65 | 7.27 | 0.38 | auralin | 37u - 42u | 37u | 42u |
7 | eddie | archer | 7.89 | 7.55 | 0.34 | auralin | 31u - 38u | 31u | 38u |
9 | asia | woźniak | 7.76 | 7.37 | 0.39 | auralin | 30u - 36u | 30u | 36u |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
688 | christopher | woodward | 7.51 | 7.06 | 0.45 | novodra | 55u - 51u | 55u | 51u |
690 | maret | sultygov | 7.67 | 7.30 | 0.37 | novodra | 26u - 23u | 26u | 23u |
694 | lixue | hsueh | 9.21 | 8.80 | 0.41 | novodra | 22u - 23u | 22u | 23u |
696 | jakob | jakobsen | 7.96 | 7.51 | 0.45 | novodra | 28u - 26u | 28u | 26u |
698 | berta | napolitani | 7.68 | 7.21 | 0.47 | novodra | 42u - 44u | 42u | 44u |
350 rows × 9 columns
treatments_clean = treatments_clean.drop('dose', axis=1)
treatments_clean.head()
given_name | surname | hba1c_start | hba1c_end | hba1c_change | treatment | dose_start | dose_end | |
---|---|---|---|---|---|---|---|---|
0 | veronika | jindrová | 7.63 | 7.20 | 0.43 | auralin | 41u | 48u |
3 | skye | gormanston | 7.97 | 7.62 | 0.35 | auralin | 33u | 36u |
6 | sophia | haugen | 7.65 | 7.27 | 0.38 | auralin | 37u | 42u |
7 | eddie | archer | 7.89 | 7.55 | 0.34 | auralin | 31u | 38u |
9 | asia | woźniak | 7.76 | 7.37 | 0.39 | auralin | 30u | 36u |
treatments
tableHint: tutorial for the merge
function used in the solution.
Merge the adverse_reaction column to the treatments
table, joining on given_name and surname.
treatments_clean = pd.merge(treatments_clean, adverse_reactions_clean,
on=['given_name', 'surname'], how='left')
treatments_clean
given_name | surname | hba1c_start | hba1c_end | hba1c_change | treatment | dose_start | dose_end | adverse_reaction | |
---|---|---|---|---|---|---|---|---|---|
0 | veronika | jindrová | 7.63 | 7.20 | 0.43 | auralin | 41u | 48u | NaN |
1 | skye | gormanston | 7.97 | 7.62 | 0.35 | auralin | 33u | 36u | NaN |
2 | sophia | haugen | 7.65 | 7.27 | 0.38 | auralin | 37u | 42u | NaN |
3 | eddie | archer | 7.89 | 7.55 | 0.34 | auralin | 31u | 38u | NaN |
4 | asia | woźniak | 7.76 | 7.37 | 0.39 | auralin | 30u | 36u | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
345 | christopher | woodward | 7.51 | 7.06 | 0.45 | novodra | 55u | 51u | nausea |
346 | maret | sultygov | 7.67 | 7.30 | 0.37 | novodra | 26u | 23u | NaN |
347 | lixue | hsueh | 9.21 | 8.80 | 0.41 | novodra | 22u | 23u | injection site discomfort |
348 | jakob | jakobsen | 7.96 | 7.51 | 0.45 | novodra | 28u | 26u | hypoglycemia |
349 | berta | napolitani | 7.68 | 7.21 | 0.47 | novodra | 42u | 44u | injection site discomfort |
350 rows × 9 columns
patients
table duplicated in treatments
and adverse_reactions
tables and Lowercase given names and surnamesHint: tutorial for merge
function used in the solution and tutorial for drop
function used in the solution.
Adverse reactions table is no longer needed so ignore that part. Isolate the patient ID and names in the patients
table, then convert these names to lower case to join with treatments
. Then drop the given name and surname columns in the treatments table (so these being lowercase isn't an issue anymore).
id_names = patients_clean[['patient_id', 'given_name', 'surname']]
id_names
patient_id | given_name | surname | |
---|---|---|---|
0 | 1 | Zoe | Wellish |
1 | 2 | Pamela | Hill |
2 | 3 | Jae | Debord |
3 | 4 | Liêm | Phan |
4 | 5 | Tim | Neudorf |
... | ... | ... | ... |
498 | 499 | Mustafa | Lindström |
499 | 500 | Ruman | Bisliev |
500 | 501 | Jinke | de Keizer |
501 | 502 | Chidalu | Onyekaozulu |
502 | 503 | Pat | Gersten |
503 rows × 3 columns
id_names.given_name = id_names.given_name.str.lower()
id_names
C:\Users\Mic\anaconda3\lib\site-packages\pandas\core\generic.py:5516: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[name] = value
patient_id | given_name | surname | |
---|---|---|---|
0 | 1 | zoe | Wellish |
1 | 2 | pamela | Hill |
2 | 3 | jae | Debord |
3 | 4 | liêm | Phan |
4 | 5 | tim | Neudorf |
... | ... | ... | ... |
498 | 499 | mustafa | Lindström |
499 | 500 | ruman | Bisliev |
500 | 501 | jinke | de Keizer |
501 | 502 | chidalu | Onyekaozulu |
502 | 503 | pat | Gersten |
503 rows × 3 columns
id_names.surname = id_names.surname.str.lower()
id_names
C:\Users\Mic\anaconda3\lib\site-packages\pandas\core\generic.py:5516: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[name] = value
patient_id | given_name | surname | |
---|---|---|---|
0 | 1 | zoe | wellish |
1 | 2 | pamela | hill |
2 | 3 | jae | debord |
3 | 4 | liêm | phan |
4 | 5 | tim | neudorf |
... | ... | ... | ... |
498 | 499 | mustafa | lindström |
499 | 500 | ruman | bisliev |
500 | 501 | jinke | de keizer |
501 | 502 | chidalu | onyekaozulu |
502 | 503 | pat | gersten |
503 rows × 3 columns
treatments_clean = pd.merge(id_names, treatments_clean, on=['given_name', 'surname'])
treatments_clean.head()
patient_id | given_name | surname | hba1c_start | hba1c_end | hba1c_change | treatment | dose_start | dose_end | adverse_reaction | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | zoe | wellish | 7.71 | 7.30 | 0.41 | novodra | 33u | 33u | NaN |
1 | 2 | pamela | hill | 9.53 | 9.10 | 0.43 | novodra | 27u | 29u | NaN |
2 | 4 | liêm | phan | 7.58 | 7.10 | 0.48 | novodra | 43u | 48u | NaN |
3 | 6 | rafael | costa | 7.73 | 7.34 | 0.39 | auralin | 50u | 60u | NaN |
4 | 7 | mary | adams | 7.65 | 7.26 | 0.39 | novodra | 32u | 33u | NaN |
treatments_clean = treatments_clean.drop(['given_name', 'surname'], axis=1)
# Confirm the merge was executed correctly
treatments_clean
patient_id | hba1c_start | hba1c_end | hba1c_change | treatment | dose_start | dose_end | adverse_reaction | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 7.71 | 7.30 | 0.41 | novodra | 33u | 33u | NaN |
1 | 2 | 9.53 | 9.10 | 0.43 | novodra | 27u | 29u | NaN |
2 | 4 | 7.58 | 7.10 | 0.48 | novodra | 43u | 48u | NaN |
3 | 6 | 7.73 | 7.34 | 0.39 | auralin | 50u | 60u | NaN |
4 | 7 | 7.65 | 7.26 | 0.39 | novodra | 32u | 33u | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
344 | 495 | 8.90 | 8.59 | 0.31 | novodra | 26u | 24u | NaN |
345 | 497 | 7.71 | 7.35 | 0.36 | auralin | 35u | 38u | NaN |
346 | 499 | 7.92 | 7.60 | 0.32 | novodra | 35u | 33u | NaN |
347 | 500 | 7.72 | 7.39 | 0.33 | auralin | 46u | 53u | NaN |
348 | 502 | 7.54 | 7.27 | 0.27 | novodra | 42u | 41u | NaN |
349 rows × 8 columns
# Patient ID should be the only duplicate column
all_columns = pd.Series(list(patients_clean) + list(treatments_clean))
all_columns[all_columns.duplicated()]
15 patient_id
dtype: object
Hint: see the "Data Cleaning Process" page.
Convert the zip code column's data type from a float to a string using astype
, remove the '.0' using string slicing, and pad four digit zip codes with a leading 0.
patients_clean.zip_code = patients_clean.zip_code.astype(str).str[:-2].str.pad(5, fillchar='0')
# Reconvert NaNs entries that were converted to '0000n' by code above
patients_clean.zip_code = patients_clean.zip_code.replace('0000n', np.nan)
patients_clean.zip_code.head()
0 92390
1 61812
2 68467
3 07095
4 36303
Name: zip_code, dtype: object
Replace height for rows in the patients
table that have a height of 27 in (there is only one) with 72 in.
patients_clean.height = patients_clean.height.replace(27, 72)
# Should be empty
patients_clean[patients_clean.height == 27]
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | birthdate | weight | height | bmi | phone_number |
---|
# Confirm the replacement worked
patients_clean[patients_clean.surname == 'Neudorf']
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | birthdate | weight | height | bmi | phone_number | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 5 | male | Tim | Neudorf | 1428 Turkey Pen Lane | Dothan | AL | 36303 | United States | 2/18/1928 | 192.3 | 72 | 26.1 | 334-515-7487 | TimNeudorf@cuvox.de |
Hint: tutorial for apply
method used in solution.
Apply a function that converts full state name to state abbreviation for California, New York, Illinois, Florida, and Nebraska.
# Mapping from full state name to abbreviation
state_abbrev = {'California': 'CA',
'New York': 'NY',
'Illinois': 'IL',
'Florida': 'FL',
'Nebraska': 'NE'}
# Function to apply
def abbreviate_state(patient): # patient refers to df to which the function is applied
if patient['state'] in state_abbrev.keys():
abbrev = state_abbrev[patient['state']]
return abbrev
else:
return patient['state']
patients_clean['state'] = patients_clean.apply(abbreviate_state, axis=1)
patients_clean.state.value_counts()
CA 60
NY 47
TX 32
IL 24
FL 22
MA 22
PA 18
GA 15
OH 14
MI 13
OK 13
LA 13
NJ 12
VA 11
WI 10
MS 10
AL 9
TN 9
IN 9
MN 9
NC 8
KY 8
WA 8
MO 7
NE 6
KS 6
ID 6
NV 6
SC 5
IA 5
CT 5
RI 4
ND 4
AR 4
AZ 4
ME 4
CO 4
MD 3
DE 3
SD 3
WV 3
OR 3
VT 2
MT 2
DC 2
AK 1
NM 1
NH 1
WY 1
Name: state, dtype: int64
Replace given name for rows in the patients
table that have a given name of 'Dsvid' with 'David'.
patients_clean.given_name = patients_clean.given_name.replace('Dsvid', 'David')
patients_clean[patients_clean.surname == 'Gustafsson']
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | birthdate | weight | height | bmi | phone_number | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 9 | male | David | Gustafsson | 1790 Nutter Street | Kansas City | MO | 64105 | United States | 3/6/1937 | 163.9 | 66 | 26.5 | 816-265-9578 | DavidGustafsson@armyspy.com |
Hint: documentation page for astype
method used in solution, documentation page for .to_datetime
function used in the solution, and documentation page for .str.strip
method used in the solution.
Convert assigned sex and state to categorical data types. Zip code data type was already addressed above. Convert birthdate to datetime data type. Strip the letter 'u' in start dose and end dose and convert those columns to data type integer.
# To category
patients_clean.assigned_sex = patients_clean.assigned_sex.astype('category')
patients_clean.state = patients_clean.state.astype('category')
# To datetime
patients_clean.birthdate = pd.to_datetime(patients_clean.birthdate)
# Strip u and to integer
treatments_clean.dose_start = treatments_clean.dose_start.str.strip('u').astype(int)
treatments_clean.dose_end = treatments_clean.dose_end.str.strip('u').astype(int)
patients_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 patient_id 503 non-null int64
1 assigned_sex 503 non-null category
2 given_name 503 non-null object
3 surname 503 non-null object
4 address 491 non-null object
5 city 491 non-null object
6 state 491 non-null category
7 zip_code 491 non-null object
8 country 491 non-null object
9 birthdate 503 non-null datetime64[ns]
10 weight 503 non-null float64
11 height 503 non-null int64
12 bmi 503 non-null float64
13 phone_number 491 non-null object
14 email 491 non-null object
dtypes: category(2), datetime64[ns](1), float64(2), int64(2), object(8)
memory usage: 53.7+ KB
treatments_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 349 entries, 0 to 348
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 patient_id 349 non-null int64
1 hba1c_start 349 non-null float64
2 hba1c_end 349 non-null float64
3 hba1c_change 349 non-null float64
4 treatment 349 non-null object
5 dose_start 349 non-null int32
6 dose_end 349 non-null int32
7 adverse_reaction 35 non-null object
dtypes: float64(3), int32(2), int64(1), object(2)
memory usage: 21.8+ KB
Hint: helpful Stack Overflow answer.
Strip all " ", "-", "(", ")", and "+" and store each number without any formatting. Pad the phone number with a 1 if the length of the number is 10 digits (we want country code).
patients_clean.phone_number = patients_clean.phone_number.str.replace(r'\D+', '').str.pad(11, fillchar='1')
C:\Users\Mic\AppData\Local\Temp/ipykernel_13776/3922059896.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
patients_clean.phone_number = patients_clean.phone_number.str.replace(r'\D+', '').str.pad(11, fillchar='1')
patients_clean.phone_number.head()
0 19517199170
1 12175693204
2 14023636804
3 17326368246
4 13345157487
Name: phone_number, dtype: object
Recall that it is assumed that the data that this John Doe data displaced is not recoverable.
Remove the non-recoverable John Doe records from the patients
table.
patients_clean = patients_clean[patients_clean.surname != 'Doe']
# Should be no Doe records
patients_clean.surname.value_counts()
Jakobsen 3
Taylor 3
Aranda 2
Tucker 2
Souza 2
..
Casárez 1
Mata 1
Pospíšil 1
Rukavina 1
Onyekaozulu 1
Name: surname, Length: 465, dtype: int64
# Should be no 123 Main Street records
patients_clean.address.value_counts()
2778 North Avenue 2
2476 Fulton Street 2
648 Old Dear Lane 2
576 Brown Bear Drive 1
2272 Williams Avenue 1
..
1066 Goosetown Drive 1
4291 Patton Lane 1
4643 Reeves Street 1
174 Lost Creek Road 1
3652 Boone Crockett Lane 1
Name: address, Length: 482, dtype: int64
Remove the Jake Jakobsen, Pat Gersten, and Sandy Taylor rows from the patients
table. These are the nicknames, which happen to also not be in the treatments
table (removing the wrong name would create a consistency issue between the patients
and treatments
table). These are all the second occurrence of the duplicate. These are also the only occurences of non-null duplicate addresses.
# tilde means not: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
patients_clean = patients_clean[~((patients_clean.address.duplicated()) & patients_clean.address.notnull())]
patients_clean[patients_clean.surname == 'Jakobsen']
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | birthdate | weight | height | bmi | phone_number | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
24 | 25 | male | Jakob | Jakobsen | 648 Old Dear Lane | Port Jervis | NY | 12771 | United States | 1985-08-01 | 155.8 | 67 | 24.4 | 18458587707 | JakobCJakobsen@einrot.com |
432 | 433 | female | Karen | Jakobsen | 1690 Fannie Street | Houston | TX | 77020 | United States | 1962-11-25 | 185.2 | 67 | 29.0 | 19792030438 | KarenJakobsen@jourrapide.com |
patients_clean[patients_clean.surname == 'Gersten']
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | birthdate | weight | height | bmi | phone_number | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
97 | 98 | male | Patrick | Gersten | 2778 North Avenue | Burr | NE | 68324 | United States | 1954-05-03 | 138.2 | 71 | 19.3 | 14028484923 | PatrickGersten@rhyta.com |
patients_clean[patients_clean.surname == 'Taylor']
patient_id | assigned_sex | given_name | surname | address | city | state | zip_code | country | birthdate | weight | height | bmi | phone_number | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
131 | 132 | female | Sandra | Taylor | 2476 Fulton Street | Rainelle | WV | 25962 | United States | 1960-10-23 | 206.1 | 64 | 35.4 | 13044382648 | SandraCTaylor@dayrep.com |
426 | 427 | male | Rogelio | Taylor | 4064 Marigold Lane | Miami | FL | 33179 | United States | 1992-09-02 | 186.6 | 69 | 27.6 | 13054346299 | RogelioJTaylor@teleworm.us |
Use advanced indexing to isolate the row where the surname is Zaitseva and convert the entry in its weight field from kg to lbs.
weight_kg = patients_clean.weight.min()
mask = patients_clean.surname == 'Zaitseva'
column_name = 'weight'
patients_clean.loc[mask, column_name] = weight_kg * 2.20462
patients_clean.weight[patients_clean.surname == 'Zaitseva'] = patients_clean.weight.min() * 2.20462
C:\Users\Mic\AppData\Local\Temp/ipykernel_13776/815172696.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
patients_clean.weight[patients_clean.surname == 'Zaitseva'] = patients_clean.weight.min() * 2.20462
# 48.8 shouldn't be the lowest anymore
patients_clean.weight.sort_values()
459 102.1
335 102.7
74 103.2
317 106.0
171 106.5
...
61 244.9
144 244.9
283 245.5
118 254.5
485 255.9
Name: weight, Length: 494, dtype: float64