Machine Learning – Police Shootings
Lonell Childred – Final Project
(Police Shootings Dataset)
Download: Python Code / HTML
My background for choosing this topic comes from my person experience in dealing with the police and I will analyse the dataset using machine learning models to further my study. This is the background for my data story.
This is an excerpt that I posted on Facebook right after the killing of George Floyd last May 2020. “I am extremely disturbed and saddened by what is happening in our nation tonight. In fact, I haven’t had much sleep in the last 2 nights. I do not endorse the violence and destruction of property, but at the same time as a black man in America I understand completely the anger and frustration that black and brown people in our nation have and are experiencing.
We just saw this morning Omar Jimenez from CNN get arrested for NO reason and was told NO reason for his arrest when he asked. Unfortunately, I am in the same club. I have never shared this, but in my lifetime I have been falsely arrested 3 times and had an Ohio State Trooper pull out a firearm on me once, and I have NO criminal record, nor have I committed any crimes. I used to drive a green BMW and was pulled over and harassed by the police, asking if this was my car, how did it get this expensive BMW and do I have any drugs or guns in my car. My response was that I actually am a IT Professional and a member of the Screen Actor’s Guild (SAG-AFTRA). The police said that they had probable cause to arrest me, my car was towed and I was taken to the local sheriff’s department, stripped down butt naked and then taken downtown Cincinnati where I was thrown in jail for the weekend. I was not told what crime I had committed or probable cause to be arrested. After having a client of mine bail be out of jail, I hired an attorney and threatened to sue the city. Then, magically I started getting all types of apologies from the police department and of course the arrest was cleared from my record.
In another incident, which was mistaken identity, which I proved to the police I was not the person that they were looking for; however, this time I was handcuffed and ruffed up by 2 police officers, slammed onto the police car and arrested. I filed a complaint against the police officers and had them internally investigated. This time I received a written apology from the police department and the charges were dropped.
The point is that I am very upset, because just like George Floyd, I complied 100% with the police and he is now dead. This could have easily happened to me. As an Orthodox Christian man, I am praying for peace, and positive changes in our nation, and I am asking everyone who sees this message to do the same. GOD Bless Us All.”
https://www.facebook.com/lonell/posts/10221100403238729
Today, 4.20.2021, we saw the conviction of Derek Chauvin on all 3 counts for the murder of George Floyd.Previously a $27 million civil settlement was awarded to the Floyd Family.
This alone is proof of the racism, bias and disparity against black and brown minorities in the United States.
Read & Convert data for proper analysis Statistical Analysis.
I would like to primarily compare cause of death based on race and age. Another way would be say this is white v.non-white deaths by th police.
The Labels are as follows:
Gender
M = Male
F = Female
None = Unknown
Race
W = White, non-Hispanic
B = Black, non-Hispanic
A = Asian
N = Native American
H = Hispanic
O = Other
None = Unknown
#import vital packages for proper analysis
import time
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from pandas import Series, DataFrame
import pandas as pd
from scipy import stats
%matplotlib inline
# Read the Police Shooting data set
data_df=pd.read_csv('/home/mint/fatal-police-shootings-data.csv')
# Print the shape & Head of the data frame
print(data_df.shape)
print(data_df.head(6))
(5416, 14) id name date manner_of_death armed age \ 0 3 Tim Elliot 2015-01-02 shot gun 53.0 1 4 Lewis Lee Lembke 2015-01-02 shot gun 47.0 2 5 John Paul Quintero 2015-01-03 shot and Tasered unarmed 23.0 3 8 Matthew Hoffman 2015-01-04 shot toy weapon 32.0 4 9 Michael Rodriguez 2015-01-04 shot nail gun 39.0 5 11 Kenneth Joe Brown 2015-01-04 shot gun 18.0 gender race city state signs_of_mental_illness threat_level \ 0 M A Shelton WA True attack 1 M W Aloha OR False attack 2 M H Wichita KS False other 3 M W San Francisco CA True attack 4 M H Evans CO False attack 5 M W Guthrie OK False attack flee body_camera 0 Not fleeing False 1 Not fleeing False 2 Not fleeing False 3 Not fleeing False 4 Not fleeing False 5 Not fleeing False
#Properly assigning categorical records as a category
data_df.id = data_df.id.astype('category')
data_df.armed = data_df.armed.astype('category')
data_df.gender = data_df.gender.astype('category')
data_df.city = data_df.city.astype('category')
data_df.state = data_df.state.astype('category')
data_df.race = data_df.race.astype('category')
data_df.threat_level = data_df.threat_level.astype('category')
data_df.flee = data_df.flee.astype('category')
data_df.manner_of_death = data_df.manner_of_death.astype('category')
#Properly naming each one of the races, to facilitate analysis and comprehension in visualizations
data_df.replace(to_replace = ['A'], value = ['Asian'], inplace = True)
data_df.replace(to_replace = ['B'], value = ['Black'], inplace = True)
data_df.replace(to_replace = ['H'], value = ['Hispanic'], inplace = True)
data_df.replace(to_replace = ['N'], value = ['Native American'], inplace = True)
data_df.replace(to_replace = ['O'], value = ['Other'], inplace = True)
data_df.replace(to_replace = ['W'], value = ['White'], inplace = True)
data_df['month'] = pd.to_datetime(data_df['date']).dt.month
data_df['year'] = pd.to_datetime(data_df['date']).dt.year
data_df.head()
id | name | date | manner_of_death | armed | age | gender | race | city | state | signs_of_mental_illness | threat_level | flee | body_camera | month | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | Tim Elliot | 2015-01-02 | shot | gun | 53.0 | M | Asian | Shelton | WA | True | attack | Not fleeing | False | 1 | 2015 |
1 | 4 | Lewis Lee Lembke | 2015-01-02 | shot | gun | 47.0 | M | White | Aloha | OR | False | attack | Not fleeing | False | 1 | 2015 |
2 | 5 | John Paul Quintero | 2015-01-03 | shot and Tasered | unarmed | 23.0 | M | Hispanic | Wichita | KS | False | other | Not fleeing | False | 1 | 2015 |
3 | 8 | Matthew Hoffman | 2015-01-04 | shot | toy weapon | 32.0 | M | White | San Francisco | CA | True | attack | Not fleeing | False | 1 | 2015 |
4 | 9 | Michael Rodriguez | 2015-01-04 | shot | nail gun | 39.0 | M | Hispanic | Evans | CO | False | attack | Not fleeing | False | 1 | 2015 |
Data Visualizations based on the data set.
The Research Question is as follows:
Is there RACIAL BIAS that shows that Non-White Americans are shot and killed by the police more often more often than White Americans based on age?
# In order to facilitate our analysis, and understand if there is racial basis in shootings, we will create categories for the following
# Armed = Will be categorized into Armed and Unarmed
# Fleeing = Will be categorized into Fleeing and Not Fleeing
# ARMED CATEGORY - BUCKET
UnavailableUndetermined = ['NaN','undetermined',]
Unarmed = ['unarmed']
Armed = ['gun',
'toy weapon',
'nail gun',
'knife',
'shovel',
'hammer',
'hatchet',
'sword',
'machete',
'box cutter',
'metal object',
'screwdriver',
'lawn mower blade',
'flagpole',
'guns and explosives',
'cordless drill',
'crossbow',
'metal pole',
'Taser',
'metal pipe',
'metal hand tool',
'blunt object',
'metal stick',
'sharp object',
'meat cleaver',
'carjack',
'chain',
"contractor's level",
'unknown weapon',
'stapler',
'beer bottle',
'bean-bag gun',
'baseball bat and fireplace poker',
'straight edge razor',
'gun and knife',
'ax',
'brick',
'baseball bat',
'hand torch',
'chain saw',
'garden tool',
'scissors',
'pole',
'pick-axe',
'flashlight',
'vehicle',
'baton',
'spear',
'chair',
'pitchfork',
'hatchet and gun',
'rock',
'piece of wood',
'bayonet',
'pipe',
'glass shard',
'motorcycle',
'pepper spray',
'metal rake',
'crowbar',
'oar',
'machete and gun',
'tire iron',
'air conditioner',
'pole and knife',
'baseball bat and bottle',
'fireworks',
'pen',
'chainsaw',
'gun and sword',
'gun and car',
'pellet gun',
'claimed to be armed',
'BB gun',
'incendiary device',
'samurai sword',
'bow and arrow',
'gun and vehicle',
'vehicle and gun',
'wrench',
'walking stick',
'barstool',
'grenade',
'BB gun and vehicle',
'wasp spray',
'air pistol',
'Airsoft pistol',
'baseball bat and knife',
'vehicle and machete',
'ice pick',
'car, knife and mace']
df_UnavailableUndetermined = pd.DataFrame({'armed': UnavailableUndetermined})
df_UnavailableUndetermined ['category'] = 'Unavailable_Undetermined'
df_Unarmed = pd.DataFrame({'armed': Unarmed})
df_Unarmed ['category'] = 'Unarmed'
df_Armed = pd.DataFrame({'armed': Armed})
df_Armed ['category'] = 'Armed'
df_lookup2 = df_Armed
df_lookup1 = df_lookup2.append(df_Unarmed)
df_lookup = df_lookup1.append(df_UnavailableUndetermined)
df2 = pd.merge(data_df, df_lookup, on = 'armed', how = 'outer' )
df2 = df2.rename({'category':'armed_category'}, axis = 1)
df2.head()
#df2.armed_category.value_counts(normalize = True)
# FLEE CATEGORY - BUCKET
Fleeing = ['Car', 'Foot', 'Other']
NotFleeing = ['Not fleeing']
FleeLookUp2 = pd.DataFrame({'flee': Fleeing})
FleeLookUp2['flee_category'] = "Fleeing"
FleeLookUp1 = pd.DataFrame({'flee': NotFleeing})
FleeLookUp1['flee_category'] = "Not_Fleeing"
FleeLookUp = FleeLookUp1.append(FleeLookUp2)
#FleeLookUp.head()
df3 = pd.merge(df2,FleeLookUp,how='outer', on = 'flee')
#df3.head()
#df3.flee_category.value_counts(normalize=True)
#df3.race.value_counts(normalize=True)
#The majority of crimes are committed by 3 racial groups. White, Black and Hispanic
df3.race.value_counts(normalize=True).plot(kind='pie', figsize = (8,8))
plt.title('Deaths by Race\nNormalized Data')
Text(0.5, 1.0, 'Deaths by Race\nNormalized Data')
Important Racial Statistics
The percentage of people from White race, in USA, is 63.4%, the percentage of Latinos and Black are 15% and 13.4%. Therefore we can assume that Latinos and African Americans die more with regards of their own population.
# Top States
plt.xlabel("Frequency")
plt.ylabel("City")
plt.title("Top 10 States with Most Fatal Police Shootings")
plt.barh(df3[STATE].value_counts(normalize=True)[:10].index, df3[STATE].value_counts()[:10].values)
# Sum by percentage
# we can see that the top 10 states in the US account for 53.32%
# of all deaths in the US. Might be worth focusing on these states to look for trends
df3.state.value_counts(normalize=True)[:10].sum()
0.5332348596750369
# Listing Specifically Black vs. White
RaceList = ['White', 'Black']
df3_race = df3[df3.race.isin(RaceList)]
df3_race.race.unique()
CityList = ['Los Angeles','Phoenix','Houston','Las Vegas','San Antonio','Columbus','Chicago','Albuquerque','Kansas City','Jacksonville']
df3_race_city = df3_race[df3_race.city.isin(CityList)]
df3_race_city.city.unique()
df3_race_city.groupby('race').city.value_counts(normalize=True).unstack().plot(kind='bar', figsize=(18,8))
plt.title('Deaths Per Race and City')
plt.ylabel('% of Total Deaths per Race')
Text(0, 0.5, '% of Total Deaths per Race')
# Adding a correlation (corrmat) as well as heatmap
corrmat = data_df.corr()
print(corrmat)
#sns.heatmap (corrmat, vmax=.8, square=True)
sns.set()
f, ax = plt.subplots(figsize=(12,8))
heatmap=data_df.corr()
sns.heatmap(heatmap, vmax=.8, square=True, annot=True)
age signs_of_mental_illness body_camera \ age 1.000000 0.105763 -0.040138 signs_of_mental_illness 0.105763 1.000000 0.051838 body_camera -0.040138 0.051838 1.000000 month 0.011028 -0.027029 0.011036 year 0.035409 -0.079972 0.018592 month year age 0.011028 0.035409 signs_of_mental_illness -0.027029 -0.079972 body_camera 0.011036 0.018592 month 1.000000 -0.144633 year -0.144633 1.000000
<AxesSubplot:>
data_df.race.unique()
array(['Asian', 'White', 'Hispanic', 'Black', 'Other', nan, 'Native American'], dtype=object)
# Here I have added two columns to the df3 dataframe named - white and non_white.
# Next, I map and convert the race for white as 1 and non-white as 0
# finally, I display the head of the df3 dataframe o confirm the mapping.
df3['white'] = 'white'
df3['non_white'] = 'non_white'
df3['white'] = df3.race.map ({'Asian':0,'White':1,'Hispanic':0,'Black':0,'Other':0,'Native American':0})
df3['non_white'] = df3.race.map ({'Asian':1,'White':0,'Hispanic':1,'Black':1,'Other':1,'Native American':1})
df3.dropna(inplace=True)
df3.head()
id | name | date | manner_of_death | armed | age | gender | race | city | state | signs_of_mental_illness | threat_level | flee | body_camera | month | year | armed_category | flee_category | white | non_white | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | Tim Elliot | 2015-01-02 | shot | gun | 53.0 | M | Asian | Shelton | WA | True | attack | Not fleeing | False | 1.0 | 2015.0 | Armed | Not_Fleeing | 0.0 | 1.0 |
1 | 4 | Lewis Lee Lembke | 2015-01-02 | shot | gun | 47.0 | M | White | Aloha | OR | False | attack | Not fleeing | False | 1.0 | 2015.0 | Armed | Not_Fleeing | 1.0 | 0.0 |
2 | 11 | Kenneth Joe Brown | 2015-01-04 | shot | gun | 18.0 | M | White | Guthrie | OK | False | attack | Not fleeing | False | 1.0 | 2015.0 | Armed | Not_Fleeing | 1.0 | 0.0 |
3 | 15 | Brock Nichols | 2015-01-06 | shot | gun | 35.0 | M | White | Assaria | KS | False | attack | Not fleeing | False | 1.0 | 2015.0 | Armed | Not_Fleeing | 1.0 | 0.0 |
4 | 21 | Ron Sneed | 2015-01-07 | shot | gun | 31.0 | M | Black | Freeport | TX | False | attack | Not fleeing | False | 1.0 | 2015.0 | Armed | Not_Fleeing | 0.0 | 1.0 |
# Disply the scatter plot of Age v. Non -white deaths from the updated df3 dataframe
x = df3['age']
y = df3['non_white']
logR = LogisticRegression()
plt.scatter(x,y, marker='+',color='red')
plt.xlabel("Age")
plt.ylabel("Non-White Deaths")
Text(0, 0.5, 'Non-White Deaths')
# I believe the result of the Linear Regression Model should really be a Logistic Regression Model.
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y, marker='+',color='red')
plt.xlabel("Age")
plt.ylabel("Non-White Deaths")
plt.plot(x, mymodel)
[<matplotlib.lines.Line2D at 0x7fb8934a0f40>]
Machine Learning Models.
# I am using a Logistic Regression Model to test and train aganist age.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df3[['age']],df3.non_white,test_size=0.7)
X_test
age | |
---|---|
1915 | 42.0 |
2816 | 17.0 |
4900 | 40.0 |
5093 | 43.0 |
4870 | 27.0 |
… | … |
2783 | 20.0 |
3065 | 40.0 |
3134 | 48.0 |
932 | 52.0 |
3280 | 40.0 |
3080 rows × 1 columns
X_train
age | |
---|---|
1050 | 55.0 |
2433 | 23.0 |
2383 | 46.0 |
238 | 40.0 |
2299 | 35.0 |
… | … |
2297 | 44.0 |
3529 | 40.0 |
1508 | 28.0 |
1699 | 27.0 |
747 | 57.0 |
1319 rows × 1 columns
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train,y_train)
LogisticRegression()
model.predict(X_test)
array([0., 1., 0., ..., 0., 0., 0.])
# Accuancy of my model prediction base on X_Test
model.score(X_test,y_test)
0.6126623376623377
model.predict_proba(X_test)
array([[0.57123091, 0.42876909], [0.31489565, 0.68510435], [0.5502643 , 0.4497357 ], ..., [0.63234362, 0.36765638], [0.67096591, 0.32903409], [0.5502643 , 0.4497357 ]])
Lonell Childred – Results & Conclusions.
Based on the data from my research and my personal experience, it is clear that there is racial bias in police shootings and police brutality in the United States. Based on age, my predictive logistic regression model shows that most deaths are between the ages of 20s-30s; after forty the probability seems to decline.I did delete some NaN/blank data from my dataframe as needed, and I choose to leave most columns intact such as name for future ML projects. I did add two columns for analysis which were white and non-white that I converted to numeric 0/1 to help with my Machine Learning analysis and Logistic Regression Modeling. My Machine Learning Model has an accuracy of 61.26% based on the ML training and also I list the predict probability based on the X_test. This class and final project has made it very clear how important Data Science is and that Machine Learning will become more important both now and in the future.
Code References:
https://www.facebook.com/lonell/posts/10221100403238729
https://www.kaggle.com/gusvalicente/is-the-police-killing-minorities
https://www.kaggle.com/andle1/kernel2cdb30105f
https://www.kaggle.com/sameensalam/police-shooting-analysis
https://www.kaggle.com/mrinaal007/police-shootouts
https://www.kaggle.com/umerkk12/police-shooting-analysis
https://academic.oup.com/policing/advance-article/doi/10.1093/police/paz035/5518992
https://www.tutorialspoint.com/python_pandas/python_pandas_merging_joining.htm
https://blog.patricktriest.com/police-data-python/
https://re-thought.com/how-to-add-new-columns-in-a-dataframe-in-pandas/
https://www.w3schools.com/python/python_ml_getting_started.asp