Using a daily updated global pandemic data set, we seek to understand how COVID-19 is affecting people of various ages to predict the number of deaths and to determine the effect of GDP on pandemic response and vice versa. The target population is global, broken down by country, using descriptive and analytical methods, we aim to answer the question not asked by other sources. This pandemic has hugely disrupted world economies and gleaning meaningful insights by applying the data science strategies that we have learnt so far on relevant data can help us learn about any changing trends or outlier observations and be more prepared in these challenging times.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
import matplotlib.pylab as plt
from pandas.plotting import scatter_matrix
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score,GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
import sklearn.ensemble as ske
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix, classification_report
from sklearn import metrics
import seaborn as sns
import datetime as dt
import random
import scipy.stats as stats
%matplotlib inline
sns.set()
from pandas.api.types import CategoricalDtype
import datetime as dt
from datetime import timedelta
from statsmodels.tsa.api import Holt
import statsmodels.formula.api as sm
# Install plotnine if not avialable
!pip install pandas plotnine
from plotnine import *
# Import warning to avoid warnings appearing on the displayed notebook
import warnings
warnings.filterwarnings('ignore')
# Setting Max rows and columns for full visibility
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
df = pd.read_csv("../data/owid-covid-data.csv",parse_dates=['date'])
We take a look at the top 10 rows of data to get a beginning idea of what structure the data is in that we will be working with. Here, we can clearly see each column, as well as that it will be separated by country. We also see that we have ‘date’ in the dataset which will act as in index when data is checked over time.
df.head(10)
df.tail(10)
Looking at the last 10 rows in the dataset shows we have few rows which are tagged as "International" and do not tie to anyone continent.
Variables and data types can be very helpful when pre-processing data. Knowing all of this information will allow us to handle the data more accurately and allow us to correct the data more easily.
df.info()
There are 33417 observations and 34 columns where 5 variables are of categorical data type and the remaining 29 variables are of numerical data types.
Looking at the column names, the dataset provides us information about total number of COVID cases, tests and deaths by continent and by different age brackets. It also has information about per capitia, life expectancy, death rate by cardiovsacular and diabetes on a daily basis.
As mentioned above, knowing the data type is very useful for handling the data and correcting it. Here we are converting the data into data types that we can more effectively use. This is crucial for the date when we start using it for time.
df['date']=df['date'].dt.strftime("%m-%d-%Y")
df['date']= pd.to_datetime(df['date'])
print(df['date'].dtype)
df = df.set_index('date')
df.head(3)
df.describe()
There are some negative numbers as seen in the above output. For example, new_cases column has a min value of -29726. As this is the number of test cases so we cannot have negative values. Looks like a data error. Similarly, we have other such columns which are negatives - new_deaths, new_cases_per_million, new_deaths_per_million, new_tests and new_tests_per_thousand. We will handle these errors later.
# Taking a look at the outcome variable: 'total_deaths'
print(df['total_deaths'].value_counts())
df.isna().sum()
There are a total number of 33417 observations and the following variables have more than 60% of data as missing values.
new_tests, total_tests, total_tests_per_thousand, new_tests_per_thousand, new_tests_smoothed, new_tests_smoothed_per_thousand, tests_units, and the variable "handwashing_facilities" has 19653 close to 60% of data as missing values.
Therefore, these variables should be removed in order to avoid any bias in modeling. The variable extreme_poverty is also having 40% of its values as missing values. However, we will try to consider this variable as it has less than 50% of its missing values.
df.drop(['new_tests', 'total_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'tests_units', 'handwashing_facilities'], axis = 1, inplace = True)
The below mentioned variables should also be removed as these are conversions per million. We would prefer to use per million numbers in general compared to just numbers, however, there are more missing values in variables using per million as conversions. Therefore, we stick to the original variables.
total_cases_per_million, new_cases_per_million, total_deaths_per_million, and new_deaths_per_million
df.drop(['total_cases_per_million', 'new_cases_per_million', 'total_deaths_per_million', 'new_deaths_per_million'], axis = 1, inplace = True)
Also, total cases should include new cases and total deaths should include new deaths, Therefore, we can remove new cases and new deaths from our data to avoid redundancy.
df.drop(['new_cases', 'new_deaths'], axis = 1, inplace = True)
Because Population Density is calculated using Population, we need to remove Population
df.drop(['population'], axis = 1, inplace = True)
df.describe(include = 'O')
Looking at the categorical variables. Some of the rows for iso_code has no values and corresponding continent also doesn't have any values. Also, the corresponding location has only "International" as the values. This means there is no way to track which location in these rows belongs to which continent and such. Therefore, we should remove these rows. There are only 64 such rows with no values in iso_code variables. Therefore, removing such a small data will not affect our model.
df = df.dropna(how='all', subset=['iso_code'])
We now have 33353 observations instead of 33417 observations and 18 variables instead of 34 variables.
df.shape
All the names of variables in our current data set are listed below
df.columns
df.describe()
df.shape
df.isnull().sum()
We can observe that now with our already reduced data the variable extreme poverty has almost 40% values missing and therefore the better way would be to delete this column as imputing it would lead to misleading results.
df.drop(['extreme_poverty'], axis = 1, inplace = True)
df.shape
Before we go deeper into the dataset, it is good to perform some graphical exploratory analysis as we can quickly see and find issues with the data
GDP, or gross domestic product is a good representation of how a country is faring economically. Here we will be comparing the GDP of countries and how it affects the number of cases, but also the reaction to the pandemic.
(ggplot(df, aes(x='gdp_per_capita'))
+ geom_histogram(bins=12,
color ="red",
fill ="orange")
+ labs(title="Histogram for gdp_per_capita", x="gdp_per_capita", y="Count")
)
Like GDP, we are looking at the median age in a histogram. This is telling us what the ages look like that were being tested. We can see that the most tested age group was in their early 40s.
(ggplot(df, aes(x='median_age'))
+ geom_histogram(bins=12,
color ="red",
fill ="orange")
+ labs(title="Histogram for median_age", x="median_age", y="Count")
)
plt.scatter(df['total_cases'], df['total_deaths'], alpha = 0.1)
plt.xlabel("Total Number of Cases")
plt.ylabel("Total Deaths")
plt.title("Total Number of Cases vs Total Deaths")
As you can see, with the number of cases growing, deaths will also grow. This can be caused by a number of factors, for example, a country could be performing a large amount of test in a short time frame, much like the US.
df.hist(bins = 50, figsize = (20,15))
plt.show()
Out of the above graphs, we look at the hospital beds per thousand, the graph is left skewed indicating that majority of the hospitals have few beds. In times of a resource crunch, fewer beds and more cases can be a stress on the hospitals and might lead to compromise in health care services offered to patients not just seeking COVID treatment but any other conditions as well.
datewise = df.groupby(["date"]).agg({"total_cases" : "sum", "total_deaths" : "sum"})
print("Total Number of Cases: ", datewise["total_cases"].iloc[-1])
print("Total Number of Deaths: ", datewise["total_deaths"].iloc[-1])
print("Total Number of Active Cases ", (datewise["total_cases"].iloc[-1] - datewise["total_deaths"].iloc[-1]))
#sampling to view data closely on a more recent sample of records to comprehend recent trend better compared to looking at
# 30K+ records together
datewise_sample = datewise[datewise.index.to_series().between('2020-03-30', '2020-06-30')]
plt.figure(figsize = (20, 5))
sns.barplot(x = datewise_sample.index.date, y = datewise_sample["total_cases"] - datewise_sample["total_deaths"])
plt.title("Active Cases")
plt.xticks(rotation = 90)
plt.show()
Sadly, but as expected, the number of cases have been increasing as you can see above.
plt.figure(figsize = (20, 5))
sns.barplot(x = datewise_sample.index.date, y = datewise_sample["total_deaths"])
plt.title("Total Deaths")
plt.xticks(rotation = 90)
plt.show()
As well as the number of deaths increasing.
datewise["WeekofYear"] = datewise.index.weekofyear
num_week = []
weekly_cases = []
weekly_deaths = []
w = 1
for i in list(datewise["WeekofYear"].unique()):
weekly_cases.append(datewise[datewise["WeekofYear"] == i]["total_cases"].iloc[-1])
weekly_deaths.append(datewise[datewise["WeekofYear"] == i]["total_deaths"].iloc[-1])
num_week.append(w)
w = w+1
plt.figure(figsize = (10, 5))
plt.plot(num_week, weekly_cases, label = "Weekly Cases", linewidth = 3)
plt.plot(num_week, weekly_deaths, label = "Weekly Deaths", linewidth = 3)
plt.xlabel("Number of Week")
plt.ylabel("Number of Cases")
plt.title("Weekly Analysis of Cases")
As you can see, after about 10 weeks, the number of cases start to grow exponentially.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12, 4))
sns.barplot(x = num_week, y = pd.Series(weekly_cases).diff().fillna(0), ax = ax1)
sns.barplot(x = num_week, y = pd.Series(weekly_deaths).diff().fillna(0), ax = ax2)
ax1.set_xlabel("Number of Week")
ax2.set_xlabel("Number of Week")
ax1.set_ylabel("Total Number of Cases")
ax2.set_ylabel("Total Number of Deaths")
ax1.set_title("Weekly Number of Cases")
ax2.set_title("Weekly Deaths")
plt.show()
Curiously, you can see that there was a fall off in the number of deaths, but that it then began to rise again.
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
print("Average number of cases increasing everyday: ", np.round(datewise["total_cases"].diff().fillna(0).mean()))
print("Average number of deaths increasing everyday: ", np.round(datewise["total_deaths"].diff().fillna(0).mean()))
plt.figure(figsize = (10, 5))
plt.plot(datewise["total_cases"].diff().fillna(0), label = "Daily increase in total number of cases", linewidth = 3)
plt.plot(datewise["total_deaths"].diff().fillna(0), label = "Daily increase in total number of deaths", linewidth = 3)
plt.xlabel("Daily")
plt.ylabel("Daily Increase")
plt.title("Daily Increase in Cases")
plt.legend()
plt.xticks(rotation = 90)
plt.show()
df.groupby('continent').size()
sns.countplot(x='continent',data=df, palette="OrRd")
plt.xticks(rotation = 90)
plt.show()
The above graph shows that Europe has the largest number of observations in the data.
by_continent = df[df.index == df.index.max()].groupby(["continent"]).agg({"total_cases" : "sum", "total_deaths" : "sum", "gdp_per_capita" : "sum", "population_density" : "sum"})
by_continent["deaths %"] = (by_continent["total_deaths"]/by_continent["total_cases"])*100
by_continent.head()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (25, 10))
top_total_cases = by_continent.sort_values(["total_cases"], ascending = False).head(10)
top_total_deaths = by_continent.sort_values(["deaths %"], ascending = False).head(10)
sns.barplot(x = top_total_cases["total_cases"], y = top_total_cases.index, ax = ax1)
ax1.set_title("Top Continents - Total Number of Cases")
sns.barplot(x = top_total_deaths["deaths %"], y = top_total_deaths.index, ax = ax2)
ax2.set_title("Top Continents - Total Percentage of Deaths")
plt.show()
From the above charts, it is clearly seen that North America has the highest number of cases and second highest percentage of deaths. South America ranks number third in total number of cases and in total percentage of deaths. Surprisingly, Asia has second highest number of cases and less percentage of deaths (only 2.2%). This might be due to several factors. But one of the reasons can be the inaccuracy in reporting the number of cases/deaths and another reason can be their stronger immune system. Europe also surprisingly has very less number of cases, but ranks number one in total percentage of deaths. Again, there can be several factors associated with this, however, one reason can be Europe is not taking the same procautions as other continents.
plt.figure(figsize=(10,6))
ax = sns.boxplot(x="continent", y="gdp_per_capita", data=df, showfliers=False)
# trying to understand any relation between gdp and total deaths
sns.lmplot(y='gdp_per_capita',x='total_deaths',data=df, size=12)
We can observe from the above plot that the total deaths are more for poorer countries but we cannot be very sure, as there is a spike very close to 60,000 which can be deemed as countries with higher gdp. Still a significant portion of deaths are still in the lesser gdp range. Which might indicate a disparity between life expectancy between richer and poorer countries.
The below chart shows that the population density is highest in North America followed by Asia, Europe, Africa, Oceana, and last South America. Therefore, even though Europe seem to have highest number of deaths and North America among the least number of deaths, it is important to consider that North America has higher population density compared to Europe.
plt.figure(figsize=(10,6))
ax = sns.boxplot(x="continent", y="population_density", data=df, showfliers=False)
The first box plot shows that Europe and Oceania have the highest GDP of all of continents. Oceania has the second least population density, which could attribute to it having the least amount of deaths. However, Europe on the other hand, is among the richest continents, ranks third in population density and ranks 4th in total number of cases, and still has the highest number of total deaths. So, it will be interesting to study Europe among all the continents and help them predict total number of deaths ahead of time. This will help Europe in managing the situation better to some extent.
# heatmap for the correlation matrix
plt.figure(figsize=(10, 5))
sns.heatmap(df.corr(), annot=True, cmap='cubehelix_r')
plt.show()
This heatmap shows the correlation between the different categories. This allows us to see how the different categories may effect each other. We can observe that median_age, aged_65_older and aged_70_older are correlated. We will include only median_age and aged_65_older in our model to avoid noise.
df.drop(['aged_70_older'], axis = 1, inplace = True)
df.dtypes
df_eu = df[df['continent'] == "Europe"]
df_eu = df_eu.drop("continent", axis=1)
df_eu.head()
datewise_europe = df_eu.groupby(df_eu.index).agg({"total_cases" : "sum", "total_deaths" : "sum"})
print("Total Number of Cases in Europe: ", datewise_europe["total_cases"].iloc[-1])
print("Total Number of Deaths in Europe: ", datewise_europe["total_deaths"].iloc[-1])
print("Total Number of Active Cases in Europe ", (datewise_europe["total_cases"].iloc[-1] - datewise_europe["total_deaths"].iloc[-1]))
plt.figure(figsize=(20,4))
ax = sns.countplot(x ='location',data=df_eu, palette="OrRd")
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.show()
by_location = df_eu[df_eu.index == df_eu.index.max()].groupby(["location"]).agg({"total_cases" : "sum", "total_deaths" : "sum"}).sort_values(["total_cases"], ascending = False)
by_location.head()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (25, 10))
loc_cases = by_location.sort_values(["total_cases"], ascending = False).head(10)
loc_deaths = by_location.sort_values(["total_deaths"], ascending = False).head(10)
sns.barplot(x = loc_cases["total_cases"], y = loc_cases.index, ax = ax1)
ax1.set_title("Total Number of Cases by Location in Europe")
sns.barplot(x = loc_deaths["total_deaths"], y = loc_deaths.index, ax = ax2)
ax2.set_title("Total Number of Deaths by Location")
plt.show()
The above results show that Russia has the highest number of cases, but the United Kingdom has the highest number of deaths.
# Getting dummy variables
df_eu = pd.get_dummies(df_eu, columns = ['iso_code', 'location'], drop_first = True)
# Describing the numeric columns
df_eu.describe(include = [np.number])
This is another key part of repairing the data set. Here we are detecting any data points listed as null and replacing them with the median value for that variable. There are several missing values in this dataset. Therefore, we first need to handle these missing values in numerical variables before proceeding further. We will use SimpleImputer for this.
df_eu.isnull().sum()
# Impute missing values using Imputer in sklearn.preprocessing
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(df_eu)
df_eu = pd.DataFrame(data=imputer.transform(df_eu) , columns=df_eu.columns)
# Assign X as a DataFrame of features and y as a Series of the outcome variable
X = df_eu.drop('total_deaths', 1)
y = df_eu.total_deaths
Firstly, we are performing a linear regression on our EU data frame to understand the impact of GDP, demographic factors like age, prevalence of smoking, life expectancy, median age, location total cases.
# Feature Scaling and linear regression
scaler = StandardScaler()
reg = LinearRegression()
steps = [('scaling', scaler), ('regression', reg)]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
pipeline.score(X_test, y_test)
print("The intercept term of the linear model:", reg.intercept_)
cdf = pd.DataFrame(data=reg.coef_, index=X_train.columns, columns=["Coefficients"])
cdf
From the model we can see that the accuracy shown is 70.5% , we can also get the intercept and the coefficients. But we need to check if this model is valid with plotting the residuals for normality and residuals versus predicted for homoscedasticity.
plt.figure(figsize=(10,7))
plt.title("Histogram of residuals to check for normality",fontsize=18)
plt.xlabel("Residuals",fontsize=15)
plt.ylabel("Kernel density", fontsize=15)
sns.distplot([y_test-np.round(y_pred)],color='purple')
This plot shows us that the residuals though might seem like normally distributed but narrower plot are not normally distributed.
plt.figure(figsize=(10,7))
plt.title("Residuals vs. predicted values plot (Homoscedasticity)\n",fontsize=18)
plt.xlabel("Predicted total deaths",fontsize=15)
plt.ylabel("Residuals", fontsize=15)
plt.scatter(x=np.round(y_pred),y=y_test-np.round(y_pred),color='red')
We can also observe from the residuals vs predicted plot that this data set is not good for a linear regression model.
The linear regression model gives us an accuracy score of 70.8% but this event related to covid and demographic analysis has to be analysed on another parameter rather than just a linear model as we cannot get an exact view of these relations in just linear format. Thus let's model with decision tree.
def train_score_regressor(sklearn_regressor, X_train, y_train, X_test, y_test, model_parameters, print_oob_score=False):
"""A helper function that:
- Trains a regressor on training data
- Scores data on training and test data
- Returns a trained model
"""
# Step 1: Initializing the sklearn regressor
regressor = sklearn_regressor(**model_parameters)
# Step 2: Training the algorithm using the X_train dataset of features and y_train, the associated target features
regressor.fit(X_train, y_train)
y_pred1 = regressor.predict (X_train)
y_pred2 = regressor.predict (X_test)
# Step 3: Calculating the score of the predictive power on the training and testing dataset.
print ('Train score: %.3f' % r2_score(y_train, np.round(y_pred1)))
print ('Test score: %.3f' % r2_score(y_test, np.round(y_pred2)))
print(regressor)
return regressor
trained_regressor = train_score_regressor(sklearn_regressor=DecisionTreeRegressor,
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
model_parameters={'random_state':100})
This clearly shows that it the decision tree is overfitting to the data. Let's get the best parameters with GridSearchCV and conduct hyperparameter tuning.
# Setting parameters to search through
parameters = {"max_depth":[3,4,5],
"max_leaf_nodes":[2,3,4]}
decision_regressor= DecisionTreeRegressor(random_state=100)
# Initialize GridSearch and then fit
regressor=GridSearchCV(decision_regressor,parameters)
regressor.fit(X_train, y_train)
print(regressor)
regressor.best_estimator_.get_params()
# evaluating the tuned model
trained_regressor = train_score_regressor(sklearn_regressor=DecisionTreeRegressor,
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
model_parameters=regressor.best_estimator_.get_params())
We can observe from the results above that the decision tree model performs better once tuned with GridSearchCV to find the best possible parameters to understand the total deaths with gdp and age parameters for different locations across Europe. The train score is 92.4% and the test score is 91.8%. This shows that the model is well tuned to predict the total deaths very closely.
We tried a regresion model based on location details to predict total deaths and got an accurancy of 70.5%. But the evaluation shows a different picture regarding the usefulness of the data set for linear regression. Our decision tree model showed signs of overfitting in the beginning but once tuned it gave a better accuracy of 91.8% on test set. This is very helpful to take proactive measures at locations which might be prone to be hotspots based on prior trends that we have captured for the same. Moving beyond the data set that we worked with, continuous data input would greatly improve the accuracy of our models over time. The data sets we’re working with are volatile, subject to change as spikes in infections can occur. With more data, the models would be able to predict future outcomes more reliably and help us predict future hotspots, whether by age, GDP or other variables.