Overview

Using a daily updated global pandemic data set, we seek to understand how COVID-19 is affecting people of various ages to predict the number of deaths and to determine the effect of GDP on pandemic response and vice versa. The target population is global, broken down by country, using descriptive and analytical methods, we aim to answer the question not asked by other sources. This pandemic has hugely disrupted world economies and gleaning meaningful insights by applying the data science strategies that we have learnt so far on relevant data can help us learn about any changing trends or outlier observations and be more prepared in these challenging times.

Data Pre-processing

Understanding structure of data

  • Identifying Continuous and Categorical data
  • Handling Missing data
  • Methods to identify outliers
  • Measuring centrality of data
  • Measuring spread of data

Loading required libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
import matplotlib.pylab as plt
from pandas.plotting import scatter_matrix
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score,GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
import sklearn.ensemble as ske
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix, classification_report
from sklearn import metrics
import seaborn as sns
import datetime as dt
import random
import scipy.stats as stats
%matplotlib inline 
sns.set()
from pandas.api.types import CategoricalDtype
import datetime as dt
from datetime import timedelta
from statsmodels.tsa.api import Holt
import statsmodels.formula.api as sm
In [2]:
# Install plotnine if not avialable
!pip install pandas plotnine
from plotnine import *

# Import warning to avoid warnings appearing on the displayed notebook
import warnings
warnings.filterwarnings('ignore')
Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-packages (0.24.2)
Requirement already satisfied: plotnine in c:\programdata\anaconda3\lib\site-packages (0.7.0)
Requirement already satisfied: numpy>=1.12.0 in c:\programdata\anaconda3\lib\site-packages (from pandas) (1.16.4)
Requirement already satisfied: python-dateutil>=2.5.0 in c:\programdata\anaconda3\lib\site-packages (from pandas) (2.8.0)
Requirement already satisfied: pytz>=2011k in c:\programdata\anaconda3\lib\site-packages (from pandas) (2019.1)
Requirement already satisfied: patsy>=0.5.1 in c:\programdata\anaconda3\lib\site-packages (from plotnine) (0.5.1)
Requirement already satisfied: mizani>=0.7.1 in c:\programdata\anaconda3\lib\site-packages (from plotnine) (0.7.1)
Requirement already satisfied: scipy>=1.2.0 in c:\programdata\anaconda3\lib\site-packages (from plotnine) (1.2.1)
Requirement already satisfied: descartes>=1.1.0 in c:\programdata\anaconda3\lib\site-packages (from plotnine) (1.1.0)
Requirement already satisfied: matplotlib>=3.1.1 in c:\programdata\anaconda3\lib\site-packages (from plotnine) (3.3.0)
Requirement already satisfied: statsmodels>=0.11.1 in c:\programdata\anaconda3\lib\site-packages (from plotnine) (0.11.1)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.5.0->pandas) (1.12.0)
Requirement already satisfied: palettable in c:\programdata\anaconda3\lib\site-packages (from mizani>=0.7.1->plotnine) (3.3.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.1.1->plotnine) (1.1.0)
Requirement already satisfied: pillow>=6.2.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.1.1->plotnine) (7.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.1.1->plotnine) (2.4.0)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.1.1->plotnine) (0.10.0)
Requirement already satisfied: setuptools in c:\programdata\anaconda3\lib\site-packages (from kiwisolver>=1.0.1->matplotlib>=3.1.1->plotnine) (41.0.1)
In [3]:
# Setting Max rows and columns for full visibility
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

Reading Data

In [4]:
df = pd.read_csv("../data/owid-covid-data.csv",parse_dates=['date'])

Understanding structure of data

Looking at top 10 rows of data

We take a look at the top 10 rows of data to get a beginning idea of what structure the data is in that we will be working with. Here, we can clearly see each column, as well as that it will be separated by country. We also see that we have ‘date’ in the dataset which will act as in index when data is checked over time.

In [5]:
df.head(10)
Out[5]:
iso_code continent location date total_cases new_cases total_deaths new_deaths total_cases_per_million new_cases_per_million total_deaths_per_million new_deaths_per_million new_tests total_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand tests_units stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy
0 AFG Asia Afghanistan 2019-12-31 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN 38928341.0 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
1 AFG Asia Afghanistan 2020-01-01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341.0 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
2 AFG Asia Afghanistan 2020-01-02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341.0 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
3 AFG Asia Afghanistan 2020-01-03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341.0 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
4 AFG Asia Afghanistan 2020-01-04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341.0 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
5 AFG Asia Afghanistan 2020-01-05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341.0 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
6 AFG Asia Afghanistan 2020-01-06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341.0 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
7 AFG Asia Afghanistan 2020-01-07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341.0 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
8 AFG Asia Afghanistan 2020-01-08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341.0 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
9 AFG Asia Afghanistan 2020-01-09 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341.0 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
In [6]:
df.tail(10)
Out[6]:
iso_code continent location date total_cases new_cases total_deaths new_deaths total_cases_per_million new_cases_per_million total_deaths_per_million new_deaths_per_million new_tests total_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand tests_units stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy
33407 NaN NaN International 2020-02-23 634.0 0.0 2.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
33408 NaN NaN International 2020-02-24 691.0 57.0 3.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
33409 NaN NaN International 2020-02-25 691.0 0.0 3.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
33410 NaN NaN International 2020-02-26 691.0 0.0 4.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
33411 NaN NaN International 2020-02-27 705.0 14.0 4.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
33412 NaN NaN International 2020-02-28 705.0 0.0 4.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
33413 NaN NaN International 2020-02-29 705.0 0.0 6.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
33414 NaN NaN International 2020-03-01 705.0 0.0 6.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
33415 NaN NaN International 2020-03-02 705.0 0.0 6.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
33416 NaN NaN International 2020-03-10 696.0 -9.0 7.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Looking at the last 10 rows in the dataset shows we have few rows which are tagged as "International" and do not tie to anyone continent.

Looking at all the variables and their types

Variables and data types can be very helpful when pre-processing data. Knowing all of this information will allow us to handle the data more accurately and allow us to correct the data more easily.

In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33417 entries, 0 to 33416
Data columns (total 34 columns):
iso_code                           33353 non-null object
continent                          33141 non-null object
location                           33417 non-null object
date                               33417 non-null datetime64[ns]
total_cases                        33062 non-null float64
new_cases                          33062 non-null float64
total_deaths                       33062 non-null float64
new_deaths                         33062 non-null float64
total_cases_per_million            32998 non-null float64
new_cases_per_million              32998 non-null float64
total_deaths_per_million           32998 non-null float64
new_deaths_per_million             32998 non-null float64
new_tests                          10401 non-null float64
total_tests                        10647 non-null float64
total_tests_per_thousand           10647 non-null float64
new_tests_per_thousand             10401 non-null float64
new_tests_smoothed                 11520 non-null float64
new_tests_smoothed_per_thousand    11520 non-null float64
tests_units                        12288 non-null object
stringency_index                   27130 non-null float64
population                         33353 non-null float64
population_density                 31910 non-null float64
median_age                         30074 non-null float64
aged_65_older                      29638 non-null float64
aged_70_older                      29919 non-null float64
gdp_per_capita                     29708 non-null float64
extreme_poverty                    19865 non-null float64
cardiovasc_death_rate              30083 non-null float64
diabetes_prevalence                31104 non-null float64
female_smokers                     23877 non-null float64
male_smokers                       23591 non-null float64
handwashing_facilities             13764 non-null float64
hospital_beds_per_thousand         27353 non-null float64
life_expectancy                    32951 non-null float64
dtypes: datetime64[ns](1), float64(29), object(4)
memory usage: 8.7+ MB

There are 33417 observations and 34 columns where 5 variables are of categorical data type and the remaining 29 variables are of numerical data types.

Looking at the column names, the dataset provides us information about total number of COVID cases, tests and deaths by continent and by different age brackets. It also has information about per capitia, life expectancy, death rate by cardiovsacular and diabetes on a daily basis.

Converting data types

As mentioned above, knowing the data type is very useful for handling the data and correcting it. Here we are converting the data into data types that we can more effectively use. This is crucial for the date when we start using it for time.

In [8]:
df['date']=df['date'].dt.strftime("%m-%d-%Y")
In [9]:
df['date']= pd.to_datetime(df['date'])
print(df['date'].dtype)
datetime64[ns]
In [10]:
df = df.set_index('date')
df.head(3)
Out[10]:
iso_code continent location total_cases new_cases total_deaths new_deaths total_cases_per_million new_cases_per_million total_deaths_per_million new_deaths_per_million new_tests total_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand tests_units stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy
date
2019-12-31 AFG Asia Afghanistan 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN 38928341.0 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
2020-01-01 AFG Asia Afghanistan 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341.0 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83
2020-01-02 AFG Asia Afghanistan 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.0 38928341.0 54.422 18.6 2.581 1.337 1803.987 NaN 597.029 9.59 NaN NaN 37.746 0.5 64.83

Describing and Summarizing numerical or continuous variables

In [11]:
df.describe()
Out[11]:
total_cases new_cases total_deaths new_deaths total_cases_per_million new_cases_per_million total_deaths_per_million new_deaths_per_million new_tests total_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy
count 3.306200e+04 33062.000000 33062.000000 33062.00000 32998.000000 32998.000000 32998.000000 32998.000000 10401.000000 1.064700e+04 10647.000000 10401.000000 11520.000000 11520.000000 27130.000000 3.335300e+04 31910.000000 30074.000000 29638.000000 29919.000000 29708.000000 19865.000000 30083.000000 31104.000000 23877.000000 23591.000000 13764.000000 27353.000000 32951.000000
mean 5.091939e+04 1010.762809 2655.291634 39.93243 1103.657007 17.858746 40.909829 0.533204 16320.258341 7.689958e+05 30.980448 0.572316 15589.503906 0.551412 58.327987 9.443562e+07 368.561392 31.634754 9.450372 5.990319 21546.066343 11.489011 249.517591 8.039533 10.990606 32.629508 53.246010 3.146980 74.244388
std 5.180225e+05 9309.139517 25233.329557 347.73264 2674.940362 62.928423 123.250689 3.006846 59168.420750 3.022411e+06 55.964699 1.104416 54168.666654 0.979232 29.773501 6.370159e+08 1680.063490 9.012636 6.375376 4.362110 20697.420278 18.736936 117.957827 4.116805 10.504692 13.328649 31.456423 2.549325 7.316460
min 0.000000e+00 -29726.000000 0.000000 -1918.00000 0.000000 -437.881000 0.000000 -41.023000 -3743.000000 1.000000e+00 0.000000 -0.398000 0.000000 0.000000 0.000000 8.090000e+02 0.137000 15.100000 1.144000 0.526000 661.240000 0.100000 79.370000 0.990000 0.100000 7.700000 1.188000 0.100000 53.280000
25% 2.100000e+01 0.000000 0.000000 0.00000 8.521500 0.000000 0.000000 0.000000 805.000000 2.585100e+04 1.437000 0.049000 903.000000 0.051000 37.960000 1.701583e+06 39.497000 24.400000 3.607000 2.162000 6171.884000 0.500000 153.493000 5.310000 1.900000 21.400000 22.863000 1.380000 70.390000
50% 4.460000e+02 5.000000 9.000000 0.00000 155.458000 0.773000 2.043000 0.000000 2766.000000 1.105140e+05 8.105000 0.221000 3115.000000 0.239000 67.590000 8.655541e+06 90.672000 31.800000 7.104000 4.458000 15183.616000 1.700000 235.954000 7.110000 6.434000 31.400000 55.182000 2.540000 75.860000
75% 5.066500e+03 102.000000 107.000000 2.00000 936.628000 10.572000 21.692000 0.140000 9307.000000 4.324700e+05 38.056000 0.693000 9528.250000 0.691000 81.940000 3.236600e+07 222.873000 39.800000 14.864000 9.720000 33132.320000 15.000000 318.949000 10.080000 19.600000 40.900000 83.741000 4.210000 80.100000
max 1.670892e+07 284710.000000 660123.000000 10512.00000 38138.741000 4944.376000 1237.551000 200.040000 929838.000000 5.063568e+07 638.167000 20.611000 801014.000000 15.456000 100.000000 7.794799e+09 19347.500000 48.200000 27.049000 18.493000 116935.600000 77.600000 724.417000 23.360000 44.000000 78.100000 98.999000 13.800000 86.750000

There are some negative numbers as seen in the above output. For example, new_cases column has a min value of -29726. As this is the number of test cases so we cannot have negative values. Looks like a data error. Similarly, we have other such columns which are negatives - new_deaths, new_cases_per_million, new_deaths_per_million, new_tests and new_tests_per_thousand. We will handle these errors later.

In [12]:
# Taking a look at the outcome variable: 'total_deaths'
print(df['total_deaths'].value_counts())
0.0         10059
1.0          2053
2.0           972
3.0           805
5.0           595
10.0          546
7.0           520
6.0           519
9.0           515
4.0           467
8.0           402
11.0          347
12.0          279
13.0          266
15.0          241
21.0          224
14.0          186
19.0          185
24.0          179
26.0          167
22.0          163
31.0          161
28.0          151
20.0          135
17.0          135
23.0          128
42.0          124
18.0          123
30.0          123
25.0          118
16.0          111
51.0          110
52.0          106
58.0          105
50.0          100
53.0          100
69.0           99
27.0           96
29.0           89
32.0           88
34.0           88
35.0           87
41.0           87
33.0           82
40.0           82
46.0           82
36.0           73
45.0           67
48.0           66
38.0           66
49.0           63
56.0           63
47.0           63
110.0          62
55.0           61
43.0           60
111.0          56
74.0           54
102.0          53
44.0           53
121.0          53
37.0           52
64.0           51
107.0          50
79.0           50
54.0           48
108.0          47
66.0           46
39.0           45
103.0          44
93.0           43
61.0           42
67.0           42
87.0           41
60.0           40
75.0           40
83.0           39
68.0           38
65.0           38
57.0           37
109.0          37
86.0           36
104.0          36
63.0           35
85.0           35
4638.0         35
328.0          34
78.0           34
92.0           33
98.0           33
80.0           32
71.0           32
112.0          32
59.0           31
115.0          31
82.0           31
72.0           31
88.0           30
99.0           30
70.0           29
84.0           29
76.0           29
90.0           28
193.0          28
113.0          27
139.0          27
123.0          27
122.0          26
120.0          26
106.0          26
91.0           25
95.0           24
62.0           24
94.0           24
188.0          24
313.0          23
117.0          23
136.0          23
97.0           22
96.0           22
146.0          22
81.0           22
116.0          22
273.0          21
119.0          21
126.0          21
167.0          20
235.0          20
149.0          20
130.0          20
73.0           20
4637.0         20
140.0          19
77.0           19
208.0          19
264.0          19
169.0          19
359.0          18
89.0           18
159.0          18
250.0          18
129.0          18
105.0          18
192.0          18
4641.0         18
255.0          18
156.0          18
147.0          18
100.0          17
282.0          17
191.0          17
144.0          17
152.0          17
124.0          17
326.0          16
127.0          16
300.0          16
174.0          16
153.0          16
213.0          16
125.0          16
298.0          16
142.0          15
148.0          15
164.0          15
114.0          15
101.0          15
183.0          15
133.0          15
194.0          15
141.0          15
239.0          15
160.0          15
373.0          15
168.0          14
209.0          14
200.0          14
128.0          14
135.0          14
175.0          14
158.0          14
151.0          14
260.0          14
269.0          14
293.0          13
280.0          13
306.0          13
185.0          13
143.0          13
199.0          13
203.0          13
1685.0         13
596.0          13
165.0          13
163.0          13
190.0          13
327.0          13
242.0          13
276.0          13
145.0          13
179.0          13
118.0          12
244.0          12
225.0          12
201.0          12
351.0          12
252.0          12
177.0          12
182.0          12
329.0          12
585.0          12
27136.0        12
155.0          12
220.0          12
161.0          12
237.0          12
233.0          12
256.0          11
206.0          11
172.0          11
375.0          11
281.0          11
249.0          11
131.0          11
197.0          11
170.0          11
706.0          11
385.0          11
214.0          11
262.0          11
150.0          11
232.0          11
243.0          11
285.0          11
134.0          11
173.0          11
238.0          11
251.0          11
154.0          11
4636.0         10
611.0          10
274.0          10
297.0          10
245.0          10
171.0          10
180.0          10
308.0          10
186.0          10
259.0          10
218.0          10
            ...  
12400.0         1
7935.0          1
27643.0         1
5370.0          1
4408.0          1
9832.0          1
3743.0          1
12829.0         1
2506.0          1
874.0           1
987.0           1
29752.0         1
34926.0         1
1649.0          1
139684.0        1
5901.0          1
8505.0          1
9753.0          1
4477.0          1
44198.0         1
34914.0         1
877.0           1
466.0           1
9003.0          1
5332.0          1
20047.0         1
9758.0          1
1655.0          1
9761.0          1
34945.0         1
28236.0         1
3296.0          1
1696.0          1
39680.0         1
3564.0          1
80684.0         1
34938.0         1
10023.0         1
110845.0        1
8078.0          1
27711.0         1
1319.0          1
8735.0          1
9044.0          1
8856.0          1
44968.0         1
4064.0          1
4950.0          1
29731.0         1
4125.0          1
15889.0         1
4503.0          1
808.0           1
2134.0          1
15074.0         1
12998.0         1
44220.0         1
5333.0          1
1175.0          1
311772.0        1
1642.0          1
1694.0          1
6919.0          1
34869.0         1
1445.0          1
4630.0          1
51271.0         1
279098.0        1
16448.0         1
7574.0          1
34854.0         1
947.0           1
3388.0          1
34899.0         1
12285.0         1
5640.0          1
4362.0          1
9746.0          1
43332.0         1
2012.0          1
17654.0         1
6145.0          1
43081.0         1
27784.0         1
5382.0          1
6424.0          1
8693.0          1
4374.0          1
5931.0          1
27563.0         1
6829.0          1
1837.0          1
9685.0          1
6075.0          1
118434.0        1
2632.0          1
447.0           1
753.0           1
1438.0          1
9859.0          1
8659.0          1
34634.0         1
51017.0         1
4948.0          1
1134.0          1
830.0           1
375475.0        1
24648.0         1
5777.0          1
2594.0          1
8667.0          1
4122.0          1
34514.0         1
6692.0          1
1802.0          1
4142.0          1
10772.0         1
23473.0         1
8663.0          1
34657.0         1
2672.0          1
1385.0          1
1532.0          1
682.0           1
9024.0          1
3646.0          1
27555.0         1
578341.0        1
4350.0          1
1148.0          1
9674.0          1
9708.0          1
12745.0         1
5131.0          1
1256.0          1
26273.0         1
3993.0          1
28322.0         1
138358.0        1
3358.0          1
13767.0         1
42632.0         1
40261.0         1
28323.0         1
9669.0          1
342148.0        1
2456.0          1
3828.0          1
14011.0         1
29958.0         1
5468.0          1
1381.0          1
28315.0         1
1915.0          1
30366.0         1
3502.0          1
4458.0          1
1851.0          1
4839.0          1
3935.0          1
6541.0          1
4042.0          1
885.0           1
34610.0         1
5033.0          1
1388.0          1
7398.0          1
1283.0          1
554738.0        1
686.0           1
1085.0          1
4343.0          1
2314.0          1
34738.0         1
1436.0          1
9784.0          1
419.0           1
42927.0         1
3359.0          1
944.0           1
34730.0         1
3968.0          1
28289.0         1
5877.0          1
2370.0          1
45312.0         1
3698.0          1
27359.0         1
2854.0          1
9711.0          1
29663.0         1
19693.0         1
88539.0         1
2429.0          1
10670.0         1
34767.0         1
2749.0          1
85906.0         1
5782.0          1
3929.0          1
9628.0          1
904.0           1
1352.0          1
728.0           1
790.0           1
6498.0          1
10045.0         1
29920.0         1
852.0           1
34716.0         1
9692.0          1
25549.0         1
3281.0          1
3656.0          1
2640.0          1
5115.0          1
2618.0          1
26251.0         1
13791.0         1
621.0           1
6051.0          1
5359.0          1
25531.0         1
342954.0        1
5709.0          1
34675.0         1
22157.0         1
1814.0          1
894.0           1
22108.0         1
6649.0          1
29633.0         1
7073.0          1
46784.0         1
13798.0         1
9704.0          1
8677.0          1
2708.0          1
4431.0          1
29640.0         1
9699.0          1
9053.0          1
4403.0          1
4922.0          1
5528.0          1
7921.0          1
6486.0          1
5476.0          1
3670.0          1
13354.0         1
Name: total_deaths, Length: 3778, dtype: int64

Feature Selection

Dropping unnecessary and redundant numerical columns before analyzing data

In [13]:
df.isna().sum()
Out[13]:
iso_code                              64
continent                            276
location                               0
total_cases                          355
new_cases                            355
total_deaths                         355
new_deaths                           355
total_cases_per_million              419
new_cases_per_million                419
total_deaths_per_million             419
new_deaths_per_million               419
new_tests                          23016
total_tests                        22770
total_tests_per_thousand           22770
new_tests_per_thousand             23016
new_tests_smoothed                 21897
new_tests_smoothed_per_thousand    21897
tests_units                        21129
stringency_index                    6287
population                            64
population_density                  1507
median_age                          3343
aged_65_older                       3779
aged_70_older                       3498
gdp_per_capita                      3709
extreme_poverty                    13552
cardiovasc_death_rate               3334
diabetes_prevalence                 2313
female_smokers                      9540
male_smokers                        9826
handwashing_facilities             19653
hospital_beds_per_thousand          6064
life_expectancy                      466
dtype: int64

There are a total number of 33417 observations and the following variables have more than 60% of data as missing values.

new_tests, total_tests, total_tests_per_thousand, new_tests_per_thousand, new_tests_smoothed, new_tests_smoothed_per_thousand, tests_units, and the variable "handwashing_facilities" has 19653 close to 60% of data as missing values.

Therefore, these variables should be removed in order to avoid any bias in modeling. The variable extreme_poverty is also having 40% of its values as missing values. However, we will try to consider this variable as it has less than 50% of its missing values.

In [14]:
df.drop(['new_tests', 'total_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 'new_tests_smoothed', 'new_tests_smoothed_per_thousand', 'tests_units', 'handwashing_facilities'], axis = 1, inplace = True)

The below mentioned variables should also be removed as these are conversions per million. We would prefer to use per million numbers in general compared to just numbers, however, there are more missing values in variables using per million as conversions. Therefore, we stick to the original variables.

total_cases_per_million, new_cases_per_million, total_deaths_per_million, and new_deaths_per_million

In [15]:
df.drop(['total_cases_per_million', 'new_cases_per_million', 'total_deaths_per_million', 'new_deaths_per_million'], axis = 1, inplace = True)

Also, total cases should include new cases and total deaths should include new deaths, Therefore, we can remove new cases and new deaths from our data to avoid redundancy.

In [16]:
df.drop(['new_cases', 'new_deaths'], axis = 1, inplace = True)

Because Population Density is calculated using Population, we need to remove Population

In [17]:
df.drop(['population'], axis = 1, inplace = True)

Describing and Summarizing categorical variables

In [18]:
df.describe(include = 'O')
Out[18]:
iso_code continent location
count 33353 33141 33417
unique 211 6 212
top GBR Europe United Arab Emirates
freq 212 9113 212

Drop specific rows

Looking at the categorical variables. Some of the rows for iso_code has no values and corresponding continent also doesn't have any values. Also, the corresponding location has only "International" as the values. This means there is no way to track which location in these rows belongs to which continent and such. Therefore, we should remove these rows. There are only 64 such rows with no values in iso_code variables. Therefore, removing such a small data will not affect our model.

In [19]:
df = df.dropna(how='all', subset=['iso_code'])

Looking at shape of the new dataset

We now have 33353 observations instead of 33417 observations and 18 variables instead of 34 variables.

In [20]:
df.shape
Out[20]:
(33353, 18)

All the names of variables in our current data set are listed below

In [21]:
df.columns
Out[21]:
Index(['iso_code', 'continent', 'location', 'total_cases', 'total_deaths',
       'stringency_index', 'population_density', 'median_age', 'aged_65_older',
       'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
       'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers',
       'male_smokers', 'hospital_beds_per_thousand', 'life_expectancy'],
      dtype='object')
In [22]:
df.describe()
Out[22]:
total_cases total_deaths stringency_index population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers hospital_beds_per_thousand life_expectancy
count 3.299800e+04 32998.000000 27130.000000 31910.000000 30074.000000 29638.000000 29919.000000 29708.000000 19865.000000 30083.000000 31104.000000 23877.000000 23591.000000 27353.000000 32951.000000
mean 5.101779e+04 2660.440057 58.327987 368.561392 31.634754 9.450372 5.990319 21546.066343 11.489011 249.517591 8.039533 10.990606 32.629508 3.146980 74.244388
std 5.185198e+05 25257.517537 29.773501 1680.063490 9.012636 6.375376 4.362110 20697.420278 18.736936 117.957827 4.116805 10.504692 13.328649 2.549325 7.316460
min 0.000000e+00 0.000000 0.000000 0.137000 15.100000 1.144000 0.526000 661.240000 0.100000 79.370000 0.990000 0.100000 7.700000 0.100000 53.280000
25% 2.100000e+01 0.000000 37.960000 39.497000 24.400000 3.607000 2.162000 6171.884000 0.500000 153.493000 5.310000 1.900000 21.400000 1.380000 70.390000
50% 4.470000e+02 9.000000 67.590000 90.672000 31.800000 7.104000 4.458000 15183.616000 1.700000 235.954000 7.110000 6.434000 31.400000 2.540000 75.860000
75% 5.105500e+03 108.000000 81.940000 222.873000 39.800000 14.864000 9.720000 33132.320000 15.000000 318.949000 10.080000 19.600000 40.900000 4.210000 80.100000
max 1.670892e+07 660123.000000 100.000000 19347.500000 48.200000 27.049000 18.493000 116935.600000 77.600000 724.417000 23.360000 44.000000 78.100000 13.800000 86.750000
In [23]:
df.shape
Out[23]:
(33353, 18)
In [24]:
df.isnull().sum()
Out[24]:
iso_code                          0
continent                       212
location                          0
total_cases                     355
total_deaths                    355
stringency_index               6223
population_density             1443
median_age                     3279
aged_65_older                  3715
aged_70_older                  3434
gdp_per_capita                 3645
extreme_poverty               13488
cardiovasc_death_rate          3270
diabetes_prevalence            2249
female_smokers                 9476
male_smokers                   9762
hospital_beds_per_thousand     6000
life_expectancy                 402
dtype: int64

We can observe that now with our already reduced data the variable extreme poverty has almost 40% values missing and therefore the better way would be to delete this column as imputing it would lead to misleading results.

In [25]:
df.drop(['extreme_poverty'], axis = 1, inplace = True)
In [26]:
df.shape
Out[26]:
(33353, 17)

Graphical Exploratory Analysis

Before we go deeper into the dataset, it is good to perform some graphical exploratory analysis as we can quickly see and find issues with the data

GDP

GDP, or gross domestic product is a good representation of how a country is faring economically. Here we will be comparing the GDP of countries and how it affects the number of cases, but also the reaction to the pandemic.

In [27]:
(ggplot(df, aes(x='gdp_per_capita'))   
 + geom_histogram(bins=12,
                 color ="red", 
                 fill ="orange")
 + labs(title="Histogram for gdp_per_capita", x="gdp_per_capita", y="Count")
)
Out[27]:
<ggplot: (-9223371897895102383)>

Median Age

Like GDP, we are looking at the median age in a histogram. This is telling us what the ages look like that were being tested. We can see that the most tested age group was in their early 40s.

In [28]:
(ggplot(df, aes(x='median_age'))   
 + geom_histogram(bins=12,
                 color ="red", 
                 fill ="orange")
 + labs(title="Histogram for median_age", x="median_age", y="Count")
)
Out[28]:
<ggplot: (-9223371897895125658)>

Number of Cases vs Deaths

In [29]:
plt.scatter(df['total_cases'], df['total_deaths'], alpha = 0.1)
plt.xlabel("Total Number of Cases")
plt.ylabel("Total Deaths")
plt.title("Total Number of Cases vs Total Deaths")
Out[29]:
Text(0.5, 1.0, 'Total Number of Cases vs Total Deaths')

As you can see, with the number of cases growing, deaths will also grow. This can be caused by a number of factors, for example, a country could be performing a large amount of test in a short time frame, much like the US.

Distribution of all Numerical Variables

In [30]:
df.hist(bins = 50, figsize = (20,15))
plt.show()

Out of the above graphs, we look at the hospital beds per thousand, the graph is left skewed indicating that majority of the hospitals have few beds. In times of a resource crunch, fewer beds and more cases can be a stress on the hospitals and might lead to compromise in health care services offered to patients not just seeking COVID treatment but any other conditions as well.

Analysis for Daily Cases

In [31]:
datewise = df.groupby(["date"]).agg({"total_cases" : "sum", "total_deaths" : "sum"})
In [32]:
print("Total Number of Cases: ", datewise["total_cases"].iloc[-1])
print("Total Number of Deaths: ", datewise["total_deaths"].iloc[-1])
print("Total Number of Active Cases ", (datewise["total_cases"].iloc[-1] - datewise["total_deaths"].iloc[-1]))
Total Number of Cases:  33136534.0
Total Number of Deaths:  1291803.0
Total Number of Active Cases  31844731.0
In [33]:
#sampling to view data closely on a more recent sample of records to comprehend recent trend better compared to looking at 
# 30K+ records together 
datewise_sample = datewise[datewise.index.to_series().between('2020-03-30', '2020-06-30')]

plt.figure(figsize = (20, 5))
sns.barplot(x = datewise_sample.index.date, y = datewise_sample["total_cases"] - datewise_sample["total_deaths"])
plt.title("Active Cases")
plt.xticks(rotation = 90)
plt.show()

Sadly, but as expected, the number of cases have been increasing as you can see above.

In [34]:
plt.figure(figsize = (20, 5))
sns.barplot(x = datewise_sample.index.date, y = datewise_sample["total_deaths"])
plt.title("Total Deaths")
plt.xticks(rotation = 90)
plt.show()

As well as the number of deaths increasing.

Analysis for Weekly Cases

In [35]:
datewise["WeekofYear"] = datewise.index.weekofyear
num_week = []
weekly_cases = []
weekly_deaths = []

w = 1
for i in list(datewise["WeekofYear"].unique()):
    weekly_cases.append(datewise[datewise["WeekofYear"] == i]["total_cases"].iloc[-1])
    weekly_deaths.append(datewise[datewise["WeekofYear"] == i]["total_deaths"].iloc[-1])    
    num_week.append(w)
    w = w+1
    
plt.figure(figsize = (10, 5))
plt.plot(num_week, weekly_cases, label = "Weekly Cases", linewidth = 3)
plt.plot(num_week, weekly_deaths, label = "Weekly Deaths", linewidth = 3)
plt.xlabel("Number of Week")
plt.ylabel("Number of Cases")
plt.title("Weekly Analysis of Cases")
Out[35]:
Text(0.5, 1.0, 'Weekly Analysis of Cases')

As you can see, after about 10 weeks, the number of cases start to grow exponentially.

In [36]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12, 4))
sns.barplot(x = num_week, y = pd.Series(weekly_cases).diff().fillna(0), ax = ax1)
sns.barplot(x = num_week, y = pd.Series(weekly_deaths).diff().fillna(0), ax = ax2)
ax1.set_xlabel("Number of Week")
ax2.set_xlabel("Number of Week")
ax1.set_ylabel("Total Number of Cases")
ax2.set_ylabel("Total Number of Deaths")
ax1.set_title("Weekly Number of Cases")
ax2.set_title("Weekly Deaths")
plt.show()

Curiously, you can see that there was a fall off in the number of deaths, but that it then began to rise again.

In [37]:
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
    
print("Average number of cases increasing everyday: ", np.round(datewise["total_cases"].diff().fillna(0).mean()))
print("Average number of deaths increasing everyday: ", np.round(datewise["total_deaths"].diff().fillna(0).mean()))

plt.figure(figsize = (10, 5))
plt.plot(datewise["total_cases"].diff().fillna(0), label = "Daily increase in total number of cases", linewidth = 3)
plt.plot(datewise["total_deaths"].diff().fillna(0), label = "Daily increase in total number of deaths", linewidth = 3)
plt.xlabel("Daily")
plt.ylabel("Daily Increase")
plt.title("Daily Increase in Cases")
plt.legend()
plt.xticks(rotation = 90)
plt.show()
Average number of cases increasing everyday:  156304.0
Average number of deaths increasing everyday:  6093.0

Analysis by Continent

Number of Observations per Continent

In [38]:
df.groupby('continent').size()
Out[38]:
continent
Africa           7479
Asia             8274
Europe           9113
North America    5047
Oceania          1253
South America    1975
dtype: int64
In [39]:
sns.countplot(x='continent',data=df, palette="OrRd")
plt.xticks(rotation = 90)
plt.show()

The above graph shows that Europe has the largest number of observations in the data.

Exploring Total Number of Cases and Total Number of Deaths by Continent

In [40]:
by_continent = df[df.index == df.index.max()].groupby(["continent"]).agg({"total_cases" : "sum", "total_deaths" : "sum", "gdp_per_capita" : "sum", "population_density" : "sum"})
In [41]:
by_continent["deaths %"] = (by_continent["total_deaths"]/by_continent["total_cases"])*100
In [42]:
by_continent.head()
Out[42]:
total_cases total_deaths gdp_per_capita population_density deaths %
continent
Africa 873331.0 18471.0 288523.368 5485.100 2.115006
Asia 4053864.0 92728.0 976155.985 19582.162 2.287398
Europe 2495240.0 173495.0 1366873.611 29988.597 6.953039
North America 5146329.0 209663.0 584691.580 8928.663 4.074030
Oceania 17078.0 197.0 93260.722 605.862 1.153531
In [43]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (25, 10))
top_total_cases = by_continent.sort_values(["total_cases"], ascending = False).head(10)
top_total_deaths = by_continent.sort_values(["deaths %"], ascending = False).head(10)
sns.barplot(x = top_total_cases["total_cases"], y = top_total_cases.index, ax = ax1)
ax1.set_title("Top Continents - Total Number of Cases")
sns.barplot(x = top_total_deaths["deaths %"], y = top_total_deaths.index, ax = ax2)
ax2.set_title("Top Continents - Total Percentage of Deaths")

plt.show()

From the above charts, it is clearly seen that North America has the highest number of cases and second highest percentage of deaths. South America ranks number third in total number of cases and in total percentage of deaths. Surprisingly, Asia has second highest number of cases and less percentage of deaths (only 2.2%). This might be due to several factors. But one of the reasons can be the inaccuracy in reporting the number of cases/deaths and another reason can be their stronger immune system. Europe also surprisingly has very less number of cases, but ranks number one in total percentage of deaths. Again, there can be several factors associated with this, however, one reason can be Europe is not taking the same procautions as other continents.

Exploring Continents by GDP

In [44]:
plt.figure(figsize=(10,6))
ax = sns.boxplot(x="continent", y="gdp_per_capita", data=df, showfliers=False)
In [45]:
# trying to understand any relation between gdp and total deaths
sns.lmplot(y='gdp_per_capita',x='total_deaths',data=df, size=12)
Out[45]:
<seaborn.axisgrid.FacetGrid at 0x205ac756198>

We can observe from the above plot that the total deaths are more for poorer countries but we cannot be very sure, as there is a spike very close to 60,000 which can be deemed as countries with higher gdp. Still a significant portion of deaths are still in the lesser gdp range. Which might indicate a disparity between life expectancy between richer and poorer countries.

Exploring population density per continent

The below chart shows that the population density is highest in North America followed by Asia, Europe, Africa, Oceana, and last South America. Therefore, even though Europe seem to have highest number of deaths and North America among the least number of deaths, it is important to consider that North America has higher population density compared to Europe.

In [46]:
plt.figure(figsize=(10,6))
ax = sns.boxplot(x="continent", y="population_density", data=df, showfliers=False)

The first box plot shows that Europe and Oceania have the highest GDP of all of continents. Oceania has the second least population density, which could attribute to it having the least amount of deaths. However, Europe on the other hand, is among the richest continents, ranks third in population density and ranks 4th in total number of cases, and still has the highest number of total deaths. So, it will be interesting to study Europe among all the continents and help them predict total number of deaths ahead of time. This will help Europe in managing the situation better to some extent.

In [47]:
# heatmap for the correlation matrix
plt.figure(figsize=(10, 5))
sns.heatmap(df.corr(), annot=True, cmap='cubehelix_r')
plt.show()

This heatmap shows the correlation between the different categories. This allows us to see how the different categories may effect each other. We can observe that median_age, aged_65_older and aged_70_older are correlated. We will include only median_age and aged_65_older in our model to avoid noise.

In [48]:
df.drop(['aged_70_older'], axis = 1, inplace = True)
df.dtypes
Out[48]:
iso_code                       object
continent                      object
location                       object
total_cases                   float64
total_deaths                  float64
stringency_index              float64
population_density            float64
median_age                    float64
aged_65_older                 float64
gdp_per_capita                float64
cardiovasc_death_rate         float64
diabetes_prevalence           float64
female_smokers                float64
male_smokers                  float64
hospital_beds_per_thousand    float64
life_expectancy               float64
dtype: object

Studying Europe

In [49]:
df_eu = df[df['continent'] == "Europe"]
df_eu = df_eu.drop("continent", axis=1)

df_eu.head()
Out[49]:
iso_code location total_cases total_deaths stringency_index population_density median_age aged_65_older gdp_per_capita cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers hospital_beds_per_thousand life_expectancy
date
2020-03-09 ALB Albania 2.0 0.0 36.11 104.871 38.0 13.188 11803.431 304.195 10.08 7.1 51.2 2.89 78.57
2020-03-10 ALB Albania 6.0 0.0 41.67 104.871 38.0 13.188 11803.431 304.195 10.08 7.1 51.2 2.89 78.57
2020-03-11 ALB Albania 10.0 0.0 51.85 104.871 38.0 13.188 11803.431 304.195 10.08 7.1 51.2 2.89 78.57
2020-03-12 ALB Albania 11.0 1.0 51.85 104.871 38.0 13.188 11803.431 304.195 10.08 7.1 51.2 2.89 78.57
2020-03-13 ALB Albania 23.0 1.0 78.70 104.871 38.0 13.188 11803.431 304.195 10.08 7.1 51.2 2.89 78.57
In [50]:
datewise_europe = df_eu.groupby(df_eu.index).agg({"total_cases" : "sum", "total_deaths" : "sum"})
In [51]:
print("Total Number of Cases in Europe: ", datewise_europe["total_cases"].iloc[-1])
print("Total Number of Deaths in Europe: ", datewise_europe["total_deaths"].iloc[-1])
print("Total Number of Active Cases in Europe ", (datewise_europe["total_cases"].iloc[-1] - datewise_europe["total_deaths"].iloc[-1]))
Total Number of Cases in Europe:  2495240.0
Total Number of Deaths in Europe:  173495.0
Total Number of Active Cases in Europe  2321745.0

Exploring the regions within Europe

In [52]:
plt.figure(figsize=(20,4))
ax = sns.countplot(x ='location',data=df_eu, palette="OrRd")
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.show()
In [53]:
by_location = df_eu[df_eu.index == df_eu.index.max()].groupby(["location"]).agg({"total_cases" : "sum", "total_deaths" : "sum"}).sort_values(["total_cases"], ascending = False)
In [54]:
by_location.head()
Out[54]:
total_cases total_deaths
location
Russia 823515.0 13504.0
United Kingdom 300692.0 45878.0
Italy 246488.0 35123.0
Germany 206926.0 9128.0
France 183804.0 30223.0
In [55]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (25, 10))
loc_cases = by_location.sort_values(["total_cases"], ascending = False).head(10)
loc_deaths = by_location.sort_values(["total_deaths"], ascending = False).head(10)
sns.barplot(x = loc_cases["total_cases"], y = loc_cases.index, ax = ax1)
ax1.set_title("Total Number of Cases by Location in Europe")
sns.barplot(x = loc_deaths["total_deaths"], y = loc_deaths.index, ax = ax2)
ax2.set_title("Total Number of Deaths by Location")

plt.show()

The above results show that Russia has the highest number of cases, but the United Kingdom has the highest number of deaths.

In [56]:
# Getting dummy variables
df_eu = pd.get_dummies(df_eu, columns = ['iso_code', 'location'], drop_first = True)
In [57]:
# Describing the numeric columns
df_eu.describe(include = [np.number])
Out[57]:
total_cases total_deaths stringency_index population_density median_age aged_65_older gdp_per_capita cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers hospital_beds_per_thousand life_expectancy iso_code_AND iso_code_AUT iso_code_BEL iso_code_BGR iso_code_BIH iso_code_BLR iso_code_CHE iso_code_CYP iso_code_CZE iso_code_DEU iso_code_DNK iso_code_ESP iso_code_EST iso_code_FIN iso_code_FRA iso_code_FRO iso_code_GBR iso_code_GGY iso_code_GIB iso_code_GRC iso_code_HRV iso_code_HUN iso_code_IMN iso_code_IRL iso_code_ISL iso_code_ITA iso_code_JEY iso_code_LIE iso_code_LTU iso_code_LUX iso_code_LVA iso_code_MCO iso_code_MDA iso_code_MKD iso_code_MLT iso_code_MNE iso_code_NLD iso_code_NOR iso_code_OWID_KOS iso_code_POL iso_code_PRT iso_code_ROU iso_code_RUS iso_code_SMR iso_code_SRB iso_code_SVK iso_code_SVN iso_code_SWE iso_code_UKR iso_code_VAT location_Andorra location_Austria location_Belarus location_Belgium location_Bosnia and Herzegovina location_Bulgaria location_Croatia location_Cyprus location_Czech Republic location_Denmark location_Estonia location_Faeroe Islands location_Finland location_France location_Germany location_Gibraltar location_Greece location_Guernsey location_Hungary location_Iceland location_Ireland location_Isle of Man location_Italy location_Jersey location_Kosovo location_Latvia location_Liechtenstein location_Lithuania location_Luxembourg location_Macedonia location_Malta location_Moldova location_Monaco location_Montenegro location_Netherlands location_Norway location_Poland location_Portugal location_Romania location_Russia location_San Marino location_Serbia location_Slovakia location_Slovenia location_Spain location_Sweden location_Switzerland location_Ukraine location_United Kingdom location_Vatican
count 9027.000000 9027.000000 7207.000000 8712.000000 7475.000000 7475.000000 7824.000000 7613.000000 8165.000000 7410.000000 7410.000000 8027.000000 8711.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.00000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.00000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000 9113.000000
mean 24474.764374 2088.278609 47.684135 650.715591 42.144187 17.862396 35119.094811 220.870109 6.363129 23.401538 34.752227 5.166233 80.007628 0.015143 0.023263 0.023263 0.015582 0.015253 0.023263 0.023263 0.015363 0.023263 0.023263 0.023263 0.023154 0.023263 0.023263 0.023263 0.016021 0.023263 0.014485 0.014485 0.023263 0.023263 0.016241 0.014375 0.023044 0.023263 0.023263 0.014485 0.015472 0.023263 0.022495 0.016679 0.021947 0.015582 0.022276 0.019203 0.014704 0.023263 0.023263 0.015143 0.016131 0.01657 0.023044 0.023263 0.023154 0.017009 0.015802 0.016021 0.023263 0.015143 0.015033 0.015143 0.023263 0.023263 0.023263 0.015253 0.015582 0.023263 0.015363 0.023263 0.023263 0.023263 0.016021 0.023263 0.023263 0.023263 0.014485 0.023263 0.014485 0.016241 0.023263 0.023044 0.014375 0.023263 0.014485 0.015143 0.016679 0.015472 0.023263 0.022495 0.022276 0.019203 0.015582 0.021947 0.014704 0.023263 0.023263 0.016131 0.01657 0.023044 0.023263 0.023154 0.017009 0.015802 0.016021 0.023154 0.023263 0.023263 0.015143 0.023263 0.015033
std 73462.064768 6821.101474 30.470777 2902.621269 2.555188 2.734599 17827.141089 121.927932 1.947425 7.186321 10.555906 2.422809 3.482640 0.122129 0.150747 0.150747 0.123859 0.122564 0.150747 0.150747 0.122997 0.150747 0.150747 0.150747 0.150400 0.150747 0.150747 0.150747 0.125563 0.150747 0.119485 0.119485 0.150747 0.150747 0.126406 0.119038 0.150051 0.150747 0.150747 0.119485 0.123429 0.150747 0.148296 0.128074 0.146517 0.123859 0.147587 0.137247 0.120373 0.150747 0.150747 0.122129 0.125985 0.12766 0.150051 0.150747 0.150400 0.129311 0.124714 0.125563 0.150747 0.122129 0.121693 0.122129 0.150747 0.150747 0.150747 0.122564 0.123859 0.150747 0.122997 0.150747 0.150747 0.150747 0.125563 0.150747 0.150747 0.150747 0.119485 0.150747 0.119485 0.126406 0.150747 0.150051 0.119038 0.150747 0.119485 0.122129 0.128074 0.123429 0.150747 0.148296 0.147587 0.137247 0.123859 0.146517 0.120373 0.150747 0.150747 0.125985 0.12766 0.150051 0.150747 0.150400 0.129311 0.124714 0.125563 0.150400 0.150747 0.150747 0.122129 0.150747 0.121693
min 0.000000 0.000000 0.000000 3.404000 37.300000 10.864000 5189.972000 86.060000 3.280000 5.900000 15.200000 2.220000 71.900000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 70.000000 0.000000 19.440000 65.180000 40.300000 15.070000 23313.199000 117.992000 4.790000 19.600000 27.300000 3.320000 76.880000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 1097.000000 29.000000 51.850000 106.749000 42.400000 18.577000 32605.906000 156.139000 5.720000 23.000000 33.100000 4.510000 81.320000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 9861.000000 329.000000 74.070000 205.859000 43.500000 19.718000 45436.686000 322.688000 7.550000 28.200000 40.200000 6.620000 82.400000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 823515.000000 45878.000000 100.000000 19347.500000 47.900000 23.021000 94277.965000 539.849000 10.080000 44.000000 58.300000 13.800000 86.750000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

Handling missing values for modeling

This is another key part of repairing the data set. Here we are detecting any data points listed as null and replacing them with the median value for that variable. There are several missing values in this dataset. Therefore, we first need to handle these missing values in numerical variables before proceeding further. We will use SimpleImputer for this.

In [58]:
df_eu.isnull().sum()
Out[58]:
total_cases                          86
total_deaths                         86
stringency_index                   1906
population_density                  401
median_age                         1638
aged_65_older                      1638
gdp_per_capita                     1289
cardiovasc_death_rate              1500
diabetes_prevalence                 948
female_smokers                     1703
male_smokers                       1703
hospital_beds_per_thousand         1086
life_expectancy                     402
iso_code_AND                          0
iso_code_AUT                          0
iso_code_BEL                          0
iso_code_BGR                          0
iso_code_BIH                          0
iso_code_BLR                          0
iso_code_CHE                          0
iso_code_CYP                          0
iso_code_CZE                          0
iso_code_DEU                          0
iso_code_DNK                          0
iso_code_ESP                          0
iso_code_EST                          0
iso_code_FIN                          0
iso_code_FRA                          0
iso_code_FRO                          0
iso_code_GBR                          0
iso_code_GGY                          0
iso_code_GIB                          0
iso_code_GRC                          0
iso_code_HRV                          0
iso_code_HUN                          0
iso_code_IMN                          0
iso_code_IRL                          0
iso_code_ISL                          0
iso_code_ITA                          0
iso_code_JEY                          0
iso_code_LIE                          0
iso_code_LTU                          0
iso_code_LUX                          0
iso_code_LVA                          0
iso_code_MCO                          0
iso_code_MDA                          0
iso_code_MKD                          0
iso_code_MLT                          0
iso_code_MNE                          0
iso_code_NLD                          0
iso_code_NOR                          0
iso_code_OWID_KOS                     0
iso_code_POL                          0
iso_code_PRT                          0
iso_code_ROU                          0
iso_code_RUS                          0
iso_code_SMR                          0
iso_code_SRB                          0
iso_code_SVK                          0
iso_code_SVN                          0
iso_code_SWE                          0
iso_code_UKR                          0
iso_code_VAT                          0
location_Andorra                      0
location_Austria                      0
location_Belarus                      0
location_Belgium                      0
location_Bosnia and Herzegovina       0
location_Bulgaria                     0
location_Croatia                      0
location_Cyprus                       0
location_Czech Republic               0
location_Denmark                      0
location_Estonia                      0
location_Faeroe Islands               0
location_Finland                      0
location_France                       0
location_Germany                      0
location_Gibraltar                    0
location_Greece                       0
location_Guernsey                     0
location_Hungary                      0
location_Iceland                      0
location_Ireland                      0
location_Isle of Man                  0
location_Italy                        0
location_Jersey                       0
location_Kosovo                       0
location_Latvia                       0
location_Liechtenstein                0
location_Lithuania                    0
location_Luxembourg                   0
location_Macedonia                    0
location_Malta                        0
location_Moldova                      0
location_Monaco                       0
location_Montenegro                   0
location_Netherlands                  0
location_Norway                       0
location_Poland                       0
location_Portugal                     0
location_Romania                      0
location_Russia                       0
location_San Marino                   0
location_Serbia                       0
location_Slovakia                     0
location_Slovenia                     0
location_Spain                        0
location_Sweden                       0
location_Switzerland                  0
location_Ukraine                      0
location_United Kingdom               0
location_Vatican                      0
dtype: int64
In [59]:
# Impute missing values using Imputer in sklearn.preprocessing
from sklearn.impute import SimpleImputer 
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(df_eu)
df_eu = pd.DataFrame(data=imputer.transform(df_eu) , columns=df_eu.columns)
In [60]:
# Assign X as a DataFrame of features and y as a Series of the outcome variable
X = df_eu.drop('total_deaths', 1)
y = df_eu.total_deaths

Linear Regression

Firstly, we are performing a linear regression on our EU data frame to understand the impact of GDP, demographic factors like age, prevalence of smoking, life expectancy, median age, location total cases.

In [61]:
# Feature Scaling and linear regression
scaler = StandardScaler() 
reg = LinearRegression()
steps = [('scaling', scaler), ('regression', reg)] 
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42) 

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test) 
pipeline.score(X_test, y_test)
Out[61]:
0.7083510471733059
In [62]:
print("The intercept term of the linear model:", reg.intercept_)
cdf = pd.DataFrame(data=reg.coef_, index=X_train.columns, columns=["Coefficients"])
cdf
The intercept term of the linear model: 2243.888492529553
Out[62]:
Coefficients
total_cases 4.388124e+03
stringency_index 4.041565e+02
population_density -7.639171e+15
median_age 1.145649e+16
aged_65_older -9.617242e+15
gdp_per_capita 2.845609e+16
cardiovasc_death_rate 2.546675e+16
diabetes_prevalence 1.111645e+16
female_smokers 1.107000e+16
male_smokers 4.598792e+15
hospital_beds_per_thousand 2.914445e+16
life_expectancy -2.018291e+16
iso_code_AND 1.677666e+16
iso_code_AUT 1.008480e+16
iso_code_BEL -2.479658e+16
iso_code_BGR -1.401767e+16
iso_code_BIH -3.326751e+16
iso_code_BLR -2.110048e+16
iso_code_CHE -3.052661e+15
iso_code_CYP -1.225351e+16
iso_code_CZE -1.263972e+15
iso_code_DEU 7.723661e+15
iso_code_DNK 2.689768e+16
iso_code_ESP 2.775221e+15
iso_code_EST 4.449207e+15
iso_code_FIN 6.387662e+15
iso_code_FRA 5.909560e+16
iso_code_FRO 1.350117e+16
iso_code_GBR 1.100902e+16
iso_code_GGY -1.286488e+16
iso_code_GIB 1.040587e+16
iso_code_GRC -6.008871e+14
iso_code_HRV -9.216104e+15
iso_code_HUN -1.994166e+16
iso_code_IMN 1.659487e+16
iso_code_IRL -1.119291e+16
iso_code_ISL 6.475450e+15
iso_code_ITA -2.739907e+16
iso_code_JEY 3.690339e+15
iso_code_LIE 3.272328e+14
iso_code_LTU -1.701523e+16
iso_code_LUX -2.166666e+16
iso_code_LVA 5.074342e+15
iso_code_MCO 1.041756e+15
iso_code_MDA -1.326413e+16
iso_code_MKD -1.458366e+16
iso_code_MLT 5.813585e+15
iso_code_MNE -2.075621e+16
iso_code_NLD -2.445927e+16
iso_code_NOR -6.922632e+15
iso_code_OWID_KOS -2.415548e+16
iso_code_POL -2.412902e+16
iso_code_PRT -2.519373e+16
iso_code_ROU -2.036722e+15
iso_code_RUS -2.596940e+16
iso_code_SMR 3.837194e+16
iso_code_SRB -1.672350e+16
iso_code_SVK 1.518220e+15
iso_code_SVN -1.782923e+16
iso_code_SWE 3.207623e+15
iso_code_UKR -9.536771e+14
iso_code_VAT -2.010495e+15
location_Andorra -1.694448e+16
location_Austria -2.107024e+16
location_Belarus -7.335600e+14
location_Belgium 2.307448e+16
location_Bosnia and Herzegovina 2.486882e+16
location_Bulgaria -1.359825e+15
location_Croatia 1.383471e+14
location_Cyprus 1.179733e+16
location_Czech Republic -1.031157e+16
location_Denmark -2.350112e+16
location_Estonia -8.848710e+15
location_Faeroe Islands -1.403089e+16
location_Finland -2.233834e+15
location_France -6.015327e+16
location_Germany -2.255030e+16
location_Gibraltar -1.034263e+16
location_Greece -1.232976e+15
location_Guernsey 1.283442e+16
location_Hungary 7.394707e+15
location_Iceland 1.260433e+15
location_Ireland 9.888678e+15
location_Isle of Man -1.655653e+16
location_Italy 3.329812e+16
location_Jersey -3.721587e+15
location_Kosovo 2.901669e+16
location_Latvia -1.650324e+16
location_Liechtenstein 2.428077e+15
location_Lithuania 3.960924e+15
location_Luxembourg 9.274371e+15
location_Macedonia 5.159152e+15
location_Malta -7.060992e+15
location_Moldova 5.215677e+15
location_Monaco -5.972042e+15
location_Montenegro 7.535881e+15
location_Netherlands 2.518425e+16
location_Norway 5.814110e+15
location_Poland 1.731636e+16
location_Portugal 2.693640e+16
location_Romania -1.710978e+16
location_Russia 2.001379e+15
location_San Marino -3.982231e+16
location_Serbia -4.298116e+14
location_Slovakia -1.209923e+16
location_Slovenia 1.704641e+16
location_Spain -4.683184e+14
location_Sweden 3.958107e+15
location_Switzerland 5.861035e+14
location_Ukraine -1.873240e+16
location_United Kingdom -3.976763e+15
location_Vatican -2.218792e+15

From the model we can see that the accuracy shown is 70.5% , we can also get the intercept and the coefficients. But we need to check if this model is valid with plotting the residuals for normality and residuals versus predicted for homoscedasticity.

In [63]:
plt.figure(figsize=(10,7))
plt.title("Histogram of residuals to check for normality",fontsize=18)
plt.xlabel("Residuals",fontsize=15)
plt.ylabel("Kernel density", fontsize=15)
sns.distplot([y_test-np.round(y_pred)],color='purple')
Out[63]:
<AxesSubplot:title={'center':'Histogram of residuals to check for normality'}, xlabel='Residuals', ylabel='Kernel density'>

This plot shows us that the residuals though might seem like normally distributed but narrower plot are not normally distributed.

In [64]:
plt.figure(figsize=(10,7))
plt.title("Residuals vs. predicted values plot (Homoscedasticity)\n",fontsize=18)
plt.xlabel("Predicted total deaths",fontsize=15)
plt.ylabel("Residuals", fontsize=15)
plt.scatter(x=np.round(y_pred),y=y_test-np.round(y_pred),color='red')
Out[64]:
<matplotlib.collections.PathCollection at 0x205aa584ba8>

We can also observe from the residuals vs predicted plot that this data set is not good for a linear regression model.

Decision Tree Model

The linear regression model gives us an accuracy score of 70.8% but this event related to covid and demographic analysis has to be analysed on another parameter rather than just a linear model as we cannot get an exact view of these relations in just linear format. Thus let's model with decision tree.

In [65]:
def train_score_regressor(sklearn_regressor, X_train, y_train, X_test, y_test, model_parameters, print_oob_score=False):
    """A helper function that:
        - Trains a regressor on training data
        - Scores data on training and test data
        - Returns a trained model
    """
    # Step 1: Initializing the sklearn regressor 
   
    regressor = sklearn_regressor(**model_parameters)  
    # Step 2: Training the algorithm using the X_train dataset of features and y_train, the associated target features

    regressor.fit(X_train, y_train)
    y_pred1 = regressor.predict (X_train)
    y_pred2 = regressor.predict (X_test)
   
    # Step 3: Calculating the score of the predictive power on the training and testing dataset.
    
    print ('Train score: %.3f' % r2_score(y_train, np.round(y_pred1)))
    print ('Test score: %.3f' % r2_score(y_test, np.round(y_pred2)))
          
    print(regressor)
    return regressor
In [66]:
trained_regressor = train_score_regressor(sklearn_regressor=DecisionTreeRegressor,
                                          X_train=X_train, 
                                          y_train=y_train, 
                                          X_test=X_test, 
                                          y_test=y_test, 
                                          model_parameters={'random_state':100})
Train score: 1.000
Test score: 1.000
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=100, splitter='best')

This clearly shows that it the decision tree is overfitting to the data. Let's get the best parameters with GridSearchCV and conduct hyperparameter tuning.

In [67]:
# Setting parameters to search through
parameters = {"max_depth":[3,4,5],
              "max_leaf_nodes":[2,3,4]}
decision_regressor= DecisionTreeRegressor(random_state=100)

# Initialize GridSearch and then fit
regressor=GridSearchCV(decision_regressor,parameters)
regressor.fit(X_train, y_train)
print(regressor)
GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=DecisionTreeRegressor(criterion='mse', max_depth=None,
                                             max_features=None,
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             presort=False, random_state=100,
                                             splitter='best'),
             iid='warn', n_jobs=None,
             param_grid={'max_depth': [3, 4, 5], 'max_leaf_nodes': [2, 3, 4]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [68]:
regressor.best_estimator_.get_params()
Out[68]:
{'criterion': 'mse',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': 4,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': False,
 'random_state': 100,
 'splitter': 'best'}
In [69]:
# evaluating the tuned model
trained_regressor = train_score_regressor(sklearn_regressor=DecisionTreeRegressor,
                                          X_train=X_train, 
                                          y_train=y_train, 
                                          X_test=X_test, 
                                          y_test=y_test, 
                                          model_parameters=regressor.best_estimator_.get_params())
Train score: 0.924
Test score: 0.918
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
                      max_leaf_nodes=4, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=100, splitter='best')

We can observe from the results above that the decision tree model performs better once tuned with GridSearchCV to find the best possible parameters to understand the total deaths with gdp and age parameters for different locations across Europe. The train score is 92.4% and the test score is 91.8%. This shows that the model is well tuned to predict the total deaths very closely.

Conclusion

We tried a regresion model based on location details to predict total deaths and got an accurancy of 70.5%. But the evaluation shows a different picture regarding the usefulness of the data set for linear regression. Our decision tree model showed signs of overfitting in the beginning but once tuned it gave a better accuracy of 91.8% on test set. This is very helpful to take proactive measures at locations which might be prone to be hotspots based on prior trends that we have captured for the same. Moving beyond the data set that we worked with, continuous data input would greatly improve the accuracy of our models over time. The data sets we’re working with are volatile, subject to change as spikes in infections can occur. With more data, the models would be able to predict future outcomes more reliably and help us predict future hotspots, whether by age, GDP or other variables.