Approach
Approach
Below is the Approach we took for this project:
- Data understanding and EDA
- Data type, data range constraints
- Looking for missing data
- Looking for inconsistencies in data
- Visualizing the above to get a better insight in the uniformity or the inconsistencies
- Understanding and comparing distribution through histograms, CDF, PMF and KDE plots
- Visualizing relationships between the variables
- Data Preparation
- Handling missing data with imputation or deletion (preferably imputation)
- Handling data types by converting some if needed
- Handling the scaling and normalizing of the data
- Dealing with pre-processing feature engineering: encoding, handling numerical and date variables
- Feature selection
- Removing the redundancy
- Checking correlation via a matrix and plot to understand which variables to select
- If necessary dimension reduction by PCA
- Machine Learning (Supervised)
- Supervised learning models to understand the effect of GDP and predict deaths
- Evaluation - Splitting the data into train and test set - Scores, classification reports, confusion matrix to understand the accuracy of classification predictions - Exploring this with cross validation as well - Tuning the hyperparameter and also planning to apply grid search for such hyper parameter tuning