This project is to model and visualize Covid-19 cases based on county-level demographic and economic dataset
- The Covid-19 data (from JHU-CSSEGIS)
- The geo-json data (from Plotly)
- The US county-level demographics and economy data (from US Census Bureau)
- Notebook or Notebook via nbviewer
- Visualize the spatial distribution of Covid-19 confirmed cases and deaths cumulative till 07-31-2020 in United States via Plotly.
- Developed data processing pipeline to process the Covid-19 data via Spark.
- Modeled the temporal evolution of Covid-19 confirmed cases via a logistic growth model.
- Rationalized the fitted model parameters and the impact on the current Covid-19 situation.
- Predict future Covid-19 cases ovetime for Orange County, CA.
- Notebook
- Leveraged the county-level demographics and economy data from US Census Bureau.
- Processed the data to join two datasets via a common key and performed data cleaning and missing value imputation.
- Visualized the correlation between features.
- Performed feature selection via LASSO Regression and reduced from 49 to 9 features.
- Developed an XGBoost model using selected features to model log10(confirmed_Cases) as a function of county-level demographics and economy information.
- Fine-tuned the model hyperparameters via Randomized Search thru Scikit-Learn and achieved an RMSE of 0.33 on hold-out set.
- Extracted the feature importance and obtained insight into how the local demographics and economy shape the Covid-19 pandemic spread.