This project delivers an end-to-end data science analysis of the key drivers behind the Human Development Index (HDI) across 180+ countries from 2000 to 2022.
Going beyond descriptive statistics, the study integrates economic theory, robust data engineering, and predictive modeling, comparing traditional econometric approaches (Linear Regression) with machine learning methods (Random Forest) to uncover non-linear dynamics in global development.
The result is a reproducible, research-grade workflow suitable for policy analysis, academic research, and applied data science portfolios.
| Data Quality & Imputation | Correlation Analysis |
|---|---|
![]() |
![]() |
| Group-wise imputation of missing time-series data. | Strong correlation (0.91) between Life Expectancy and HDI. |
| The Preston Curve | Structural Clustering (PCA) |
|---|---|
![]() |
![]() |
| Diminishing returns of GDP on Life Expectancy. | Distinct regional and income-based development clusters. |
- Linear Regression (Baseline): RMSE = 0.056
- Random Forest (ML): RMSE = 0.026
- Performance Gain: 54% improvement in predictive accuracy
➡️ This confirms that human development follows non-linear patterns poorly captured by linear models.
Key Drivers of HDI:
- GDP per Capita (Primary Driver)
- Life Expectancy
- Health Expenditure
- Unemployment Rate (Critical bottleneck effect)
- Machine Learning Superiority: Random Forest significantly outperforms traditional regression, highlighting complex interactions in development indicators.
- Preston Curve Validated: Economic growth yields diminishing returns on health and human development after a threshold.
- Policy-Relevant Bottlenecks: Health spending and labor market conditions meaningfully constrain development outcomes beyond income alone.
📄 View Full Analysis Report - Download report.html and open in your browser for the complete interactive report with all visualizations and code.
- R (4.x)
tidyverse,janitorWDI(World Bank API)naniar(Missing data diagnostics & imputation)
ggplot2,GGallycorrplotfactoextra(PCA & clustering)
tidymodelsrandomForestvip(Model interpretability)
- World Bank Open Data API
- UNDP Human Development Reports
-
Clone the repository
git clone https://github.com/your-username/global-development-analytics.git
-
Open
human_development_index_R -
Run data collection:
scripts/01_data_collection.R
-
Generate the full report:
rmarkdown::render("report.Rmd")
├── data/ # Raw and processed datasets
├── scripts/ # Modular R scripts (ETL, EDA, Modeling)
├── results/ # Figures and model outputs
├── report.Rmd # Reproducible analysis report
├── mega-hdi-analysis.Rproj
└── README.md # Project documentation
⭐ If you find this project useful, feel free to star the repository or reach out for collaboration.




