Ordinary Least Squares regression with comprehensive diagnostics, robust alternatives, and best practices.
This Jupyter notebook provides an in-depth exploration of Ordinary Least Squares (OLS) regression, covering theoretical foundations, practical implementation, model diagnostics, and advanced techniques. The notebook serves as both an educational resource and a practical toolkit for regression analysis.
- Mathematical formulation of OLS with proper LaTeX formatting
- Gauss-Markov theorem and assumptions
- Geometric interpretation as orthogonal projection
- Statistical inference (t-tests, F-tests, confidence intervals)
-
Residual Analysis: Six diagnostic plots including:
- Residuals vs Fitted
- QQ plot for normality
- Scale-Location plot
- Residuals vs Order
- ACF plot for autocorrelation
- Residual distribution histogram
-
Statistical Tests:
- Durbin-Watson for autocorrelation
- Jarque-Bera for normality
- Breusch-Pagan for heteroscedasticity
- Rainbow test for linearity
-
Non-linear relationships linear in parameters
-
Categorical variables with dummy encoding
-
Multicollinearity analysis:
- Variance Inflation Factors (VIF)
- Condition number calculation
- Correlation matrix visualization
-
Model comparison and selection:
- Information criteria (AIC, BIC)
- Likelihood ratio tests
- Nested model comparison
- Cook's Distance calculation and visualization
- DFFITS and DFBETAS analysis
- Leverage points identification
- Comprehensive influence measures
- Comparison with Huber regression
- Comparison with Bisquare (Tukey) regression
- Performance evaluation with outliers
- Scale parameter comparison
- Multiple approaches to OLS calculation
- Numerical stability considerations
- Cross-validation techniques
- Model validation best practices
- VIF Calculation: Automatic detection of multicollinearity
- Influence Metrics: Cook's D, DFFITS, DFBETAS with threshold guidelines
- Residual Diagnostics: Complete visualization suite
- Statistical Tests: Comprehensive assumption checking
- Huber Regression: M-estimation with Huber's T norm
- Bisquare Regression: Tukey's biweight estimation
- Performance Comparison: Side-by-side evaluation with OLS
- Model Selection: AIC, BIC, and cross-validation
- Assumption Checking: Complete diagnostic workflow
- Outlier Treatment: Identification and handling strategies
- Reporting Standards: Complete results interpretation
numpy(1.21.0+)pandas(1.3.0+)matplotlib(3.4.0+)statsmodels(0.13.0+)scipy(1.7.0+)scikit-learn(0.24.0+)
pip install numpy pandas matplotlib statsmodels scipy scikit-learnimport statsmodels.api as sm
import numpy as np
# Generate sample data
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x**2))
X = sm.add_constant(X)
beta = np.array([1, 0.1, 10])
y = np.dot(X, beta) + np.random.normal(size=nsample)
# Fit model
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())The notebook includes a comprehensive diagnostic section that automatically:
- Generates six diagnostic plots
- Performs statistical tests for assumptions
- Identifies influential observations
- Provides interpretation guidelines
Compare different model specifications using:
- Information criteria (AIC/BIC)
- Likelihood ratio tests
- Cross-validation performance
-
Introduction to OLS
- Theoretical foundations
- Mathematical formulation
- Assumptions and properties
-
Basic OLS Implementation
- Simple regression example
- Results interpretation
- Parameter extraction
-
Comprehensive Diagnostics
- Visual diagnostics (6 plots)
- Statistical tests
- Assumption validation
-
Advanced Topics
- Non-linear relationships
- Categorical variables
- Interaction terms
-
Model Comparison
- Multiple specifications
- Information criteria
- Hypothesis testing
-
Outlier Analysis
- Influence measures
- Detection methods
- Treatment strategies
-
Robust Regression
- Alternative estimators
- Performance comparison
- Use case recommendations
-
Best Practices
- Model selection guidelines
- Diagnostic workflows
- Reporting standards
- Complete diagnostic workflow
- Model comparison framework
- Robust regression alternatives
- Production-ready code patterns
- Theoretical foundations
- Statistical inference methods
- Assumption validation techniques
- Publication-quality visualizations
- Step-by-step explanations
- Practical examples
- Common pitfalls and solutions
- Best practice guidelines
-
Diagnostic Plots:
- Residual analysis (6 plots)
- Multicollinearity diagnostics
- Influence measures
- Model comparison charts
-
Regression Results:
- Fitted vs actual values
- Confidence intervals
- Prediction intervals
- Component contributions
-
Statistical Tests:
- QQ plots for normality
- ACF plots for autocorrelation
- Leverage vs residuals
- Cook's Distance visualization
- Numerical Stability: Uses QR decomposition for better numerical stability
- Memory Efficiency: Optimized for large datasets
- Computational Speed: Efficient algorithms for common operations
- Scalability: Suitable for datasets up to millions of observations
This notebook is designed as an educational resource. Contributions for:
- Additional diagnostic tools
- More regression techniques
- Performance optimizations
- Documentation improvements
are welcome through pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions, suggestions, or collaboration opportunities, please open an issue in the repository.
Note: This notebook is designed to be self-contained and educational. All code examples include explanations and interpretations to facilitate learning and practical application.