Skip to content

esosetrov/ordinary_least_squares

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ordinary_least_squares

Ordinary Least Squares regression with comprehensive diagnostics, robust alternatives, and best practices.

Overview

This Jupyter notebook provides an in-depth exploration of Ordinary Least Squares (OLS) regression, covering theoretical foundations, practical implementation, model diagnostics, and advanced techniques. The notebook serves as both an educational resource and a practical toolkit for regression analysis.

Features

1. Theoretical Foundation

  • Mathematical formulation of OLS with proper LaTeX formatting
  • Gauss-Markov theorem and assumptions
  • Geometric interpretation as orthogonal projection
  • Statistical inference (t-tests, F-tests, confidence intervals)

2. Comprehensive Model Diagnostics

  • Residual Analysis: Six diagnostic plots including:

    • Residuals vs Fitted
    • QQ plot for normality
    • Scale-Location plot
    • Residuals vs Order
    • ACF plot for autocorrelation
    • Residual distribution histogram
  • Statistical Tests:

    • Durbin-Watson for autocorrelation
    • Jarque-Bera for normality
    • Breusch-Pagan for heteroscedasticity
    • Rainbow test for linearity

3. Advanced Regression Techniques

  • Non-linear relationships linear in parameters

  • Categorical variables with dummy encoding

  • Multicollinearity analysis:

    • Variance Inflation Factors (VIF)
    • Condition number calculation
    • Correlation matrix visualization
  • Model comparison and selection:

    • Information criteria (AIC, BIC)
    • Likelihood ratio tests
    • Nested model comparison

4. Outlier and Influence Diagnostics

  • Cook's Distance calculation and visualization
  • DFFITS and DFBETAS analysis
  • Leverage points identification
  • Comprehensive influence measures

5. Robust Regression Methods

  • Comparison with Huber regression
  • Comparison with Bisquare (Tukey) regression
  • Performance evaluation with outliers
  • Scale parameter comparison

6. Practical Implementation

  • Multiple approaches to OLS calculation
  • Numerical stability considerations
  • Cross-validation techniques
  • Model validation best practices

Key Highlights

Diagnostic Tools

  • VIF Calculation: Automatic detection of multicollinearity
  • Influence Metrics: Cook's D, DFFITS, DFBETAS with threshold guidelines
  • Residual Diagnostics: Complete visualization suite
  • Statistical Tests: Comprehensive assumption checking

Robust Alternatives

  • Huber Regression: M-estimation with Huber's T norm
  • Bisquare Regression: Tukey's biweight estimation
  • Performance Comparison: Side-by-side evaluation with OLS

Best Practices

  • Model Selection: AIC, BIC, and cross-validation
  • Assumption Checking: Complete diagnostic workflow
  • Outlier Treatment: Identification and handling strategies
  • Reporting Standards: Complete results interpretation

Requirements

Python Libraries

  • numpy (1.21.0+)
  • pandas (1.3.0+)
  • matplotlib (3.4.0+)
  • statsmodels (0.13.0+)
  • scipy (1.7.0+)
  • scikit-learn (0.24.0+)

Installation

pip install numpy pandas matplotlib statsmodels scipy scikit-learn

Usage

Basic OLS Estimation

import statsmodels.api as sm
import numpy as np

# Generate sample data
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x**2))
X = sm.add_constant(X)
beta = np.array([1, 0.1, 10])
y = np.dot(X, beta) + np.random.normal(size=nsample)

# Fit model
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

Running Diagnostics

The notebook includes a comprehensive diagnostic section that automatically:

  1. Generates six diagnostic plots
  2. Performs statistical tests for assumptions
  3. Identifies influential observations
  4. Provides interpretation guidelines

Model Comparison

Compare different model specifications using:

  • Information criteria (AIC/BIC)
  • Likelihood ratio tests
  • Cross-validation performance

Notebook Structure

  1. Introduction to OLS

    • Theoretical foundations
    • Mathematical formulation
    • Assumptions and properties
  2. Basic OLS Implementation

    • Simple regression example
    • Results interpretation
    • Parameter extraction
  3. Comprehensive Diagnostics

    • Visual diagnostics (6 plots)
    • Statistical tests
    • Assumption validation
  4. Advanced Topics

    • Non-linear relationships
    • Categorical variables
    • Interaction terms
  5. Model Comparison

    • Multiple specifications
    • Information criteria
    • Hypothesis testing
  6. Outlier Analysis

    • Influence measures
    • Detection methods
    • Treatment strategies
  7. Robust Regression

    • Alternative estimators
    • Performance comparison
    • Use case recommendations
  8. Best Practices

    • Model selection guidelines
    • Diagnostic workflows
    • Reporting standards

Key Features for Practitioners

For Data Scientists

  • Complete diagnostic workflow
  • Model comparison framework
  • Robust regression alternatives
  • Production-ready code patterns

For Researchers

  • Theoretical foundations
  • Statistical inference methods
  • Assumption validation techniques
  • Publication-quality visualizations

For Students

  • Step-by-step explanations
  • Practical examples
  • Common pitfalls and solutions
  • Best practice guidelines

Visualizations Included

  1. Diagnostic Plots:

    • Residual analysis (6 plots)
    • Multicollinearity diagnostics
    • Influence measures
    • Model comparison charts
  2. Regression Results:

    • Fitted vs actual values
    • Confidence intervals
    • Prediction intervals
    • Component contributions
  3. Statistical Tests:

    • QQ plots for normality
    • ACF plots for autocorrelation
    • Leverage vs residuals
    • Cook's Distance visualization

Performance Considerations

  • Numerical Stability: Uses QR decomposition for better numerical stability
  • Memory Efficiency: Optimized for large datasets
  • Computational Speed: Efficient algorithms for common operations
  • Scalability: Suitable for datasets up to millions of observations

Contributing

This notebook is designed as an educational resource. Contributions for:

  • Additional diagnostic tools
  • More regression techniques
  • Performance optimizations
  • Documentation improvements

are welcome through pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions, suggestions, or collaboration opportunities, please open an issue in the repository.


Note: This notebook is designed to be self-contained and educational. All code examples include explanations and interpretations to facilitate learning and practical application.

Releases

No releases published

Packages

 
 
 

Contributors