This project implements a fully modular, end-to-end classical ML system for phishing URL detection — from raw CSV ingestion to a deployed inference API — built using a production-first ML engineering architecture.
The design emphasizes:
- Component-based pipeline structure
- Clear separation between training and inference
- Clean artifact + config management
- Schema-driven data validation
- Production-style logging & exception handling
- FastAPI for serving
- Docker for deployment
This mirrors real-world ML engineering standards: modular, testable, maintainable, and deployment-ready.
The system classifies URLs as phishing or legitimate using a classical ML model.
Raw labeled URLs pass through a multi-stage training pipeline, and the deployed FastAPI inference service returns:
- predicted class
- confidence score
Consistent preprocessing ensures training-serving parity.
Final Model: Random Forest Classifier
Test Metrics:
- F1 Score: 0.9741
- Precision: 0.9737
- Recall: 0.9745
All metrics and artifacts are tracked using MLflow.
The deployed API accepts:
- URL string
The service returns:
- predicted label (
phishing/legitimate) - confidence score
A simple web UI (Jinja templates) supports single URL input and batch CSV prediction.
Training Pipeline (main.py)
↓
Saved Artifacts (model, transformer, metrics)
↓
Inference Pipeline (loads artifacts only)
↓
FastAPI Server (runtime prediction)
↓
Docker Deployment- Strict separation of training and inference
- Component-based modular pipeline
- Schema-based validation (YAML)
- Logged experiments via MLflow
- Production-grade logging + error handling
- Lightweight Dockerized inference environment
-
Data Ingestion
Reads raw CSV, splits into train/test. -
Data Validation
Validates structure usingschema.yaml. -
Data Transformation
Feature engineering + preprocessing for classical ML. -
Model Training
Trains multiple classical ML models via GridSearchCV. -
Model Evaluation
Computes F1/Precision/Recall, logs results to MLflow. -
Model Pushing
Saves model + transformer artifacts usingdill.
Artifacts are stored in a versioned structure inside Artifacts/.
A FastAPI app (app.py) serves real-time predictions.
python app.pyOpen:
http://localhost:8000Supports:
- UI-based predictions
- JSON API predictions
- CSV batch predictions
docker build -t phishing-detector .docker run -p 8000:8000 \
-e MONGO_DB_URL="$MONGO_DB_URL" \
-v /path/to/artifacts:/app/Artifacts \
phishing-detectorOpen:
http://localhost:8000ssh -i <key.pem> ubuntu@<public-ip>docker run -d -p 8000:8000 \
-e MONGO_DB_URL="mongodb://..." \
-v /home/ubuntu/artifacts:/app/Artifacts \
phishing-detectorVisit:
http://<public-ip>:8000- Python
- Scikit-Learn
- Pandas / NumPy
- Modular component-based architecture
- YAML schema validation
- Dill-based artifact serialization
- MLflow
- FastAPI
- Uvicorn
- Jinja2 templates
- Docker
- Optional MongoDB logging
MIT