End-to-End Phishing URL Detection — Modular, Production-Ready ML Pipeline & Deployment System

This project implements a fully modular, end-to-end classical ML system for phishing URL detection — from raw CSV ingestion to a deployed inference API — built using a production-first ML engineering architecture.

The design emphasizes:

Component-based pipeline structure
Clear separation between training and inference
Clean artifact + config management
Schema-driven data validation
Production-style logging & exception handling
FastAPI for serving
Docker for deployment

This mirrors real-world ML engineering standards: modular, testable, maintainable, and deployment-ready.

Problem Statement

The system classifies URLs as phishing or legitimate using a classical ML model.

Raw labeled URLs pass through a multi-stage training pipeline, and the deployed FastAPI inference service returns:

predicted class
confidence score

Consistent preprocessing ensures training-serving parity.

Model Summary

Final Model: Random Forest Classifier

Test Metrics:

F1 Score: 0.9741
Precision: 0.9737
Recall: 0.9745

All metrics and artifacts are tracked using MLflow.

API Inputs

The deployed API accepts:

URL string

The service returns:

predicted label (phishing / legitimate)
confidence score

A simple web UI (Jinja templates) supports single URL input and batch CSV prediction.

Architecture Overview

Training Pipeline (main.py)
↓
Saved Artifacts (model, transformer, metrics)
↓
Inference Pipeline (loads artifacts only)
↓
FastAPI Server (runtime prediction)
↓
Docker Deployment

Key Engineering Decisions

Strict separation of training and inference
Component-based modular pipeline
Schema-based validation (YAML)
Logged experiments via MLflow
Production-grade logging + error handling
Lightweight Dockerized inference environment

Training Pipeline Components

Data Ingestion
Reads raw CSV, splits into train/test.
Data Validation
Validates structure using schema.yaml.
Data Transformation
Feature engineering + preprocessing for classical ML.
Model Training
Trains multiple classical ML models via GridSearchCV.
Model Evaluation
Computes F1/Precision/Recall, logs results to MLflow.
Model Pushing
Saves model + transformer artifacts using dill.

Artifacts are stored in a versioned structure inside Artifacts/.

Deployment

A FastAPI app (app.py) serves real-time predictions.

Run Locally

python app.py

Open:

http://localhost:8000

Supports:

UI-based predictions
JSON API predictions
CSV batch predictions

Docker Deployment

Build Image

docker build -t phishing-detector .

Run Container

docker run -p 8000:8000 \
  -e MONGO_DB_URL="$MONGO_DB_URL" \
  -v /path/to/artifacts:/app/Artifacts \
  phishing-detector

Open:

http://localhost:8000

EC2 Deployment Example

SSH into instance

ssh -i <key.pem> ubuntu@<public-ip>

Run the Docker container

docker run -d -p 8000:8000 \
  -e MONGO_DB_URL="mongodb://..." \
  -v /home/ubuntu/artifacts:/app/Artifacts \
  phishing-detector

Visit:

http://<public-ip>:8000

Tech Stack

Core

Python
Scikit-Learn
Pandas / NumPy

Pipeline

Modular component-based architecture
YAML schema validation
Dill-based artifact serialization

Experiment Tracking

MLflow

Serving

FastAPI
Uvicorn
Jinja2 templates

Deployment

Docker
Optional MongoDB logging

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data_schema		data_schema
network_data		network_data
network_security		network_security
prediction_output		prediction_output
templates		templates
valid_data		valid_data
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
main.py		main.py
push_data.py		push_data.py
requirements.txt		requirements.txt
setup.py		setup.py
test_mongoDB.py		test_mongoDB.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End Phishing URL Detection — Modular, Production-Ready ML Pipeline & Deployment System

Problem Statement

Model Summary

API Inputs

Architecture Overview

Key Engineering Decisions

Training Pipeline Components

Deployment

Run Locally

Docker Deployment

Build Image

Run Container

EC2 Deployment Example

SSH into instance

Run the Docker container

Tech Stack

Core

Pipeline

Experiment Tracking

Serving

Deployment

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

End-to-End Phishing URL Detection — Modular, Production-Ready ML Pipeline & Deployment System

Problem Statement

Model Summary

API Inputs

Architecture Overview

Key Engineering Decisions

Training Pipeline Components

Deployment

Run Locally

Docker Deployment

Build Image

Run Container

EC2 Deployment Example

SSH into instance

Run the Docker container

Tech Stack

Core

Pipeline

Experiment Tracking

Serving

Deployment

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages