An end-to-end fraud intelligence system combining knowledge graphs, graph neural networks,
real-time streaming, and classical ML to detect coordinated fraud rings at scale.
🚀 Quick Start • 🏗️ Architecture • 📊 Datasets • 🧠 ML Pipeline • 📁 Project Structure • 🛣️ Roadmap
Most fraud detection systems look at transactions in isolation.
This system looks at the entire network — who shares a device, an IP, a card, an email — and finds coordinated rings that traditional ML completely misses.
Traditional approach: This system:
Transaction → Score User ─── Device ─── User
│ │
❌ Misses rings IP ──────────────── Card
❌ No graph context │
❌ Reactive Merchant
✅ Detects rings
✅ Graph-aware scoring
✅ Real-time streaming
╔══════════════════════════════════════════════════════════════════════╗
║ DATA SOURCES ║
║ PaySim (6.3M txns) · IEEE-CIS (590K txns) · Elliptic (GNN) ║
╚══════════════════════════╦═══════════════════════════════════════════╝
▼
╔══════════════════════════════════════════════════════════════════════╗
║ KAFKA STREAMING LAYER ║
║ transactions_raw · fraud_alerts · login_events ║
╚══════════════════════════╦═══════════════════════════════════════════╝
▼
╔══════════════════════════════════════════════════════════════════════╗
║ db.py ─ INGESTION ORCHESTRATOR ║
╚═════════════════╦════════════════════════╦════════════════════════════╝
▼ ▼
╔═════════════════════════╗ ╔═════════════════════════════════════════╗
║ POSTGRESQL ║ ║ NEO4J ║
║ ║ ║ ║
║ customers ║ ║ (:Customer)-[:MADE]->(:Transaction) ║
║ transactions ║ ║ (:Customer)-[:USES]->(:Device) ║
║ devices · cards · ips ║ ║ (:Customer)-[:OWNS]->(:Card) ║
║ transaction_features ║ ║ (:Customer)-[:LOGGED_FROM]->(:IP) ║
║ predictions ║ ║ (:IP)-[:LOCATED_IN]->(:Country) ║
╚═════════════════════════╝ ╚═════════════════════════════════════════╝
║ ║
╚═══════════╦════════════╝
▼
╔══════════════════════════════════════════════════════════════════════╗
║ FEATURE ENGINEERING (features.py) ║
║ tx_velocity · balance_drain · shared_device_count · pagerank ║
║ fraud_neighbor_ratio · betweenness · country_risk · night_ratio ║
╚══════════════════════════╦═══════════════════════════════════════════╝
▼
╔══════════════════════════════════════════════════════════════════════╗
║ ML PIPELINE (model.py) ║
║ ║
║ NetworkX Graph ──► Node2Vec ──► GraphSAGE/GAT ──► Embeddings ║
║ │ ║
║ Feature Matrix ─────────────────────────────────────► │ ║
║ ▼ ║
║ XGBoost · LightGBM ║
║ HDBSCAN · IsolationForest║
╚══════════════════════════╦═══════════════════════════════════════════╝
▼
╔══════════════════════════════════════════════════════════════════════╗
║ FRAUD PROBABILITY SCORE · 0.0 → 1.0 ║
╚══════╦══════════════╦══════════════════╦═══════════════╦═════════════╝
▼ ▼ ▼ ▼
FastAPI Kafka Alert Pyvis Graph MLflow
/score fraud_alerts Ring Visual Tracking
| Dataset | Purpose | Size | Source |
|---|---|---|---|
| PaySim | Core transactions, customer + merchant nodes | 6.3M rows | Kaggle |
| IEEE-CIS Transaction | Card data, email nodes, enriched features | 590K rows | Kaggle |
| IEEE-CIS Identity | Device, browser, OS nodes | 144K rows | Kaggle |
| Elliptic Bitcoin | GNN pre-training only | 203K nodes | Kaggle |
| DB-IP Geolocation | IP → Country enrichment | ~3.2M rows | db-ip.com |
data/
├── raw/
│ ├── paysim/ ← PS_20174392719_1491204439457_log.csv
│ ├── ieee/ ← train_transaction.csv · train_identity.csv
│ ├── elliptic/ ← elliptic_txs_features · classes · edgelist
│ └── ip/ ← dbip-country-lite.csv
└── processed/ ← pipeline writes cleaned DataFrames here
- Python 3.11
- Docker Desktop
- Neo4j Desktop
- Conda
git clone https://github.com/yourusername/fraud_detection.git
cd fraud_detection
conda create -n fraud_detection python=3.11.9 -y
conda activate fraud_detection
pip install --upgrade pip# PostgreSQL via Docker
docker run --name postgres-fraud \
-e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=fraud_db \
-p 5432:5432 -d postgres
# Redis via Docker
docker run --name redis-fraud \
-p 6379:6379 -d redis
# Kafka via Docker
docker run -d --name kafka \
-p 9092:9092 apache/kafka:latestNeo4j — open Neo4j Desktop, create a new database, set password, click Start.
Bolt port will bebolt://localhost:7687
cp .env.example .env
# Fill in your passwords in .env# config.yaml is pre-filled with localhost defaults
# Only change if your ports differ# All packages
pip install neo4j==5.20.0 psycopg2-binary==2.9.9 redis==5.0.4 \
sqlalchemy==2.0.30 confluent-kafka==2.3.0 pandas==2.2.2 \
numpy==1.26.4 pyarrow==16.0.0 scipy==1.13.0 networkx==3.3 \
node2vec==0.5.0 scikit-learn==1.4.2 xgboost==2.0.3 \
lightgbm==4.3.0 imbalanced-learn==0.12.3 matplotlib==3.9.0 \
seaborn==0.13.2 pyvis==0.3.2 plotly==5.22.0 fastapi==0.111.0 \
uvicorn==0.29.0 pydantic==2.7.1 python-multipart==0.0.9 \
httpx==0.27.0 pyyaml==6.0.1 python-dotenv==1.0.1 tqdm==4.66.4 \
loguru==0.7.2 tenacity==8.3.0 mlflow==2.13.0 apscheduler==3.10.4 \
celery==5.4.0 prometheus-client==0.20.0 evidently==0.4.30 \
pytest==8.2.0 pytest-asyncio==0.23.7
# PyTorch + PyG (CPU)
pip install torch==2.3.0
pip install torch-geometric==2.5.3
pip install pyg_lib torch_scatter torch_sparse torch_cluster \
torch_spline_conv \
-f https://data.pyg.org/whl/torch-2.3.0+cpu.htmldocker exec kafka /opt/kafka/bin/kafka-topics.sh \
--create --topic transactions_raw \
--bootstrap-server localhost:9092
docker exec kafka /opt/kafka/bin/kafka-topics.sh \
--create --topic fraud_alerts \
--bootstrap-server localhost:9092# Create all tables and seed databases
python main.py --setup
# Run full pipeline
python main.py --pipeline
# Start API server
uvicorn src.main:app --reload
# Start real-time consumer (separate terminal)
python src/kafka_consumer.py
# Launch MLflow UI
mlflow ui # → localhost:5000Step 1 — Node2Vec
NetworkX graph → random walks → 128-dim node embeddings
Captures graph topology: who is connected to whom
Step 2 — GraphSAGE (PyTorch Geometric)
Node2Vec embeddings as input features
Pre-trained on Elliptic Bitcoin dataset
Fine-tuned on PaySim + IEEE graph
Output: refined fraud-aware embeddings
Step 3 — Classical ML
GNN embeddings + engineered features → XGBoost / LightGBM
Isolation Forest for unsupervised anomaly scoring
HDBSCAN for community/cluster detection
Step 4 — Final Score
Weighted combination → 0.0 to 1.0 fraud probability
SHAP values → which features drove the score
Stored in PostgreSQL predictions table
| Pattern | How It's Found | Query |
|---|---|---|
| Shared Device Ring | 3+ users → same device fingerprint | SHARED_DEVICE_QUERY |
| Shared IP Ring | 3+ users → same IP address | SHARED_IP_QUERY |
| Shared Card Fraud | 2+ users → same card | SHARED_CARD_QUERY |
| Email Cluster | Users sharing disposable email domains | SHARED_EMAIL_QUERY |
| Velocity Burst | 10+ transactions in 1 hour | features.py |
| Night Pattern | 70%+ transactions between 11pm–5am | features.py |
| VPN + High Value | VPN IP + transaction > threshold | registry.py |
fraud_detection/
│
├── 📂 src/
│ ├── 📂 adapters/
│ │ ├── __init__.py
│ │ ├── base.py ← Neo4j driver · Postgres engine · Redis client
│ │ ├── neo.py ← All Neo4j node + relationship functions
│ │ └── ps.py ← All PostgreSQL table + insert + query functions
│ │
│ ├── analysis.py ← Cypher queries → NetworkX graph export
│ ├── config.py ← Reads config.yaml + .env
│ ├── db.py ← Ingestion orchestrator (calls both adapters)
│ ├── features.py ← Feature matrix builder (Postgres + Neo4j)
│ ├── kafka_consumer.py ← Real-time transaction consumer + scorer
│ ├── kafka_producer.py ← Fraud alert publisher with Redis dedup
│ ├── logger.py ← Loguru config · file + console sinks
│ ├── model.py ← Node2Vec → GraphSAGE → XGBoost → score
│ ├── pipeline.py ← Master runner (CSV → DB → Graph → ML → Alert)
│ ├── queries.py ← All Cypher strings as constants
│ ├── registry.py ← All thresholds and constants
│ ├── visualize.py ← Pyvis fraud rings · Plotly dashboards
│ └── main.py ← Entry point
│
├── 📂 data/
│ ├── raw/ ← Original CSV files (git-ignored)
│ └── processed/ ← Cleaned DataFrames
│
├── 📂 output/
│ ├── logs/ ← fraud.log (rotates at 10MB)
│ ├── graphs/ ← Pyvis HTML fraud ring visualizations
│ ├── embeddings/ ← Saved Node2Vec embeddings (.pkl)
│ └── models/ ← Trained model weights
│
├── config.yaml ← DB URLs · Kafka topics · model hyperparams
├── .env ← Passwords (never commit this)
├── .env.example ← Template for .env
├── requirements.txt
└── README.md
Start the server: uvicorn src.main:app --reload
Interactive docs: http://localhost:8000/docs
| Endpoint | Method | Description |
|---|---|---|
/score/{user_id} |
GET |
Returns fraud probability + top contributing features |
/ring/{user_id} |
GET |
Returns the fraud ring subgraph for a user |
/alerts |
GET |
Returns all high-score predictions above threshold |
/health |
GET |
Returns service health status |
/metrics |
GET |
Prometheus metrics (transactions scored, alerts fired, latency) |
{
"user_id": "U0012483",
"fraud_score": 0.923,
"risk_level": "HIGH",
"top_features": [
{ "feature": "shared_device_count", "contribution": 0.41 },
{ "feature": "tx_velocity_1h", "contribution": 0.29 },
{ "feature": "vpn_flag", "contribution": 0.18 }
],
"ring_members": ["U0012483", "U0019234", "U0045821"],
"shared_entities": {
"devices": ["D_8f3a2b"],
"ips": ["192.168.1.1"]
},
"scored_at": "2025-06-10T14:32:11Z"
}| Label | Key Property | Source |
|---|---|---|
(:Customer) |
customer_id |
PaySim nameOrig |
(:Transaction) |
tx_id |
PaySim + IEEE |
(:Device) |
device_id |
IEEE identity |
(:Card) |
card_id |
IEEE transaction |
(:IP) |
address |
DB-IP + enrichment |
(:Email) |
domain |
IEEE P/R_emaildomain |
(:Merchant) |
merchant_id |
PaySim nameDest |
(:Country) |
code |
DB-IP |
(:BitcoinTx) |
tx_id |
Elliptic (GNN only) |
(Customer)-[:MADE]->(Transaction)
(Transaction)-[:TO]->(Customer)
(Transaction)-[:TO]->(Merchant)
(Customer)-[:USES]->(Device)
(Customer)-[:OWNS]->(Card)
(Customer)-[:USES_EMAIL]->(Email)
(Customer)-[:LOGGED_FROM]->(IP)
(Transaction)-[:USED_CARD]->(Card)
(Transaction)-[:USED_DEVICE]->(Device)
(Transaction)-[:USED_IP]->(IP)
(IP)-[:LOCATED_IN]->(Country)
(BitcoinTx)-[:TRANSFERRED_TO]->(BitcoinTx)customers — customer_id · external_id · is_fraud
transactions — tx_id · customer_id · merchant_id · card_id
device_id · ip_id · amount · is_fraud · timestamp
cards — card_id · card_fingerprint · card_network
devices — device_id · device_info · os · browser
ips — ip_id · ip_address · country · vpn_flag · tor_flag
transaction_features — C1–C14 · D1–D15 · M1–M9 (IEEE columns)
customer_identity — id_01–id_38 (IEEE identity columns)
login_events — event_id · customer_id · ip_id · device_id
predictions — customer_id · fraud_score · top_features · scored_at- PostgreSQL schema and ingestion
- Neo4j graph schema and ingestion
- PaySim dataset pipeline
- IEEE-CIS dataset pipeline
- Node2Vec embeddings
- GraphSAGE training on Elliptic
- Feature engineering matrix
- XGBoost + Isolation Forest scoring
- Kafka producer / consumer
- FastAPI scoring endpoint
- Pyvis fraud ring visualization
- MLflow experiment tracking
- SHAP explainability layer
- Prometheus metrics
- Evidently drift monitoring
- Plaid sandbox webhook integration
- Stripe webhook + real-time scoring
- Temporal GNN upgrade
| Layer | Technology |
|---|---|
| Graph Database | Neo4j 5.20 |
| Relational DB | PostgreSQL 16 (Docker) |
| Cache | Redis 5.0 (Docker) |
| Streaming | Apache Kafka + confluent-kafka |
| Graph ML | PyTorch Geometric 2.5.3 · GraphSAGE · GAT |
| Graph Embeddings | Node2Vec 0.5.0 |
| Classical ML | XGBoost · LightGBM · Isolation Forest · HDBSCAN |
| Graph Library | NetworkX 3.3 |
| API | FastAPI 0.111 + Uvicorn |
| Experiment Tracking | MLflow 2.13 |
| Monitoring | Prometheus · Evidently |
| Scheduling | Celery · APScheduler |
| Visualization | Pyvis · Plotly · Seaborn |
| Config | PyYAML · python-dotenv |
| Logging | Loguru |