Files: stage2.py · cuda_spmm/warp_gat_kernel.cu · cuda_spmm/spmm_kernel.cu · cuda_spmm/setup.py
Stage 2 takes the unsupervised embeddings from Stage 1 and fine-tunes them using a Graph Attention Network (GAT) with a small set of labeled anomaly nodes. The goal: assign every node in the 1.5M-node graph an anomaly score between 0 and 1.
| File | Description |
|---|---|
Z_embeddings_stage1.npy |
(N_total, 128) float32 — pre-trained node embeddings from Stage 1 GAE |
data/labels.npy |
(L, 2) int — column 0: global node ID, column 1: binary label (1 = anomaly) |
data/node_counts.json |
{ users, products, sellers, total } |
data/epu_*.bin, eps_*.bin, euu_*.bin |
CSR topology binaries from preprocessing |
| File | Description |
|---|---|
anomaly_scores_stage2.npy |
(N_total,) float32 — sigmoid anomaly probability for every node |
Z_stage2.npy |
(N_total, 128) float32 — refined embeddings after GAT fine-tuning |
gat_stage2_best.pt |
PyTorch state dict of the best model checkpoint |
The model is a 2-layer heterogeneous GAT followed by a linear anomaly scoring head:
Z_stage1 (128-D)
│
▼
┌─────────────────┐
│ GAT Layer 1 │ ← Type-specific attention across 3 edge streams
│ (128 → 128) │
│ + ELU │
└────────┬────────┘
│
▼
┌─────────────────┐
│ GAT Layer 2 │
│ (128 → 128) │
│ + ELU │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Linear Head │ ← nn.Linear(128, 1)
│ → Anomaly │
│ Logit │
└─────────────────┘
Each TypeSpecificGATLayer processes three edge streams independently, each with its own learned weights:
| Stream | Edge Type | Weight Matrix | Attention Vector | What It Does |
|---|---|---|---|---|
| Purchase | E_pu + E_pu_T | W_pu |
a_pu |
Messages between Users ↔ Products |
| Selling | E_ps + E_ps_T | W_ps |
a_ps |
Messages between Products ↔ Sellers |
| Co-purchase | E_uu | W_uu |
a_uu |
Messages between Users ↔ Users |
Step-by-step per stream (e.g. for Purchase edges):
- Linear projection: Apply
W_puto both user and product features →Wh_u,Wh_p(both 128-D) - Attention computation: For each edge
(u, p), compute attention logit:e_up = LeakyReLU(a_pu[:128]·Wh_u + a_pu[128:]·Wh_p)
- Softmax normalization: Normalize attention weights across each node's neighbors:
α_up = softmax(e_up)(with numerically stable max-subtraction) - Weighted aggregation: Each destination node's output = Σ (α_up · Wh_source) over all its neighbors
- Accumulate: Add this stream's output into the node's accumulator
After all three streams, apply ELU activation (with scaling: user/product outputs are divided by 2 since they receive messages from 2 streams).
The final output is [H_u_out | H_p_out | H_s_out] concatenated back into a single (N_total, 128) tensor.
The most interesting engineering decision in Stage 2 — training and inference use different backends:
| Mode | Backend | Why |
|---|---|---|
Training (model.train()) |
Native PyTorch with torch.autograd.checkpoint |
Needs gradient computation; CUDA kernel has no backward pass |
Inference (model.eval()) |
Custom warp_gat CUDA kernel |
No gradients needed; kernel runs 3-5× faster |
Training VRAM optimizations:
- 85% DropEdge: During training, 85% of edges are randomly dropped per chunk — this prevents over-smoothing AND saves massive VRAM
- Micro-chunking: Nodes are processed in chunks of 2,500 to avoid VRAM spikes from "super-nodes" with thousands of neighbors
- Intra-loop garbage collection: All intermediate tensors are explicitly deleted after each chunk
- Gradient checkpointing:
torch.utils.checkpoint.checkpointrecomputes forward activations during backward instead of storing them, trading compute for VRAM
The total loss is a weighted sum of supervised and unsupervised components:
L_total = L_supervised + λ · L_unsupervised (λ = 0.5)
Standard Binary Cross-Entropy with Logits on the labeled nodes only:
L_sup = BCE(logits[labeled_idx], labels)Only labeled nodes contribute gradients — unlabeled nodes still participate through message passing in the GAT layers.
Encourages connected nodes to have similar anomaly scores (Equation 17 from the paper):
- Convert logits to bounded scores via sigmoid:
s_i = σ(logit_i) - Sample 5,000 random edges from each of the 3 edge types
- For each sampled edge
(i, j): compute(s_i - s_j)² - Average across all sampled edges
This enforces the intuition that if two nodes are connected (e.g. a user and a product they bought), their anomaly scores should be similar — anomaly signal propagates through the graph.
Stage 2's data loader (load_and_sanitize_csr) does something Stage 1 didn't need — it strips cross-type self-loops from the CSR files:
Stage 1 preprocessing added self-loops where every node connects to itself via its global ID. But in Stage 2's GAT, edges are processed per-relation with strict bipartite boundaries (e.g. E_pu should only connect Users → Products). A user self-loop in the E_pu CSR would point to a User ID, which is out-of-bounds for the Product node type.
The sanitizer:
- Creates a validity mask:
(col_idx >= valid_min) & (col_idx <= valid_max) - Filters
col_idxto keep only edges pointing to the correct node type - Reconstructs
row_ptrfrom scratch to reflect the new edge counts per row
Labels are split into Train / Validation / Test sets (60% / 20% / 20%) with a fixed seed for reproducibility.
Metric: AUC-ROC (Area Under the ROC Curve) — implemented manually without sklearn:
- Sort all labeled nodes by their anomaly score (descending)
- Compute cumulative true positive rate (TPR) and false positive rate (FPR)
- Integrate using the trapezoidal rule
Validation AUC-ROC is checked every 10 epochs, and the best checkpoint is saved. Final evaluation runs on the held-out test set.
| Parameter | Value |
|---|---|
| Epochs | 100 |
| Optimizer | Adam |
| Learning Rate | 0.001 |
| λ (unsupervised weight) | 0.5 |
| Mixed Precision | FP16 autocast + GradScaler |
| DropEdge rate (training) | 85% |
| Chunk size | 2,500 nodes |
| Unsupervised edge samples | 5,000 per edge type |
| Train / Val / Test split | 60% / 20% / 20% |
Two separate CUDA extensions are built via setup.py:
File: warp_gat_kernel.cu — Used during model.eval() for fast inference.
How it works — 1 warp (32 threads) processes 1 destination node:
For each destination node u:
┌─────────────────────────────────────────────────────┐
│ Step 1: Compute dst_proj = Σ H_dst[u] * a_dst │
│ (distributed across 32 lanes, warp-reduced) │
├─────────────────────────────────────────────────────┤
│ Step 2: For each neighbor v of u: │
│ Compute src_proj = Σ H_src[v] * a_src │
│ e_uv = LeakyReLU(dst_proj + src_proj) │
│ Track max(e_uv) for numerical stability │
├─────────────────────────────────────────────────────┤
│ Step 3: Second pass over neighbors: │
│ exp_sum += exp(e_uv - max_e) │
│ (warp-reduced for softmax denominator) │
├─────────────────────────────────────────────────────┤
│ Step 4: Feature aggregation: │
│ For each feature dim d (striped by lane): │
│ out[u,d] = Σ_v α_uv * H_src[v,d] │
│ where α_uv = exp(e_uv-max)/exp_sum │
└─────────────────────────────────────────────────────┘
Key CUDA primitives:
__shfl_down_sync(0xffffffff, val, offset)— warp shuffle for in-register reductions (no shared memory needed)warpReduceMax— finds maximum attention logit across all 32 lanes for numerically stable softmaxwarpReduceSum— sums exp values across all 32 lanes for softmax denominator
Launch config: 128 threads/block (4 warps/block), ceil(N_dst × 32 / 128) blocks.
File: spmm_kernel.cu — Used in Stage 1 GAE training.
Kernel 1: csr_spmm_warp_kernel (Encoder)
- Performs Sparse Matrix × Dense Matrix multiplication:
out = A · X - 1 warp = 1 row of the sparse adjacency matrix
- Reads FP16 input features, accumulates in FP32 to prevent precision loss
- Applies pre-computed symmetric GCN weights (
D^{-1/2} A D^{-1/2}) at the neighbor aggregation step - Supports up to 1024-D features via chunked accumulation (32 chunks × 32 lanes)
Kernel 2: warp_sparse_dot_product_kernel (Decoder)
- Computes batched inner products
Z_src · Z_dstfor sampled edge pairs - 1 warp = 1 edge: each warp computes the dot product for one (src, dst) pair
- Uses
__shfl_down_syncwarp shuffle for fast in-warp reduction — no global memory synchronization
Python-side backprop (SpMM_Autograd):
The custom SpMM has a hand-written PyTorch autograd.Function that implements chunk-streaming backpropagation — processing 5,000 rows at a time to prevent VRAM spikes during gradient computation on the 1.5M-node graph.
cd cuda_spmm
python setup.py installThis compiles both extensions (custom_spmm and warp_gat) as importable Python modules via PyTorch's CUDAExtension build system. Requires CUDA Toolkit + a CUDA-capable GPU.
┌──────────────────────────────────────────────────────────────┐
│ Stage 1 (Unsupervised) │
│ train_stage1_gae.py + custom_spmm CUDA kernel │
│ │
│ X_combined.memmap ──▶ 2-Layer GCN ──▶ Z_embeddings_stage1 │
│ (128-D features) (SpMM encoder) (128-D latent) │
│ (dot-product │
│ decoder) │
└────────────────────────────┬─────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Stage 2 (Semi-Supervised) │
│ stage2.py + warp_gat CUDA kernel │
│ │
│ Z_stage1 + labels.npy ──▶ 2-Layer GAT ──▶ anomaly_scores │
│ (128-D embeddings) (attention (0 to 1 per node) │
│ message │
│ passing) │
└──────────────────────────────────────────────────────────────┘
Stage 1 learns what the graph looks like (structure) without any labels. Stage 2 learns what anomalies look like by fine-tuning with a small labeled set, while the unsupervised loss ensures anomaly signals propagate through the graph topology.