Skip to content

ray-singh/aphex

Repository files navigation

aphex aphex

A hardware-aware deployment planner that profiles arbitrary ML models, searches the deployment space, and produces a recommended serving configuration with predicted latency/throughput tradeoffs — locally or on a remote cloud machine.



Features

  • Hardware profiling: detects CPU cores, RAM, CUDA GPUs, Apple MPS, and CoreML availability
  • Model inspection: parameter count, memory footprint (FP32/FP16/BF16), architecture family
  • Pre-flight checks: fast feasibility check before committing to a full benchmark run
  • Multi-backend benchmarking: PyTorch (FP32/FP16/BF16), ONNX Runtime (CPU/CUDA/CoreML), torch.compile, INT8 quantization, TensorRT, OpenVINO
  • Batch size sweep: benchmarks every backend across multiple batch sizes in one run
  • Quality-constrained recommendation: requires a labelled test dataset; measures accuracy/F1/MAE/RMSE drop from quantization and filters candidates that exceed your tolerance before ranking
  • Magnitude pruning: 30 / 50 / 70 % unstructured + 2:4 structured (Ampere+ sparse Tensor Cores) appear as first-class benchmark candidates with measured accuracy drop
  • Knowledge distillation: aphex distill trains a user-supplied student model from a teacher using soft-label KD (Hinton KL on temperature-scaled logits + optional CE)
  • Multi-GPU data-parallel sweep: when ≥2 CUDA devices are detected, aphex automatically benchmarks nn.DataParallel variants (pytorch_dp{2,4,8}_{fp32,fp16,bf16}) so throughput scaling on multi-GPU boxes is measured, not assumed
  • Artifact export: converts the recommended model to its deployment format (.pt, .onnx, .engine, .xml)
  • HTML report: interactive latency-vs-throughput chart with full candidate table
  • Remote execution: runs the full benchmark pipeline on an EC2 instance (or any SSH host) and pulls results back locally
  • Cloud registry: push/pull versioned model artifacts to S3
  • sklearn / XGBoost / LightGBM / CatBoost support: ONNX export for traditional ML models

Installation

Install the core CLI (no ML frameworks):

pip install aphex

Add the extras you need:

pip install 'aphex[torch]'      # PyTorch benchmarking (~2 GB)
pip install 'aphex[sklearn]'    # scikit-learn / tree model support
pip install 'aphex[onnx]'       # ONNX export + ONNX Runtime
pip install 'aphex[tensorflow]' # TensorFlow models
pip install 'aphex[aws]'        # S3 registry + EC2 remote execution
pip install 'aphex[full]'       # everything above

Quickstart

# Inspect hardware and model
aphex analyze model.pt

# Feasibility check before benchmarking
aphex preflight model.pt --dtype fp16

# Benchmark all deployment strategies
aphex benchmark model.pt --input-shape 3,224,224

# Get an optimized recommendation (eval data + inference callable required)
aphex optimize model.pt --input-shape 3,224,224 \
  --eval val.pt --infer-fn infer.py:predict \
  --max-accuracy-loss 0.02

# Regression model: constrain by MAE instead
aphex optimize model.pt --input-shape 16 --eval val.pt --max-mae-loss 0.05 --objective latency

# Save an HTML report and metrics JSON
aphex optimize model.pt --input-shape 3,224,224 --eval val.pt --max-accuracy-loss 0.02 \
  --report report.html --metrics metrics.json

# Distill a teacher into a smaller student (training-based; requires labels)
aphex distill teacher.pt --student make_student.py:tiny_mlp \
  --eval val.pt --epochs 5 --output student.pt

Example output

racing 7 backends × 4 batch sizes

  ✓ PyTorch FP32 CPU           bs=1     17.55 ms      57 req/s
  ✓ PyTorch FP32 CPU           bs=8      2.44 ms     410 req/s
  ✓ ONNX Runtime + CoreML      bs=1      0.92 ms    1085 req/s
  ✓ ONNX Runtime + CoreML      bs=8      0.31 ms    3226 req/s
  ✓ ONNX Runtime INT8 (CPU)    bs=1      0.01 ms    9200 req/s
  ✓ ONNX Runtime INT8 (CPU)    bs=8      0.04 ms   24800 req/s
  ...

  #1  ONNX Runtime INT8 (CPU)   bs=8   0.04 ms   24800 req/s  ████████████████░░░░
  #2  ONNX Runtime INT8 (CPU)   bs=4   0.03 ms   16600 req/s  █████████████░░░░░░░
  #3  ONNX Runtime + CoreML     bs=8   0.31 ms    3226 req/s  ██░░░░░░░░░░░░░░░░░░

CLI reference

Command Description
aphex analyze <model> Hardware profile + model inspection
aphex preflight <model> Feasibility check (fast, no benchmarking)
aphex benchmark <model> Full benchmark across all backends
aphex optimize <model> Benchmark + Pareto-optimal recommendation + artifact export
aphex convert <model> Convert a model to a specific backend format
aphex distill <teacher> Knowledge-distill a teacher into a smaller student model
aphex check <model> --from-config deployment.yaml Regression-check a model against a saved deployment baseline (CI-friendly; exits 1 on threshold breach)
aphex targets List available hardware targets
aphex push <deployment.yaml> <artifact> Push a versioned model to S3
aphex pull <name> Pull a model artifact from S3
aphex ls List models and versions in the S3 registry

Common options

--input-shape 3,224,224     Input tensor shape (no batch dim)
--batch-sizes 1,2,4,8       Batch sizes to sweep (comma-separated)
--objective latency          Optimization goal: latency | throughput | memory
--eval PATH                 Labelled test dataset (.pt, .csv, .parquet, image dir, or s3://, gs:// URI). Required for optimize.
--infer-fn module.py:fn     Inference callable for true accuracy measurement. Required to score `--eval` against the user's full pipeline.
--max-accuracy-loss 0.02    Max relative accuracy drop vs original model baseline (classification)
--max-f1-loss 0.02          Max relative macro-F1 drop vs original model baseline (classification)
--max-mae-loss 0.05         Max relative MAE increase vs original model baseline (regression)
--max-rmse-loss 0.05        Max relative RMSE increase vs original model baseline (regression)
--max-latency-ms 5.0        Hard latency constraint (p50)
--max-memory-mb 512         Hard memory constraint
--min-throughput-rps 200    Hard throughput constraint
--calibration-data PATH     .pt file or image dir for INT8 quantization calibration
--format table|json         Output format (json suppresses Rich output)
--report PATH               Write an HTML benchmark report
--metrics PATH              Write benchmark metrics as JSON
--remote HOST               Run benchmark on a remote SSH host
--output PATH               Where to write the deployment artifact
--jobs N, -j N              Run N (candidate, batch_size) benchmarks in parallel (default 1; keep at 1 for accurate latency/memory)

Loading user models safely

Some models can only be loaded with PyTorch's pickle-based loader, which executes arbitrary code on load. aphex defaults to the safe weights_only=True path; if it fails with a pickle-related error you get a friendly message listing both options (re-save as a state_dict, or opt in for trusted files):

APHEX_TRUST_PICKLE=1 aphex optimize model.pt --input-shape 3,224,224 --eval val.pt ...

Do not set this for models from untrusted sources.

Pruning

Magnitude pruning is wired in as four additional benchmark candidates on every device path, so aphex benchmark / aphex optimize automatically score them alongside FP16, INT8, etc.:

Backend Sparsity Notes
pytorch_prune_unstructured_30 30 % L1-magnitude, every Linear/Conv weight
pytorch_prune_unstructured_50 50 % same, more aggressive
pytorch_prune_unstructured_70 70 % accuracy cost is usually visible past 50 %
pytorch_prune_2_4 50 % structured 2-of-4 pattern for NVIDIA Ampere+ sparse Tensor Cores

aphex measures both latency and accuracy drop for pruned candidates through the same pipeline as quantized backends. Latency improvement on dense CPUs is usually modest; the value is the storage / accuracy tradeoff and, on Ampere+ GPUs, the 2:4 sparse-kernel speedup. Use the existing --max-accuracy-loss / --max-f1-loss / etc. flags to filter out pruned variants that exceed your quality budget before ranking.

aphex's pruning is post-training: no labels, no gradient updates. For recovery training, distill into a smaller dense student instead (below).

Distillation

aphex distill is the only command that performs gradient updates. It trains a student model to imitate a teacher using soft-label knowledge

Distillation:

L = α · KL( softmax(student / T) || softmax(teacher / T) ) · T²
  + (1 - α) · CE(student, hard_label)        # classification
L = MSE(student, teacher)                    # regression

The student architecture is yours; provide a zero-argument factory function and aphex handles the training loop, scoring, and report:

# Write a tiny factory file
cat > make_student.py <<'PY'
import torch.nn as nn
def tiny_mlp():
    return nn.Sequential(nn.Linear(8, 4), nn.ReLU(), nn.Linear(4, 3))
PY

aphex distill teacher.pt \
  --student make_student.py:tiny_mlp \
  --eval val.pt \
  --epochs 8 --batch-size 16 --lr 1e-2 \
  --temperature 3.0 --alpha 0.7 \
  --task classification --device cpu \
  --output student.pt --report distill_report.json

Output (excerpt):

  teacher params  387
  student params  51
  epochs=8  batch=16  lr=0.01  temp=3.0  alpha=0.7  task=classification  device=cpu

  compression  7.6×
  final loss   1.93  (first epoch 3.99)
  accuracy     teacher 1.0000  →  student 0.7350
  ✓  student state_dict → student.pt

Common flags:

--student module.py:fn      Student factory: zero-arg callable returning an nn.Module
--eval PATH                 Labelled dataset for distillation
--task classification|regression
--temperature 4.0           KD softmax temperature (higher = softer teacher distribution)
--alpha 0.7                 Weight on KD loss; (1 - alpha) on hard-label CE (classification only)
--epochs 3 --batch-size 32 --lr 1e-3
--device cpu|cuda|mps
--output student.pt         Destination for the trained student state_dict
--report report.json        Optional JSON: per-epoch losses, param counts, teacher/student scores

The teacher is frozen during training. With labels=None aphex falls back to pure KD (alpha=1.0). The output is a state_dict — reconstruct your student with the same factory + load_state_dict() to deploy or feed back into aphex optimize for a deployment-format search.

Multi-GPU benchmarking

When profile_hardware() detects more than one CUDA device, the candidate generator emits single-process nn.DataParallel variants alongside the regular PyTorch backends:

Backend Devices Dtype
pytorch_dp2_{fp32,fp16,bf16} 2 × GPU matches dtype suffix
pytorch_dp4_{fp32,fp16,bf16} 4 × GPU (host must have ≥4)
pytorch_dp8_{fp32,fp16,bf16} 8 × GPU (host must have ≥8)

BF16 variants are only emitted on Ampere+ (sm_80+). DP shards the batch dimension across replicas, so it's a throughput win at large batch and a latency no-op (or slight loss) at batch=1. The runner enforces --batch-size >= N and surfaces a clear error otherwise — feed a multi-GPU sweep a sensible batch list:

aphex benchmark model.pt --input-shape 3,224,224 --batch-sizes 8,16,32

DP candidates do not run the cosine-similarity accuracy proxy: replication doesn't alter weights, so accuracy is identical to the underlying-dtype candidate (e.g. pytorch_dp4_fp16 shares the same accuracy signal as pytorch_fp16).

For real distributed training / inference (DDP, tensor parallelism, pipeline parallelism, multi-node), see Out of scope below.

AWS integration

Remote benchmarking on EC2

Run the full benchmark pipeline on a remote machine — useful when you want results for a GPU instance without setting up a local GPU environment.

# Benchmark on an EC2 instance and pull results back locally
aphex optimize model.pt \
  --input-shape 3,224,224 \
  --eval val.pt \
  --max-accuracy-loss 0.02 \
  --remote ec2-user@<instance-ip> \
  --output deployment.yaml \
  --report report.html \
  --metrics metrics.json

aphex uploads the model and eval dataset, runs the full benchmark on the remote host, streams output to your terminal, then downloads deployment.yaml, the HTML report, and the metrics JSON. The remote temp directory is cleaned up automatically.

Setup

  1. Add the instance to ~/.ssh/config:
Host <instance-ip>
    IdentityFile ~/.ssh/your-key.pem
    User ec2-user
    StrictHostKeyChecking no
  1. Install aphex on the instance:
ssh ec2-user@<instance-ip> "pip install 'aphex[torch,onnx]'"
  1. Verify the connection:
ssh ec2-user@<instance-ip> "aphex --help"

Recommended instance type for cost-effective benchmarking: t3a.large (8 GB RAM, ~$0.02/hr as a spot instance) covers most CPU/ONNX workloads. Use a g4dn.xlarge for GPU benchmarking.

S3 model registry

Push versioned model artifacts to S3 and pull them from any machine.

# Configure storage (one-time)
export APHEX_BUCKET=my-models-bucket
export AWS_REGION=us-east-1

# Push a deployment artifact
aphex push deployment.yaml model.onnx --name resnet50 --version v1

# Pull on another machine
aphex pull resnet50             # latest version
aphex pull resnet50@v1          # specific version
aphex pull resnet50 --out ./models/

# List what's in the registry
aphex ls                        # all models
aphex ls resnet50               # versions of a specific model

Credentials are picked up from the standard AWS chain (AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY, ~/.aws/credentials, or an IAM instance role).

Pipeline

model.pt + hardware
       |
       v
  inspect_model()     → parameters, memory, family
  profile_hardware()  → CPU, RAM, GPU / MPS / CoreML
       |
       v
  run_preflight()     → feasibility: ok / tight / unlikely / impossible
       |
       v
  generate_candidates() → (backend, dtype, device, batch_size) combos
                          incl. quantized, pruned (30/50/70/2:4), and torch.compile variants
       |
       v
  benchmark_candidate() × (backends × batch sizes) → p50 / p95 / p99, throughput, memory
       |
       v
  evaluate_quality()  → accuracy/F1/MAE/RMSE drop vs original model baseline (--eval dataset)
       |
       v
  recommend()         → Pareto frontier → filter by quality constraint → best candidate for objective
       |
       v
  convert()           → deployment artifact (.pt / .onnx / .engine / .xml)

Supported backends

Backend Device Dtype
PyTorch eager CPU FP32
PyTorch eager MPS (Apple Silicon) FP32, FP16
PyTorch eager CUDA FP32, FP16, BF16
torch.compile CPU / CUDA FP32
ONNX Runtime CPU FP32
ONNX Runtime + CoreML Apple Silicon FP32
ONNX Runtime CUDA FP32
PyTorch INT8 dynamic CPU INT8
ONNX Runtime INT8 CPU INT8
TensorRT CUDA FP32, FP16, INT8
OpenVINO CPU FP32, INT8
PyTorch + magnitude prune CPU / MPS / CUDA FP32 @ 30 / 50 / 70 % sparsity
PyTorch + 2:4 structured prune CPU / MPS / CUDA (Ampere+ for speedup) FP32 @ 50 % sparsity
PyTorch + nn.DataParallel CUDA × {2, 4, 8} GPUs FP32, FP16, BF16 (throughput-oriented)

Out of scope (for current version)

  • Pruning recovery training: aphex's pruning is post-training only. If your model degrades past tolerance at the sparsity you want, distill into a smaller dense student instead.
  • Quantization-aware training (QAT): only post-training quantization is supported.
  • LLM-specific quality metrics: cosine-similarity proxies are skipped for generative families (llm, transformer_decoder, seq2seq); score those models with a custom --infer-fn (perplexity, task benchmarks).
  • Distributed multi-GPU (DDP / tensor parallelism / pipeline parallelism): aphex sweeps single-process nn.DataParallel candidates (pytorch_dp{2,4,8}_{fp32,fp16,bf16}) when ≥2 CUDA devices are detected, but real DDP / torchrun orchestration and tensor- or pipeline-parallel sharding are out of scope. DP variants are throughput-oriented and require --batch-size >= N.

Requirements

  • Python 3.12+
  • At least one framework extra (aphex[torch], aphex[sklearn], etc.)
  • For remote execution: ssh and scp on the local machine, aphex installed on the remote

License

MIT

About

hardware-aware ml deployment optimization and recommendation framework

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors