Hybrid Prompt-Injection Guard — Detection Engine of the TUP AIGSMP
About • Architecture • Results • Evaluation • Limitations • Roadmap • Getting Started • Contributing • Structure • Configuration
TUP Detection is the prompt-injection detection engine of TUP — an enterprise-grade, open-core AI Governance and Security Monitoring Platform (AIGSMP). It is the analytical core that powers the TUP Manager service: a hybrid pipeline that evaluates LLM inputs and outputs against deterministic policies and neural classifiers, then emits structured, OWASP-mapped security alerts to the wider platform.
Unlike traditional SIEM detection rules designed for OS-level or network-level events, TUP Detection operates directly at the intelligence layer — scoring prompts, system instructions, and model responses to catch jailbreaks, instruction overrides, and multi-turn adversarial steering.
This repository is the standalone detection & benchmarking module. It plugs into the full platform at notyorch/TUP-fullstack.
This work was submitted to Apart Research · Global South 2026. See Citation for how to reference it.
TUP Detection is the TUP Manager brain inside the wider AIGSMP platform: the Collector intercepts LLM telemetry, the Manager scores it, the Indexer persists alerts, and the Dashboard surfaces them.
Within the Manager, the engine runs a multi-layer, fail-safe pipeline (M1–M5). Cheap deterministic checks run first; the neural classifier only runs when no rule fires, and an optional LLM judge arbitrates the gray zone.
flowchart TD
X(["Prompt / LLM output"])
X --> M1
M1["M1 Normalize + segment<br/><i>text_normalize · prompt_segments</i>"]
M1 --> M2
M2["M2 Build variant set V(x)<br/><i>raw · normalized · per-segment</i>"]
M2 --> M3
M3["M3 L1 Regex policy<br/><i>policies/rules/</i>"]
M3 -- hit --> ALERT["ALERT<br/><i>rule_id, OWASP-mapped</i>"]
M3 -- no hit --> M4
M4["M4 Sentinel v2 — max s(v) over V(x)<br/><i>injection_classifier (HF endpoint)</i>"]
M4 -- gray zone --> L3
L3["L3 LLM judge (optional)<br/><i>nvidia_judge_engine</i>"]
L3 -.-> M4
M4 --> M5
M5["M5 Threshold τ → verdict → structured alert → TUP-fullstack"]
| Layer | Component | Characteristics |
|---|---|---|
| L1 | OWASP-mapped regex (policies/rules/) |
Deterministic, zero-latency, traceable rule_id |
| L2 | Sentinel v2 (HF Inference Endpoint) | Neural classifier, paraphrase-robust, no fine-tuning |
| L3 (optional) | LLM judge (NVIDIA NIM — Llama 3.1) | Gray-zone arbitration for s ∈ [0.15, 0.85] |
The engine monitors both inputs and outputs — bidirectional scoring catches attacks that are only observable after the model has been steered.
| Mode | τ | Benign guard | Use |
|---|---|---|---|
benchmark |
0.15 | off | Tier-B evaluation / max recall |
production |
0.50 | on | Live traffic / FP suppression |
Primary benchmark: deepset/prompt-injections (n = 662, Tier B). Metric: PINT balanced accuracy = ½ (attack recall + benign specificity).
| System | PINT Balanced Accuracy |
|---|---|
| TUP + DeBERTa (legacy baseline) | 72.4% |
| Sentinel v2 (model card, indirect) | ~88% |
| TUP + Sentinel v2 (this repo) | 95.1% |
Stack ablation on deepset (τ = 0.15):
| Stack | PINT | Attack recall | Benign pass | TP | FN | FP |
|---|---|---|---|---|---|---|
| L1 only | 58.4% | 17.9% | 99.0% | 47 | 216 | 4 |
| Sentinel only | 95.1% | 93.2% | 97.0% | 245 | 18 | 12 |
| Hybrid | 95.1% | 94.3% | 96.0% | 248 | 15 | 16 |
PINT is rounded to one decimal: the hybrid stack trades slightly more false positives for higher attack recall, so both rows land at 95.1%.
On Crescendo multi-turn adversarial dialogues (n = 10), full-transcript scoring achieves 100% attack recall.
Frozen score caches in
notebooks/data/external/results/reproduce all metrics without re-querying the inference endpoint.
Stack ablation across four metrics. Layer 1 alone protects benign traffic (99% pass rate) but catches only 18% of attacks. Sentinel v2 alone provides strong recall. The hybrid retains Tier-B PINT accuracy while adding 3 explainable catches that Sentinel misses, each with traceable rule_id attribution.
TUP + Sentinel v2 measured on the same deepset split vs. our legacy TUP + DeBERTa stack and publicly reported baselines. The two measured stacks (TUP + Sentinel v2, TUP + DeBERTa) are evaluated on our identical YAML split; the literature values (Sentinel v2 model card, ProtectAI DeBERTa) are reported under different conditions — see Limitations.
Among the 263 attack samples, the two layers are complementary: 201 detected by Sentinel alone, 44 by both, and 3 exclusively by Layer 1 — those 3 carry rule_id attribution traceable to OWASP-mapped patterns in policies/rules/, something no classifier provides. Adding Layer 1 recovers them at a cost of +4 FP over Sentinel alone.
Crescendo attacks gradually escalate across conversation turns — early turns appear benign. A stateless per-turn guard misses 25% of attacks. Full-transcript scoring feeds the complete conversation to the classifier and achieves 100% attack recall across all 10 dialogues.
First detection occurs at turn 2.7 on average; all conversations are flagged by turn 6.
We report these openly so results are interpreted in context:
- Baseline comparison is approximate. Only the two TUP stacks (TUP + Sentinel v2 and TUP + DeBERTa) are measured on our identical deepset YAML split. The Sentinel v2 model-card (~88%) and ProtectAI DeBERTa (77.6%) figures are reported values produced under different datasets, splits, and decision thresholds, and were not re-run under our infrastructure. Treat cross-system gaps as indicative, not head-to-head.
- Single primary dataset. The headline Tier-B claim rests on one public benchmark (deepset/prompt-injections, n = 662). Broader-distribution validation (e.g. Antijection, OWASP v2) is in progress and not yet completed.
- Small multi-turn sample. The Crescendo evaluation covers n = 10 dialogues — strong directional evidence of multi-turn coverage, but not a large-scale robustness study.
- Full-transcript favourability. Crescendo's 100% recall is obtained with full-transcript scoring, which feeds the complete conversation to the classifier. Stateless per-turn scoring (a stricter, more operationally realistic setting) is harder and detects less.
- Endpoint dependency. Live scoring requires the gated Sentinel v2 HF Inference Endpoint. The frozen score caches reproduce all reported metrics offline, but new inputs need the deployed model.
These map directly to the Limitations above and are tracked as GitHub issues — contributions welcome (see Contributing):
- Expand the Crescendo multi-turn benchmark from n = 10 to n ≥ 50 dialogues for a stronger robustness claim.
- Complete broader-distribution validation on Antijection and OWASP v2 splits beyond the primary deepset benchmark.
- Add an adversarial evasion test suite (paraphrase, encoding, and token-level perturbations) to probe Layer 2 robustness.
- Re-run literature baselines under our infrastructure to replace approximate cross-system comparisons with head-to-head numbers.
- Grow the Layer 1 rule pack with additional OWASP-mapped patterns and per-rule false-positive regression tests.
- Python 3.10+
- A Hugging Face account (Read token + accepted Sentinel v2 license)
- (optional, L3 judge) An NVIDIA NIM API key
Want to see the engine run in 30 seconds without any token or endpoint? Layer 1 (the deterministic OWASP-mapped regex engine) needs no credentials and no network:
python3 -m venv .venv-pint && source .venv-pint/bin/activate
pip install -r scripts/requirements-pint.txt && pip install -r tup-manager/requirements.txt
python scripts/smoke_l1.pyExpected output — benign prompts pass, attacks fire with a traceable rule_id:
[PASS] benign | alert=False (rule: —)
[PASS] attack | alert=True (rule: tup-rule-0001, tup-rule-0009, tup-rule-0011)
...
RESULT: all 5 cases matched — Layer 1 engine is working
The full pipeline (Layer 2 Sentinel v2 + optional L3 judge) needs the steps below.
python3 -m venv .venv-pint && source .venv-pint/bin/activate
pip install -r scripts/requirements-pint.txt
pip install -r tup-manager/requirements.txtcp notebooks/.env.pint.example .envThen edit .env with your secrets (see Configuration Reference):
SENTINEL_API_KEY=hf_... # HF token (Read scope, license accepted)
HF_INFERENCE_ENDPOINT=https://xxxxx.aws.endpoints.huggingface.cloud
NVIDIA_JUDGE_API_KEY=nvapi-... # optional — only for the L3 judge
DETECTION_MODE=benchmark # or: production
BENIGN_GUARD_ENABLED=false # true for productionWarning:
.envholds live secrets and is already in.gitignore— never commit it.
Sentinel v2 is a gated model — accept the license first.
- Accept at rogue-security/prompt-injection-jailbreak-sentinel-v2
- Create an endpoint at ui.endpoints.huggingface.co/new
- Model:
rogue-security/prompt-injection-jailbreak-sentinel-v2 - Task: Text Classification · Instance: CPU · Scale-to-zero: ON
- Model:
- Paste the endpoint URL into
.env(HF_INFERENCE_ENDPOINT)
python scripts/verify_hf_endpoint.py# Automated (import deepset + benchmark)
./scripts/run_sentinel_tier_b.sh
# Manual
python scripts/import_external_dataset.py --preset deepset \
--out notebooks/data/external/deepset.yaml
python scripts/run_pint_benchmark.py \
--dataset notebooks/data/external/deepset.yaml \
--detection-mode benchmark \
--results-out notebooks/data/external/results/deepset-sentinel.jsonpytest tup-manager/tests/ -vTUP-detection/
├── tup-manager/ # Detection engine (TUP Manager core)
│ ├── tup_manager/
│ │ ├── detection_engine.py # Pipeline orchestration (M1–M5)
│ │ ├── injection_classifier.py # Sentinel v2 backend (hf / local)
│ │ ├── prompt_segments.py # Context/user segment parser
│ │ ├── text_normalize.py # Input normalization (M1)
│ │ ├── rules_engine.py # L1 regex dispatch
│ │ ├── benign_guard.py # FP suppression for production
│ │ ├── ensemble_classifier.py # Optional Llama Prompt Guard 2
│ │ └── nvidia_judge_engine.py # Optional L3 LLM judge (NVIDIA NIM)
│ └── tests/ # Unit tests (pytest)
│
├── policies/
│ └── rules/ # OWASP-mapped regex rules (YAML)
│
├── scripts/
│ ├── run_pint_benchmark.py # Main benchmark + detection modes
│ ├── run_stack_ablation_benchmark.py
│ ├── run_crescendo_benchmark.py
│ ├── import_external_dataset.py # deepset / OWASP v2 / Antijection
│ ├── verify_hf_endpoint.py # Endpoint smoke test
│ └── requirements-pint.txt
│
└── notebooks/
├── benchmark.ipynb
├── tup_detection_guard_benchmark_report.ipynb
├── tier_b_guard_comparison.ipynb
├── data/external/results/ # Frozen benchmark JSON results
└── .env.pint.example
| Variable | Default | Description |
|---|---|---|
SENTINEL_API_KEY |
— | HF token (also accepted as HF_TOKEN) |
HF_INFERENCE_ENDPOINT |
— | Deployed Sentinel v2 endpoint URL |
DETECTION_MODE |
production |
benchmark or production |
INJECTION_THRESHOLD |
0.5 |
Production threshold τ |
INJECTION_THRESHOLD_STRICT |
0.15 |
Benchmark threshold τ |
BENIGN_GUARD_ENABLED |
true |
FP suppression for doc-like inputs |
INJECTION_FAIL_OPEN |
true |
On inference failure: true → benign (0.0), false → malicious (1.0) |
HF_INFERENCE_TIMEOUT |
180 |
Seconds per request before retry |
HF_INFERENCE_RETRIES |
5 |
Max retry attempts (scale-to-zero cold start) |
DETECTION_JUDGE_ENABLED |
auto |
Enable the L3 LLM judge |
NVIDIA_JUDGE_API_KEY |
— | NVIDIA NIM key for the L3 judge |
NVIDIA_JUDGE_MODEL |
meta/llama-3.1-8b-instruct |
Judge model |
JUDGE_THRESHOLD |
0.65 |
Judge decision threshold |
| Symptom | Fix |
|---|---|
401 / 403 |
Token scope or model license not accepted |
503 / model not supported |
Use a dedicated Inference Endpoint, not the serverless free tier |
Score always 0 |
Endpoint not Running or wrong URL |
| Slow first request | Scale-to-zero cold start — a warmup request is sent automatically |
Contributions from the AI-safety and LLM-security community are welcome — new detection rules, benchmarks, and fixes. See CONTRIBUTING.md for how to add a Layer 1 YAML rule and how to run the tests before opening a PR. The quickest way in is the credential-free smoke test:
python scripts/smoke_l1.pyIf you use TUP Detection in your research or build on its benchmarks, please cite it. A machine-readable CITATION.cff is included in the repository root (GitHub's "Cite this repository" button uses it).
Research submission — Apart Research · Global South 2026:
@misc{tup_detection_apart_2026,
title = {(HckPrj) TUP Detection: Hybrid Prompt-Injection Guard for AI Generative Security Monitoring},
author = {Jorge Enrique Vargas Pech and Jose Luis Rej{\'o}n Quintal and William Emmanuel Fern{\'a}ndez Castillo and Sa{\'u}l Ruiz Pe{\~n}a},
date = {2026-06-22},
organization = {Apart Research},
note = {Research submission to the research sprint hosted by Apart.},
howpublished = {\url{https://apartresearch.com/project/tup-detection-hybrid-promptinjection-guard-for-ai-generative-security-monitoring-r4w6}}
}Software:
@software{tup_detection_2026,
author = {Vargas Pech, Jorge Enrique and Fern{\'a}ndez Castillo, William Emmanuel and Ruiz Pe{\~n}a, Sa{\'u}l and Rej{\'o}n Quintal, Jose Luis},
title = {TUP Detection: A Hybrid Tier-B Prompt-Injection Engine for the TUP AIGSMP Platform},
year = {2026},
url = {https://github.com/notyorch/TUP-detection},
note = {Detection engine of the TUP AI Governance \& Security Monitoring Platform (AIGSMP)}
}Built by Jorge Vargas Pech, William Fernández Castillo, Saúl Ruiz Peña, and Jose Luis Rejón Quintal as the detection module of TUP-fullstack.
Powered by Sentinel v2, evaluated on deepset/prompt-injections and Crescendo multi-turn attacks.
This project is licensed under the MIT License — see the LICENSE file for details.





