GitHub - notyorch/TUP-detection: Hybrid prompt-injection detection engine (regex · Sentinel v2 · LLM judge) — 95.1% PINT balanced accuracy. Detection core of TUP AIGSMP. Built at Apart Research Global South Hackathon 2026.

Hybrid Prompt-Injection Guard — Detection Engine of the TUP AIGSMP

About • Architecture • Results • Evaluation • Limitations • Roadmap • Getting Started • Contributing • Structure • Configuration

About the Project

TUP Detection is the prompt-injection detection engine of TUP — an enterprise-grade, open-core AI Governance and Security Monitoring Platform (AIGSMP). It is the analytical core that powers the TUP Manager service: a hybrid pipeline that evaluates LLM inputs and outputs against deterministic policies and neural classifiers, then emits structured, OWASP-mapped security alerts to the wider platform.

Unlike traditional SIEM detection rules designed for OS-level or network-level events, TUP Detection operates directly at the intelligence layer — scoring prompts, system instructions, and model responses to catch jailbreaks, instruction overrides, and multi-turn adversarial steering.

This repository is the standalone detection & benchmarking module. It plugs into the full platform at notyorch/TUP-fullstack.

This work was submitted to Apart Research · Global South 2026. See Citation for how to reference it.

Core Architecture

TUP Detection is the TUP Manager brain inside the wider AIGSMP platform: the Collector intercepts LLM telemetry, the Manager scores it, the Indexer persists alerts, and the Dashboard surfaces them.

Within the Manager, the engine runs a multi-layer, fail-safe pipeline (M1–M5). Cheap deterministic checks run first; the neural classifier only runs when no rule fires, and an optional LLM judge arbitrates the gray zone.

flowchart TD
    X(["Prompt / LLM output"])

    X --> M1
    M1["M1 Normalize + segment<br/><i>text_normalize · prompt_segments</i>"]
    M1 --> M2
    M2["M2 Build variant set V(x)<br/><i>raw · normalized · per-segment</i>"]

    M2 --> M3
    M3["M3 L1 Regex policy<br/><i>policies/rules/</i>"]
    M3 -- hit --> ALERT["ALERT<br/><i>rule_id, OWASP-mapped</i>"]
    M3 -- no hit --> M4

    M4["M4 Sentinel v2 — max s(v) over V(x)<br/><i>injection_classifier (HF endpoint)</i>"]
    M4 -- gray zone --> L3
    L3["L3 LLM judge (optional)<br/><i>nvidia_judge_engine</i>"]
    L3 -.-> M4

    M4 --> M5
    M5["M5 Threshold τ → verdict → structured alert → TUP-fullstack"]

Layer	Component	Characteristics
L1	OWASP-mapped regex (`policies/rules/`)	Deterministic, zero-latency, traceable `rule_id`
L2	Sentinel v2 (HF Inference Endpoint)	Neural classifier, paraphrase-robust, no fine-tuning
L3 (optional)	LLM judge (NVIDIA NIM — Llama 3.1)	Gray-zone arbitration for `s ∈ [0.15, 0.85]`

The engine monitors both inputs and outputs — bidirectional scoring catches attacks that are only observable after the model has been steered.

Detection modes

Mode	τ	Benign guard	Use
`benchmark`	0.15	off	Tier-B evaluation / max recall
`production`	0.50	on	Live traffic / FP suppression

Results

Primary benchmark: deepset/prompt-injections (n = 662, Tier B). Metric: PINT balanced accuracy = ½ (attack recall + benign specificity).

System	PINT Balanced Accuracy
TUP + DeBERTa (legacy baseline)	72.4%
Sentinel v2 (model card, indirect)	~88%
TUP + Sentinel v2 (this repo)	95.1%

Stack ablation on deepset (τ = 0.15):

Stack	PINT	Attack recall	Benign pass	TP	FN	FP
L1 only	58.4%	17.9%	99.0%	47	216	4
Sentinel only	95.1%	93.2%	97.0%	245	18	12
Hybrid	95.1%	94.3%	96.0%	248	15	16

PINT is rounded to one decimal: the hybrid stack trades slightly more false positives for higher attack recall, so both rows land at 95.1%.

On Crescendo multi-turn adversarial dialogues (n = 10), full-transcript scoring achieves 100% attack recall.

Frozen score caches in notebooks/data/external/results/ reproduce all metrics without re-querying the inference endpoint.

Evaluation

Test 1 — Single-turn injection detection (deepset, n = 662)

Stack ablation across four metrics. Layer 1 alone protects benign traffic (99% pass rate) but catches only 18% of attacks. Sentinel v2 alone provides strong recall. The hybrid retains Tier-B PINT accuracy while adding 3 explainable catches that Sentinel misses, each with traceable rule_id attribution.

Test 2 — Comparison against public baselines

TUP + Sentinel v2 measured on the same deepset split vs. our legacy TUP + DeBERTa stack and publicly reported baselines. The two measured stacks (TUP + Sentinel v2, TUP + DeBERTa) are evaluated on our identical YAML split; the literature values (Sentinel v2 model card, ProtectAI DeBERTa) are reported under different conditions — see Limitations.

Test 3 — Layer complementarity (what each layer catches)

Among the 263 attack samples, the two layers are complementary: 201 detected by Sentinel alone, 44 by both, and 3 exclusively by Layer 1 — those 3 carry rule_id attribution traceable to OWASP-mapped patterns in policies/rules/, something no classifier provides. Adding Layer 1 recovers them at a cost of +4 FP over Sentinel alone.

Test 4 — Multi-turn attack detection (Crescendo, n = 10 dialogues)

Crescendo attacks gradually escalate across conversation turns — early turns appear benign. A stateless per-turn guard misses 25% of attacks. Full-transcript scoring feeds the complete conversation to the classifier and achieves 100% attack recall across all 10 dialogues.

First detection occurs at turn 2.7 on average; all conversations are flagged by turn 6.

Limitations

We report these openly so results are interpreted in context:

Baseline comparison is approximate. Only the two TUP stacks (TUP + Sentinel v2 and TUP + DeBERTa) are measured on our identical deepset YAML split. The Sentinel v2 model-card (~88%) and ProtectAI DeBERTa (77.6%) figures are reported values produced under different datasets, splits, and decision thresholds, and were not re-run under our infrastructure. Treat cross-system gaps as indicative, not head-to-head.
Single primary dataset. The headline Tier-B claim rests on one public benchmark (deepset/prompt-injections, n = 662). Broader-distribution validation (e.g. Antijection, OWASP v2) is in progress and not yet completed.
Small multi-turn sample. The Crescendo evaluation covers n = 10 dialogues — strong directional evidence of multi-turn coverage, but not a large-scale robustness study.
Full-transcript favourability. Crescendo's 100% recall is obtained with full-transcript scoring, which feeds the complete conversation to the classifier. Stateless per-turn scoring (a stricter, more operationally realistic setting) is harder and detects less.
Endpoint dependency. Live scoring requires the gated Sentinel v2 HF Inference Endpoint. The frozen score caches reproduce all reported metrics offline, but new inputs need the deployed model.

Roadmap

These map directly to the Limitations above and are tracked as GitHub issues — contributions welcome (see Contributing):

Expand the Crescendo multi-turn benchmark from n = 10 to n ≥ 50 dialogues for a stronger robustness claim.
Complete broader-distribution validation on Antijection and OWASP v2 splits beyond the primary deepset benchmark.
Add an adversarial evasion test suite (paraphrase, encoding, and token-level perturbations) to probe Layer 2 robustness.
Re-run literature baselines under our infrastructure to replace approximate cross-system comparisons with head-to-head numbers.
Grow the Layer 1 rule pack with additional OWASP-mapped patterns and per-rule false-positive regression tests.

Getting Started

Prerequisites

Python 3.10+
A Hugging Face account (Read token + accepted Sentinel v2 license)
(optional, L3 judge) An NVIDIA NIM API key

Quick smoke test (no credentials)

Want to see the engine run in 30 seconds without any token or endpoint? Layer 1 (the deterministic OWASP-mapped regex engine) needs no credentials and no network:

python3 -m venv .venv-pint && source .venv-pint/bin/activate
pip install -r scripts/requirements-pint.txt && pip install -r tup-manager/requirements.txt

python scripts/smoke_l1.py

Expected output — benign prompts pass, attacks fire with a traceable rule_id:

[PASS] benign | alert=False (rule: —)
[PASS] attack | alert=True  (rule: tup-rule-0001, tup-rule-0009, tup-rule-0011)
...
RESULT: all 5 cases matched — Layer 1 engine is working

The full pipeline (Layer 2 Sentinel v2 + optional L3 judge) needs the steps below.

1. Install

python3 -m venv .venv-pint && source .venv-pint/bin/activate
pip install -r scripts/requirements-pint.txt
pip install -r tup-manager/requirements.txt

2. Configure

cp notebooks/.env.pint.example .env

Then edit .env with your secrets (see Configuration Reference):

SENTINEL_API_KEY=hf_...                 # HF token (Read scope, license accepted)
HF_INFERENCE_ENDPOINT=https://xxxxx.aws.endpoints.huggingface.cloud
NVIDIA_JUDGE_API_KEY=nvapi-...          # optional — only for the L3 judge

DETECTION_MODE=benchmark                # or: production
BENIGN_GUARD_ENABLED=false              # true for production

Warning: .env holds live secrets and is already in .gitignore — never commit it.

3. Deploy the Sentinel v2 Inference Endpoint

Sentinel v2 is a gated model — accept the license first.

Accept at rogue-security/prompt-injection-jailbreak-sentinel-v2
Create an endpoint at ui.endpoints.huggingface.co/new
- Model: rogue-security/prompt-injection-jailbreak-sentinel-v2
- Task: Text Classification · Instance: CPU · Scale-to-zero: ON
Paste the endpoint URL into .env (HF_INFERENCE_ENDPOINT)

4. Verify

python scripts/verify_hf_endpoint.py

5. Run the benchmark

# Automated (import deepset + benchmark)
./scripts/run_sentinel_tier_b.sh

# Manual
python scripts/import_external_dataset.py --preset deepset \
  --out notebooks/data/external/deepset.yaml

python scripts/run_pint_benchmark.py \
  --dataset notebooks/data/external/deepset.yaml \
  --detection-mode benchmark \
  --results-out notebooks/data/external/results/deepset-sentinel.json

Run the test suite

pytest tup-manager/tests/ -v

Repository Structure

TUP-detection/
├── tup-manager/                    # Detection engine (TUP Manager core)
│   ├── tup_manager/
│   │   ├── detection_engine.py     # Pipeline orchestration (M1–M5)
│   │   ├── injection_classifier.py # Sentinel v2 backend (hf / local)
│   │   ├── prompt_segments.py      # Context/user segment parser
│   │   ├── text_normalize.py       # Input normalization (M1)
│   │   ├── rules_engine.py         # L1 regex dispatch
│   │   ├── benign_guard.py         # FP suppression for production
│   │   ├── ensemble_classifier.py  # Optional Llama Prompt Guard 2
│   │   └── nvidia_judge_engine.py  # Optional L3 LLM judge (NVIDIA NIM)
│   └── tests/                      # Unit tests (pytest)
│
├── policies/
│   └── rules/                      # OWASP-mapped regex rules (YAML)
│
├── scripts/
│   ├── run_pint_benchmark.py       # Main benchmark + detection modes
│   ├── run_stack_ablation_benchmark.py
│   ├── run_crescendo_benchmark.py
│   ├── import_external_dataset.py  # deepset / OWASP v2 / Antijection
│   ├── verify_hf_endpoint.py       # Endpoint smoke test
│   └── requirements-pint.txt
│
└── notebooks/
    ├── benchmark.ipynb
    ├── tup_detection_guard_benchmark_report.ipynb
    ├── tier_b_guard_comparison.ipynb
    ├── data/external/results/      # Frozen benchmark JSON results
    └── .env.pint.example

Configuration Reference

Variable	Default	Description
`SENTINEL_API_KEY`	—	HF token (also accepted as `HF_TOKEN`)
`HF_INFERENCE_ENDPOINT`	—	Deployed Sentinel v2 endpoint URL
`DETECTION_MODE`	`production`	`benchmark` or `production`
`INJECTION_THRESHOLD`	`0.5`	Production threshold τ
`INJECTION_THRESHOLD_STRICT`	`0.15`	Benchmark threshold τ
`BENIGN_GUARD_ENABLED`	`true`	FP suppression for doc-like inputs
`INJECTION_FAIL_OPEN`	`true`	On inference failure: `true` → benign (0.0), `false` → malicious (1.0)
`HF_INFERENCE_TIMEOUT`	`180`	Seconds per request before retry
`HF_INFERENCE_RETRIES`	`5`	Max retry attempts (scale-to-zero cold start)
`DETECTION_JUDGE_ENABLED`	`auto`	Enable the L3 LLM judge
`NVIDIA_JUDGE_API_KEY`	—	NVIDIA NIM key for the L3 judge
`NVIDIA_JUDGE_MODEL`	`meta/llama-3.1-8b-instruct`	Judge model
`JUDGE_THRESHOLD`	`0.65`	Judge decision threshold

Troubleshooting

Symptom	Fix
`401` / `403`	Token scope or model license not accepted
`503` / model not supported	Use a dedicated Inference Endpoint, not the serverless free tier
Score always `0`	Endpoint not Running or wrong URL
Slow first request	Scale-to-zero cold start — a warmup request is sent automatically

Contributing

Contributions from the AI-safety and LLM-security community are welcome — new detection rules, benchmarks, and fixes. See CONTRIBUTING.md for how to add a Layer 1 YAML rule and how to run the tests before opening a PR. The quickest way in is the credential-free smoke test:

python scripts/smoke_l1.py

Citation

If you use TUP Detection in your research or build on its benchmarks, please cite it. A machine-readable CITATION.cff is included in the repository root (GitHub's "Cite this repository" button uses it).

Research submission — Apart Research · Global South 2026:

@misc{tup_detection_apart_2026,
  title        = {(HckPrj) TUP Detection: Hybrid Prompt-Injection Guard for AI Generative Security Monitoring},
  author       = {Jorge Enrique Vargas Pech and Jose Luis Rej{\'o}n Quintal and William Emmanuel Fern{\'a}ndez Castillo and Sa{\'u}l Ruiz Pe{\~n}a},
  date         = {2026-06-22},
  organization = {Apart Research},
  note         = {Research submission to the research sprint hosted by Apart.},
  howpublished = {\url{https://apartresearch.com/project/tup-detection-hybrid-promptinjection-guard-for-ai-generative-security-monitoring-r4w6}}
}

Software:

@software{tup_detection_2026,
  author  = {Vargas Pech, Jorge Enrique and Fern{\'a}ndez Castillo, William Emmanuel and Ruiz Pe{\~n}a, Sa{\'u}l and Rej{\'o}n Quintal, Jose Luis},
  title   = {TUP Detection: A Hybrid Tier-B Prompt-Injection Engine for the TUP AIGSMP Platform},
  year    = {2026},
  url     = {https://github.com/notyorch/TUP-detection},
  note    = {Detection engine of the TUP AI Governance \& Security Monitoring Platform (AIGSMP)}
}

Authors & Acknowledgements

Built by Jorge Vargas Pech, William Fernández Castillo, Saúl Ruiz Peña, and Jose Luis Rejón Quintal as the detection module of TUP-fullstack.

Powered by Sentinel v2, evaluated on deepset/prompt-injections and Crescendo multi-turn attacks.

License

This project is licensed under the MIT License — see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About the Project

Core Architecture

Detection modes

Results

Evaluation

Test 1 — Single-turn injection detection (deepset, n = 662)

Test 2 — Comparison against public baselines

Test 3 — Layer complementarity (what each layer catches)

Test 4 — Multi-turn attack detection (Crescendo, n = 10 dialogues)

Limitations

Roadmap

Getting Started

Prerequisites

Quick smoke test (no credentials)

1. Install

2. Configure

3. Deploy the Sentinel v2 Inference Endpoint

4. Verify

5. Run the benchmark

Run the test suite

Repository Structure

Configuration Reference

Troubleshooting

Contributing

Citation

Authors & Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
notebooks		notebooks
policies/rules		policies/rules
scripts		scripts
tup-manager		tup-manager
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
logo.svg		logo.svg

Folders and files

Latest commit

History

Repository files navigation

About the Project

Core Architecture

Detection modes

Results

Evaluation

Test 1 — Single-turn injection detection (deepset, n = 662)

Test 2 — Comparison against public baselines

Test 3 — Layer complementarity (what each layer catches)

Test 4 — Multi-turn attack detection (Crescendo, n = 10 dialogues)

Limitations

Roadmap

Getting Started

Prerequisites

Quick smoke test (no credentials)

1. Install

2. Configure

3. Deploy the Sentinel v2 Inference Endpoint

4. Verify

5. Run the benchmark

Run the test suite

Repository Structure

Configuration Reference

Troubleshooting

Contributing

Citation

Authors & Acknowledgements

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages