Multi-classifier behavioral analysis engine for LLM responses.
PSA-core is the standalone engine that powers PSA. It classifies every AI response into behavioral postures, then derives metrics from posture sequences to detect adversarial stress, sycophancy, hallucination risk, persuasion techniques, input pressure, and agentic behavioral drift — in real time.
For the full web application (FastAPI, dashboards, billing, REST API), see the PSA repository.
| Component | Function |
|---|---|
| PSA v2 | 7 micro-classifiers (C0–C4, C3-v3, CA), DRM session-level risk engine, SIGTRACK v2 incident archive, CPF3 behavioral snapshot analysis |
| PSA Human Layer | Longitudinal behavioral profile of the human (Layers 1–4), built across sessions |
| PSA v3 | Multi-agent analysis — Swiss Cheese detection (SCS), contagion metrics (PPI, CAHS, WLS, CER, AGM), action-risk classification (C5/PAI), HMM temporal prediction, swarm coordination, corpus-wide intelligence |
| PSA-RAG (RDM) | Retrieval Drift Monitor — detects context-biased RAG retrieval (FPC + RDS) for legal, health, finance |
| Browser Extension | Chrome MV3 — real-time PSA monitoring + PSA Legal extension (RDM-powered) |
API key from splabs.io/settings — Pro or Enterprise plan.
curl -X POST https://splabs.io/api/v2/psa/analyze \
-H "Authorization: Bearer psa_your_key" \
-H "Content-Type: application/json" \
-d '{"response_text": "Of course, I would be happy to help!", "dry_run": true}'{
"c1": { "postures": [5], "poi": 0.0, "pe": 0.0, "dpi": 0.31, "mps": 0 },
"c2": { "postures": [2], "sd": 0.82 },
"c3": { "postures": [0], "hri": 0.0 },
"c4": { "postures": [0], "pd": 0.0, "td": 0 },
"bhs": 0.67,
"alert": "yellow",
"dry_run": true
}See API.md for the full endpoint reference.
Micro-classifiers sharing a fine-tuned MiniLM embedding backbone (384-dim, L2-normalised, ONNX runtime):
| ID | Name | Code prefix | Classes | Classifies | Detects |
|---|---|---|---|---|---|
| C0 | Input Pressure | I0–I9 | 10 | User messages | Override commands, authority claims, emotional loading, jailbreak attempts |
| C1 | Adversarial Stress | P0–P20 | 21 | Model responses | Boundary erosion — RESTRICT vs. CONCEDE vs. SOFT posture |
| C2 | Sycophancy Delta | S0–S9 | 10 | Model responses | Agreement creep, validation seeking, opinion mirroring |
| C3 | Hallucination Risk | H0–H7 | 8 | Model responses | Over-specification, phantom attribution, confidence-hedge mismatch |
| C4 | Persuasion Density | M0–M11 | 12 | Model responses | Framing, anchoring, authority, social proof, scarcity, reciprocity |
| C3-v3 | Agentic Behavioral Stability | G0–G10 | 11 | Agent turns | Boundary dissolution, role capture, epistemic overconfidence, conceptual substitution |
| CA | Inter-Agent Pressure | A0–A11 | 12 | Agent-to-agent messages | Authority spoofing, constraint removal, cascade amplification, anomaly suppression |
H-layer (user-side classifiers, used in Human Profile feature):
| ID | Code prefix | Classes | Detects |
|---|---|---|---|
| H2 | 0–5 | 6 | Relational dynamics — validation seeking, agency erosion, dependency |
| H3 | 0–4 | 5 | Cognitive patterns — rigidity, reality anchoring, distortion, semantic compression |
| H4 | 0–3 | 4 | Social dynamics — legibility adaptation, reciprocity expectation, social substitution |
| H5 | 0–3 | 4 | Adversarial patterns — manipulation, ideological drift, radicalization |
sentence → MiniLM encoder (ONNX / ST fallback) → 384-dim embedding
→ MLP head (2–3 layers) → softmax → (label, confidence)
- ONNX path:
encoder.onnx+{clf}_head.npz— < 1 ms/sentence - Fallback:
sentence-transformersfrom HuggingFace - All heads use minimum 2-layer MLP; C3-v3 uses 3-layer (512→256→11)
All metrics returned per turn by POST /api/v2/psa/analyze:
| Metric | Full Name | Range | Description |
|---|---|---|---|
| BHS | Behavioral Health Score | 0–1 | Per-turn composite health. Low = degraded. 1 − (0.4×POI + 0.2×SD + 0.2×HRI + 0.2×PD×TD) |
| POI | Posture Oscillation Index | 0–1 | Variability of C1 postures across turns. High = unstable — no stable boundary. |
| PE | Posture Entropy | 0 to log₂(N) | Shannon entropy of posture distribution. Low = uniform (normal or post-dissolution); High = active stress. |
| DPI | Dissolution Position Index | 0–1 | Normalised mean ordinal position of CONCEDE/RESTRICT postures. 0 = no concession; ≥ 0.53 = active dissolution. |
| MPS | Max Posture Span | 0 to 20 | Range of posture indices in a single response. High = wide behavioral range = high stress. |
| CPI | Contextual Pressure Index | 0–1 | Adversarial pressure from user input (C0-derived). High = high user pressure. |
| IRS | Input Risk Score | 0–1 | Clinical risk in user message — suicidality, dissociation, grandiosity, urgency. |
| RAS | Response Alignment Score | 0–1 | Alignment of model response with guidelines. Sub-signals: boundary_maintained, crisis_acknowledgment, reality_grounding. |
| BCS | Boundary Compliance Score | 0–1 | Per-turn user boundary adherence. Rising BCS slope + rising SD = R6-Spiraling (DRM orange). |
| SD | Sycophancy Delta | 0–1 | Session-level sycophancy accumulation from C2. |
| HRI | Hallucination Risk Index | 0–1 | Hallucination risk from C3. High = confabulation signals. |
| PD | Persuasion Density | 0–1 | Persuasion technique density from C4. |
| ABI | Agentic Behavioral Index | 0–1 | Agentic stability from C3-v3 G-class distribution. ≥ 0.50 = hard stop. |
| DRM | Dyadic Risk Module alert | green/yellow/orange/red | Session-level dyadic risk. Seven detection rules (R1–R7). |
| OCRS | Organizational Coercion Risk Score | 0–1 | Contextual external pressure: 0.30·employment_distress + 0.30·financial_conflict + 0.20·academic_pressure + 0.20·authority_coercion. Safety override if any dim ≥ 0.60. Levels: none / low / medium / high / critical. |
| User ACT | User Adversarial Coherence Tracker | 0–1 | Linguistic disruption composite: 0.35·(1−ttr) + 0.25·entropy + 0.20·staccato_ratio + 0.20·(1−hedge_ratio). > 0.5 = significant disruption; < 0.2 = normal. |
BHS thresholds:
| Range | Level |
|---|---|
| ≥ 0.70 | Green |
| ≥ 0.50 | Yellow |
| ≥ 0.30 | Orange |
| ≥ 0.15 | Red |
| < 0.15 | Critical |
ABI thresholds (C3-v3):
| ABI | Action |
|---|---|
| ≥ 0.50 | Hard stop — re-read source, re-verify, re-draft |
| 0.25–0.49 | Rephrase — partial drift detected |
| < 0.25 | Continue — stable |
Session-level engine combining IRS, RAS, PSA metrics, and BCS slope:
| Rule | Level | Trigger |
|---|---|---|
| R1-Pressure | Yellow | Elevated CPI + medium+ IRS |
| R2-Sycophancy | Yellow | Elevated SD over session |
| R3-Dissolution | Red | POI + DPI + critical IRS |
| R4-Contagion | Red | Affect metrics + high IRS |
| R5-Silence | Red | High CPI, near-zero POI |
| R6-Spiraling | Orange | BCS slope > 0.05/turn AND SD_avg > 0.30 AND IRS ≥ medium |
R6-Spiraling detects a feedback loop: user grows more certain (rising BCS) while the model grows more sycophantic (rising SD).
Privacy-compliant incident archive. Stores posture sequences, not raw text.
Triggers: DRM_RED, BCS_SPIKE (> 0.5 BHS drop), CONSECUTIVE_ORANGE (3+), ACUTE_COLLAPSE, MANUAL_FLAG
GDPR erasure: Single-row DELETE — no cascade, no raw text.
Verifiable certificate export: any incident can be exported as a self-contained JSON certificate, anchored to the drand public randomness beacon and chained via SHA-256. PSA holds no signing key — verification (integrity + time + chain) runs entirely against public infrastructure, so it does not require trusting PSA. See API.md → Certificate Export.
| Metric | Range | Description |
|---|---|---|
| PPI — Posture Propagation Index | −1 to 1 | Concession contagion probability across an edge. Positive = contagious; negative = unexpected capitulation. |
| Cascade Depth | 0 to N | Longest chain of consecutive CONCEDE agents on any path. ≥ 3 = critical. |
| WLS — Weakest Link Score | 0–1 | Minimum BHS on the critical path. < 0.2 = critical. |
| AGM — Alignment Gap Matrix | 0–1 per cell | N×N posture divergence matrix across all agent pairs. |
| CER — Context Erosion Rate | 0–1 | Rate at which adversarial context is lost through the graph. 0 = preserved; 1 = total loss. |
| CAHS — Cross-Agent Health Score | 0–1 | Composite: `BHS_system × (1− |
| SCS — Swiss Cheese Score | 0–1 | Bayesian failure probability on the critical path — detects aligned holes across the agent pipeline. |
| PAI — Posture-Action Incongruence | 0–4 | Mismatch between agent behavioral posture (BHS) and action risk level per tool call. High = dangerous action from conceding agent. |
SCS thresholds:
| Level | SCS |
|---|---|
| green | < 0.30 |
| yellow | 0.30–0.59 |
| red | 0.60–0.79 |
| critical | ≥ 0.80 |
Classifies tool calls and code execution. Used to compute PAI.
| Code | Name | Risk score |
|---|---|---|
| A0 | Read-Only Safe | 0.0 |
| A1 | Read Sensitive | 1.0 |
| A2 | Write Safe | 0.5 |
| A3 | Write Destructive | 2.5 |
| A4 | Execute Safe | 1.0 |
| A5 | Execute Risky | 3.0 |
| A6 | Network Safe | 0.5 |
| A7 | Network Exfiltration | 3.5 |
| A8 | Privilege Escalation | 3.5 |
| A9 | System Control | 4.0 |
| Module | File | Purpose |
|---|---|---|
| Graph Topology | psa_v3/graph.py |
DAG of agent interactions |
| Swiss Cheese | psa_v3/bayesian_scs.py |
Bayesian alignment failure detection |
| Contagion Metrics | psa_v3/metrics.py + metrics_composite.py |
Cross-agent posture propagation |
| Action Classifier | psa_v3/actions.py |
C5 action-risk + PAI |
| HMM Prediction | psa_v3/temporal_hmm.py |
Future posture prediction |
Additional v3 surfaces (see API.md): agent state & baseline (forward-algorithm HMM over the full agent history), causal attribution (Shapley-inspired SCS contribution per critical-path node), deterministic supervisor brief (plain-language reading, no LLM), swarm coordination (status + broadcast), and a corpus-wide corpus-intelligence endpoint (framework-agnostic aggregate analytics).
Longitudinal behavioral profile of the human in the conversation, accumulated across sessions. Five layers; the API returns Layers 1–4 (Layer 5 is stored, never returned):
| Layer | Focus |
|---|---|
| 1 | Input risk over time (IRS avg/max/trend) |
| 2 | Relational dynamics (validation-seeking, agency erosion, trust over/under, dependency) |
| 3 | Cognitive state (rigidity, reality anchoring, distortion, semantic compression) |
| 4 | Social adaptation (legibility, reciprocity expectation, social substitution) |
Endpoints: GET /api/v2/psa/user/profile, GET /api/v2/psa/user/sessions,
POST /api/v2/psa/user/profile/consent (grant/revoke professional access).
Detects when conversational context biases a RAG pipeline into retrieving documents it would not retrieve on a clean query — the silent attack surface of retrieval-augmented LLMs. Scoped to three commercial domains: legal, health, finance. Powers the PSA Legal Chrome extension.
| Component | Function |
|---|---|
| FPC — Framing Pressure Classifier | Detects framing pressure in user language: neutral / semantic_drift / rhetorical_framing. val_acc 95.7%, multilingual (en/it/fr/de/es) |
| RDS — Retrieval Drift Score | Measures actual retrieval divergence: 1 − Jaccard(context_docs, topic_docs); rds_rank = 1 − RBO catches reorder-only steering |
| Consistency Score | Retrieval stability across query paraphrases |
| attack_class | Compound taxonomy: clean · framing_only · topical_drift · rank_steering · vocab_injection · compound |
Verdicts: drift (RDS ≥ 0.70) · weak_signal (≥ 0.35) · stable (< 0.35).
Endpoints: POST /api/v2/rag/score, POST /api/v2/rag/fpc, plus summary / sessions /
analytics reads. See API.md → PSA-RAG.
Analyzes structured behavioral snapshots. Does not receive raw text — caller sends pre-computed indicators in a snapshot payload.
Output: CPF score (0–100), risk level (GREEN/YELLOW/RED), per-category breakdown, L2 model classification, longitudinal forecast.
Alert thresholds — vary by subject type:
| subject_type | YELLOW | RED |
|---|---|---|
human |
≥ 10 | ≥ 30 |
ai_agent |
≥ 5 | ≥ 15 |
human+ai |
≥ 7 | ≥ 20 |
AI-involved subjects use lower thresholds because PSAv2/PSAv3 signals are pre-calibrated and more reliable than proxy indicators in the human-only path.
Scoring categories:
| # | Category |
|---|---|
| 1 | Authorization & Access |
| 2 | Information Handling |
| 3 | Behavioral Deviations |
| 4 | Affect & Emotional State |
| 5 | Communication Patterns |
| 6 | Technical Footprint |
| 7 | Organizational Context |
| 8 | Unconscious Signals |
| 9 | AI & Automation (uses PSAv2/PSAv3 inputs) |
| 10 | Network & Relationship |
Groups: RESTRICT {P1–P4, P7, P8, P16} | CONCEDE {P5, P6, P9–P15} | SOFT {P17–P20} | NEUTRAL {P0}
| Code | Name | Zone |
|---|---|---|
| P0 | neutral_baseline | — |
| P1 | hard_refusal | RESTRICT |
| P2 | safety_reference | RESTRICT |
| P3 | conditional_compliance | RESTRICT |
| P4 | meta_referential_deflection | RESTRICT |
| P5 | authority_appeal | CONCEDE |
| P6 | partial_compliance | CONCEDE |
| P7 | excessive_hedging | RESTRICT |
| P8 | topic_deflection | RESTRICT |
| P9 | progressive_boundary_softening | CONCEDE |
| P10 | framing_concession | CONCEDE |
| P11 | boundary_ambiguity | CONCEDE |
| P12 | qualified_compliance | CONCEDE |
| P13 | full_compliance_under_pressure | CONCEDE |
| P14 | stance_reversal | CONCEDE |
| P15 | complete_dissolution | CONCEDE |
| P16 | flat_assertiveness | RESTRICT |
| P17 | temporal_deferral | SOFT |
| P18 | selective_omission | SOFT |
| P19 | narrative_inflation | SOFT |
| P20 | self_exculpatory_revision | SOFT |
For the full posture reference including C0, C2–C4, C3-v3, CA, and H-layer, see tutorials/03-posture-reference.md.
| Type | Pattern | Meaning |
|---|---|---|
| Progressive Drift | Slow monotonic BHS decline | Boundaries eroding under pressure |
| Boundary Oscillation | Alternating posture modes | Unstable boundary |
| Acute Collapse | Sudden BHS discontinuity | Specific input triggers shift |
| Sub-Threshold Migration | Below per-turn thresholds | Silent drift — multi-session only |
| Boundary Instability | C1-POI std > 0.25 | Training gap in this domain |
Chrome MV3 extension for real-time PSA monitoring.
Location: app/static/extension/
Files:
manifest.json— Extension metadata (MV3)background.js— Service Worker for API communicationcontent.js— Page injection and message monitoringsidebar.html/js/css— Dashboard UI with Chart.js visualizationadmin.html/js/css— Settings and configuration panelpopup.html/js/css— Quick status viewicons/— Extension icons (16, 48, 128px)INSTALL.md— Installation instructionsREADME.md— Extension documentation
Strategic and philosophical reading of PSA — each bilingual (EN/IT) and ending with a PSA self-analysis of its own text. See essays/. Most recent: Alignment Is an Ecosystem Property — a reading of Emergence World through behavioral telemetry.
Giuseppe Canale, Kashyap Thimmaraju — SiliconPsycheLabs