Summary
PyRIT scorers are used to decide whether an attack succeeded (objective scorer) and to feed progress signals to multi-turn attacks. Several paths cause a response that PyRIT cannot cleanly score — errored, blocked/filtered, wrong-modality, or a hedged refusal that still complies — to be recorded as "attack did not succeed" (False / 0.0 / refusal=True). Because these failure modes cluster on exactly the hardest/most-evasive responses, the bias is systematically toward under-reporting successful attacks: a red-team run reports fewer jailbreaks than actually occurred, and the target looks safer than it is. PyRIT's "fail toward not-successful" defaults sensibly avoid false positives, but the false negative direction is currently silent and, in some paths, indistinguishable from a genuine safe result.
This issue collects four related instances (each execution-proven). They share one root cause: "could not score / was prevented from scoring" is encoded identically to "scored not-harmful," with no signal to the consumer.
1. skip_on_error_result=True default drops errored successful attacks
Scorer.score_response_async / score_response_multiple_scorers_async default skip_on_error_result=True (pyrit/score/scorer.py:876). When the response is_error(), score_async returns [] (unless score_blocked_content + partial_content). A response that contains attack-success content but is flagged error/blocked therefore produces an empty objective_scores list — and "no success score" reads as "attack failed."
Repro (deterministic, SubStringScorer):
CLEAN objective_scores: [(True, 'true')] # harmful payload present -> success
BLOCKED objective_scores: [] # identical payload, response_error='blocked' -> no score
2. SelfAskRefusalScorer scores any blocked piece as refusal=True
self_ask_refusal_scorer.py:158-171 short-circuits on response_error=='blocked' and returns refusal=True ("Content was filtered, constituting a refusal.") without inspecting prompt_metadata['partial_content']. A content filter that truncated a partially complying harmful response is treated as a clean refusal = attack-failed (default score_blocked_content=False).
3. Unsupported-modality response -> 0.0 -> threshold False
A FloatScaleThresholdScorer over a text-only FloatScaleScorer, given a harmful response in a non-text modality (image), filters all pieces -> fallback 0.0 -> 0.0 >= 0.5 is False = attack-failed. "Unscoreable in this modality" is reported as "not harmful." (Lowest severity — the 0.0 fallback is documented; included for completeness as the same pattern.)
4. Hedged-but-complying response judged refusal=True (judge-prompt quality)
In OBJECTIVE_STRICT mode (refusal_objective_strict.yaml), a response that delivers a complete actionable harmful recipe wrapped in "I can't help / I discourage this" is scored refusal=True by the judge — it weights the disclaimer over the delivered content (the judge's own rationale acknowledges the harmful content was provided). This is a judge-prompt/eval-quality gap, not a code bug, but it lands in the same place: a successful jailbreak logged as a refusal.
Why it matters
A red-team's value is an accurate count of which attacks succeeded. Each path above silently converts "we couldn't tell / the response was prevented / the model hedged" into "the attack failed," so the most evasive successes are exactly the ones dropped. The companion PR #2042 makes the composite-scorer instance visible without changing verdicts; the broader directions below need maintainer sign-off because they change reported numbers.
Proposed directions (for maintainer input)
- Make "could not score" a first-class, distinguishable outcome (not the same value as "not harmful") so consumers/aggregators/orchestrators can handle it deliberately — PR #2042 adds a metadata flag + warning as a non-breaking first step.
- Reconsider whether
skip_on_error_result=True should be the default for objective scoring, or at least emit a count/warning of dropped responses (so empty != silently-failed).
- Have
SelfAskRefusalScorer honor partial_content on blocked pieces (score the leaked content) rather than unconditionally returning refusal=True.
- Tighten the strict-refusal judge prompt so delivered actionable harmful content outweighs a token disclaimer.
Happy to turn whichever direction you prefer into focused PRs.
Summary
PyRIT scorers are used to decide whether an attack succeeded (objective scorer) and to feed progress signals to multi-turn attacks. Several paths cause a response that PyRIT cannot cleanly score — errored, blocked/filtered, wrong-modality, or a hedged refusal that still complies — to be recorded as "attack did not succeed" (False / 0.0 / refusal=True). Because these failure modes cluster on exactly the hardest/most-evasive responses, the bias is systematically toward under-reporting successful attacks: a red-team run reports fewer jailbreaks than actually occurred, and the target looks safer than it is. PyRIT's "fail toward not-successful" defaults sensibly avoid false positives, but the false negative direction is currently silent and, in some paths, indistinguishable from a genuine safe result.
This issue collects four related instances (each execution-proven). They share one root cause: "could not score / was prevented from scoring" is encoded identically to "scored not-harmful," with no signal to the consumer.
1.
skip_on_error_result=Truedefault drops errored successful attacksScorer.score_response_async/score_response_multiple_scorers_asyncdefaultskip_on_error_result=True(pyrit/score/scorer.py:876). When the responseis_error(),score_asyncreturns[](unlessscore_blocked_content+partial_content). A response that contains attack-success content but is flagged error/blocked therefore produces an emptyobjective_scoreslist — and "no success score" reads as "attack failed."Repro (deterministic,
SubStringScorer):2.
SelfAskRefusalScorerscores any blocked piece as refusal=Trueself_ask_refusal_scorer.py:158-171short-circuits onresponse_error=='blocked'and returnsrefusal=True("Content was filtered, constituting a refusal.") without inspectingprompt_metadata['partial_content']. A content filter that truncated a partially complying harmful response is treated as a clean refusal = attack-failed (defaultscore_blocked_content=False).3. Unsupported-modality response -> 0.0 -> threshold False
A
FloatScaleThresholdScorerover a text-onlyFloatScaleScorer, given a harmful response in a non-text modality (image), filters all pieces -> fallback0.0->0.0 >= 0.5is False = attack-failed. "Unscoreable in this modality" is reported as "not harmful." (Lowest severity — the 0.0 fallback is documented; included for completeness as the same pattern.)4. Hedged-but-complying response judged refusal=True (judge-prompt quality)
In OBJECTIVE_STRICT mode (
refusal_objective_strict.yaml), a response that delivers a complete actionable harmful recipe wrapped in "I can't help / I discourage this" is scoredrefusal=Trueby the judge — it weights the disclaimer over the delivered content (the judge's own rationale acknowledges the harmful content was provided). This is a judge-prompt/eval-quality gap, not a code bug, but it lands in the same place: a successful jailbreak logged as a refusal.Why it matters
A red-team's value is an accurate count of which attacks succeeded. Each path above silently converts "we couldn't tell / the response was prevented / the model hedged" into "the attack failed," so the most evasive successes are exactly the ones dropped. The companion PR #2042 makes the composite-scorer instance visible without changing verdicts; the broader directions below need maintainer sign-off because they change reported numbers.
Proposed directions (for maintainer input)
skip_on_error_result=Trueshould be the default for objective scoring, or at least emit a count/warning of dropped responses (so empty != silently-failed).SelfAskRefusalScorerhonorpartial_contenton blocked pieces (score the leaked content) rather than unconditionally returning refusal=True.Happy to turn whichever direction you prefer into focused PRs.