Summary
When a TrueFalseScorer cannot evaluate a response (no message piece survives its validator filtering — e.g. an image-only scorer handed a text-only response), the base class returns a fallback Score(score_value="false") whose rationale is "No supported pieces to score after filtering; returning false." (see pyrit/score/true_false/true_false_scorer.py, _build_fallback_score, and pyrit/score/scorer.py:267-268).
This "could-not-score" false is indistinguishable from a genuine "not harmful" false. In a TrueFalseCompositeScorer (pyrit/score/true_false/true_false_composite_scorer.py) using TrueFalseScoreAggregator.AND (pyrit/score/true_false/true_false_score_aggregator.py — functools.reduce(operator.and_, bool_values)), a sub-scorer that merely could not evaluate the response contributes a hard false that silently vetoes another sub-scorer's confirmed harmful true. The aggregate reports false = "attack did not succeed", with no signal that a sub-scorer abstained.
For a red-team this is a false-assurance hazard: a real success is under-reported, and nothing in the verdict, rationale, or logs flags that the false came from an abstention rather than a genuine "not harmful" judgement.
Deterministic repro
The repro composes two real TrueFalseScorer subclasses under AND:
- A text-only scorer that flags the text response as a successful jailbreak (
true).
- An image-only scorer fed the same text response; its piece is filtered, so the base
TrueFalseScorer falls back to false ("No pieces to score after filtering").
child HarmDetector -> true
child ImageOnly -> false | No pieces to score after filtering; returning false.
AND composite verdict -> False | get_value = False
The child harm detector confirmed true, the image-only scorer only abstained (false because it could not score), and the AND composite reports False.
Expected vs actual
- Expected: A consumer can tell that the composite
false is contaminated by a sub-scorer that could not evaluate the response — i.e. that an abstention, not a genuine "not harmful" judgement, contributed to the false. At minimum this should be visible in logs and in the aggregated rationale, and the could-not-score state should be programmatically distinguishable.
- Actual: The composite returns a bare
false identical to a genuine all-"not harmful" false. There is no metadata flag, no log warning, and no rationale note. A confirmed harmful true is silently vetoed by an abstaining sub-scorer.
Fix
A non-breaking observability fix for this is implemented in PR #2042 — it marks the filtered fallback false with score_metadata={"unscoreable": 1}, adds logger.warning + rationale notes in the aggregators when unscoreable sub-scores are present, and propagates the flag into aggregate metadata. The aggregate verdict value is intentionally not changed by that PR.
Whether the default verdict for an unscoreable input should differ (e.g. abstain / skip rather than contribute false) is a separate, behavior-changing discussion for this issue.
Summary
When a
TrueFalseScorercannot evaluate a response (no message piece survives its validator filtering — e.g. an image-only scorer handed a text-only response), the base class returns a fallbackScore(score_value="false")whose rationale is "No supported pieces to score after filtering; returning false." (seepyrit/score/true_false/true_false_scorer.py,_build_fallback_score, andpyrit/score/scorer.py:267-268).This "could-not-score"
falseis indistinguishable from a genuine "not harmful"false. In aTrueFalseCompositeScorer(pyrit/score/true_false/true_false_composite_scorer.py) usingTrueFalseScoreAggregator.AND(pyrit/score/true_false/true_false_score_aggregator.py—functools.reduce(operator.and_, bool_values)), a sub-scorer that merely could not evaluate the response contributes a hardfalsethat silently vetoes another sub-scorer's confirmed harmfultrue. The aggregate reportsfalse= "attack did not succeed", with no signal that a sub-scorer abstained.For a red-team this is a false-assurance hazard: a real success is under-reported, and nothing in the verdict, rationale, or logs flags that the
falsecame from an abstention rather than a genuine "not harmful" judgement.Deterministic repro
The repro composes two real
TrueFalseScorersubclasses under AND:true).TrueFalseScorerfalls back tofalse("No pieces to score after filtering").The child harm detector confirmed
true, the image-only scorer only abstained (falsebecause it could not score), and the AND composite reportsFalse.Expected vs actual
falseis contaminated by a sub-scorer that could not evaluate the response — i.e. that an abstention, not a genuine "not harmful" judgement, contributed to thefalse. At minimum this should be visible in logs and in the aggregated rationale, and the could-not-score state should be programmatically distinguishable.falseidentical to a genuine all-"not harmful"false. There is no metadata flag, no log warning, and no rationale note. A confirmed harmfultrueis silently vetoed by an abstaining sub-scorer.Fix
A non-breaking observability fix for this is implemented in PR #2042 — it marks the filtered fallback
falsewithscore_metadata={"unscoreable": 1}, addslogger.warning+ rationale notes in the aggregators when unscoreable sub-scores are present, and propagates the flag into aggregate metadata. The aggregate verdict value is intentionally not changed by that PR.Whether the default verdict for an unscoreable input should differ (e.g. abstain / skip rather than contribute
false) is a separate, behavior-changing discussion for this issue.