Skip to content

Composite AND scorer silently masks a confirmed true with a could-not-score false #2043

@AUTHENSOR

Description

@AUTHENSOR

Summary

When a TrueFalseScorer cannot evaluate a response (no message piece survives its validator filtering — e.g. an image-only scorer handed a text-only response), the base class returns a fallback Score(score_value="false") whose rationale is "No supported pieces to score after filtering; returning false." (see pyrit/score/true_false/true_false_scorer.py, _build_fallback_score, and pyrit/score/scorer.py:267-268).

This "could-not-score" false is indistinguishable from a genuine "not harmful" false. In a TrueFalseCompositeScorer (pyrit/score/true_false/true_false_composite_scorer.py) using TrueFalseScoreAggregator.AND (pyrit/score/true_false/true_false_score_aggregator.pyfunctools.reduce(operator.and_, bool_values)), a sub-scorer that merely could not evaluate the response contributes a hard false that silently vetoes another sub-scorer's confirmed harmful true. The aggregate reports false = "attack did not succeed", with no signal that a sub-scorer abstained.

For a red-team this is a false-assurance hazard: a real success is under-reported, and nothing in the verdict, rationale, or logs flags that the false came from an abstention rather than a genuine "not harmful" judgement.

Deterministic repro

The repro composes two real TrueFalseScorer subclasses under AND:

  1. A text-only scorer that flags the text response as a successful jailbreak (true).
  2. An image-only scorer fed the same text response; its piece is filtered, so the base TrueFalseScorer falls back to false ("No pieces to score after filtering").
child HarmDetector  -> true
child ImageOnly     -> false | No pieces to score after filtering; returning false.
AND composite verdict -> False | get_value = False

The child harm detector confirmed true, the image-only scorer only abstained (false because it could not score), and the AND composite reports False.

Expected vs actual

  • Expected: A consumer can tell that the composite false is contaminated by a sub-scorer that could not evaluate the response — i.e. that an abstention, not a genuine "not harmful" judgement, contributed to the false. At minimum this should be visible in logs and in the aggregated rationale, and the could-not-score state should be programmatically distinguishable.
  • Actual: The composite returns a bare false identical to a genuine all-"not harmful" false. There is no metadata flag, no log warning, and no rationale note. A confirmed harmful true is silently vetoed by an abstaining sub-scorer.

Fix

A non-breaking observability fix for this is implemented in PR #2042 — it marks the filtered fallback false with score_metadata={"unscoreable": 1}, adds logger.warning + rationale notes in the aggregators when unscoreable sub-scores are present, and propagates the flag into aggregate metadata. The aggregate verdict value is intentionally not changed by that PR.

Whether the default verdict for an unscoreable input should differ (e.g. abstain / skip rather than contribute false) is a separate, behavior-changing discussion for this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions