Composite AND scorer silently masks a confirmed true with a could-not-score false

## Summary

When a `TrueFalseScorer` cannot evaluate a response (no message piece survives its validator filtering — e.g. an image-only scorer handed a text-only response), the base class returns a fallback `Score(score_value="false")` whose rationale is "No supported pieces to score after filtering; returning false." (see `pyrit/score/true_false/true_false_scorer.py`, `_build_fallback_score`, and `pyrit/score/scorer.py:267-268`).

This "could-not-score" `false` is **indistinguishable from a genuine "not harmful" `false`**. In a `TrueFalseCompositeScorer` (`pyrit/score/true_false/true_false_composite_scorer.py`) using `TrueFalseScoreAggregator.AND` (`pyrit/score/true_false/true_false_score_aggregator.py` — `functools.reduce(operator.and_, bool_values)`), a sub-scorer that merely *could not evaluate* the response contributes a hard `false` that silently vetoes another sub-scorer's confirmed harmful `true`. The aggregate reports `false` = "attack did not succeed", with no signal that a sub-scorer abstained.

For a red-team this is a false-assurance hazard: a real success is under-reported, and nothing in the verdict, rationale, or logs flags that the `false` came from an abstention rather than a genuine "not harmful" judgement.

## Deterministic repro

The repro composes two real `TrueFalseScorer` subclasses under AND:

1. A text-only scorer that flags the text response as a successful jailbreak (`true`).
2. An image-only scorer fed the same text response; its piece is filtered, so the base `TrueFalseScorer` falls back to `false` ("No pieces to score after filtering").

```
child HarmDetector  -> true
child ImageOnly     -> false | No pieces to score after filtering; returning false.
AND composite verdict -> False | get_value = False
```

The child harm detector confirmed `true`, the image-only scorer only *abstained* (`false` because it could not score), and the AND composite reports `False`.

## Expected vs actual

- **Expected:** A consumer can tell that the composite `false` is contaminated by a sub-scorer that could not evaluate the response — i.e. that an abstention, not a genuine "not harmful" judgement, contributed to the `false`. At minimum this should be visible in logs and in the aggregated rationale, and the could-not-score state should be programmatically distinguishable.
- **Actual:** The composite returns a bare `false` identical to a genuine all-"not harmful" `false`. There is no metadata flag, no log warning, and no rationale note. A confirmed harmful `true` is silently vetoed by an abstaining sub-scorer.

## Fix

A non-breaking observability fix for this is implemented in PR [#2042](https://github.com/microsoft/PyRIT/pull/2042) — it marks the filtered fallback `false` with `score_metadata={"unscoreable": 1}`, adds `logger.warning` + rationale notes in the aggregators when unscoreable sub-scores are present, and propagates the flag into aggregate metadata. The aggregate verdict value is intentionally **not** changed by that PR.

Whether the *default* verdict for an unscoreable input should differ (e.g. abstain / skip rather than contribute `false`) is a separate, behavior-changing discussion for this issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Composite AND scorer silently masks a confirmed true with a could-not-score false #2043

Summary

Deterministic repro

Expected vs actual

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Composite AND scorer silently masks a confirmed true with a could-not-score false #2043

Description

Summary

Deterministic repro

Expected vs actual

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions