All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
First public release. 7 failure-mode detectors, 4 data adapters, Markdown/HTML reporters, FastAPI server with Docker, embeddings similarity backend, MkDocs documentation site. 203 tests, 90% coverage.
- Rewritten README: badges, value prop, Mermaid architecture diagram, positioning section explaining the niche without competitor comparisons, examples links, development quickstart
ARCHITECTURE.md— full system design doc with data flow diagrams, extension points, design decisions, performance characteristics, security modelSUPPORT.md— where to go for help, response-time expectationsexamples/directory with 5 runnable end-to-end scripts and an index.editorconfig— cross-editor consistency.github/CODEOWNERS— PR review routing.github/FUNDING.yml— sponsor button (template)- YAML-form issue templates (GitHub's newer format) replacing the old Markdown templates
.github/ISSUE_TEMPLATE/config.ymlredirects support questions to Discussions and security issues to private advisoriesPUBLISHING.md— first-push instructions, PyPI trusted publisher setup, release process, deprecation policy
chatbot_auditor.serverFastAPI application with/healthz,/readyz,/version,/detectors, and/analyzeendpoints- Optional bearer-token auth on
/analyzeviaCHATBOT_AUDITOR_API_KEYSenvironment variable (comma-separated) CHATBOT_AUDITOR_MAX_CONVERSATIONS_PER_REQUESTenv cap (default 1000)- Per-startup detector registry via FastAPI lifespan
- Auto-generated OpenAPI docs at
/docs(Swagger) and/redoc - Multi-stage
Dockerfilerunning as unprivileged user with healthcheck docker-compose.ymlfor local developmentchatbot_auditor.backends.embeddings.EmbeddingsSimilarity— drop-in semantic similarity backend forDeathLoopDetectorusing sentence-transformers; LRU cache by text, injectable encoder for tests- Docs: self-host tutorial, LLM & embedding backends tutorial, server and backends reference pages
- 203 tests passing, 90% coverage, mypy strict clean, ruff clean, docs strict build clean
reportingmodule withMarkdownReporter,HTMLReporter,Reporterbase class, andReportSummarydataclassrender_markdown()andrender_html()convenience functions- Reports include: overall summary metrics, detections-by-severity table, detections-by-detector table, and top-N ranked conversations with evidence
- HTML output is a self-contained document with inline CSS — email-safe, Slack-attachable, and fully escapes user-provided content (XSS-safe)
- CLI reworked:
--format text|json|markdown|html(defaulttext) replaces the previous--jsonflag;--output PATHwrites to a file - Updated docs, tutorials, and reference pages to cover the new commands
- MkDocs Material site with home, getting-started, concepts, tutorials, reference
sections, auto-deployed to GitHub Pages on push to
mainvia thedocs.ymlworkflow - Auto-generated API reference via
mkdocstrings[python]covering schema, detectors, adapters, knowledge bases, and audit entry points - Three tutorials: audit Intercom data, write a custom detector, configure a policy base
[docs]optional dependency group:pip install chatbot-auditor[docs]
Adapterabstract base class defining the commonfetch()contractJSONAdapter: reads conversations from.json(single or list) or.jsonlfiles with format auto-detectionCSVAdapter: reads conversations from CSV/TSV files with flexible header detection (acceptsconversation_id/conv_id/thread_id,role/author,content/message/body), customizable role mapping, and ISO-8601 or Unix timestamp parsingIntercomAdapter: pulls conversations via Intercom REST API with cursor-based pagination, HTML body cleaning, rate-limit retry/backoff, and role mapping across user/bot/admin author typesZendeskAdapter: pulls tickets + comments via Zendesk API with OAuth or email+API-token auth, pagination, rate-limit handling, bot user ID configuration, and public/private comment role mapping- CLI
analyze-intercomandanalyze-zendeskcommands for direct API access - CLI
analyzecommand now auto-detects file type from extension
SentimentCollapseDetector: pluggableSentimentScorerprotocol with a stdlib-onlyKeywordSentimentScorerdefault; compares early/late thirds of user messages and flags meaningful sentiment drops with severity scalingBrandDamageDetector: pluggableContentSafetyCheckerwith a stdlib-onlyPatternSafetyCheckerdefault covering profanity, self-deprecation, competitor endorsements, and off-brand content (poems, jokes, politics); configurable competitor namesConfidentLiesDetector: regex-based detection of bot commitments (refunds, timelines, guarantees, account changes); takes an optionalPolicyBaseknowledge base to distinguish allowed from disallowed commitments; without a policy, flags all commitments for reviewConfidentMisinformationDetector: regex-based detection of factual claims (pricing, hours, availability, policy); takes an optionalFactBaseto cross-check claims against ground truth; distinguishes "verified", "contradiction", and "unverifiable" outcomesknowledgemodule:PolicyBaseandFactBasedataclasses defining the minimal knowledge-base interfacesdefault_registry()now includes SentimentCollapse and BrandDamage. ConfidentLies/Misinformation are available but opt-in — they need a knowledge base to be genuinely useful.- 131 tests passing, 93% coverage, mypy strict clean, ruff clean
similaritymodule:normalize,lexical_similarity(SequenceMatcher-based, stdlib-only, explicitly symmetric),token_jaccard, and aSimilarityFntype for pluggable backendsDeathLoopDetector: connected-components grouping over pairwise similarity, configurable threshold, min repeat count, minimum content length, pluggable similarity function, confidence scoring with frustration-keyword boost, severity scaling from low to criticalSilentChurnDetector: flags multi-turn conversations that ended with no customer-side resolution signal; confidence boosted when the platform reported the conversation as resolvedEscalationBurialDetector: detects explicit human-agent requests the bot deflects; aggregates multiple burials per conversation into one detection with severity scaling; transfer-confirmation phrases recognized as properly handled escalationsConversationGenerator: deterministic synthetic generator for healthy conversations, death loops (3 paraphrase levels), silent churn, and escalation burial scriptsaudit()anddefault_registry()entry points; default registry includes all three Phase 1 detectors- CLI
analyzecommand: accepts a JSON file, prints or emits JSON detections, returns non-zero exit code when failures are detected scripts/benchmark.py: writesdocs/benchmarks.mdwith precision/recall/F1 for every detector against the synthetic corpus- Property-based tests using
hypothesis: identical-message detection invariant, unique-message non-detection invariant, detector idempotence - 96 tests passing, 90% coverage, mypy strict clean, ruff clean
- Initial project scaffold:
pyproject.toml, CI, linting, type-checking configuration - Core Pydantic schema:
Message,Conversation,Detection,FailureMode,Severity - Abstract
Detectorbase class defining the detection contract - Detector registry for dynamic loading
- Typer-based CLI skeleton
- Apache 2.0 license, NOTICE file, SECURITY policy, Code of Conduct, contributing guide
- GitHub Actions CI pipeline (test matrix on Python 3.11, 3.12, 3.13 across Linux, macOS, Windows)
- Pre-commit hooks (ruff, mypy, pytest)
Initial pre-release. Not published to PyPI.