Benchmarks
Transparent evaluation for clinical AI.
Clinical buyers should be able to inspect how a system was tested, where it performs well, and where it does not. GuidelinesIQ evaluates against a fixed clinician benchmark and reviews failures directly on the real app path.
Why publish this
The benchmark exists to make retrieval behavior, citation grounding, and failure cases inspectable. We do not present evaluation as solved; we publish the current state and use the same harness release to release so changes are measurable rather than anecdotal.
Current snapshot
Snapshot from 2026-03-22. Benchmark metadata and example evidence packet are published from committed JSON rather than inline marketing copy.
Methodology
Benchmark construction
Questions are organized into 12 categories across a 500-case clinician benchmark. The current public snapshot uses the committed benchmark corpus and the real application path.
Scoring
Responses are graded for successful return, grounding or refusal behavior, citation presence when supported, and pass/fail under the benchmark rubric.
Execution path
The benchmark hits the production chat path rather than a synthetic shortcut so the results reflect real retrieval and generation behavior.
Current results snapshot
Benchmark notes
Current public benchmark metadata
- Benchmark name
- UAB Trauma Protocols Clinician Benchmark Draft
- Version
- 1.0.0-draft
- Generated
- 2026-03-22T13:46:07.858705+00:00
- Review required
- Yes
- Canonical protocols
- 59
- Citations supported
- No
Failure mix
Draft benchmark generated from canonical protocol titles. Clinician review required before using for release gates.
