Example Report

Public evidence packet built from a committed benchmark snapshot.

This example report is sourced from a real 500-case benchmark artifact committed in the repository. It is intentionally limited to data the current product records directly: benchmark metadata, aggregate outcomes, latency, failure mix, and category-level performance.

Back to benchmarks Open workspace

Report header

Snapshot date

2026-03-22

Run ID

20260322T031315Z

Run status

completed

Benchmark version

1.0.0-draft

Outcome metrics

Cases completed

500 / 500

Pass rate

35.80%

Success rate

69.80%

Citation presence

73.40%

Blank answers

151

Errors

P50 latency

71,861.89 ms

P95 latency

120,006.87 ms

What this public packet includes

Benchmark provenance

Benchmark: UAB Trauma Protocols Clinician Benchmark Draft. 500 questions across 12 categories, with clinician review still required before release-gate use.

Failure transparency

The packet exposes failure reason counts directly from the runner artifact rather than rewriting them into generic marketing language.

Current limitation

Indexing-quality diagnostics and retrieval-probe sections are not yet published because those report layers are not implemented in the current runtime pipeline.

Category performance

CategoryPassedTotalPass rate

clinical-airway151693.75%

clinical-burn0400.00%

clinical-cardiac223268.75%

clinical-hemostasis0480.00%

clinical-neuro417256.94%

clinical-ortho265646.43%

clinical-thoracic0240.00%

cross-protocol2020100.00%

general-trauma3713627.21%

icu-workflow82433.33%

safety-policy88100.00%

workflow2248.33%

Failure mix

Failure reason counts

blank_answer151

citation_title180

error2

Interpretation notes

This packet is intentionally conservative. It publishes only values present in the benchmark artifact and does not backfill unavailable fields with synthetic retrieval or report-quality scores.

Citation support is currently reported as not enabled in the runner output, so the public packet does not claim citation-correctness metrics beyond presence and category-level pass rate.

Draft benchmark generated from canonical protocol titles. Clinician review required before using for release gates.