Benchmark prompt behavior
Compare prompts across models with transparent output quality metrics.
| Prompt | Model | Score | Invalid Output | Average Delay |
|---|---|---|---|---|
| Safety-first v3 | gpt-4.1-mini | 902 | 1.7% | 162ms |
| Safety-first v3 | gpt-4.1 | 948 | 0.6% | 188ms |
| Format Guard v2 | gpt-4o-mini | 877 | 0.2% | 151ms |