LLM Reasoning Benchmark
Adversarial Reasoning
Evaluation for LLMs
Fallax surfaces failure modes that single-turn benchmarks miss. Step-level correctness scoring across 25 adversarial prompt templates; not just final-answer accuracy.
Benchmark v1 Results
| Model | Overall Score | Failure Rate | Captured |
|---|---|---|---|
| claude-sonnet-4-6 | 6.77 | 82.0% | 2026-05-13 |
| gpt-4o-mini | 8.14 | 91.0% | 2026-05-13 |
Scores on a 0-10 step-failure scale; lower is better. Failure rate is the fraction of prompts scoring at or above 4. See benchmarks/v1/baselines.json for per-category breakdowns.
Failure Taxonomy
logic_error
contradiction · invalid_inference
assumption_error
unstated_assumption · unjustified_assumption
constraint_violation
ignored_constraint · partial_satisfaction
generalization_error
overgeneralization · pattern_misapplication
ambiguity_failure
ambiguity_failure
multi_step_break
multi_step_break
Quick Start
# install
uv sync
# run evaluation
uv run python -m fallax run \
--models claude-sonnet-4-6 \
--judge claude-haiku-4-5-20251001 \
--output results.jsonl
# benchmark against v1
uv run python -m fallax baseline capture \
--version v1 \
--model claude-sonnet-4-6 \
--judge claude-haiku-4-5-20251001
# analyze results
uv run python -m fallax analyze results.jsonl