LLM Reasoning Benchmark

Adversarial Reasoning
Evaluation for LLMs

Fallax surfaces failure modes that single-turn benchmarks miss. Step-level correctness scoring across 25 adversarial prompt templates; not just final-answer accuracy.

25 templates

100 benchmark prompts

6 failure categories

Python 3.12+

MIT

Benchmark v1 Results

Model	Overall Score	Failure Rate	Captured
claude-sonnet-4-6	6.77	82.0%	2026-05-13
gpt-4o-mini	8.14	91.0%	2026-05-13

Scores on a 0-10 step-failure scale; lower is better. Failure rate is the fraction of prompts scoring at or above 4. See benchmarks/v1/baselines.json for per-category breakdowns.

Failure Taxonomy

logic_error

contradiction · invalid_inference

assumption_error

unstated_assumption · unjustified_assumption

constraint_violation

ignored_constraint · partial_satisfaction

generalization_error

overgeneralization · pattern_misapplication

ambiguity_failure

multi_step_break

Quick Start

# install
uv sync

# run evaluation
uv run python -m fallax run \
  --models claude-sonnet-4-6 \
  --judge claude-haiku-4-5-20251001 \
  --output results.jsonl

# benchmark against v1
uv run python -m fallax baseline capture \
  --version v1 \
  --model claude-sonnet-4-6 \
  --judge claude-haiku-4-5-20251001

# analyze results
uv run python -m fallax analyze results.jsonl

Adversarial ReasoningEvaluation for LLMs

Adversarial Reasoning
Evaluation for LLMs