Benchmark Results — Model Pile by IFusionSoft

Date: Feb 2–13, 2026 Models: Mistral-7B, Tulu-3-8B, Llama-3.2-3B, Qwen3-1.7B, AceMath-7B Type: Critique + Debate

Critique Templates

Debate Templates

Models Tested

100%

Fix-It Success

Critique Scores: 47 to 100 Across 12 Templates

Each template reveals different model strengths. Creative Writing and Bias Analysis hit near-perfect scores. Business Analysis remains the hardest challenge.

Critique quality scores across 12 templates

Why this matters: The best model pair changes for every template. Model Pile automatically selects the right combination based on this data.

Model × Template: Who Excels Where?

Mistral-7B hits perfect 100 on Bias, Creative Writing, and Tech Docs. Qwen3-1.7B dominates Case Law (95.9). No model leads everywhere.

Model performance heatmap across critique templates

Why this matters: Each cell in this heatmap drives Model Pile's model selection. Your Case Law query goes to Qwen3. Your Bias check goes to Mistral. Automatically.

Mistral-7B: The Value of Adding the Right Model

When we added Mistral-7B-Instruct, Bias Analysis jumped +28 points to a perfect 100. Tech Docs gained +6. Every template improved.

Impact of adding Mistral-7B model to the benchmark pool

Why this matters: This is why we continuously benchmark new models. Adding the right model to the mix can push scores from good to perfect.

Fix-It Enhancement: Up to +297% Document Improvement

After critique analysis identifies issues, the Fix-It process rewrites documents. Writing Quality documents grew +297%. Contract Analysis grew +287%.

Fix-It document enhancement percentages across templates

Why this matters: Model Pile doesn't just find problems — it fixes them. The Before/After Fix It Proof shows exactly what changed and why.

Debate Quality: 7 Templates from Pro vs Con to Legal

Pro vs Con and Contract Terms debates scored highest (0.78-0.79). Philosophical and Scientific debates are hardest. 100% Fix-It success across all templates.

Debate quality scores across 7 debate templates

Why this matters: AI debates require models to argue both sides convincingly. This data shows which models can reason, counter-argue, and provide evidence by debate style.

Debate Performance: Model × Template

Llama-3.2-3B leads Legal Interpretation (0.81) and Scientific debate (0.73). Qwen3 dominates Structured debate (0.75). Mistral-7B struggles with Scientific (0.33) and Philosophical (0.41).

Model performance heatmap across debate templates

Why this matters: Even strong models have blind spots. Mistral-7B excels at critique but struggles in scientific debate. Multi-model ensures no blind spot goes unnoticed.

Date: Nov 15–24, 2025 Models: Tulu-3-8B, Qwen3-1.7B, EXAONE-2.4B, AceMath-7B, Llama-3.2-3B Type: Debate + Critique + Document Analysis

500+

Tests Run

Models Tested

Task Categories

Templates

No Single Model Wins Everything

Rankings shift dramatically across task types. The debate champion drops to #4 in document analysis. The critique leader ranks #5 in debate.

Why this matters: If you rely on one model, you inherit its weaknesses. Model Pile selects the best models for each task automatically.

AI Debate Performance: Model × Template

Each model has a debate style where it excels. Tulu-3 dominates structured debates (0.89), Qwen3 owns policy (0.85), AceMath leads scientific (0.84).

Why this matters: A policy question and a scientific question need different models. Model Pile knows which one to pick.

Automated Optimization: 0% → 100% Quality

Four models initially failed critique quality checks. After automated parameter tuning, three reached 100% pass rate with up to 61% speed improvement.

Why this matters: Model Pile doesn't just run models — we optimize them per task type for better quality and faster responses.

Document Analysis: 240 Tests Across 8 Templates

From Writing Quality to Statistical Review, each template reveals different strengths. The top 3 models are separated by just 0.018 points.

Why this matters: Statistical Review is hardest for all models. Running multiple models catches what any single model misses.

Model Size vs Performance: Bigger Isn't Always Better

Qwen3 with only 1.7B parameters beats 7B and 8B models in document analysis. Performance depends on task fit, not parameter count.

Why this matters: Picking the "biggest" model isn't a strategy. Task-specific benchmarking is what Model Pile provides.

Debate Skills Breakdown

Scored across Argumentation (30%), Rebuttal (25%), Logic (20%), Evidence (15%), and Completeness (10%). AceMath — a math model — leads in evidence usage.

Why this matters: Models differ most in rebuttal quality. This skill gap is invisible when you use just one model.

The Data Behind Model Pile