Classic debate format with opposing positions. One model argues FOR, another argues AGAINST.
Topics: AI regulation, social media liability, UBI, genetic engineering
Full formal structure: opening statements, rebuttals, cross-examination, closing arguments.
Topics: Cryptocurrency economics, remote work, nuclear energy
Evidence-based argumentation with focus on empirical support and methodology.
Topics: String theory, dark matter, consciousness, AGI feasibility
Multi-stakeholder perspectives on policy issues with trade-off analysis.
Topics: Climate change, healthcare, immigration, education funding
Abstract reasoning with logical argumentation and thought experiments.
Topics: Free will, trolley problem, personal identity, hard problem of consciousness
Strength, coherence, and persuasiveness of initial arguments
Quality of counter-arguments and engagement with opposing views
Use of valid reasoning and logical connectives
Citation of data, research, examples, and facts
All debate phases executed successfully
| Rank | Model | Parameters | Overall Score | Best Template | Strengths |
|---|---|---|---|---|---|
| 1 | Llama-3.1-Tulu-3-8B | 8B | Structured (0.89) | Formal Debates Detailed Rebuttals | |
| 2 | Qwen3-1.7B | 1.7B | Policy (0.85) | Versatile Efficient | |
| 3 | EXAONE-Deep-2.4B | 2.4B | Pro vs Con (0.82) | Balanced Consistent | |
| 4 | AceMath-Nemotron-7B | 7B | Scientific (0.84) | Technical Topics Evidence-Heavy | |
| 5 | Llama-3.2-3B-Instruct | 3B | Philosophical (0.79) | Abstract Reasoning Quick Response |
| Template | #1 Model | Score | #2 Model | Score | Notes |
|---|---|---|---|---|---|
| ⚖️ Pro vs Con | EXAONE-Deep-2.4B | 0.82 | Qwen3-1.7B | 0.80 | Clear position-taking |
| 🎯 Structured | Llama-3.1-Tulu-3-8B | 0.89 | EXAONE-Deep-2.4B | 0.84 | Follows format well |
| 🔬 Scientific | AceMath-Nemotron-7B | 0.84 | Llama-3.1-Tulu-3-8B | 0.76 | Strong evidence usage |
| 📊 Policy | Qwen3-1.7B | 0.85 | Llama-3.1-Tulu-3-8B | 0.83 | Trade-off analysis |
| 🧠 Philosophical | Llama-3.2-3B-Instruct | 0.79 | Llama-3.1-Tulu-3-8B | 0.77 | Abstract reasoning |
| Model | Argumentation (30%) | Rebuttal (25%) | Logic (20%) | Evidence (15%) | Completeness (10%) |
|---|---|---|---|---|---|
| Llama-3.1-Tulu-3-8B | 0.86 | 0.84 | 0.79 | 0.76 | 1.00 |
| Qwen3-1.7B | 0.82 | 0.78 | 0.77 | 0.72 | 1.00 |
| EXAONE-Deep-2.4B | 0.79 | 0.76 | 0.75 | 0.71 | 1.00 |
| AceMath-Nemotron-7B | 0.74 | 0.68 | 0.73 | 0.85 | 1.00 |
| Llama-3.2-3B-Instruct | 0.73 | 0.69 | 0.72 | 0.67 | 1.00 |
| Specification | Value | Details |
|---|---|---|
| Hardware | NVIDIA GPU with CUDA | 8-bit quantization for efficiency |
| Framework | Quart (async Flask) | Streaming SSE responses |
| Test Duration | 82 minutes | 5 templates × 5 model combinations |
| Timeout | 5 minutes per debate | Auto-retry on timeout (max 3 attempts) |
| Success Rate | 100% | All tests completed successfully |
| Parameter | Debate Value | Critique Value | Purpose |
|---|---|---|---|
| temperature | 0.7 | 0.3-0.5 | Higher creativity for argumentation |
| max_new_tokens | 1024 | 400-512 | Longer responses for comprehensive arguments |
| top_p | 0.9 | 0.9 | Nucleus sampling for quality |
| repetition_penalty | 1.1 | 1.1 | Prevents circular reasoning |
| Metric | Value | Notes |
|---|---|---|
| Total Debates | 25 | 5 templates × 5 model pair tests |
| Models Tested | 5 | All debate-capable models |
| Templates | 5 | Pro/Con, Structured, Scientific, Policy, Philosophical |
| Test Duration | 82 minutes | November 24, 2025 |
| Avg Debate Time | 3.3 minutes | Including initial + rebuttals + summary |
| Success Rate | 100% | With auto-retry handling |
| Retries Used | 3 | Auto-retry handled transient errors |
| Score Range | 0.71 - 0.82 | All models in GOOD+ range |
The debate testing benchmark demonstrates that all 5 tested models are capable of engaging in multi-model debates across various formats. Llama-3.1-Tulu-3-8B leads overall performance, particularly excelling in structured formal debates, while Qwen3-1.7B shows remarkable efficiency, competing with larger models despite having only 1.7B parameters.
Key Takeaways: