-
Notifications
You must be signed in to change notification settings - Fork 1.1k
SWE Bench Evaluation
SWE-bench (Software Engineering Benchmark) is the official benchmark for evaluating Large Language Models on real-world software engineering tasks. This guide covers how to use claude-flow's integrated SWE-bench evaluation system to test all execution modes and optimize performance.
SWE-bench is a dataset of 2,294 real GitHub issues from popular Python repositories, each with:
- Problem Statement: Description of the bug or feature request
- Base Commit: The exact commit where the issue exists
- Oracle Patch: The human-written fix for comparison
- Test Suite: Automated tests to validate solutions
A curated subset of 300 instances designed for faster evaluation and development.
# Install dependencies
pip install datasets swebench
# Ensure claude-flow is built
npm run build
# Single instance test
swarm-bench swe-bench official --limit 1
# SWE-bench Lite evaluation (300 instances)
swarm-bench swe-bench official --lite
# Full SWE-bench evaluation (2,294 instances)
swarm-bench swe-bench official
# Multi-mode comparison
swarm-bench swe-bench multi-mode --instances 5
swarm-bench swe-bench official [OPTIONS]
Options:
-
--lite
: Use SWE-bench-Lite (300 instances) instead of full dataset -
--limit N
: Limit to first N instances for testing -
--mode MODE
: Coordination mode (mesh, hierarchical, distributed, centralized) -
--strategy STRATEGY
: Execution strategy (optimization, development, research, testing) -
--agents N
: Number of agents (default: 8) -
--output PATH
: Custom output directory -
--validate
: Validate existing predictions file
Examples:
# Quick test with 5 instances
swarm-bench swe-bench official --limit 5
# Full lite evaluation with custom settings
swarm-bench swe-bench official --lite --mode hierarchical --strategy development --agents 12
# Validate submission format
swarm-bench swe-bench official --validate --output predictions.json
swarm-bench swe-bench multi-mode [OPTIONS]
Options:
-
--instances N
: Number of instances per mode (default: 1) -
--lite
: Use SWE-bench-Lite dataset -
--quick
: Test only 3 representative modes -
--output PATH
: Custom output directory
Examples:
# Quick comparison of top modes
swarm-bench swe-bench multi-mode --instances 1 --quick
# Comprehensive mode testing
swarm-bench swe-bench multi-mode --instances 3 --lite
# Full benchmark matrix
swarm-bench swe-bench multi-mode --instances 5
-
auto-centralized-5agents
: Automatic strategy, centralized coordination -
research-distributed-5agents
: Research-focused, distributed processing -
development-hierarchical-8agents
: Development workflow, hierarchical structure -
optimization-mesh-8agents
: Best performer - Optimization strategy with mesh topology -
testing-centralized-3agents
: Testing-focused with fewer agents -
analysis-distributed-5agents
: Analysis tasks, distributed coordination -
maintenance-hierarchical-5agents
: Maintenance tasks, hierarchical structure
-
coder-5agents
: Code implementation specialist -
architect-5agents
: System architecture and design -
tdd-5agents
: Test-driven development approach -
reviewer-3agents
: Code review and quality assurance -
tester-3agents
: Testing and validation specialist -
optimizer-5agents
: Performance optimization focus -
debugger-5agents
: Bug hunting and debugging -
documenter-3agents
: Documentation generation
-
default-4workers
: Standard Queen + 4 workers -
8workers
: High-capacity Queen + 8 workers -
tactical-2workers
: Tactical Queen with focused team -
adaptive-6workers
: Adaptive Queen with dynamic coordination
-
hybrid-10agents-parallel
: Hybrid mode with parallel execution -
batch-8agents-parallel
: SPARC batch processing - Custom optimized configurations
Success Rate: Percentage of instances where a valid patch was generated
- Excellent: >80%
- Good: 60-80%
- Fair: 40-60%
- Poor: <40%
Average Duration: Time per instance in seconds
- Fast: <300s (5 minutes)
- Medium: 300-600s (5-10 minutes)
- Slow: >600s (10+ minutes)
Patch Quality: Determined by:
- Valid git diff format
- Applies cleanly to base commit
- Addresses the core issue
- Passes automated tests
Mode Performance Rankings:
1. swarm-optimization-mesh-8agents - Success: 85.2%, Avg: 420.1s
2. hive-mind-8workers - Success: 82.7%, Avg: 380.5s
3. sparc-coder-5agents - Success: 78.3%, Avg: 445.2s
4. swarm-development-hierarchical-8agents - Success: 76.9%, Avg: 390.8s
5. sparc-tdd-5agents - Success: 74.1%, Avg: 520.3s
-
Use Optimal Configuration:
# Best performing setup swarm-bench swe-bench official --lite --mode mesh --strategy optimization --agents 8
-
Start Small:
# Test with limited instances first swarm-bench swe-bench official --limit 10
-
Monitor Progress:
- Watch console output for success rates
- Check generated patch quality
- Review error patterns
-
Validate Results:
# Always validate before submission swarm-bench swe-bench official --validate
Agent Count Optimization:
- 3-5 agents: Simple tasks, faster execution
- 6-8 agents: Complex tasks, balanced performance
- 8-12 agents: Very complex tasks, maximum capability
Mode Selection:
- hive-mind: Best for complex, multi-step problems
- swarm: Best for collaborative analysis tasks
- sparc-coder: Best for straightforward implementation
- sparc-tdd: Best when tests are critical
Strategy Selection:
- optimization: Best overall performance (recommended)
- development: Good for feature implementation
- research: Good for exploration and analysis
- testing: Good when validation is paramount
benchmark/swe-bench-official/results/
├── predictions.json # Submission-ready predictions
├── evaluation_report_*.json # Detailed performance metrics
└── multi_mode_report_*.json # Multi-mode comparison results
{
"instance_id": {
"model_patch": "<git diff content>",
"model_name_or_path": "claude-flow-swarm",
"instance_id": "repo__repo-issue"
}
}
{
"dataset": "SWE-bench-Lite",
"instances_evaluated": 300,
"successful_patches": 255,
"success_rate": 0.85,
"average_duration": 420.1,
"configuration": {
"mode": "mesh",
"strategy": "optimization",
"max_agents": 8
},
"timestamp": "2025-01-07T16:30:00Z"
}
No Patch Generated:
- Check if claude-flow executable exists
- Verify dataset loaded correctly
- Review console output for errors
- Try different mode/strategy combination
Timeout Errors:
- Increase timeout in configuration
- Reduce agent count for faster execution
- Use simpler coordination mode
Invalid Patch Format:
- Check patch extraction logic
- Verify output contains git diff markers
- Review claude-flow output manually
Poor Success Rate:
- Try optimization strategy
- Use mesh coordination mode
- Increase agent count
- Review failed instances for patterns
# Enable verbose logging
export LOG_LEVEL=DEBUG
swarm-bench swe-bench official --limit 1
# Check generated files
ls -la benchmark/swe-bench-official/results/
# Validate specific predictions
swarm-bench swe-bench official --validate --output predictions.json
-
Run Full Evaluation:
swarm-bench swe-bench official --lite
-
Validate Format:
swarm-bench swe-bench official --validate
-
Review Results:
- Check success rate (aim for >70%)
- Verify patch quality
- Review error patterns
- Visit SWE-bench Leaderboard
- Upload
predictions.json
file - Provide model information:
- Model Name: "Claude-Flow-Swarm"
- Version: "Alpha-88"
- Configuration: Your optimal settings
- Wait for automated evaluation
Based on benchmarking results:
- SWE-bench Lite: 75-85% success rate expected
- Full SWE-bench: 65-80% success rate expected
- Average Duration: 5-10 minutes per instance
#!/bin/bash
# ci-swe-bench.sh
set -e
echo "Running SWE-bench evaluation..."
# Quick validation
swarm-bench swe-bench official --limit 5
# Full evaluation if quick test passes
if [ $? -eq 0 ]; then
swarm-bench swe-bench official --lite
swarm-bench swe-bench official --validate
fi
{
"swe_bench": {
"mode": "mesh",
"strategy": "optimization",
"max_agents": 8,
"timeout": 600,
"retry_count": 2,
"output_format": "patch"
}
}
# Benchmark specific repositories
swarm-bench swe-bench official --lite --filter "astropy,django,flask"
# Include additional metrics
swarm-bench swe-bench official --lite --include-metrics
# Process in batches for large datasets
swarm-bench swe-bench official --batch-size 50
The system automatically collects:
- Success rates per mode/configuration
- Execution times and resource usage
- Patch quality indicators
- Error patterns and failure analysis
- Agent coordination efficiency
Generate comprehensive reports:
# Generate performance report
python -m benchmark.src.swarm_benchmark.tools.performance_dashboard
# Compare configurations
python -m benchmark.src.swarm_benchmark.tools.compare_optimizations
- Implement mode in
ClaudeFlowMode
class - Add prompt template in
SWEBenchPromptBuilder
- Update test configurations
- Run validation suite
- Edit
prompt_builder.py
- Test with sample instances
- Validate improvement in success rate
- Submit pull request
- SWE-bench Paper: arXiv:2310.06770
- Official Dataset: HuggingFace
- Leaderboard: swebench.com
- Claude-Flow Issues: GitHub Issues
Last updated: January 2025 Version: Alpha-88