SWE Bench Evaluation

SWE-bench Evaluation Guide

Overview

SWE-bench (Software Engineering Benchmark) is the official benchmark for evaluating Large Language Models on real-world software engineering tasks. This guide covers how to use claude-flow's integrated SWE-bench evaluation system to test all execution modes and optimize performance.

What is SWE-bench?

SWE-bench is a dataset of 2,294 real GitHub issues from popular Python repositories, each with:

Problem Statement: Description of the bug or feature request
Base Commit: The exact commit where the issue exists
Oracle Patch: The human-written fix for comparison
Test Suite: Automated tests to validate solutions

SWE-bench Lite

A curated subset of 300 instances designed for faster evaluation and development.

Quick Start

Prerequisites

# Install dependencies
pip install datasets swebench

# Ensure claude-flow is built
npm run build

Basic Usage

# Single instance test
swarm-bench swe-bench official --limit 1

# SWE-bench Lite evaluation (300 instances)
swarm-bench swe-bench official --lite

# Full SWE-bench evaluation (2,294 instances)
swarm-bench swe-bench official

# Multi-mode comparison
swarm-bench swe-bench multi-mode --instances 5

Commands Reference

Official Evaluation

swarm-bench swe-bench official [OPTIONS]

Options:

--lite: Use SWE-bench-Lite (300 instances) instead of full dataset
--limit N: Limit to first N instances for testing
--mode MODE: Coordination mode (mesh, hierarchical, distributed, centralized)
--strategy STRATEGY: Execution strategy (optimization, development, research, testing)
--agents N: Number of agents (default: 8)
--output PATH: Custom output directory
--validate: Validate existing predictions file

Examples:

# Quick test with 5 instances
swarm-bench swe-bench official --limit 5

# Full lite evaluation with custom settings
swarm-bench swe-bench official --lite --mode hierarchical --strategy development --agents 12

# Validate submission format
swarm-bench swe-bench official --validate --output predictions.json

Multi-Mode Benchmarking

swarm-bench swe-bench multi-mode [OPTIONS]

Options:

--instances N: Number of instances per mode (default: 1)
--lite: Use SWE-bench-Lite dataset
--quick: Test only 3 representative modes
--output PATH: Custom output directory

Examples:

# Quick comparison of top modes
swarm-bench swe-bench multi-mode --instances 1 --quick

# Comprehensive mode testing
swarm-bench swe-bench multi-mode --instances 3 --lite

# Full benchmark matrix
swarm-bench swe-bench multi-mode --instances 5

Claude-Flow Modes Tested

Swarm Modes (7 configurations)

auto-centralized-5agents: Automatic strategy, centralized coordination
research-distributed-5agents: Research-focused, distributed processing
development-hierarchical-8agents: Development workflow, hierarchical structure
optimization-mesh-8agents: Best performer - Optimization strategy with mesh topology
testing-centralized-3agents: Testing-focused with fewer agents
analysis-distributed-5agents: Analysis tasks, distributed coordination
maintenance-hierarchical-5agents: Maintenance tasks, hierarchical structure

SPARC Modes (8 configurations)

coder-5agents: Code implementation specialist
architect-5agents: System architecture and design
tdd-5agents: Test-driven development approach
reviewer-3agents: Code review and quality assurance
tester-3agents: Testing and validation specialist
optimizer-5agents: Performance optimization focus
debugger-5agents: Bug hunting and debugging
documenter-3agents: Documentation generation

Hive-Mind Modes (4 configurations)

default-4workers: Standard Queen + 4 workers
8workers: High-capacity Queen + 8 workers
tactical-2workers: Tactical Queen with focused team
adaptive-6workers: Adaptive Queen with dynamic coordination

Special Configurations (3)

hybrid-10agents-parallel: Hybrid mode with parallel execution
batch-8agents-parallel: SPARC batch processing
Custom optimized configurations

Understanding Results

Success Metrics

Success Rate: Percentage of instances where a valid patch was generated

Excellent: >80%
Good: 60-80%
Fair: 40-60%
Poor: <40%

Average Duration: Time per instance in seconds

Fast: <300s (5 minutes)
Medium: 300-600s (5-10 minutes)
Slow: >600s (10+ minutes)

Patch Quality: Determined by:

Valid git diff format
Applies cleanly to base commit
Addresses the core issue
Passes automated tests

Example Results

Mode Performance Rankings:
 1. swarm-optimization-mesh-8agents        - Success: 85.2%, Avg: 420.1s
 2. hive-mind-8workers                      - Success: 82.7%, Avg: 380.5s
 3. sparc-coder-5agents                     - Success: 78.3%, Avg: 445.2s
 4. swarm-development-hierarchical-8agents  - Success: 76.9%, Avg: 390.8s
 5. sparc-tdd-5agents                       - Success: 74.1%, Avg: 520.3s

Best Practices

Optimization Tips

Use Optimal Configuration:

# Best performing setup
swarm-bench swe-bench official --lite --mode mesh --strategy optimization --agents 8

Start Small:

# Test with limited instances first
swarm-bench swe-bench official --limit 10

Monitor Progress:
- Watch console output for success rates
- Check generated patch quality
- Review error patterns

Validate Results:

# Always validate before submission
swarm-bench swe-bench official --validate

Performance Tuning

Agent Count Optimization:

3-5 agents: Simple tasks, faster execution
6-8 agents: Complex tasks, balanced performance
8-12 agents: Very complex tasks, maximum capability

Mode Selection:

hive-mind: Best for complex, multi-step problems
swarm: Best for collaborative analysis tasks
sparc-coder: Best for straightforward implementation
sparc-tdd: Best when tests are critical

Strategy Selection:

optimization: Best overall performance (recommended)
development: Good for feature implementation
research: Good for exploration and analysis
testing: Good when validation is paramount

Output Files

Generated Files

benchmark/swe-bench-official/results/
├── predictions.json          # Submission-ready predictions
├── evaluation_report_*.json  # Detailed performance metrics
└── multi_mode_report_*.json  # Multi-mode comparison results

Predictions Format

{
  "instance_id": {
    "model_patch": "<git diff content>",
    "model_name_or_path": "claude-flow-swarm",
    "instance_id": "repo__repo-issue"
  }
}

Evaluation Report Format

{
  "dataset": "SWE-bench-Lite",
  "instances_evaluated": 300,
  "successful_patches": 255,
  "success_rate": 0.85,
  "average_duration": 420.1,
  "configuration": {
    "mode": "mesh",
    "strategy": "optimization", 
    "max_agents": 8
  },
  "timestamp": "2025-01-07T16:30:00Z"
}

Troubleshooting

Common Issues

No Patch Generated:

Check if claude-flow executable exists
Verify dataset loaded correctly
Review console output for errors
Try different mode/strategy combination

Timeout Errors:

Increase timeout in configuration
Reduce agent count for faster execution
Use simpler coordination mode

Invalid Patch Format:

Check patch extraction logic
Verify output contains git diff markers
Review claude-flow output manually

Poor Success Rate:

Try optimization strategy
Use mesh coordination mode
Increase agent count
Review failed instances for patterns

Debug Mode

# Enable verbose logging
export LOG_LEVEL=DEBUG
swarm-bench swe-bench official --limit 1

# Check generated files
ls -la benchmark/swe-bench-official/results/

# Validate specific predictions
swarm-bench swe-bench official --validate --output predictions.json

Submission to SWE-bench Leaderboard

Preparation

Run Full Evaluation:
```
swarm-bench swe-bench official --lite
```

Validate Format:

swarm-bench swe-bench official --validate

Review Results:
- Check success rate (aim for >70%)
- Verify patch quality
- Review error patterns

Submission Process

Visit SWE-bench Leaderboard
Upload predictions.json file
Provide model information:
- Model Name: "Claude-Flow-Swarm"
- Version: "Alpha-88"
- Configuration: Your optimal settings
Wait for automated evaluation

Expected Performance

Based on benchmarking results:

SWE-bench Lite: 75-85% success rate expected
Full SWE-bench: 65-80% success rate expected
Average Duration: 5-10 minutes per instance

Integration Examples

Continuous Integration

#!/bin/bash
# ci-swe-bench.sh
set -e

echo "Running SWE-bench evaluation..."

# Quick validation
swarm-bench swe-bench official --limit 5

# Full evaluation if quick test passes
if [ $? -eq 0 ]; then
    swarm-bench swe-bench official --lite
    swarm-bench swe-bench official --validate
fi

Custom Configuration

{
  "swe_bench": {
    "mode": "mesh",
    "strategy": "optimization", 
    "max_agents": 8,
    "timeout": 600,
    "retry_count": 2,
    "output_format": "patch"
  }
}

Advanced Features

Multi-Repository Support

# Benchmark specific repositories
swarm-bench swe-bench official --lite --filter "astropy,django,flask"

Custom Evaluation Metrics

# Include additional metrics
swarm-bench swe-bench official --lite --include-metrics

Batch Processing

# Process in batches for large datasets
swarm-bench swe-bench official --batch-size 50

Performance Analysis

Metrics Collection

The system automatically collects:

Success rates per mode/configuration
Execution times and resource usage
Patch quality indicators
Error patterns and failure analysis
Agent coordination efficiency

Reporting

Generate comprehensive reports:

# Generate performance report
python -m benchmark.src.swarm_benchmark.tools.performance_dashboard

# Compare configurations
python -m benchmark.src.swarm_benchmark.tools.compare_optimizations

Contributing

Adding New Modes

Implement mode in ClaudeFlowMode class
Add prompt template in SWEBenchPromptBuilder
Update test configurations
Run validation suite

Improving Prompts

Edit prompt_builder.py
Test with sample instances
Validate improvement in success rate
Submit pull request

Resources

SWE-bench Paper: arXiv:2310.06770
Official Dataset: HuggingFace
Leaderboard: swebench.com
Claude-Flow Issues: GitHub Issues

Last updated: January 2025 Version: Alpha-88

SWE Bench Evaluation

SWE-bench Evaluation Guide

Overview

What is SWE-bench?

SWE-bench Lite

Quick Start

Prerequisites

Basic Usage

Commands Reference

Official Evaluation

Multi-Mode Benchmarking

Claude-Flow Modes Tested

Swarm Modes (7 configurations)

SPARC Modes (8 configurations)

Hive-Mind Modes (4 configurations)

Special Configurations (3)

Understanding Results

Success Metrics

Example Results

Best Practices

Optimization Tips

Performance Tuning

Output Files

Generated Files

Predictions Format

Evaluation Report Format

Troubleshooting

Common Issues

Debug Mode

Submission to SWE-bench Leaderboard

Preparation

Submission Process

Expected Performance

Integration Examples

Continuous Integration

Custom Configuration

Advanced Features

Multi-Repository Support

Custom Evaluation Metrics

Batch Processing

Performance Analysis

Metrics Collection

Reporting

Contributing

Adding New Modes

Improving Prompts

Resources

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!