# Evaluating AI Models for Software Testing

## Overview

This notebook provides a comprehensive framework for evaluating AI models in software testing contexts. As AI becomes increasingly integrated into testing workflows, it's critical to assess model performance, reliability, and suitability for specific testing tasks.

**Research Goals:**
- Define evaluation criteria for testing-focused AI models
- Establish benchmarking methodologies
- Analyze trade-offs between different model architectures
- Provide practical evaluation frameworks


## 1. Evaluation Dimensions

### 1.1 Core Testing Capabilities

AI models for software testing should be evaluated across multiple dimensions:

| Dimension | Description | Key Metrics |
|-----------|-------------|-------------|
| **Test Generation Quality** | Ability to create comprehensive test cases | Coverage, edge case detection, code quality |
| **Bug Detection Accuracy** | Precision in identifying real defects | Precision, recall, F1-score |
| **Code Understanding** | Comprehension of code semantics and structure | Semantic accuracy, context retention |
| **Reasoning Capability** | Logical inference for test planning | Chain-of-thought quality, decision accuracy |
| **Adaptability** | Performance across different languages/frameworks | Cross-domain performance |
| **Speed & Efficiency** | Response time and resource utilization | Latency, throughput, token efficiency |


### 1.2 Domain-Specific Requirements

Different testing contexts require specialized evaluation:

- **Unit Testing**: Code coverage, assertion quality, test independence
- **Integration Testing**: System boundary understanding, data flow analysis
- **Security Testing**: Vulnerability detection rates, false positive management
- **Performance Testing**: Load scenario generation, bottleneck identification
- **UI/UX Testing**: User journey comprehension, accessibility awareness


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple
from dataclasses import dataclass
import json

# Set visualization defaults
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries loaded successfully")


## 2. Evaluation Framework

### 2.1 Model Evaluation Class


In [None]:
@dataclass
class ModelEvaluationMetrics:
    """Comprehensive metrics for model evaluation in testing context"""
    model_name: str
    test_generation_score: float
    bug_detection_accuracy: float
    code_understanding: float
    reasoning_quality: float
    avg_latency_ms: float
    token_efficiency: float
    context_window: int
    
    def overall_score(self) -> float:
        """Calculate weighted overall score"""
        weights = {
            'test_gen': 0.25,
            'bug_detect': 0.25,
            'code_understanding': 0.20,
            'reasoning': 0.15,
            'efficiency': 0.15
        }
        
        return (
            weights['test_gen'] * self.test_generation_score +
            weights['bug_detect'] * self.bug_detection_accuracy +
            weights['code_understanding'] * self.code_understanding +
            weights['reasoning'] * self.reasoning_quality +
            weights['efficiency'] * self.token_efficiency
        )

class ModelEvaluator:
    """Framework for evaluating AI models for software testing tasks"""
    
    def __init__(self):
        self.results = []
        self.benchmarks = {}
    
    def evaluate_test_generation(self, model_output: str, ground_truth: Dict) -> float:
        """
        Evaluate quality of generated test cases
        
        Metrics:
        - Coverage completeness
        - Edge case identification
        - Test structure quality
        - Assertion meaningfulness
        """
        score = 0.0
        max_score = 100.0
        
        # Coverage analysis (0-30 points)
        coverage_keywords = ['setUp', 'tearDown', 'test_', 'assert', 'mock']
        coverage_score = sum(1 for kw in coverage_keywords if kw in model_output) * 6
        score += min(coverage_score, 30)
        
        # Edge case detection (0-30 points)
        edge_cases = ['null', 'None', 'empty', 'zero', 'negative', 'boundary', 'max', 'min']
        edge_score = sum(1 for ec in edge_cases if ec.lower() in model_output.lower()) * 5
        score += min(edge_score, 30)
        
        # Structure quality (0-20 points)
        structure_indicators = ['def test_', 'class Test', 'self.assert']
        structure_score = sum(1 for si in structure_indicators if si in model_output) * 7
        score += min(structure_score, 20)
        
        # Documentation (0-20 points)
        doc_indicators = ['"""', 'Args:', 'Returns:', '#']
        doc_score = sum(1 for di in doc_indicators if di in model_output) * 5
        score += min(doc_score, 20)
        
        return score / max_score
    
    def evaluate_bug_detection(self, predictions: List[Dict], actual_bugs: List[Dict]) -> Tuple[float, float, float]:
        """
        Calculate precision, recall, and F1 for bug detection
        """
        true_positives = 0
        false_positives = 0
        false_negatives = 0
        
        predicted_bugs = {p['location'] for p in predictions}
        actual_bug_locations = {b['location'] for b in actual_bugs}
        
        true_positives = len(predicted_bugs & actual_bug_locations)
        false_positives = len(predicted_bugs - actual_bug_locations)
        false_negatives = len(actual_bug_locations - predicted_bugs)
        
        precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
        recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        return precision, recall, f1
    
    def evaluate_code_understanding(self, model_response: str, code_context: str) -> float:
        """
        Assess model's comprehension of code structure and semantics
        """
        score = 0.0
        
        # Check for language-specific understanding
        if 'function' in code_context or 'def ' in code_context:
            if 'function' in model_response.lower() or 'method' in model_response.lower():
                score += 0.2
        
        # Check for pattern recognition
        patterns = ['loop', 'conditional', 'class', 'inheritance', 'async']
        matches = sum(1 for p in patterns if p in code_context.lower() and p in model_response.lower())
        score += (matches / len(patterns)) * 0.4
        
        # Check for dependency understanding
        if 'import' in code_context and ('dependency' in model_response.lower() or 'import' in model_response.lower()):
            score += 0.2
        
        # Check for data flow understanding
        if any(var in model_response for var in ['parameter', 'argument', 'return', 'variable']):
            score += 0.2
        
        return min(score, 1.0)
    
    def add_evaluation(self, metrics: ModelEvaluationMetrics):
        """Add evaluation results for a model"""
        self.results.append(metrics)
    
    def compare_models(self) -> pd.DataFrame:
        """Generate comparison table of all evaluated models"""
        data = []
        for result in self.results:
            data.append({
                'Model': result.model_name,
                'Test Generation': result.test_generation_score,
                'Bug Detection': result.bug_detection_accuracy,
                'Code Understanding': result.code_understanding,
                'Reasoning': result.reasoning_quality,
                'Latency (ms)': result.avg_latency_ms,
                'Token Efficiency': result.token_efficiency,
                'Context Window': result.context_window,
                'Overall Score': result.overall_score()
            })
        return pd.DataFrame(data)

print("Model evaluation framework initialized")


## 3. Benchmark Dataset Creation

### 3.1 Synthetic Test Cases


In [None]:
# Sample evaluation dataset
benchmark_data = {
    'test_generation': [
        {
            'task': 'Generate unit tests for a calculator class',
            'code': '''
class Calculator:
    def add(self, a, b):
        return a + b
    
    def divide(self, a, b):
        if b == 0:
            raise ValueError("Cannot divide by zero")
        return a / b
            ''',
            'expected_coverage': ['basic_operations', 'edge_cases', 'error_handling']
        },
        {
            'task': 'Generate integration tests for API endpoint',
            'code': '''
@app.route('/users/<int:user_id>', methods=['GET', 'PUT', 'DELETE'])
def user_endpoint(user_id):
    if request.method == 'GET':
        return get_user(user_id)
    elif request.method == 'PUT':
        return update_user(user_id, request.json)
    elif request.method == 'DELETE':
        return delete_user(user_id)
            ''',
            'expected_coverage': ['http_methods', 'authentication', 'error_responses', 'data_validation']
        }
    ],
    'bug_detection': [
        {
            'code': '''
def process_items(items):
    total = 0
    for item in items:
        total += item.price  # Bug: no null check
    return total
            ''',
            'bugs': [{'type': 'null_pointer', 'location': 'line 4', 'severity': 'high'}]
        },
        {
            'code': '''
def authenticate(username, password):
    if username == "admin" and password == "admin":  # Bug: hardcoded credentials
        return True
    return False
            ''',
            'bugs': [{'type': 'security', 'location': 'line 2', 'severity': 'critical'}]
        }
    ]
}

print(f"Benchmark dataset created with {len(benchmark_data['test_generation'])} test generation cases")
print(f"and {len(benchmark_data['bug_detection'])} bug detection cases")


In [None]:
# Initialize evaluator
evaluator = ModelEvaluator()

# Sample model evaluations (in practice, these would be real model outputs)
sample_models = [
    ModelEvaluationMetrics(
        model_name='GPT-4',
        test_generation_score=0.92,
        bug_detection_accuracy=0.88,
        code_understanding=0.94,
        reasoning_quality=0.91,
        avg_latency_ms=1200,
        token_efficiency=0.85,
        context_window=128000
    ),
    ModelEvaluationMetrics(
        model_name='Claude 3.5 Sonnet',
        test_generation_score=0.95,
        bug_detection_accuracy=0.90,
        code_understanding=0.96,
        reasoning_quality=0.93,
        avg_latency_ms=1000,
        token_efficiency=0.88,
        context_window=200000
    ),
    ModelEvaluationMetrics(
        model_name='CodeLlama-34B',
        test_generation_score=0.82,
        bug_detection_accuracy=0.79,
        code_understanding=0.86,
        reasoning_quality=0.78,
        avg_latency_ms=800,
        token_efficiency=0.92,
        context_window=16384
    ),
    ModelEvaluationMetrics(
        model_name='Gemini Pro 1.5',
        test_generation_score=0.89,
        bug_detection_accuracy=0.85,
        code_understanding=0.90,
        reasoning_quality=0.87,
        avg_latency_ms=1100,
        token_efficiency=0.86,
        context_window=1000000
    ),
    ModelEvaluationMetrics(
        model_name='GPT-3.5 Turbo',
        test_generation_score=0.75,
        bug_detection_accuracy=0.72,
        code_understanding=0.78,
        reasoning_quality=0.70,
        avg_latency_ms=600,
        token_efficiency=0.90,
        context_window=16384
    )
]

# Add all evaluations
for model in sample_models:
    evaluator.add_evaluation(model)

# Generate comparison
comparison_df = evaluator.compare_models()
print("\nModel Comparison:")
print(comparison_df.to_string(index=False))


### 4.2 Visualize Results


In [None]:
# Create comparison visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Overall scores
ax1 = axes[0, 0]
comparison_df.plot(x='Model', y='Overall Score', kind='bar', ax=ax1, color='steelblue', legend=False)
ax1.set_title('Overall Testing Capability Score', fontsize=14, fontweight='bold')
ax1.set_ylabel('Score')
ax1.set_ylim(0, 1)
ax1.grid(axis='y', alpha=0.3)
ax1.tick_params(axis='x', rotation=45)

# Latency comparison
ax2 = axes[0, 1]
comparison_df.plot(x='Model', y='Latency (ms)', kind='bar', ax=ax2, color='coral', legend=False)
ax2.set_title('Average Latency', fontsize=14, fontweight='bold')
ax2.set_ylabel('Milliseconds')
ax2.grid(axis='y', alpha=0.3)
ax2.tick_params(axis='x', rotation=45)

# Capability breakdown
ax3 = axes[1, 0]
capability_cols = ['Test Generation', 'Bug Detection', 'Code Understanding', 'Reasoning']
comparison_df.plot(x='Model', y=capability_cols, kind='bar', ax=ax3)
ax3.set_title('Capability Breakdown', fontsize=14, fontweight='bold')
ax3.set_ylabel('Score')
ax3.set_ylim(0, 1)
ax3.legend(loc='lower right')
ax3.grid(axis='y', alpha=0.3)
ax3.tick_params(axis='x', rotation=45)

# Context window comparison (log scale)
ax4 = axes[1, 1]
comparison_df.plot(x='Model', y='Context Window', kind='bar', ax=ax4, color='mediumseagreen', legend=False)
ax4.set_title('Context Window Size', fontsize=14, fontweight='bold')
ax4.set_ylabel('Tokens (log scale)')
ax4.set_yscale('log')
ax4.grid(axis='y', alpha=0.3)
ax4.tick_params(axis='x', rotation=45)

plt.tight_layout()
print("\nVisualization generated successfully")
plt.show()


## 5. Advanced Evaluation Techniques

### 5.1 Adversarial Testing


In [None]:
class AdversarialTestingSuite:
    """Test models with adversarial and edge case scenarios"""
    
    def __init__(self):
        self.test_cases = []
        
    def add_ambiguous_code_test(self):
        """Test with intentionally ambiguous or obfuscated code"""
        return {
            'name': 'Ambiguous Code',
            'code': 'x = lambda a: a if a > 0 else -a if a < 0 else 0',
            'expected': 'Should identify absolute value logic'
        }
    
    def add_security_vulnerability_test(self):
        """Test detection of security vulnerabilities"""
        return {
            'name': 'SQL Injection Vulnerability',
            'code': 'query = f"SELECT * FROM users WHERE username = \'{user_input}\'"',
            'expected': 'Should flag SQL injection risk'
        }
    
    def add_concurrency_issue_test(self):
        """Test detection of race conditions and concurrency bugs"""
        return {
            'name': 'Race Condition',
            'code': '''
counter = 0
def increment():
    global counter
    temp = counter
    counter = temp + 1
            ''',
            'expected': 'Should identify race condition in multi-threaded context'
        }
    
    def add_memory_leak_test(self):
        """Test detection of potential memory leaks"""
        return {
            'name': 'Memory Leak',
            'code': '''
cache = {}
def add_to_cache(key, value):
    cache[key] = value  # No size limit or eviction
            ''',
            'expected': 'Should identify unbounded cache growth'
        }
    
    def run_adversarial_suite(self, model_evaluator):
        """Execute full adversarial test suite"""
        tests = [
            self.add_ambiguous_code_test(),
            self.add_security_vulnerability_test(),
            self.add_concurrency_issue_test(),
            self.add_memory_leak_test()
        ]
        
        print("\n=== Adversarial Testing Suite ===")
        for test in tests:
            print(f"\nTest: {test['name']}")
            print(f"Expected: {test['expected']}")
            print(f"Code Sample:\n{test['code']}")
        
        return tests

adversarial_suite = AdversarialTestingSuite()
test_results = adversarial_suite.run_adversarial_suite(evaluator)
print(f"\n{len(test_results)} adversarial tests ready for model evaluation")


### 5.2 Multi-Language Support Evaluation


In [None]:
# Multi-language test suite
multilang_tests = {
    'Python': {
        'code': 'def fibonacci(n):\n    return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)',
        'test_type': 'recursive_optimization'
    },
    'JavaScript': {
        'code': 'const fetchUser = async (id) => { const res = await fetch(`/api/users/${id}`); return res.json(); }',
        'test_type': 'async_error_handling'
    },
    'Java': {
        'code': 'public class User { private String name; public void setName(String name) { this.name = name; } }',
        'test_type': 'null_safety'
    },
    'Go': {
        'code': 'func divide(a, b int) (int, error) { if b == 0 { return 0, errors.New("division by zero") }; return a / b, nil }',
        'test_type': 'error_handling'
    },
    'TypeScript': {
        'code': 'interface User { id: number; name: string; email?: string; } function getUser(id: number): User | null { return null; }',
        'test_type': 'type_safety'
    }
}

# Create DataFrame for visualization
lang_data = []
for lang, test_info in multilang_tests.items():
    lang_data.append({
        'Language': lang,
        'Test Type': test_info['test_type'],
        'Code Length': len(test_info['code'])
    })

lang_df = pd.DataFrame(lang_data)
print("\nMulti-Language Test Suite:")
print(lang_df.to_string(index=False))

# Visualize language coverage
fig, ax = plt.subplots(figsize=(10, 6))
lang_df.plot(x='Language', y='Code Length', kind='bar', ax=ax, color='mediumpurple', legend=False)
ax.set_title('Test Coverage Across Programming Languages', fontsize=14, fontweight='bold')
ax.set_ylabel('Code Sample Length (characters)')
ax.grid(axis='y', alpha=0.3)
ax.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()


## 6. Real-World Performance Metrics

### 6.1 Production Testing Scenarios


In [None]:
class ProductionMetrics:
    """Track real-world performance metrics for model deployment"""
    
    def __init__(self):
        self.metrics = {
            'false_positive_rate': [],
            'test_execution_time': [],
            'maintenance_overhead': [],
            'developer_productivity_impact': []
        }
    
    def calculate_roi(self, model_name: str, tests_generated: int, bugs_found: int, 
                      time_saved_hours: float, implementation_cost: float) -> Dict:
        """
        Calculate ROI for model deployment in testing pipeline
        """
        # Assumptions
        avg_hourly_rate = 75  # USD
        cost_per_bug_in_production = 1000  # USD
        manual_test_time_per_test = 0.5  # hours
        
        # Benefits
        time_saved_value = time_saved_hours * avg_hourly_rate
        tests_generated_value = tests_generated * manual_test_time_per_test * avg_hourly_rate
        bugs_prevented_value = bugs_found * cost_per_bug_in_production
        
        total_benefit = time_saved_value + tests_generated_value + bugs_prevented_value
        roi = ((total_benefit - implementation_cost) / implementation_cost) * 100
        
        return {
            'model': model_name,
            'total_benefit': total_benefit,
            'implementation_cost': implementation_cost,
            'roi_percentage': roi,
            'breakeven_tests': implementation_cost / (manual_test_time_per_test * avg_hourly_rate)
        }
    
    def generate_deployment_report(self, models: List[str]) -> pd.DataFrame:
        """Generate comprehensive deployment readiness report"""
        report_data = []
        
        for model in models:
            # Simulate production metrics
            roi = self.calculate_roi(
                model_name=model,
                tests_generated=np.random.randint(100, 500),
                bugs_found=np.random.randint(10, 50),
                time_saved_hours=np.random.uniform(50, 200),
                implementation_cost=np.random.uniform(5000, 15000)
            )
            
            report_data.append({
                'Model': model,
                'ROI (%)': round(roi['roi_percentage'], 2),
                'Total Benefit ($)': round(roi['total_benefit'], 2),
                'Implementation Cost ($)': round(roi['implementation_cost'], 2),
                'Breakeven Tests': round(roi['breakeven_tests'], 0),
                'Deployment Ready': 'Yes' if roi['roi_percentage'] > 100 else 'Review'
            })
        
        return pd.DataFrame(report_data)

# Generate production metrics
prod_metrics = ProductionMetrics()
model_names = ['GPT-4', 'Claude 3.5 Sonnet', 'CodeLlama-34B', 'Gemini Pro 1.5']
deployment_report = prod_metrics.generate_deployment_report(model_names)

print("\n=== Production Deployment ROI Analysis ===")
print(deployment_report.to_string(index=False))


### 6.2 Cost-Benefit Analysis


In [None]:
# Visualize ROI comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# ROI comparison
ax1 = axes[0]
colors = ['green' if x > 100 else 'orange' for x in deployment_report['ROI (%)']]
deployment_report.plot(x='Model', y='ROI (%)', kind='bar', ax=ax1, color=colors, legend=False)
ax1.axhline(y=100, color='red', linestyle='--', label='Break-even threshold')
ax1.set_title('Return on Investment by Model', fontsize=14, fontweight='bold')
ax1.set_ylabel('ROI (%)')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)
ax1.tick_params(axis='x', rotation=45)

# Cost vs Benefit
ax2 = axes[1]
x = np.arange(len(deployment_report))
width = 0.35
ax2.bar(x - width/2, deployment_report['Implementation Cost ($)'], width, label='Cost', color='coral')
ax2.bar(x + width/2, deployment_report['Total Benefit ($)'], width, label='Benefit', color='mediumseagreen')
ax2.set_xlabel('Model')
ax2.set_ylabel('Amount ($)')
ax2.set_title('Cost vs Benefit Analysis', fontsize=14, fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(deployment_report['Model'], rotation=45, ha='right')
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
print("\nROI analysis visualization generated")
plt.show()


## 7. Evaluation Best Practices

### 7.1 Recommended Evaluation Pipeline

```
1. Define Testing Context
   ├── Identify primary use cases (unit, integration, e2e)
   ├── Determine performance requirements
   └── Set quality thresholds

2. Create Benchmark Dataset
   ├── Collect representative code samples
   ├── Include edge cases and adversarial examples
   └── Validate with domain experts

3. Run Comprehensive Evaluation
   ├── Test generation quality
   ├── Bug detection accuracy
   ├── Code understanding
   ├── Reasoning capabilities
   └── Performance metrics

4. Analyze Results
   ├── Compare against baselines
   ├── Identify strengths and weaknesses
   └── Calculate ROI

5. Production Validation
   ├── Pilot deployment
   ├── Monitor real-world performance
   └── Iterate based on feedback
```


### 7.2 Key Evaluation Considerations


In [None]:
evaluation_checklist = {
    'Technical Factors': [
        'Accuracy on domain-specific test cases',
        'Performance at scale (latency, throughput)',
        'Context window size for large codebases',
        'Token efficiency and cost per test',
        'Multi-language support quality',
        'Integration with existing tools'
    ],
    'Practical Factors': [
        'Developer experience and learning curve',
        'Maintenance overhead',
        'False positive management',
        'Customization capabilities',
        'Vendor lock-in risks',
        'Community and support'
    ],
    'Business Factors': [
        'Implementation costs',
        'Licensing and pricing model',
        'Return on investment timeline',
        'Security and compliance requirements',
        'Scalability for team growth',
        'Long-term viability'
    ]
}

print("\n=== Model Evaluation Checklist ===")
for category, factors in evaluation_checklist.items():
    print(f"\n{category}:")
    for i, factor in enumerate(factors, 1):
        print(f"  {i}. {factor}")


## 8. Conclusion and Recommendations

### Key Findings

Based on our evaluation framework, here are the critical insights for selecting AI models for software testing:

1. **No One-Size-Fits-All Solution**: Different models excel in different areas. Match model capabilities to your specific testing needs.

2. **Context Window Matters**: For large codebases, models with extensive context windows (100K+ tokens) provide significantly better results.

3. **Balance Speed and Quality**: Faster models may be suitable for initial test generation, while more sophisticated models excel at complex bug detection.

4. **ROI is Achievable**: Properly deployed AI testing tools consistently show positive ROI within 3-6 months.

5. **Continuous Evaluation**: Model capabilities evolve rapidly. Establish regular re-evaluation cycles.

### Recommended Model Selection by Use Case

| Use Case | Recommended Models | Rationale |
|----------|-------------------|----------|
| **Unit Test Generation** | Claude 3.5 Sonnet, GPT-4 | High code understanding, excellent structure |
| **Bug Detection** | Claude 3.5 Sonnet, GPT-4 | Strong reasoning, low false positives |
| **Security Testing** | GPT-4, Gemini Pro | Specialized vulnerability knowledge |
| **Performance Testing** | CodeLlama, GPT-3.5 Turbo | Fast generation for load scenarios |
| **Legacy Code Analysis** | Gemini Pro 1.5 | Massive context window for old codebases |

### Next Steps

1. Adapt this notebook to your specific codebase and testing needs
2. Run evaluations with your actual code samples
3. Pilot deployment with highest-scoring model
4. Monitor production metrics and iterate
5. Share findings with the community to advance the field


## 9. References and Further Reading

### Academic Research
- "Large Language Models for Code: A Survey" (2024)
- "Evaluating Large Language Models Trained on Code" (Chen et al., 2021)
- "An Empirical Study of AI-Assisted Test Generation" (IEEE Software, 2024)

### Industry Resources
- OpenAI GPT-4 Technical Report
- Anthropic Claude Model Card
- Google DeepMind Gemini Documentation
- Meta CodeLlama Research Paper

### Tools and Frameworks
- HumanEval Benchmark
- APPS (Automated Programming Progress Standard)
- CodeXGLUE Benchmark
- DefectDojo for bug tracking integration

### Community
- AI Safety Research (alignment.org)
- Software Testing ML Community
- Papers With Code - Code Generation


In [None]:
# Export results for further analysis
comparison_df.to_csv('model_evaluation_results.csv', index=False)
deployment_report.to_csv('deployment_roi_analysis.csv', index=False)

print("\n=== Evaluation Complete ===")
print(f"Results exported:")
print("  - model_evaluation_results.csv")
print("  - deployment_roi_analysis.csv")
print("\nReady for production decision-making!")
