Evaluating RL Models for Mortal Kombat II

This guide explains how to thoroughly evaluate trained reinforcement learning models using the Kane vs Abel framework.

Why Evaluate?

Proper evaluation helps: - Determine if a model is ready for deployment - Compare different training approaches - Identify specific weaknesses or failure modes - Track improvements during fine-tuning - Understand generalization to new opponents/scenarios

Using the Evaluation Script

The Kane vs Abel framework provides a comprehensive evaluation script (test.py) with multiple evaluation modes.

Basic Usage

python test.py --model_path models/my_model.zip --model_type DUELINGDDQN --num_episodes 10

Command Line Options

Option	Description
`--model_path`	Path to the saved model file
`--model_type`	Model type (DQN, DDQN, DUELINGDDQN, PPO)
`--game`	Name of the ROM/game (default: MortalKombatII-Genesis)
`--state`	Game state to load for evaluation
`--states`	Comma-separated list of states to evaluate on
`--individual_eval`	If set, evaluate each state individually
`--render_mode`	Render mode (human, rgb_array, none)
`--num_stack`	Number of frames to stack (default: 4)
`--num_skip`	Number of frames to skip (default: 4)
`--num_episodes`	Number of episodes for evaluation (default: 10)

Evaluation Modes

Single State Evaluation

Evaluate on a single, specific game state:

python test.py --model_path models/kane/my_model.zip --model_type DUELINGDDQN \
    --state "Level1.LiuKangVsJax" --render_mode human

This mode is useful for: - Visually inspecting model behavior against a specific opponent - Focused testing of challenging scenarios - Direct comparison between models on a standard benchmark

Multiple States Evaluation (Combined)

Evaluate across multiple states with random sampling:

python test.py --model_path models/kane/my_model.zip --model_type DUELINGDDQN \
    --states "VeryEasy.LiuKang-04,VeryEasy.LiuKang-05,VeryEasy.LiuKang-06"

This mode is useful for: - Testing overall model robustness - Getting an aggregate performance metric - Simulating varied opponent encounters

Multiple States Evaluation (Individual)

Evaluate each state separately with individual results:

python test.py --model_path models/kane/my_model.zip --model_type DUELINGDDQN \
    --states "VeryEasy.LiuKang-04,VeryEasy.LiuKang-05" --individual_eval

This mode is useful for: - Identifying specific strengths/weaknesses against different opponents - Granular performance analysis - Detecting overfitting to specific scenarios

Understanding Evaluation Results

The evaluation script generates CSV files with detailed metrics:

Output Format

episode,reward,won
1,253.0,True
2,187.5,True
3,-52.5,False
...
average,178.2,
std,97.5,

Result Analysis

Key metrics to examine:

Average Reward: Higher is better, but context matters:
200+: Excellent performance (usually wins consistently)
100-200: Good performance (wins most matches)
0-100: Mediocre performance (inconsistent results)
<0: Poor performance (usually loses)
Win Rate: Percentage of episodes won
Primary indicator of agent effectiveness
Consider alongside average reward (some wins might be barely scraped)
Standard Deviation: Indicates consistency:
Low std dev + high average: Consistent strong performance
High std dev: Inconsistent (sometimes great, sometimes terrible)
Per-State Performance: For individual evaluations
Identifies matchup-specific strengths/weaknesses
Helps target fine-tuning efforts

Evaluation Strategies

Visualization

Use the --render_mode human option to visually observe agent behavior:

python test.py --model_path models/my_model.zip --model_type DUELINGDDQN \
    --state "Level1.LiuKangVsJax" --render_mode human --num_episodes 3

This helps identify: - Action patterns and strategies - Positioning and spacing behavior - Defensive reactions and counter-attacks - Obvious mistakes or sub-optimal behaviors

Cross-Model Comparison

To compare multiple models:

Evaluate each model on the same set of states: bash python test.py --model_path models/model_A.zip --model_type DUELINGDDQN --states "state1,state2,state3" --individual_eval python test.py --model_path models/model_B.zip --model_type DDQN --states "state1,state2,state3" --individual_eval
Compare results using the generated CSV files
Consider implementing an automated comparison script for large-scale evaluations

Stress Testing

Test resilience by evaluating on particularly challenging scenarios:

python test.py --model_path models/my_model.zip --model_type DUELINGDDQN \
    --states "Hard.LiuKang-01,VeryHard.LiuKang-02" --individual_eval

Ensemble Evaluation

For critical applications, use ensemble evaluation across many episodes (30+) for statistical significance:

python test.py --model_path models/my_model.zip --model_type DUELINGDDQN \
    --states "State1,State2,State3,State4,State5" --individual_eval --num_episodes 30

Customizing the Evaluation Framework

Custom Metrics

You can extend the evaluate_agent function in test.py to track additional metrics:

def evaluate_agent(model, env, num_episodes=10):
    # ...existing code...

    additional_metrics = {
        'combos_performed': [],
        'avg_reaction_time': [],
        'defensive_actions': []
    }

    # ...track these during evaluation loop...

    return avg_reward, std_reward, episode_rewards, episode_wins, additional_metrics

Best Practices for Evaluation

Statistical Significance: Always evaluate on enough episodes (10 minimum, 30+ preferred)
Diverse Scenarios: Test on states both seen and unseen during training
Controlled Comparisons: Use identical seeds when comparing models
Regular Benchmarking: Establish standard evaluation scenarios for ongoing development
Version Control: Track evaluation results alongside model versions
Reproducibility: Document exact evaluation parameters
Reality Check: Supplement metrics with visual inspection

Troubleshooting Evaluation

Common issues and solutions:

Inconsistent Results: Increase number of evaluation episodes
Model Loading Errors: Verify model type matches what's specified
Environment Errors: Check that specified game states exist in your ROM
Performance Gaps: Compare against baseline models or human performance
Resource Usage: For large evaluations, reduce render quality or use no rendering