Mario Game with DQN
Double-Dueling-DQN with NoisyNet and N-Step Replay
OVERVIEW
This project implements a Reinforcement Learning (RL) agent for Super Mario Bros using advanced DQN variants to improve training stability and performance.
The main techniques used are:
- Double DQN: Helps mitigate overestimation of action values.
- Dueling DQN: Separates state-value and advantage estimation for better policy learning.
- Noisy Networks (NoisyNet): Replaces ε-greedy exploration with parameterized, learnable noise.
- N-Step Replay Buffer: More stable learning by considering multi-step returns.
- Curriculum Learning: Structured training phases to prevent premature local optima.
DEMO VIDEO (DQN Variants)
VANILLA DQN COMPARISON
This video shows the performance of a vanilla DQN agent after 7,000k episodes of training. Compared with the DQN variants, the difference in stability and performance is clear.
KEY FEATURES
- Frame Skipping and Max-Pooling to reduce computation and extract motion.
- Frame Stacking (4 frames) for temporal information.
- Randomized No-Op Start for better generalization.
- Life-Based Early Episode Termination to handle “death” scenarios.
- Adaptive Exploration via Noisy Linear Layers.
TRAINING STRATEGY
To avoid the agent falling into local optima early in training, the training is divided into three conceptual phases:
| Phase | Episodes | Purpose |
|---|---|---|
| Phase 1 | 300 episodes | Randomized levels (random 1-1 to 1-4) to encourage broad exploration. |
| Phase 2 | Planned 500 episodes per level | Sequential level training (1-1, 1-2, 1-3, 1-4) — skipped during actual training. |
| Phase 3 | Planned 5000 episodes | End-to-end full game training. |
Although the code plans for about 5000 episodes, in reality, 2000 episodes were sufficient for achieving strong performance, and training was stopped early.
AGENT ARCHITECTURE
- Input: Stack of 4 grayscale frames (84×84 pixels).
- Convolutional Backbone: Two convolutional layers (DeepMind DQN style).
- Head: Shared fully connected NoisyLinear layer, split into two branches:
- One estimating State Value.
- One estimating Action Advantages.
- Output: Q-values for each action in the COMPLEX_MOVEMENT action space.
HYPERPARAMETERS
| Parameter | Value |
|---|---|
| Batch Size | 512 |
| Replay Buffer Capacity | 200,000 transitions |
| Learning Rate | 1e-4 |
| Discount Factor (Gamma) | 0.99 |
| Target Network Update Frequency | 5,000 steps |
| Max Gradient Norm | 5.0 |
| N-Step Return | 5 agent action blocks (≈ 20 frames) |