Mario Game with DQN

Double-Dueling-DQN with NoisyNet and N-Step Replay

OVERVIEW

This project implements a Reinforcement Learning (RL) agent for Super Mario Bros using advanced DQN variants to improve training stability and performance.

The main techniques used are:

  • Double DQN: Helps mitigate overestimation of action values.
  • Dueling DQN: Separates state-value and advantage estimation for better policy learning.
  • Noisy Networks (NoisyNet): Replaces ε-greedy exploration with parameterized, learnable noise.
  • N-Step Replay Buffer: More stable learning by considering multi-step returns.
  • Curriculum Learning: Structured training phases to prevent premature local optima.

DEMO VIDEO (DQN Variants)

VANILLA DQN COMPARISON

This video shows the performance of a vanilla DQN agent after 7,000k episodes of training. Compared with the DQN variants, the difference in stability and performance is clear.

KEY FEATURES

  • Frame Skipping and Max-Pooling to reduce computation and extract motion.
  • Frame Stacking (4 frames) for temporal information.
  • Randomized No-Op Start for better generalization.
  • Life-Based Early Episode Termination to handle “death” scenarios.
  • Adaptive Exploration via Noisy Linear Layers.

TRAINING STRATEGY

To avoid the agent falling into local optima early in training, the training is divided into three conceptual phases:

PhaseEpisodesPurpose
Phase 1300 episodesRandomized levels (random 1-1 to 1-4) to encourage broad exploration.
Phase 2Planned 500 episodes per levelSequential level training (1-1, 1-2, 1-3, 1-4) — skipped during actual training.
Phase 3Planned 5000 episodesEnd-to-end full game training.

Although the code plans for about 5000 episodes, in reality, 2000 episodes were sufficient for achieving strong performance, and training was stopped early.

AGENT ARCHITECTURE

  • Input: Stack of 4 grayscale frames (84×84 pixels).
  • Convolutional Backbone: Two convolutional layers (DeepMind DQN style).
  • Head: Shared fully connected NoisyLinear layer, split into two branches:
    • One estimating State Value.
    • One estimating Action Advantages.
  • Output: Q-values for each action in the COMPLEX_MOVEMENT action space.

HYPERPARAMETERS

ParameterValue
Batch Size512
Replay Buffer Capacity200,000 transitions
Learning Rate1e-4
Discount Factor (Gamma)0.99
Target Network Update Frequency5,000 steps
Max Gradient Norm5.0
N-Step Return5 agent action blocks (≈ 20 frames)

CODE

Full implementation, README, and training details are on GitHub.

← Back to projects