Blank RL Agent Template

Reinforcement learning tutorials often jump straight to neural networks, replay buffers, and libraries. That can hide something simpler and more useful first: the shape of an agent.

This page is about that shape—before PyTorch, before Gymnasium, before performance tuning.

The loop to keep in mind:

Observe → Decide → Act → Store → Learn → Repeat

This is scaffolding: an abstract frame for behavior, memory, and later learning—not a scoring system.


Why start with a blank agent?

Starting “blank” means you separate roles from implementations:

  • What does it mean to choose an action?

  • Where does experience live?

  • Where would learning plug in?

If you begin with a full DQN stack, those questions are buried under tensor shapes and optimizers. A minimal class makes the boundaries visible. You can swap policy from random stub to heuristic to neural net without rewriting the outer loop.

Performance is not the goal here; legibility is.


The core loop

In code, interaction with an environment usually looks like this pattern:

state = observe environment
while not terminal:
    action = agent.act(state)
    next_state, reward, done = environment.step(action)
    agent.remember(...)
    agent.learn()   # possibly no-op until you add learning
    state = next_state

Whether learn updates anything yet, the slots exist: act, remember, learn. That mirrors how real RL systems are organized—not which algorithm fills them on day one.


The minimal class

Below is an intentionally small skeleton: memory list, epsilon for future exploration wiring, stubs for policy and learning.

Copy-paste naming is illustrative; rename to fit your codebase.

import random


class Agent:
    """
    Minimal RL agent skeleton.

    The agent:
    1. observes the world
    2. decides an action
    3. stores experience
    4. learns from outcomes
    """

    def __init__(self):
        self.memory = []
        self.epsilon = 1.0
        self.gamma = 0.99

    def act(self, state):
        if random.random() < self.epsilon:
            return self.random_action()

        return self.policy(state)

    def policy(self, state):
        return 0

    def remember(self, experience):
        self.memory.append(experience)

    def learn(self):
        pass

    def random_action(self):
        return random.randint(0, 1)
  • act — chooses between exploration-style randomness and policy.

  • policy — placeholder: always returns 0 until you replace it.

  • remember — appends transitions; format is yours to define.

  • learn — empty until you add an update rule.

This is not PyTorch RL yet. It is the scaffold where PyTorch would later attach.


First runnable system

The blank class becomes more useful once it is connected to an environment and a training loop. In one sentence each:

  • Agent — chooses actions.

  • Environment — responds to actions with new observations, rewards, and done flags.

  • Training loop — repeats the interaction and stores experience so learning can attach later.

Together, that is a minimal RL system: real execution, even before meaningful learning.

Suggested layout

A small project can look like this:

rl_project/
├── agent.py
├── train.py
├── model.py
└── README.md
  • agent.py — behavior: exploration, acting, remembering, and (later) learning.

  • train.py — the loop: reset, step, accumulate reward, repeat.

  • model.py — empty or optional at first; reserved for when you plug in a network.

  • README.md — how to install dependencies and run train.py.

Gymnasium environment

Gymnasium provides standard environments. CartPole is a common first choice: a pole on a cart that the agent balances by pushing left or right.

import gymnasium as gym

env = gym.make("CartPole-v1")

CartPole gives the agent:

  • a state (observation vector),

  • discrete actions (typically two),

  • a reward each step,

  • termination when the pole falls or the cart goes out of bounds (and truncation when a time limit is hit).

agent.py — simple agent

This version is intentionally small: epsilon for future exploration, memory for transitions, and a stub learn. For CartPole-v1, valid actions are 0 and 1.

import random


class Agent:
    def __init__(self):
        self.epsilon = 1.0
        self.memory = []

    def act(self, state):
        if random.random() < self.epsilon:
            return random.randint(0, 1)

        return 0

    def remember(self, experience):
        self.memory.append(experience)

    def learn(self):
        pass

Think of it as a baseline organism: it can act and store experience, but it does not improve yet. The policy is effectively random with a placeholder branch for later.

train.py — training loop

Gymnasium’s step returns terminated and truncated separately. For the loop, treat the transition as finished when either is true:

from agent import Agent
import gymnasium as gym


env = gym.make("CartPole-v1")
agent = Agent()

episodes = 100

for episode in range(episodes):
    state, _ = env.reset()
    done = False
    total_reward = 0

    while not done:
        action = agent.act(state)

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        agent.remember((state, action, reward, next_state, done))

        state = next_state
        total_reward += reward

    print(f"Episode {episode}: {total_reward}")

What the loop is doing:

  1. The agent chooses an action from the current state.

  2. The environment returns the next state, reward, and whether the run ended.

  3. The transition is stored in memory.

  4. State moves forward; total reward tracks how well the run went.

  5. After each full run, you print a score line for inspection.

What this does not do yet

This is real reinforcement learning execution (agent, environment, transitions, rewards), but not meaningful learning yet:

  • learn() is still empty.

  • The policy is mostly random exploration; nothing updates from data.

  • The point is to make the execution loop visible and debuggable before you add update rules.

Learning comes in the next layers—tabular or neural—using the same slots.

Upgrade path

Rough order many projects follow:

  1. Add a simple reward-based or hand-tuned rule (optional bridge before full RL).

  2. Add Q-learning (tabular) if the state space allows.

  3. Add a PyTorch model for Q-values or policy.

  4. Add experience replay and larger batches.

  5. Add target networks and other stabilizers as needed.

  6. Later: multi-agent or more complex environments.

Next step with neural nets: PyTorch DQN Agent Walkthrough


Evolving learn()

The template begins with structure only:

def learn(self):
    pass

No learning — no weights or tables updating.
No feedback applied yet — nothing changes from rewards.
Just structure — a named place where a behavior-update rule will live.

Everything above still assumed an observe → act → remember loop. Once that loop exists, learn() answers one question:

Given what happened before, how should I change behavior so the next rollout looks better?

Stages below are pedagogical—they are not a claim that each snippet is optimal or production-ready. They show how the same hook fills in from simple bookkeeping to approximate dynamic programming (tabular Q) to gradient-based nets.


Stage 1 — Remember what worked

This is not deep learning. It is a simple association:

state + action → accumulated reward.

Add bookkeeping on the agent, for example in __init__:

self.score_table = {}

Then a first-cut learn might tally how often (state, action) paid off cumulatively:

def learn(self):
    """
    Stage 1: basic reinforcement.

    No neural network yet.
    The agent only records which actions received reward.
    """

    for state, action, reward in self.memory:
        key = (tuple(state), action)

        if key not in self.score_table:
            self.score_table[key] = 0

        self.score_table[key] += reward

If the agent took action a in state s and saw reward r, that pair’s score goes up by r. Rough “what worked,” not yet a notion of future return.

(This assumes transitions in memory match the (state, action, reward) pattern you iterate over—isolate this from full (s, a, r, s', done) tuples as needed.)


Stage 2 — Estimate future reward

Here the idea is closer to Q-learning: immediate reward counts, but expected future reward counts too—you care what happens after the next transition.

Represent per-state action values in a table. On the agent, you might define:

self.q_table = {}
self.alpha = 0.1
self.gamma = 0.99

A teaching-style incremental update:

def learn(self):
    """
    Stage 2: tabular Q-learning style update.
    """

    for state, action, reward, next_state, done in self.memory:
        s = tuple(state)
        s_next = tuple(next_state)

        if s not in self.q_table:
            self.q_table[s] = [0, 0]

        if s_next not in self.q_table:
            self.q_table[s_next] = [0, 0]

        current_q = self.q_table[s][action]
        max_future_q = max(self.q_table[s_next])

        target = reward
        if not done:
            target += self.gamma * max_future_q

        self.q_table[s][action] += self.alpha * (target - current_q)

The agent compares the current estimate (current_q) with a one-step bootstrap target (target), then moves the estimate a fraction alpha toward that target—the familiar “prediction error” pattern, implemented on a discrete table rather than neural weights.

Tables explode or fail as states multiply; still, the mental model transfers.


Stage 3 — Train a neural network

Instead of a row for every state–action pair, a network approximates values or policies over many states at once—same feedback idea, differentiable parameters.

Representative wiring on the agent might include:

self.model = self.build_model()
self.optimizer = torch.optim.Adam(self.model.parameters(), lr=0.001)
self.loss_fn = torch.nn.MSELoss()
self.gamma = 0.99

Batch-style learn sketch (two actions assumed in gather; align build_model, tensor shapes, and batch construction with your environment). Assume import torch and a build_model you define elsewhere.

def learn(self, batch):
    states, actions, rewards, next_states, dones = batch

    states = torch.FloatTensor(states)
    next_states = torch.FloatTensor(next_states)
    actions = torch.LongTensor(actions)
    rewards = torch.FloatTensor(rewards)
    dones = torch.FloatTensor(dones)

    q_values = self.model(states)
    q_action = q_values.gather(1, actions.unsqueeze(1)).squeeze()

    with torch.no_grad():
        next_q_values = self.model(next_states)
        max_next_q = next_q_values.max(1)[0]

    target = rewards + self.gamma * max_next_q * (1 - dones)

    loss = self.loss_fn(q_action, target)

    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()

Conceptually:

predict → compare targets → backward pass → optimizer step — prediction error drives weight change. That loop is what bridges into the fuller PyTorch DQN Agent Walkthrough.


What learn() becomes

Stage

What learn() does

1

Remembers associations between (state, action) and summed reward

2

Updates a tabular estimate using immediate reward plus discounted future

3

Trains neural parameters with a loss comparing predictions to bootstrap targets

learn() is the behavior-update mechanism sitting inside the broader feedback loop: experience in, revised policy or value estimates out.

For a full minimal neural Agent and learn() in code, open PyTorch DQN Agent Walkthrough—the same link appears again at the end of this page.


Random actions validate the loop

Early on, policy might stay trivial while you verify the environment loop: resets, steps, terminal conditions, and what a “state” actually is.

Random or fixed actions prove wiring before intelligence.


Hand-written rules before gradients

Replace the stub policy with hand-written rules: if state looks like this, return that. No gradients—just explicit behavior.

That clarifies what “better” might mean before the model learns it for you.


When learn uses memory directly

When memory holds enough structured transitions, learn can:

  • sample or iterate over experiences,

  • compute targets (e.g., reward + discounted future value),

  • update parameters—first maybe tabular or simple statistics, later a network.

The interface does not change; only the body of learn does. The Evolving learn() section above walks concrete versions of those bodies step by step.


Neural network hooks

Swap policy and learn to use a small network: state in, action values or logits out, loss and optimizer inside learn.

You have now connected the same class outline to differentiable programming—see PyTorch DQN Agent Walkthrough.


Toward a full RL stack

Further work adds what research and production stacks need: stable targets, replay buffers with capacity, batched GPU updates, proper evaluation, seeds, logging, and environments from a standard library (e.g., Gymnasium).

Those are incremental upgrades to the skeleton—not a different mental model.


What this teaches

  • Separation of data flow (remember) from decision (act) from optimization (learn).

  • Exploration (epsilon) as a first-class knob, even before neural nets.

  • A path from understandable stubs to sophisticated implementations without throwing the structure away.


Next page

After the minimal class, First runnable system, and Evolving learn(), continue to:

PyTorch DQN Agent Walkthrough

That page shows how the same shape becomes a neural RL agent with a non-empty learn().