Exploring AI Reinforcement Learning with OpenAI's Gym Toolkit

What is Reinforcement Learning?

Reinforcement learning is akin to guiding a system to learn through experimentation. Picture teaching a dog new tricks: rewarding the animal for correct actions while disregarding incorrect ones. Eventually, the dog discerns which behaviors yield rewards and which do not.

In this learning paradigm, a model explores various actions to identify those that result in the most favorable outcomes. The goal is to train the model to make informed decisions to achieve specific objectives across diverse scenarios.

This method is distinct from other machine learning approaches, which typically rely on preprocessed historical data for training. Instead, reinforcement learning thrives on interactive learning, where an agent modifies its behavior based on the feedback received from its actions, aiming to maximize overall rewards.

The potential for reinforcement learning in finance, especially in trading, is substantial. Here are several ways it can be applied:

Algorithmic Trading: By employing reinforcement learning algorithms, traders can formulate strategies that adapt to fluctuating market conditions. These algorithms learn from both historical data and live market signals to make educated decisions regarding asset transactions.
Risk Management: These algorithms can enhance risk management by optimizing portfolio distributions and controlling exposure to different assets. By analyzing volatility, correlations, and market dynamics, reinforcement learning aids traders in risk mitigation while striving for better returns.
Market Prediction: Techniques from reinforcement learning can be utilized to project market trends and foresee price changes. By sifting through vast amounts of financial data to uncover patterns, these algorithms provide valuable insights into future market movements, assisting traders in their investment choices.
High-Frequency Trading: In environments where rapid execution is paramount, reinforcement learning can optimize trading strategies to seize transient opportunities. By swiftly responding to market fluctuations, these algorithms can exploit minor price variances for profit.

Reinforcement learning presents a robust framework for crafting intelligent and adaptive trading systems, empowering traders to better navigate intricate market conditions and fulfill their investment goals.

This was my initial experience with reinforcement learning, and it took me several weeks to grasp its mechanics and create something functional in Python. I discovered an impressive Python library by OpenAI, known as Gym, which greatly simplified the process and enhanced my understanding. If I made any errors or could have approached something differently, please share your feedback in the comments.

What is OpenAI’s Gym Library?

OpenAI’s Gym is an open-source library tailored for developing and evaluating reinforcement learning algorithms. It offers a variety of environments for testing and benchmarking, ranging from simple grid-based setups to intricate physics-based simulations.

The Gym library provides a standardized interface to interact with these environments, facilitating experimentation with various reinforcement learning algorithms and performance comparisons. It includes environments with both discrete and continuous action spaces and supports episodic and continuous tasks.

OpenAI’s Gym encompasses several pre-built environments across diverse categories. Here are a few examples:

Classic Control: Features straightforward control tasks like CartPole, where the objective is to maintain the balance of a pole on a cart by moving left or right.
Atari: Presents classic Atari video games such as Pong, Breakout, and Space Invaders, where agents learn to play by processing pixel inputs.
Box2D: Utilizes physics-based simulations employing the Box2D engine, including tasks like LunarLander, where the goal is to land a spacecraft safely on the moon.
MuJoCo: Leverages the MuJoCo physics engine for more advanced control challenges, such as directing a simulated humanoid robot.
Toy Text: Provides simple text-based environments like FrozenLake, where agents navigate a grid to reach a goal while avoiding pitfalls.

For my reinforcement learning experiment in Gym, I needed to create a custom environment, as none of the pre-existing ones fit my requirements. I’ll demonstrate how to do this shortly. If you decide to try it out, I strongly suggest designing your own environment. It clarifies the process significantly and offers greater flexibility.

In my case, I intended to build a trading environment using readily available data from EODHD APIs. I believe a trading example is relatable and easy to visualize. I appreciate EODHD APIs for their user-friendly endpoints, substantial data per request—which is ideal for model training—and their extensive market data. I’ve been subscribed to their services for years and highly recommend them.

Introducing Gym

You can install the Python “gym” library via PIP. It’s advisable to do this within a virtual environment to avoid potential issues.

rl % python3 -m venv venv

rl % source venv/bin/activate

(venv) rl % python3 -m pip install --upgrade pip

Requirement already satisfied: pip in ./venv/lib/python3.11/site-packages (23.0.1)

Collecting pip

Using cached pip-24.0-py3-none-any.whl (2.1 MB)

Installing collected packages: pip

Attempting uninstall: pip

Found existing installation: pip 23.0.1

Uninstalling pip-23.0.1:

Successfully uninstalled pip-23.0.1

Successfully installed pip-24.0

(venv) rl % python3 -m pip install gym

Collecting gym

Using cached gym-0.26.2-py3-none-any.whl

Collecting numpy>=1.18.0 (from gym)

Downloading numpy-1.26.4-cp311-cp311-macosx_10_9_x86_64.whl.metadata (61 kB)

...

Successfully installed cloudpickle-3.0.0 gym-0.26.2 gym-notices-0.0.8 numpy-1.26.4

A “Classic Control” environment can be instantiated as follows:

import gym # Import the Gym library

# Create the environment

env = gym.make("CartPole-v1") # Selecting the CartPole-v1 environment

# Reset the environment to its initial state and get the first observation

observation = env.reset()

# Execute the simulation for a defined number of steps

for t in range(100):

# Optionally render the environment for visualization

env.render()

# Randomly select an action from the action space

action = env.action_space.sample() # Random action selection

# Apply the action to the environment and receive the subsequent observation, reward, and termination status

observation, reward, done, info = env.step(action)

# Output details of the current step

print("Step:", t)

print("Action:", action)

print("Observation:", observation)

print("Reward:", reward)

print("Done:", done)

print("Info:", info)

# Check if the episode has concluded

if done:

print("Episode finished after {} timesteps".format(t + 1))

break

# Close the environment

env.close()

In this instance, the following occurs:

The “Classic Control” environment “CartPole-v1” is established.
The environment is reset to obtain the initial observation.
A loop runs for 100 steps within this episode.
Optionally, rendering occurs during the process.
In this example, the selected action is random, which serves merely as a demonstration. In a trading context, actions could be represented numerically: 0 for hold, 1 for buy, and 2 for sell.

A potential trading scenario could be structured as follows:

obs = env.reset()

done = False

while not done:

# Current prices and moving averages

current_price = env.df.loc[env.current_step, "close"]

short_ma = env.df.loc[env.current_step, "sma50"]

long_ma = env.df.loc[env.current_step, "sma200"]

# Action determination based on moving average crossover strategy

if short_ma > long_ma and env.position == 0: # Golden cross - Buy signal

action = 1

elif short_ma < long_ma and env.position == 1: # Death cross - Sell signal

action = 2

else:

action = 0 # Hold

obs, reward, done, info = env.step(action)

env.render()

You may have already identified a limitation: you dictate the action at each step. While this approach functions, it relies solely on rudimentary technical analysis. A key aspect missing here is incorporating the reward from successful trades into the learning loop, allowing the model to improve. It took me some time to figure this out, but I’ll elaborate on it later. As this topic is intricate, I aim to build your understanding gradually.

The action is fed into the episode step, returning the outcome of that action as the observation, reward (either positive or negative), the done status (True if completed, False if still in progress), and optional debug information.
Process the steps and then close the environment.

You will notice methods such as “reset,” “render,” and “close” within the environment. To create your own environment, implementing these methods is essential.

Here’s how it can be done:

class MyEnv:

def __init__(self, input1, input2):

self.input1 = input1

self.input2 = input2

def reset(self):

return None

def step(self, action):

reward = 1

done = True

return "current_state", reward, done, {}

def render(self):

print("something useful")

def close(self):

pass

Creating Your Classes

As mentioned earlier, when a reinforcement learning action is completed, and a reward is issued, it needs to be reintroduced into the process for improvement. You will need to develop a class for this purpose. My implementation is as follows.

components/QLearningAgent.py

This is a straightforward reinforcement learning agent utilizing Q-learning. Envision the agent as a navigator within a maze. Initially unaware of the optimal route to the reward, it must explore and learn from its encounters. The QLearningAgent class maintains a table (the Q-table) where each row signifies a state (a maze position), and each column signifies an action (like moving in a particular direction).

The agent adopts an "epsilon-greedy" strategy when selecting actions. Occasionally, it chooses a random action to explore new pathways, while other times it capitalizes on prior experiences to select the most advantageous action. After each move, it updates its Q-table, refining its understanding of which actions yield better rewards. Gradually, the agent reduces random exploration in favor of relying on accumulated knowledge to consistently choose the most rewarding actions. Essentially, the agent learns through trial and error, progressively discovering improved strategies to attain maximum rewards.

import numpy as np

class QLearningAgent:

def __init__(self, n_actions, state_dim, learning_rate=0.01, discount_factor=0.99, exploration_rate=1.0, max_exploration_rate=1.0, min_exploration_rate=0.01, exploration_decay_rate=0.001):

self.n_actions = n_actions

self.state_dim = state_dim

self.learning_rate = learning_rate

self.discount_factor = discount_factor

self.exploration_rate = exploration_rate

self.max_exploration_rate = max_exploration_rate

self.min_exploration_rate = min_exploration_rate

self.exploration_decay_rate = exploration_decay_rate

self.q_table = np.zeros((state_dim, n_actions))

def choose_action(self, state):

if np.random.rand() < self.exploration_rate:

action = np.random.randint(self.n_actions)

else:

action = np.argmax(self.q_table[state])

print(f"Choosing action: {action} for state: {state}") # Debug statement

return action

def update_policy(self, state, action, reward, next_state, done):

old_value = self.q_table[state, action]

next_max = np.max(self.q_table[next_state])

new_value = (1 - self.learning_rate) * old_value + self.learning_rate * (reward + self.discount_factor * next_max * (not done))

self.q_table[state, action] = new_value

if self.exploration_rate > self.min_exploration_rate:

self.exploration_rate -= self.exploration_decay_rate

components/TradingEnv.py

I developed this environment class for my trading example, designed to be adaptable for various use cases. The intention is to illustrate the concept and provide a foundation for further exploration. I find trading scenarios relatable and comprehensible for many.

Key points to note include:

self.action_space = spaces.Discrete(3) # 0=hold, 1=buy, 2=sell

The action space delineates the permissible actions, which are assigned numerical values: 0 for hold, 1 for buy, and 2 for sell.

self.observation_space = spaces.Box(low=0, high=1, shape=(len(df.columns),), dtype=np.float32)

This line is dynamic, ensuring that you don’t need to modify it, but it’s crucial that the observation space aligns with the dimensions of your data. I employed “shape=(len(df.columns),)” to achieve this.

self.scaler = MinMaxScaler()

self.df_scaled = self.scaler.fit_transform(self.df[['open', 'high', 'low', 'close', 'volume']])

As with many data science tasks, scaling the data is typically necessary. In this instance, I opted for a MinMaxScaler to normalize all data between 0 and 1. Caution is advised when using alternatives like StandardScaler, as they may scale between -1 and 1, which could lead to confusion when dealing with positive trading data. I’m uncertain whether this would affect results, but debugging could become challenging if negative prices emerge.

The remaining code should be self-explanatory, accompanied by comments for clarification. Should you have questions, feel free to ask in the comments, and I’ll do my best to assist.

import gym

from gym import spaces

import numpy as np

from sklearn.preprocessing import MinMaxScaler

class TradingEnv(gym.Env):

metadata = {"render.modes": ["human"]}

def __init__(self, df, initial_balance=10000):

super(TradingEnv, self).__init__()

self.df = df

self.initial_balance = initial_balance

self.action_space = spaces.Discrete(3) # 0=hold, 1=buy, 2=sell

self.observation_space = spaces.Box(

low=0, high=1, shape=(len(df.columns),), dtype=np.float32

)

self.scaler = MinMaxScaler()

self.df_scaled = self.scaler.fit_transform(

self.df[["open", "high", "low", "close", "adjusted_close", "volume"]]

)

self.df["short_ma"] = self.df["adjusted_close"].rolling(window=50).mean()

self.df["long_ma"] = self.df["adjusted_close"].rolling(window=200).mean()

self.reset()

def reset(self):

self.balance = self.initial_balance

self.position = 0

self.open_position_price = 0

self.current_step = 0

self.trade_open = False

self.trade_summary = {}

return self.get_discrete_state()

def _next_observation(self):

return self.df_scaled[self.current_step]

def get_discrete_state(self):

current_price = self.df.loc[self.current_step, "adjusted_close"]

short_ma = self.df.loc[self.current_step, "short_ma"]

long_ma = self.df.loc[self.current_step, "long_ma"]

if current_price > short_ma > long_ma:

return 0 # Bullish signal

elif current_price < short_ma < long_ma:

return 1 # Bearish signal

else:

return 2 # Neutral

def step(self, action):

done = False

self.current_step += 1

if self.current_step >= len(self.df) - 1:

done = True

current_price = self.df.loc[self.current_step, "adjusted_close"]

reward = 0

trade_info = "hold"

if action == 1 and self.position == 0: # Buy

self.position = 1

self.open_position_price = current_price

trade_info = "buy"

self.trade_summary = {

"open_price": current_price,

"open_step": self.current_step,

}

elif action == 2 and self.position == 1: # Sell

profit = current_price - self.open_position_price

reward = profit - abs(profit) * 0.01

self.balance += profit

self.position = 0

trade_info = "sell"

self.trade_summary.update(

{

"close_price": current_price,

"close_step": self.current_step,

"profit": reward, # Here, reward includes the fee

}

)

unrealized_profit = (

current_price - self.open_position_price if self.position else 0

)

self.info = {

"trade": trade_info,

"open_position_price": self.open_position_price if self.position else None,

"current_price": current_price,

"unrealised_profit": unrealized_profit,

}

next_state = self.get_discrete_state()

return next_state, reward, done, self.info

def render(self, mode="human", close=False):

trade_status = "open" if self.position else "closed"

current_price = self.df.loc[self.current_step, "adjusted_close"]

if self.position:

unrealized_profit = current_price - self.open_position_price

else:

unrealized_profit = 0

# General information about the current step

print(

f"Step: {self.current_step}, Balance: {self.balance:.2f}, "

f"Open Trade: {trade_status}, Action: {self.info['trade']}, "

f"Current Price: {current_price:.2f}, "

f"Unrealised Profit: {unrealized_profit:.2f}"

)

# Detailed trade summary when a position is closed

if "profit" in self.trade_summary and not self.position:

print(

f'Trade Summary - Open Price: {self.trade_summary["open_price"]:.2f}, '

f'Close Price: {self.trade_summary["close_price"]:.2f}, '

f'Profit: {self.trade_summary["profit"]:.2f}, Steps Held: '

f'{self.trade_summary["close_step"] - self.trade_summary["open_step"]}'

)

The final segment encompasses the training code.

train.py

import sys

import warnings

import pandas as pd

import numpy as np

from eodhd import APIClient

from components import TradingEnv, QLearningAgent

import config as cfg

api = APIClient(cfg.API_KEY)

def get_ohlc_data():

df = api.get_historical_data("GSPC.INDX", "d", results=1825) # 5 years of trading days

# Remove features we don't need

df.drop(columns=["symbol", "interval"], inplace=True)

# Reset index

df.reset_index(drop=True, inplace=True)

return df

if __name__ == "__main__":

df = get_ohlc_data()

# df.to_csv("data/ohlc_data.csv", index=True)

# df = pd.read_csv("data/ohlc_data.csv", index_col=0)

env = TradingEnv(df)

state_dim = 3 # Three states (Hold, Buy, Sell)

n_actions = env.action_space.n

agent = QLearningAgent(n_actions, state_dim)

n_episodes = 100 # Run for a defined number of episodes

max_steps_per_episode = len(df) # Limit the number of steps per episode if necessary

for episode in range(n_episodes):

state = env.reset()

done = False

total_reward = 0

steps = 0

while not done and steps < max_steps_per_episode:

action = agent.choose_action(state)

next_state, reward, done, info = env.step(action)

agent.update_policy(state, action, reward, next_state, done)

state = next_state

total_reward += reward

steps += 1

env.render() # This needs to be called here

print(f"Episode: {episode}, Total reward: {total_reward:.2f}, Final balance: {env.balance:.2f}, Exploration rate: {agent.exploration_rate:.4f}, Steps: {steps}")

Most of this should be self-explanatory, and I’ve added comments for clarity.

Episodes:

An episode represents a single attempt where the agent interacts with the environment, starting from a defined point toward a specific goal.
Throughout the episode, the agent executes a series of actions aimed at maximizing rewards.
Once the goal is achieved or the maximum number of steps is reached (known as “max steps per episode”), the episode concludes, leading to the initiation of a new one.

Max Steps per Episode:

This parameter defines the upper limit of actions the agent can perform before the episode concludes.
If the agent achieves its goal sooner, the episode ends prematurely. Otherwise, it concludes after reaching the predefined limit set by “max steps per episode.”
In my instance, I set the maximum to match the number of days/rows in the trading data.

Conclusion

I incorporated print statements to illustrate the training process, showing when trades are executed, the unrealized profits of open positions, and sell actions.

The output might resemble this:

Episode: 999, Total reward: 1763.29, Final balance: 11808.21, Exploration rate: 0.0100, Steps: 1257

At a high level, my trial commenced with £10,000 and concluded with £11,808.21, which is encouraging.

The total reward calculated indicates that successful trades yield positive rewards, while unsuccessful ones result in negative rewards. In this example, the overall reward is 1763.29.

I hope this provides insight into potential applications for various use cases.

Thank you for reading this article! If you found it engaging and informative, please consider following me and signing up for email notifications.

> If you liked this article, I recommend checking out EODHD APIs on Medium. They have some intriguing articles.

Michael Whittle

If you enjoyed this, please follow me on Medium
For more interesting articles, please follow my publication
Interested in collaborating? Let’s connect on LinkedIn
Support me and other Medium writers by signing up here
Please don’t forget to clap for the article :) Thank you!

takarajapaneseramen.com

Exploring AI Reinforcement Learning with OpenAI's Gym Toolkit

What is Reinforcement Learning?

What is OpenAI’s Gym Library?

Introducing Gym

Creating Your Classes

Conclusion

Michael Whittle

Share the page:

Recent Post:

Nurturing Future Programmers: A Parent's Guide to Early Coding

Coping with Grief: Navigating the Loss of a Beloved Pet

Exploring the Connection Between Magnons and Dark Matter

Mastering Your Mind: A Practical 3-Step Approach to Overcome Negativity

Augur: Harnessing Cryptocurrency for Future Predictions

Exploring the Synchronization of Couples' Brains in MRI Scans

Apple TV 4K: A Comprehensive Look at the Latest Features

Exploring the Unknown: Life After Death and Its Mysteries