Exploring AI Reinforcement Learning with OpenAI's Gym Toolkit
Written on
What is Reinforcement Learning?
Reinforcement learning is akin to guiding a system to learn through experimentation. Picture teaching a dog new tricks: rewarding the animal for correct actions while disregarding incorrect ones. Eventually, the dog discerns which behaviors yield rewards and which do not.
In this learning paradigm, a model explores various actions to identify those that result in the most favorable outcomes. The goal is to train the model to make informed decisions to achieve specific objectives across diverse scenarios.
This method is distinct from other machine learning approaches, which typically rely on preprocessed historical data for training. Instead, reinforcement learning thrives on interactive learning, where an agent modifies its behavior based on the feedback received from its actions, aiming to maximize overall rewards.
The potential for reinforcement learning in finance, especially in trading, is substantial. Here are several ways it can be applied:
- Algorithmic Trading: By employing reinforcement learning algorithms, traders can formulate strategies that adapt to fluctuating market conditions. These algorithms learn from both historical data and live market signals to make educated decisions regarding asset transactions.
- Risk Management: These algorithms can enhance risk management by optimizing portfolio distributions and controlling exposure to different assets. By analyzing volatility, correlations, and market dynamics, reinforcement learning aids traders in risk mitigation while striving for better returns.
- Market Prediction: Techniques from reinforcement learning can be utilized to project market trends and foresee price changes. By sifting through vast amounts of financial data to uncover patterns, these algorithms provide valuable insights into future market movements, assisting traders in their investment choices.
- High-Frequency Trading: In environments where rapid execution is paramount, reinforcement learning can optimize trading strategies to seize transient opportunities. By swiftly responding to market fluctuations, these algorithms can exploit minor price variances for profit.
Reinforcement learning presents a robust framework for crafting intelligent and adaptive trading systems, empowering traders to better navigate intricate market conditions and fulfill their investment goals.
This was my initial experience with reinforcement learning, and it took me several weeks to grasp its mechanics and create something functional in Python. I discovered an impressive Python library by OpenAI, known as Gym, which greatly simplified the process and enhanced my understanding. If I made any errors or could have approached something differently, please share your feedback in the comments.
What is OpenAI’s Gym Library?
OpenAI’s Gym is an open-source library tailored for developing and evaluating reinforcement learning algorithms. It offers a variety of environments for testing and benchmarking, ranging from simple grid-based setups to intricate physics-based simulations.
The Gym library provides a standardized interface to interact with these environments, facilitating experimentation with various reinforcement learning algorithms and performance comparisons. It includes environments with both discrete and continuous action spaces and supports episodic and continuous tasks.
OpenAI’s Gym encompasses several pre-built environments across diverse categories. Here are a few examples:
- Classic Control: Features straightforward control tasks like CartPole, where the objective is to maintain the balance of a pole on a cart by moving left or right.
- Atari: Presents classic Atari video games such as Pong, Breakout, and Space Invaders, where agents learn to play by processing pixel inputs.
- Box2D: Utilizes physics-based simulations employing the Box2D engine, including tasks like LunarLander, where the goal is to land a spacecraft safely on the moon.
- MuJoCo: Leverages the MuJoCo physics engine for more advanced control challenges, such as directing a simulated humanoid robot.
- Toy Text: Provides simple text-based environments like FrozenLake, where agents navigate a grid to reach a goal while avoiding pitfalls.
For my reinforcement learning experiment in Gym, I needed to create a custom environment, as none of the pre-existing ones fit my requirements. I’ll demonstrate how to do this shortly. If you decide to try it out, I strongly suggest designing your own environment. It clarifies the process significantly and offers greater flexibility.
In my case, I intended to build a trading environment using readily available data from EODHD APIs. I believe a trading example is relatable and easy to visualize. I appreciate EODHD APIs for their user-friendly endpoints, substantial data per request—which is ideal for model training—and their extensive market data. I’ve been subscribed to their services for years and highly recommend them.
Introducing Gym
You can install the Python “gym” library via PIP. It’s advisable to do this within a virtual environment to avoid potential issues.
rl % python3 -m venv venv
rl % source venv/bin/activate
(venv) rl % python3 -m pip install --upgrade pip
Requirement already satisfied: pip in ./venv/lib/python3.11/site-packages (23.0.1)
Collecting pip
Using cached pip-24.0-py3-none-any.whl (2.1 MB)Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 23.0.1
Uninstalling pip-23.0.1:
Successfully uninstalled pip-23.0.1Successfully installed pip-24.0
(venv) rl % python3 -m pip install gym
Collecting gym
Using cached gym-0.26.2-py3-none-any.whlCollecting numpy>=1.18.0 (from gym)
Downloading numpy-1.26.4-cp311-cp311-macosx_10_9_x86_64.whl.metadata (61 kB)
...
Successfully installed cloudpickle-3.0.0 gym-0.26.2 gym-notices-0.0.8 numpy-1.26.4
A “Classic Control” environment can be instantiated as follows:
import gym # Import the Gym library
# Create the environment
env = gym.make("CartPole-v1") # Selecting the CartPole-v1 environment
# Reset the environment to its initial state and get the first observation
observation = env.reset()
# Execute the simulation for a defined number of steps
for t in range(100):
# Optionally render the environment for visualization
env.render()
# Randomly select an action from the action space
action = env.action_space.sample() # Random action selection
# Apply the action to the environment and receive the subsequent observation, reward, and termination status
observation, reward, done, info = env.step(action)
# Output details of the current step
print("Step:", t)
print("Action:", action)
print("Observation:", observation)
print("Reward:", reward)
print("Done:", done)
print("Info:", info)
# Check if the episode has concluded
if done:
print("Episode finished after {} timesteps".format(t + 1))
break
# Close the environment
env.close()
In this instance, the following occurs:
- The “Classic Control” environment “CartPole-v1” is established.
- The environment is reset to obtain the initial observation.
- A loop runs for 100 steps within this episode.
- Optionally, rendering occurs during the process.
- In this example, the selected action is random, which serves merely as a demonstration. In a trading context, actions could be represented numerically: 0 for hold, 1 for buy, and 2 for sell.
A potential trading scenario could be structured as follows:
obs = env.reset()
done = False
while not done:
# Current prices and moving averages
current_price = env.df.loc[env.current_step, "close"]
short_ma = env.df.loc[env.current_step, "sma50"]
long_ma = env.df.loc[env.current_step, "sma200"]
# Action determination based on moving average crossover strategy
if short_ma > long_ma and env.position == 0: # Golden cross - Buy signal
action = 1elif short_ma < long_ma and env.position == 1: # Death cross - Sell signal
action = 2else:
action = 0 # Holdobs, reward, done, info = env.step(action)
env.render()
You may have already identified a limitation: you dictate the action at each step. While this approach functions, it relies solely on rudimentary technical analysis. A key aspect missing here is incorporating the reward from successful trades into the learning loop, allowing the model to improve. It took me some time to figure this out, but I’ll elaborate on it later. As this topic is intricate, I aim to build your understanding gradually.
- The action is fed into the episode step, returning the outcome of that action as the observation, reward (either positive or negative), the done status (True if completed, False if still in progress), and optional debug information.
- Process the steps and then close the environment.
You will notice methods such as “reset,” “render,” and “close” within the environment. To create your own environment, implementing these methods is essential.
Here’s how it can be done:
class MyEnv:
def __init__(self, input1, input2):
self.input1 = input1
self.input2 = input2
def reset(self):
return Nonedef step(self, action):
reward = 1
done = True
return "current_state", reward, done, {}
def render(self):
print("something useful")def close(self):
pass
Creating Your Classes
As mentioned earlier, when a reinforcement learning action is completed, and a reward is issued, it needs to be reintroduced into the process for improvement. You will need to develop a class for this purpose. My implementation is as follows.
components/QLearningAgent.py
This is a straightforward reinforcement learning agent utilizing Q-learning. Envision the agent as a navigator within a maze. Initially unaware of the optimal route to the reward, it must explore and learn from its encounters. The QLearningAgent class maintains a table (the Q-table) where each row signifies a state (a maze position), and each column signifies an action (like moving in a particular direction).
The agent adopts an "epsilon-greedy" strategy when selecting actions. Occasionally, it chooses a random action to explore new pathways, while other times it capitalizes on prior experiences to select the most advantageous action. After each move, it updates its Q-table, refining its understanding of which actions yield better rewards. Gradually, the agent reduces random exploration in favor of relying on accumulated knowledge to consistently choose the most rewarding actions. Essentially, the agent learns through trial and error, progressively discovering improved strategies to attain maximum rewards.
import numpy as np
class QLearningAgent:
def __init__(self, n_actions, state_dim, learning_rate=0.01, discount_factor=0.99, exploration_rate=1.0, max_exploration_rate=1.0, min_exploration_rate=0.01, exploration_decay_rate=0.001):
self.n_actions = n_actions
self.state_dim = state_dim
self.learning_rate = learning_rate
self.discount_factor = discount_factor
self.exploration_rate = exploration_rate
self.max_exploration_rate = max_exploration_rate
self.min_exploration_rate = min_exploration_rate
self.exploration_decay_rate = exploration_decay_rate
self.q_table = np.zeros((state_dim, n_actions))
def choose_action(self, state):
if np.random.rand() < self.exploration_rate:
action = np.random.randint(self.n_actions)else:
action = np.argmax(self.q_table[state])print(f"Choosing action: {action} for state: {state}") # Debug statement
return action
def update_policy(self, state, action, reward, next_state, done):
old_value = self.q_table[state, action]
next_max = np.max(self.q_table[next_state])
new_value = (1 - self.learning_rate) * old_value + self.learning_rate * (reward + self.discount_factor * next_max * (not done))
self.q_table[state, action] = new_value
if self.exploration_rate > self.min_exploration_rate:
self.exploration_rate -= self.exploration_decay_rate
components/TradingEnv.py
I developed this environment class for my trading example, designed to be adaptable for various use cases. The intention is to illustrate the concept and provide a foundation for further exploration. I find trading scenarios relatable and comprehensible for many.
Key points to note include:
self.action_space = spaces.Discrete(3) # 0=hold, 1=buy, 2=sell
The action space delineates the permissible actions, which are assigned numerical values: 0 for hold, 1 for buy, and 2 for sell.
self.observation_space = spaces.Box(low=0, high=1, shape=(len(df.columns),), dtype=np.float32)
This line is dynamic, ensuring that you don’t need to modify it, but it’s crucial that the observation space aligns with the dimensions of your data. I employed “shape=(len(df.columns),)” to achieve this.
self.scaler = MinMaxScaler()
self.df_scaled = self.scaler.fit_transform(self.df[['open', 'high', 'low', 'close', 'volume']])
As with many data science tasks, scaling the data is typically necessary. In this instance, I opted for a MinMaxScaler to normalize all data between 0 and 1. Caution is advised when using alternatives like StandardScaler, as they may scale between -1 and 1, which could lead to confusion when dealing with positive trading data. I’m uncertain whether this would affect results, but debugging could become challenging if negative prices emerge.
The remaining code should be self-explanatory, accompanied by comments for clarification. Should you have questions, feel free to ask in the comments, and I’ll do my best to assist.
import gym
from gym import spaces
import numpy as np
from sklearn.preprocessing import MinMaxScaler
class TradingEnv(gym.Env):
metadata = {"render.modes": ["human"]}
def __init__(self, df, initial_balance=10000):
super(TradingEnv, self).__init__()
self.df = df
self.initial_balance = initial_balance
self.action_space = spaces.Discrete(3) # 0=hold, 1=buy, 2=sell
self.observation_space = spaces.Box(
low=0, high=1, shape=(len(df.columns),), dtype=np.float32)
self.scaler = MinMaxScaler()
self.df_scaled = self.scaler.fit_transform(
self.df[["open", "high", "low", "close", "adjusted_close", "volume"]])
self.df["short_ma"] = self.df["adjusted_close"].rolling(window=50).mean()
self.df["long_ma"] = self.df["adjusted_close"].rolling(window=200).mean()
self.reset()
def reset(self):
self.balance = self.initial_balance
self.position = 0
self.open_position_price = 0
self.current_step = 0
self.trade_open = False
self.trade_summary = {}
return self.get_discrete_state()
def _next_observation(self):
return self.df_scaled[self.current_step]def get_discrete_state(self):
current_price = self.df.loc[self.current_step, "adjusted_close"]
short_ma = self.df.loc[self.current_step, "short_ma"]
long_ma = self.df.loc[self.current_step, "long_ma"]
if current_price > short_ma > long_ma:
return 0 # Bullish signalelif current_price < short_ma < long_ma:
return 1 # Bearish signalelse:
return 2 # Neutraldef step(self, action):
done = False
self.current_step += 1
if self.current_step >= len(self.df) - 1:
done = Truecurrent_price = self.df.loc[self.current_step, "adjusted_close"]
reward = 0
trade_info = "hold"
if action == 1 and self.position == 0: # Buy
self.position = 1
self.open_position_price = current_price
trade_info = "buy"
self.trade_summary = {
"open_price": current_price,
"open_step": self.current_step,
}
elif action == 2 and self.position == 1: # Sell
profit = current_price - self.open_position_price
reward = profit - abs(profit) * 0.01
self.balance += profit
self.position = 0
trade_info = "sell"
self.trade_summary.update(
{
"close_price": current_price,
"close_step": self.current_step,
"profit": reward, # Here, reward includes the fee
}
)
unrealized_profit = (
current_price - self.open_position_price if self.position else 0)
self.info = {
"trade": trade_info,
"open_position_price": self.open_position_price if self.position else None,
"current_price": current_price,
"unrealised_profit": unrealized_profit,
}
next_state = self.get_discrete_state()
return next_state, reward, done, self.info
def render(self, mode="human", close=False):
trade_status = "open" if self.position else "closed"
current_price = self.df.loc[self.current_step, "adjusted_close"]
if self.position:
unrealized_profit = current_price - self.open_position_priceelse:
unrealized_profit = 0# General information about the current step
print(
f"Step: {self.current_step}, Balance: {self.balance:.2f}, "
f"Open Trade: {trade_status}, Action: {self.info['trade']}, "
f"Current Price: {current_price:.2f}, "
f"Unrealised Profit: {unrealized_profit:.2f}"
)
# Detailed trade summary when a position is closed
if "profit" in self.trade_summary and not self.position:
print(
f'Trade Summary - Open Price: {self.trade_summary["open_price"]:.2f}, '
f'Close Price: {self.trade_summary["close_price"]:.2f}, '
f'Profit: {self.trade_summary["profit"]:.2f}, Steps Held: '
f'{self.trade_summary["close_step"] - self.trade_summary["open_step"]}'
)
The final segment encompasses the training code.
train.py
import sys
import warnings
import pandas as pd
import numpy as np
from eodhd import APIClient
from components import TradingEnv, QLearningAgent
import config as cfg
api = APIClient(cfg.API_KEY)
def get_ohlc_data():
df = api.get_historical_data("GSPC.INDX", "d", results=1825) # 5 years of trading days
# Remove features we don't need
df.drop(columns=["symbol", "interval"], inplace=True)
# Reset index
df.reset_index(drop=True, inplace=True)
return df
if __name__ == "__main__":
df = get_ohlc_data()
# df.to_csv("data/ohlc_data.csv", index=True)
# df = pd.read_csv("data/ohlc_data.csv", index_col=0)
env = TradingEnv(df)
state_dim = 3 # Three states (Hold, Buy, Sell)
n_actions = env.action_space.n
agent = QLearningAgent(n_actions, state_dim)
n_episodes = 100 # Run for a defined number of episodes
max_steps_per_episode = len(df) # Limit the number of steps per episode if necessary
for episode in range(n_episodes):
state = env.reset()
done = False
total_reward = 0
steps = 0
while not done and steps < max_steps_per_episode:
action = agent.choose_action(state)
next_state, reward, done, info = env.step(action)
agent.update_policy(state, action, reward, next_state, done)
state = next_state
total_reward += reward
steps += 1
env.render() # This needs to be called here
print(f"Episode: {episode}, Total reward: {total_reward:.2f}, Final balance: {env.balance:.2f}, Exploration rate: {agent.exploration_rate:.4f}, Steps: {steps}")
Most of this should be self-explanatory, and I’ve added comments for clarity.
Episodes:
- An episode represents a single attempt where the agent interacts with the environment, starting from a defined point toward a specific goal.
- Throughout the episode, the agent executes a series of actions aimed at maximizing rewards.
- Once the goal is achieved or the maximum number of steps is reached (known as “max steps per episode”), the episode concludes, leading to the initiation of a new one.
Max Steps per Episode:
- This parameter defines the upper limit of actions the agent can perform before the episode concludes.
- If the agent achieves its goal sooner, the episode ends prematurely. Otherwise, it concludes after reaching the predefined limit set by “max steps per episode.”
- In my instance, I set the maximum to match the number of days/rows in the trading data.
Conclusion
I incorporated print statements to illustrate the training process, showing when trades are executed, the unrealized profits of open positions, and sell actions.
The output might resemble this:
Episode: 999, Total reward: 1763.29, Final balance: 11808.21, Exploration rate: 0.0100, Steps: 1257
At a high level, my trial commenced with £10,000 and concluded with £11,808.21, which is encouraging.
The total reward calculated indicates that successful trades yield positive rewards, while unsuccessful ones result in negative rewards. In this example, the overall reward is 1763.29.
I hope this provides insight into potential applications for various use cases.
Thank you for reading this article! If you found it engaging and informative, please consider following me and signing up for email notifications.
> If you liked this article, I recommend checking out EODHD APIs on Medium. They have some intriguing articles.
Michael Whittle
- If you enjoyed this, please follow me on Medium
- For more interesting articles, please follow my publication
- Interested in collaborating? Let’s connect on LinkedIn
- Support me and other Medium writers by signing up here
- Please don’t forget to clap for the article :) Thank you!