Reinforcement Learning with TFAgents
Reinforcement learning (RL) is a general framework where agents learn to perform actions in an environment so as to maximize a reward. The two main components are the environment, which represents the problems to be solved, and the agent, which represents the learning algorithm.
The agent and environment continuously interact with each other. At each time step, the agent takes an action on the environment based on its policy , where is the current observation from the environment, and receives a reward and the next observation from the environment. The goal is to improve the policy so as to maximize the sum of rewards.
Introduction
TFAgents makes designing, implementing and testing new RL algorithms easier, by providing well tested modular components that can be modified and extended. It enables fast code iteration, with good test integration and benchmarking.
Cartpole Environment
The Cartpole environment is one of the most well known classic RL.
 The observation from the environment st is a 4D vector representing the position and velocity of the cart, and the angle and angular velocity of the pole.
 The agent can control the system by taking one of 2 actions at: push the cart right (+1) or left (1).
 A reward rt+1=1 is provided for every timestep that the pole remains upright. The episode ends when one of the following is true:
 the pole tips over some angle limit
 the cart moves outside of the world edges
 200 time steps pass.
DQN Agent
The DQN (Deep QNetwork) algorithm was developed by DeepMind in 2015. It was able to solve a wide range of Atari games (some to superhuman level) by combining reinforcement learning and deep neural networks at scale. The algorithm was developed by enhancing a classic RL algorithm called QLearning with deep neural networks and a technique called experience replay.
Setup
Dependencies
Hyperparameters
In the Cartpole environment:

observation
is an array of 4 floats: the position and velocity of the cart
 the angular position and velocity of the pole

reward
is a scalar float value 
action
is a scalar integer with only two possible values:
0
— "move left" 
1
— "move right"

Training the Agent
It will take ~7 minutes to run
Discussion (0)