Is Reinforcement Learning Right For Your AI Problem?
In the world of machine learning, reinforcement learning is an important sub-category of deep learning. In deep learning, the human brain is mimicked through a hierarchical structure of man-made artificial neural networks.
Reinforcement learning (LR) is a basic machine learning paradigm that does not require the data to label, as is typically required with machine learning. Reinforcement learning helps determine whether an algorithm is producing a good response or a reward indicating that it was a good decision. RL is based on the interactions between an AI system and its environment. An algorithm is given a numerical score based on its result, then positive behaviors are “reinforced” to refine the algorithm over time. In recent years, RL has been the source of superhuman performance on GO, Atari games and many other apps.
Imagine training a machine learning agent to trade stocks. One option is to provide the system with many examples of good strategies, i.e. information labeled on whether or not to sell a particular stock at any given time. This is the well-known paradigm of supervised learning. Because the agent tries to emulate the right strategies, he cannot overcome them. How to find strategies that surpass the expert? The answer is RL.
But while RL is a powerful approach to AI, it is not suitable for all problems and there are several types of RL.
Ask yourself these six questions to decide what might help you with what you’re trying to solve:
- Does my algorithm have to make a sequence of decisions?
RL is ideal for problems that require sequential decision making – that is, a series of decisions that all affect each other. If you are developing an AI program to win a game, it is not enough for the algorithm to make a good decision; he has to make a whole series of good decisions. By providing a single reward for a positive outcome, RL eliminates solutions that result in low rewards and elevates those that allow a full sequence of good decisions.
- Do I have an existing model?
If you want to write a program for a robot to pick up a physical object, you can use the laws of physics to inform your model. But if you are trying to write a program to maximize stock market returns, there is no existing model that can be used. Instead, you’ll need to use heuristics that have been tuned manually over time. But these heuristics could be suboptimal. Generally, RL is a good choice when there is no existing model to build on or you want to improve on an existing decision-making strategy.
- How much data do I have? What is at stake if the wrong decision is made?
The amount of data you already have and the cost of making the wrong decisions can help you determine whether to use RL online or offline.
For example, imagine that you are using a video platform and you have to train an algorithm to provide recommendations to users. If you have no data, you have no choice but to interact with the user and make recommendation decisions in real time, using an online process. Such exploration comes at a cost – a few bad recommendations made while learning the system can disappoint the user. However, if you already have large amounts of data, you can develop a good policy without interacting with specific users. This is offline RL training.
- Does my goal change?
Sometimes in AI your target never changes. With stocks, you will always want to maximize your returns. Such a problem is not conditioned by a goal, because you always solve the same goal. But in other cases, your goal may be a moving target. Consider Loon, Google’s recently shut down effort to build giant balloons to bring the internet to rural areas. Here, the optimal position for each ball is different. For such cases, lens conditioned RL is more suitable.
- How long is my time horizon?
So how many decisions does my algorithm have to make before arriving at a solution?
The answer can help you determine whether to use hierarchical or non-hierarchical LR. Consider writing a program to make a robot pick up an object. The robot must approach the object and close its grippers to lift the object. For programs like this, with a small number of decisions, non-hierarchical RL is often adequate. Now imagine that the same robot has to locate nails, place them on a board, then pick up a hammer and hit the nail with the hammer. At the abstract level, there are only three or four stages. But if we write a program that displays the position of the robot’s hands, it will be a long sequence of actions. In such cases with longer time horizons, hierarchical LR is often useful.
- Is your task really sequential decision-making? What information do I have about my users?
Say you are looking to optimize the design of a website to sell a particular product. In some cases, a user may never return to your website. Whether the user makes a purchase may depend on the color of the website. You can show users three backgrounds in different colors at random and see which one works best. But if you have additional information about your users, such as their gender or location, you can incorporate that information and use it to better shape your AI program. Contextual Bandits are a unique decision making approach that is suited to these types of situations. With context bandits, there are theoretical guarantees of performance: an algorithm can test different actions and learn what is the most rewarding outcome for a given situation. However, if the user has to come back multiple times – go ahead and use RL in its most general form – alas at the cost of no notional guarantees.
This list of questions is by no means exhaustive. For example, there are also safety and fairness considerations to take into account. But by asking these six questions, data scientists can begin to get a feel for how LR might best help them solve their problems.
about the authors
Pulkit Agrawal is an assistant professor of electrical engineering and computer science at MIT and heads the Improbable AI Lab, which is part of MIT’s computer science and artificial intelligence lab.
Cathy wu is the Gilbert W. Winslow Assistant Professor of Career Development in Civil and Environmental Engineering at MIT and has worked in many fields and organizations including Microsoft Research, OpenAI, the Google X Self-Driving Car Team, AT&T, Caltrans, Facebook and Dropbox. Wu is also the founder and chair of the Interdisciplinary Research Initiative at the ACM Future of Computing Academy.
Agrawal and Wu are also co-instructors of MIT’s professional training course, Advanced Reinforcement Learning, which is part of the Professional Certificate in Machine Learning and Artificial Intelligence.