Weekly review of Reinforcement Learning papers #2

Every Monday, I present 4 publications from my research area. Let’s discuss them!

7 min readMar 29, 2021

Paper 1: Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification

Eysenbach, B., Levine, S., & Salakhutdinov, R. (2021). Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification. arXiv preprint arXiv:2103.12656.

The core of reinforcement learning is the reward function: how well the agent does. In some cases, this reward function is easy to describe: in video games: let’s take the score. In other cases, it is not easy to give a reward function. Let’s take the example of the article: closing a drawer. If the observation space is the image of the scene (with the robot and the drawer), it is very complicated to define if the task is successful or not: we would need an algorithm to estimate the drawer placement, taking into account the occlusion of the piece of furniture and then a succession of conditions, blah blah, blah…

Nevertheless! it is easy to produce images of the closed drawer. These would be examples of observation if the task was successful. Well, that is the idea behind this article.

Make no mistake, this is not imitation learning: the intermediate states of the actions are not given. Basically, example-based learning is close to goal-conditioned RL: maximizing the probability of reaching a given state.

The algorithm does not learn the reward function from the examples of success. Instead, it learns directly to predict whether the task will be successful in the future, based on recursive learning. At each iteration, a classifier is trained to predict y=1 for success state examples, and y=γ w/(1+γ w) where w is the prediction of the classifier at the next interaction (hence the recursive term). Then new trajectories are collected with the interractions. And so on. They call their algorithm recursive classification of examples (RCE).

Compared with prior methods, the RCE approach solves the task of hammering a nail into a board more reliably that prior approaches based on imitation learning [SQIL, DAC] and those that learn an explicit reward function [VICE, ORIL, PURL]. Source

The results seem quite good. I wonder if such results will be obtained on all classical RL benchmarks. This is an approach that tries to simplify the paradigm of reinforcement learning. It is this kind of publication that I like very much.

Paper 2: Learning Monopoly Gameplay: A Hybrid Model-Free Deep Reinforcement Learning and Imitation Learning Approach

Haliem, M., Bonjour, T., Alsalem, A., Thomas, S., Li, H., Aggarwal, V., … & Kejriwal, M. (2021). Learning Monopoly Gameplay: A Hybrid Model-Free Deep Reinforcement Learning and Imitation Learning Approach. arXiv preprint arXiv:2103.00683.

No major breakthrough in this publication but I definitely wanted to talk about it. The authors trained an agent on a game of Monopoly! I have played this game a lot and there are many strategies that can be considered. There is of course a randomness involved since you have to roll a dice to advance or use luck cards. This is a reinforcement learning environment that I had not thought of!

To do this they start with imitation learning using an agent whose choices have been conditioned in advance (hard-coded strategy). Then, the agent continues its training with a DQN algorithm with experience replay. To evaluate their agent, they proposed 4 agents, whose level is increasing and whose strategy is hardcoded (called P_1,…, P_4). For the evaluation, they organized tournaments (50 games with 4 players), opposing their agents to the agents P1, …, P_4. Regardless of the composition of the opponents, their agent won at least 75% of the tournaments. I wonder if Monopoly will become a classic environment for RL benchmarks.

Paper 3: Sample-efficient Reinforcement Learning Representation Learning with Curiosity Contrastive Forward Dynamics Model

Nguyen, T., Luu, T. M., Vu, T., & Yoo, C. D. (2021). Sample-efficient Reinforcement Learning Representation Learning with Curiosity Contrastive Forward Dynamics Model. arXiv preprint arXiv:2103.08255.

How to learn effectively from raw images? This is an important question in reinforcement learning: often, most of the pixels have little importance. It is as if the important information is hidden in a very high dimensional space.

To address this question, the authors proposed an algorithm they call Curiosity Contrative Forward Dynamic Model (CCFDM). The three major aspects of this algorithm are the Contrastive Learning, the Forward Dynamic Model (FDM) and the Curiosity Module. Let me explain this in two words.

The contrastive learning aspect is borrowed from unsupervised learning. Here is a good article that will help you understand what it is. It all starts with a sampling of transitions in the replay buffer. To put it simply, the current observation will be the query, and the next observation will be the key. They both are augmented and encoded with a CNN (it gives q and k’). This is where FDM comes in. The observation feature (q) and the action feature (a_e) are entered into FDM, and the output will be, conceptually, a prediction of the next state (q’). This q’ is the query and the key are then treated by the Contrastive Unsupervised Learning.

They add to this the curiosity module. I don’t have much information about how it works. The main point is that it is adding a reward for exploring new states. This is an idea that is often seen and is quite effective. At this point, everything is ready for reinforcement learning.

The results are pretty good. They are close to what they call the upper learning bound (the one obtained with a SAC on the state). It’s interesting to link contrastive learning and DRL methods, even if I regret the complexity of the algorithm, compared to DrQ.

Paper 4: Explainability in Deep Reinforcement Learning

Heuillet, A., Couthouis, F., & Díaz-Rodríguez, N. (2021). Explainability in deep reinforcement learning. Knowledge-Based Systems, 214, 106685.

This is the magic and the pitfall of AI, the explicability of models. As a good scientist, we can’t just say “it works”. We must be able to explain the learning mechanisms. This is a recurring question that is widely studied for classification problems but much less so for reinforcement learning. This paper gives an overview of the papers that aims to explain reinforcement learning. They split these papers into two categories : transparent algorithms and post-hoc explainability. This is a very dense paper of 17 pages. Since each discussed paper has its own way of approaching the question, it is impossible to make a faithful synthesis of all the material in this article. So I’ll settle for a single, visual example.

Figure from the article : Comparison of Jacobian saliency (left) first introduced by Simonyan et al. to the authors’ perturbation-based approach (right) in an actor-critic model. Red indicates saliency for the critic; blue is saliency for the actor. Reproduced with permission of Sam Greydanus.

A method of calculating the salience based on the disturbance: A perturbation is applied to the image, which will remove information. This perturbation adds a spatial uncertainty to the perturbed region. By measuring the effect that this perturbation has on the agent, we are able to distinguish the determining areas for the agent’s choices. The result is given on the image on the right. The result looks quite natural: it is the regions around the ball and the racket that have a high value.

This is one of the many methods presented in this paper. I really invite you to go and see this publication in detail.

Bonus Paper: First M87 Event Horizon Telescope Results. VII. Polarization of the Ring

The Event Horizon Telescope Collaboration and al. (2021). First M87 Event Horizon Telescope Results. VII. Polarization of the Ring. The Astrophysical Journal Letters, 910:L12

Do you remember the historic image that the Event Horizon Telescope had taken on April 10, 2019? The very first image of a black hole, called M87! This black hole is located 53 million light years away. To achieve such a remarkable feat, not one, but 8 radio observatories, spread over 4 continents, have been used. This collaboration is called Event Horizon Telescope. By combining their observations, it was possible to obtain an image clearly showing the particles and dust orbiting around what is called the event horizon: the boundary beyond which the gravity of the black hole is so strong that nothing, not even light or matter, can escape.

Since this fabulous breakthrough, researchers have not been idle. And last week, they published this image.

The image of M87’s supermassive black hole, depicting the lines of polarized light surrounding the black hole’s event horizon. Source

By examining the motion of the polarized light, astronomers noticed that one of the regions of the disk showed interesting patterns. This polarized light actually describes the polarization of the space it comes from. By making these polarizations appear on the original image, we see very beautiful field lines. In addition to being aesthetically pleasing, this will help astronomers to better understand the mechanisms of formation of supermassive black holes.

A small remark that makes you dizzy: the photons captured to make this image left the black hole region shortly after the extinction of the dinosaurs.

It was with great pleasure that I presented you my readings of the week. Feel free to send me your feedback.
To read my reviews from Sunday evening, visit my blog: https://qgallouedec.github.io