Autonomous robotic nanofabrication with reinforcement learning

Intelligent learning agent controls a scanning probe microscope and masters a nanoscale robotic task of molecule manipulation.

The swift development of quantum technologies could be further advanced if we managed to free ourselves from the imperatives of crystal growth and self-assembly, and learned to fabricate custom-built metastable structures on atomic and molecular length scales routinely. Metastable structures, apart from being more abundant than stable ones, tend to offer attractive functionalities, because their constituent building blocks can be arranged more freely and in particular in desired functional relationships [7].
With all these types of actuators, a sequence of manipulation steps can be carried out in order to bring a system of molecular building blocks into a desired target state [7]. The problem of creating custom-built structures from single molecules can therefore be cast as a challenge in robotics. In the macroscopic world, robots are typically steered using either human expert knowledge or model-based approaches. Both strategies are not available at the nanoscale, because on the one hand, human intuition is largely trained on concepts like inertia and gravity, which do not apply here, while on the other hand atomistic simulations are either too imprecise or computationally expensive to be helpful. Even if calculations were possible, actuators have variable and typically unknown properties at the nanometer scale, making it extremely difficult if not impossible to establish a connection between the actual robotic process and the atomistic simulation. This leaves autonomous robotic nanofabrication as the preferred option.
In the current study, we show for the first time that Reinforcement Learning (RL) can be used to automate a manipulation task at the nanoscale. In RL, a software agent is placed in an environment at time t = 0 and sequentially performs actions to alter the state s t of this environment. While executing random actions in the beginning, the agent will, based on its accumulated experience, incrementally learn a policy π for choosing actions a t that maximize a sum over reward signals r t+1 . The reward signal is returned by the environment in a manner specified by the experimenter beforehand. The experimenter designs the reward signal such that behaviour which solves the problem yields a high reward, whereas failure to do so yields a low reward. The advantage of this approach is that the experimenter does not have to specify how the agent needs to act, but instead only has to check whether the intended task is being solved.
Considering a compact layer of PTCDA (3,4,9,10-perylene-tetracarboxylic dianhydride) on an Ag(111) surface, we define removing individual molecules from this layer using an SPM as the RL agent's goal (see Fig. 1). This task is an example of a subtractive manufacturing process that starts from a self-assembled molecular structure. We note that subtractive manufacturing has been identified as key to nanoscale fabrication [6,19]. Counts  Figure 1: Subtractive manufacturing with a RL agent. a, PTCDA molecules can spontaneously bind to the SPM tip and be removed from a monolayer upon tip retraction on a suitable trajectory. Bond formation and breaking cause strong increases or decreases in the tunneling current (left inset). The removal task is complicated since PTCDA is retained in the layer by a network of hydrogen bonds (dotted lines in right inset). The RL agent can repeatedly choose from the five indicated actions a 1-5 (green arrows) to find a suitable trajectory (action set A: ∆z = 0.1Å step plus ±0.3Å step in x or y direction, or no lateral movement). b, STM image of a PTCDA layer with 16 vacancies created by the RL agent (scale bar is 5 nm). c, Probability of bond rupture in 0.5Å z-intervals as a function of z based on all bond-breaking events accumulated during the RL-agent experiments (inset). d, The Q-function is approximated by a neural network with 30 neurons in the first, and 2 × 15 neurons in the second hidden layer.
This dueling network architecture [20] features separate outputs A i and V , with Q i = V + A i for actions a i=1...5 . The actually performed action is then randomly chosen from A with probabilities computed with the policy π.

Robotic nanofabrication as a Reinforcement Learning challenge
A RL task is usually modeled as a Markov Decision Process (MDP) [10], which is a Markov Process equipped with an agent which can perform an action at each state to influence the transition into the next state. In nanofabrication, a complete numerical representation of the environment states t would consist of the coordinates of all relevant atoms in the environment. The probability distribution p(s t+1 , r t+1 |s t , a t ), which defines the probability to reach states t+1 and receive reward r t+1 after taking action a t in states t is deterministic at low temperatures (in our example at T = 5 K), and stochastic at temperatures where thermal fluctuations are enabled. The complete states t of the environment is generally not observable at the current level of technology. To apply RL to nanofabrication we therefore suggest using an approximate state definition s t . There are several plausible elements to construct such a definition. First, there is the known actuator position. Second, there are typically some measurable quantities (like forces) in any robotic nanofabrication setup which are functions of the complete states t of the environment and which could thus be used to approximate this state.
Importantly, any approximate state definition will be of much lower dimensionality than the complete state definition, such that transitionss t →s t+1 in the complete state space cannot be captured fully by transitions s t → s t+1 in the approximate state space. Hence, two states s t and s t could be identical even when the underlying complete statess t ands t are not. Using an approximate state description has several consequences: Firstly, it could break the Markov property, because the history s 0 , . . . , s t would provide more information abouts t than s t alone. Secondly, the problem could become effectively non-stationary because a change in the actuator (i.e., in the arrangement of atoms) could change the entire probability distribution p(s t+1 , r t+1 |s t , a t ) from s t=0 onward, without the approximate state definition capable of capturing these changes. This could render the accumulated experience at least partially worthless. An additional source of non-stationarity of nanofabrication systems are parameters of the (external) macroscopic environment (the apparatus, the room, etc.), which are also varying slowly.
This non-stationarity is at the heart of the difficulty of autonomous nanofabrication, because it means that the successful policy is not static but must be evolved constantly. Further, the speed at which a policy is learned needs to be faster than the rate at which p(s t+1 , r t+1 |s t , a t ) changes. In practice, this requires a substantial speed-up of the standard RL algorithm, which is typically very slow because it needs a lot of training data. If this key advance was achieved, a policy π(s t ) could be learned in the lifetime of the distribution p(s t+1 , r t+1 |s t , a t ). Moreover, the intrinsically adaptive RL agent would be able to deal with occasional hidden changes ofs t . Below, we demonstrate that RL can be sped up sufficiently to solve our exemplary nanofabrication task.

The PTCDA lifting task as a Reinforcement Learning challenge
We have previously studied the removal of single PTCDA molecules from condensed layers on Ag(111) by manual tip control [6,[21][22][23][24][25]. Using motion capture and virtual reality, we were able to identify specific three-dimensional tip trajectories that reach the target state in which the molecule is fully disconnected from the surface but still bonded to the tip. We stress that the bond between one of the carboxylic oxygen atoms and the tip (Fig. 1a) typically ruptures if a random retraction trajectory is chosen, since along many trajectories the molecule-surface and intermolecular forces together exceed the strength of the tip-molecule bond. Thus, to be successful, the RL agent has to find trajectories on which the retaining force, which holds the molecule in the layer, never surpasses the tip-molecule bond strength.
In our example, neither the atomic coordinates of the object system (PTCDA layer and manipulated PTCDA molecule) nor the atomic structure of the rest of the environment (SPM tip apex) are known precisely. For the definition of the approximate state s t , we could exploit two measurable quantities, the tunneling current and the force gradient of the SPM. We tried to use these quantities together with the Cartesian coordinates of the tip apex as the state description of the environment (5 dimensions), but we had to conclude that, given the limited time until the task needs to be solved, the measured quantities have too high variance to be helpful in our RL setup. We therefore chose to significantly reduce the state description and only include the Cartesian coordinates of the tip (3 dimensions) in the state definition. Although this state definition does probably not have the Markov property for the given nanofabrication task, the task could be solved successfully. We therefore chose to keep the complexity in our proof-of-concept study as low as possible and did not use strategies to attempt to restore the Markov property (see e.g. [10] ch. 17).
In the PTCDA lifting task, the non-stationarity of p(s t+1 , r t+1 |s t , a t ) discussed above arises for example from abrupt changes in the atomic structure of the tip apex and from a slow drift of the piezo-crystals controlling the SPM tip. Such changes influence both the measurable quantities as well as the trajectories on which it is possible to lift the molecule.
It should be noted that despite the non-observability of the precise state, two important key events can be unambiguously detected: the undesired loss of contact to the tip and the desired loss of contact to the surface. In the former event, the tunneling current drops by typically an order of magnitude ( Fig. 1a). Additionally, in both events, a clear signal can be observed in the SPM force gradient channel.

Formal Reinforcement Learning setup
In our formal description of the MDP we use upper-case characters to represent random variables and lower-case characters to represent values that the random variables assume. The RL agent steers the SPM with its actions.
The environment are the actuators of the SPM, and the object system. At time step t, the environment is in a state which we represent numerically by S t ∈ S. As noted above, we simplify the state representation to S ⊂ R 3 , and the three components are the Cartesian coordinates of the tip apex. Based on the received state, the agent picks an action A t from the set of actions A. We specify A to consist of five possible actions, all of which move the tip in different directions (see Fig. 1a). The performed action in turn causes the environment to emit a new state S t+1 and also a reward R t+1 ∈ R.
We design the reward system as follows: If the environment transitions to a nonterminal state, we assign a default reward of r t+1 = 0.01 (see A.3 for a discussion). If transitioning into a state in which the SPM tip loses contact with the molecule, the agent is penalized with r t+1 = −1 and the current episode stops. Finally, if transitioning into a state where the molecule has been lifted successfully, we assign a reward of r t+1 = +1 and the episode also stops. After each failed episode, the molecule, by virtue of the hydrogen bonds ( Fig. 1), drops back to its original position in the PTCDA-layer and the tip is moved back to s 0 = (0, 0, 0), where the tip-molecule bond re-establishes such that the next episode can start with identical conditions (provided that no change in the tip apex has occurred).
Central to RL is the Q-value function, which is learned (Fig. 2c) and which, in our case, is approximated by a neural network (NN) (Fig. 1d). Q(s t , a t ) is the agent's estimate of the expected discounted future reward will receive when performing action a t in state s t and afterwards following its policy π. In a given state, the policy π assigns action-selection probabilities to each action depending on their respective Q-values. In our case, π is computed from Q = −Q using the Boltzmann distribution As is common in RL, Q appears with opposite sign in this equation, because a high Q means a high probability, opposite to the energy/occupation relation for which this distribution was initially derived. The "temperature" parameter T determines how greedily the agent chooses actions having higher Q-values.
The interaction with the environment is organized into state-action-reward-state tuples (s t , a t , r t+1 , s t+1 ) and stored in an experience memory to be used for training (Fig. 1d). During training, the Q-values predicted by the NN are adjusted to the discounted future rewards (Fig. 2c). We use an off-policy variant (see below) of the Expected SARSA [26] algorithm, for which the loss is computed as and used to optimize the NN weights via gradient descent with samples obtained by prioritized sampling [27] from the experience memory. Note that the discounted (γ = 0.97) future reward at t + 1 is given by the Qvalue function itself. This recursive formulation, called temporal-difference learning, allows to learn Q-values particularly efficiently and propagate them through the state space [10].
Simulation results and RL adjustments Before connecting the RL agent to our microscope, we benchmarked our RL setup on a simulated system using synthetic bond-breaking criteria (Fig. 2a) derived from prior lifting experiments [6]. Note that the probability of bond-rupture as a function of tip height is similar between the simulation and the real experiment ( Fig. 1c and   Fig. 2b). Specifically, there are two heights at which there is an increased chance of rupture in the experiment, and our synthetic bond-breaking criteria recover this pattern. Even in this stable simulation with no uncontrolled variability (and complete observability), the agent typically requires more than 150 episodes to find a successful policy (Fig. 2d). As discussed above, this low data efficiency would make it (almost) impossible to achieve the goal in the real-world experiment, where we expect a substantial degree of variability over this time scale of hundreds of episodes, rendering much of the collected experience worthless.
Driven by the need to solve the task more efficiently, we introduce two modifications to the standard Expected SARSA algorithm: First, we make use of our purely Cartesian-coordinate state description to perform model-based RL similar to the Dyna algorithm [28]. Dyna uses both actual experience from interaction with the environment, and experience obtained from a learned environment model to update the Q-values. In our case, learning an environment model is easy, because the state transition from s t given a t to S t+1 is deterministic. The environment model also needs to model the obtained reward. We implement our learned environment model such that it emits the default reward r t+1 = 0.01 unless the successor state is a known failure state (bond previously ruptured at this position in the experiment), in which case it emits the failure-reward r t+1 = −1.0. We use this fact to sample (s t , a t , r t+1 , s t+1 ) tuples around states obtained from prior experience (see A.7) and train our model with them.
Second, we introduce a rupture avoidance mechanism by setting a negative temperature T train < 0 in Eq. 1 during training. Using a negative temperature gives lower Q-values at time step t+1 in Eq. 2 more importance. Therefore, information about impending failure states is propagated much further towards previous positions and the agent can use this information to avoid them. Of course, while acting in the environment we still set a positive temperature T act > 0.
We next benchmark the performance of the RL with our two modifications in the simulation. Fig. 2 shows that especially in combination, the two modifications speed up the learning process dramatically (Fig. 2d), to the extent that it now becomes possible to connect this modified RL agent to the real-world experiment.

SPM setup
While the RL agent controls our ultra-high-vacuum low-temperature non-contact AFM/STM fitted with a qPlus tuning-fork sensor [29], the tunneling current through the junction (V bias = 10 mV) is continuously monitored by our software (see A.1). When the bond to the tip ruptures, the increased tunnelling barrier leads to a sudden drop in current and the failure-reward is assigned. Since the length of the molecule is known (11.5Å), the target state can also be automatically detected as any state with z > 14Å and the contact to the molecule still in place. Thus, the agent works fully autonomously. The final success of the manipulation is verified by the experimenter, who deposits the molecule from the tip elsewhere onto the Ag(111) surface [6] and images the vacancy that is created in the PTCDA layer (Fig. 1b).   Figure 3: Performance of the RL agent in experiment. a, Swarm plot of the number of episodes n required to accomplish the removal task. Groups of at least 3 data points acquired with the same tip are identically coloured (except black). If a tip capable of removal (proven by a successful experiment) failed in another experiment, the respective data point is labeled as "agent fail". Points labeled as "tip fail" denote tips with which the removal task has never been accomplished, notwithstanding that this could, in principle, also be a failure of the agent. Open green data points have been obtained for Tip B with a P-agent initialised from a successful agent for Tip B. The small n value shows that agents strongly benefit from pre-training if the tip remains unchanged. b, Density of (x, y)-positions where all (ultimately successful) tip trajectories pass through the z-region of highest bond-rupture probability (z = 2Å, Fig. 1c). The positions for Tip D (strong tip) and Tip E (weak tip) are indicated by dots. c, (x, y)-projections of all bond ruptures occurring within the first ten episodes for R-and P-agents. Cross sizes indicate rupture heights z. The quoted numbers give the percentage of rupture points located in each of the four quadrants of the coordinate system. The green curve shows the last trajectory chosen by the P-agent during its pre-training. Its direction indicates that the P-agents have a clear preference to explore the promising (Panel b) lower left quadrant, which explains their performance edge (Panel a).

Analysis of the learning process
We measure the performance of each RL agent by the number n of episodes which it requires to solve the removal task. To separate intrinsic RL stochasticity from the uncontrollable variability of the SPM manipulation, we plot the data points n of real-world experiments that were conducted with identical tips in the same colour in Fig. 3a. The scatter is indeed smaller within groups of experiments with identical tips. Moreover, the difficulty of the removal task clearly depends on the tip. For example, with tip D removal is easy, resulting on average in small n, while tip E, with which removal appears more difficult, results in larger n.
A small force threshold of the tip-molecule bond reduces the fraction of successful trajectories. For particularly weak tips (labeled tip failure in Fig. 3a), the removal task cannot be solved at all. One would expect that for larger n, i.e. weaker tips, successful trajectories cluster more narrowly in space. Fig. 3b, in which we compare the (x, y) coordinates of successful trajectories at z = 2Å for tip D and tip E, clearly reveals this effect. The distribution of the corresponding (x, y) coordinates for all tips (plotted as a gray-scale background) is rather broad and indeed similar to the distribution for the strong tip D. For the weak tip E, however, the agent has to traverse a very specific region in the xy plane to avoid bond-breaking. This naturally explains why tip E tends to require larger numbers n of episodes until success.
In Fig. 3a we also compare randomly initialized (R) agents with pre-trained (P) ones. R-agents start with random weights in the neural network, while P-agents have already solved the removal task once with one particular tip. Initially, all P-agents are identical, i.e. they have the same experience and NN weights. On average, P-agents perform better than R-agents. This is evident both in the complete dataset and for individual tips, see for instance tip E. It clearly demonstrates that at least some knowledge about the removal task is universal and can be transferred to new tip configurations. of the rupture points (i.e., termini of unsuccessful episodes) for R-agents (black) and P-agents (red). The data is limited to the first ten episodes of each experiment, in which case the difference in training and experience between both types of agents is most significant. The plot shows that the randomly initialized agents explore all (x, y) directions rather uniformly, while the pre-trained agents have a strong bias towards the lower left quadrant (x < 0, y < 0) through which almost all successful trajectories pass (Fig. 3b). This bias is the essence of the universally valid policy which, once learned, gives P-agents a performance edge over R-agents.

Conclusion
Automatically fabricating complex metastable structures at the molecular level is a highly desirable goal. Given the limited observability and uncontrolled variability involved, this goal seemed out of reach until now. In this proof of concept study, we demonstrated that indeed autonomous robotic nanofabrication becomes possible using RL without the necessity of human intervention. We chose the real-world task of lifting a molecular structure off a material surface as a textbook example. Because we used the RL framework, it was not necessary to specify how to solve the task -instead only the goal had to be set, which is clearly easier. We showed that an RL agent in full control of an SPM setup is not only capable of reaching a real manipulation goal with a moderate number of trials, but that it is moreover also robust enough to transfer a previously learned policy to new object systems.
The limited observability is perhaps the most severe limitation of RL at the nanoscale. Although RL can indeed work under partial observability and in stochastic environments, such conditions negatively affect the amount of trials needed to solve a task. To alleviate this problem, one could try using a hybrid approach in which insight from atomistic simulations is used for guiding the RL agent in its exploration of possible solutions. While atomistic conformations may not be practically accessible in detail, related measurements like tunneling current and force (gradient) are. Hence, a future research direction could focus on directions for including such helpful variables into a RL setup.
In conclusion, we demonstrated that autonomous robotic nanofabrication is viable. It enables immediate progress towards the freedom of designing quantum matter, beyond the constraints of even the most complex quantum materials.

Additional information
Correspondence and requests for materials should be addressed to C.W. or K.-R.M.

A.1 Experiments
PTCDA is deposited onto Ag(111) at room temperature and briefly annealed to 200 • C. The PtIr tip of the qPlus sensor was cut by ion-beam etching and prepared via indentation into the uncovered Ag(111) surface. Since each indentation typically changes the tip apex structure it affects the strength of the molecule-tip bond and thus the difficulty of the removal task. This allowed us to test the RL agent at various levels of difficulty. The RL agent controls the tip via a voltage source the output of which is added to the piezo voltages of the SPM setup.
In principle, the primary criterion to quantify the performance of an agent should be the time the agent requires to accomplish a manipulation task. In order to assess agents for simulated and real systems on equal footing, we use the number of episodes n required for this task. This quantity n is not fully, but closely related to the time needed (episodes may take longer or shorter depending on the length of the trajectory). Using the wall-clock time as a criterion is moreover rather meaningless, since we have intentionally slowed down the removal process in the experiment to a point where we could carefully observe the actions of the agent in order to, for example, spot changes in tip apex structure immediately. A removal process took typically 5 to 10 minutes in the experiment. The tip apex changed in 20% of the removal experiments (not to be confused with episodes) which were excluded from the statistics. During re-deposition of the removed molecule onto the surface, the apex changed with a probability of 15%.

A.2 Reinforcment Learning
The pure state transitions P (S t+1 = s t+1 |S t = s t , A t = a t ) in our setup are deterministic, because the SPM tip can be moved deterministically to a new position. We use this fact to introduce model-based RL (see A.7).
Note that this determinism is a result of our choice of restricting the state description to the coordinates of the tip. If we included the tunneling current or the force gradient measurements, model-based RL would not be possible anymore, unless one would learn to model these variables as well, which proved too unstable in our pilot experiments.
In order to prevent overly optimistic estimates of the Q-values [30], we use the Double-Q-learning approach [31], which also works with function approximation [32]. Note that instead of Q-learning, we use Expected Sarsa [26].
While Expected Sarsa is an on-policy algorithm, we make it off-policy by using different temperatures T train , T act for the Boltzmann-distribution (Eq. 1) during training compared to when using the network to act in the environment (see A.6). In Double-Q-learning, two networks are used in parallel: they start out with equal weights, but in the subsequent training steps, only one network is updated (the "live network") while the other is held fixed for a while (the "target network"). When computing Q-value estimates for time step t + 1 in the loss function (Eq. 2), the target network is used to obtain the actual Q-values, while the live network is used for the probabilities of the Boltzmann-distribution in the training policy (Eq. 1). Every 200 training steps, the weights of the target network are set to the weights of the live network, such that both networks once again have equal weights.

A.3 Rewards
There is a trade-off between sparse and shaped rewards. The former are only given once the agent either accomplishes its final goal or ultimately fails, while the latter also reward (or punish) intermediate steps, thus directing the agent more efficiently towards its goal. If not chosen well, shaped rewards can induce unwanted agent behaviour, while sparse rewards induce no such bias. We use sparse rewards because of the poor observability of the full state (atomic coordinates) of the object system, and the resulting lack of information regarding the assessment of intermediate steps. Therefore, we give a reward of +1.0 for success (fully lifting the PTCDA-molecule out of the surface) and −1.0 for rupture of the bond between the tip and PTCDA-molecule. However, we do slightly shape the reward by giving a reward of +0.01 for each non-terminal step of the agent. Physically, this is motivated by the fact that each step separates the molecule 0.1Å further from the surface. On the RL side, this small reward makes the agent prefer exploring trajectories on which it previously advanced very far before rupture. This happens because the (discounted) propagation of rewards to previous states during training (Eq. 2) makes states along longer trajectories have higher value and therefore more "attractive" to the agent.

A.4 Neural network architecture
The neural network used to approximate the Q-function consists of the three-neuron input layer receiving (x, y, z)

A.5 Training details
The neural network was trained after each episode and held fixed during episodes. The optimization method was "Adam" [33] with a constant learning rate of 10 −3 and a batch size of 30. In order to avoid performing many training steps while little experience is present, the amount of training steps was increased in the first ten episodes, from 200 after the first episode to finally 2000 training steps after ten or more episodes. The discount for future rewards was γ = 0.97. The training-temperature (see section A.6) was T train = −0.1, and the action-temperature used to select actions during an episode was T act = 0.004.
During training, 10% of the samples are chosen from actual experience that was obtained during any previous episode. These samples are drawn with prioritized experience replay [27]. 90% of the samples are obtained from the environment model (see A.7).

A.6 Rupture avoidance mechanism
Prior human trials of removing the PTCDA molecule from the surface show that if a bond rupture (failure) occurs at a given location, ruptures are likely to occur in the area around it too. Also, as stated in the main text, a viable trajectory needs to be found as fast as possible, because the SPM tip might change at any time. So the agent needs to explore the state space quickly for a promising direction. Therefore, we implement a mechanism to make the agent rapidly avoid states where the tip-molecule bond previously ruptured, and particularly also the states leading up to it.
Usually, RL algorithms like the one trained with Eq. 2 tend to "ignore" future negative rewards, if there is a strategy that narrowly avoids them: to compute the expected future reward at time t + 1, they weigh the highest Q-values more than the lower ones. In the limit as T → 0, all Q-values but the highest one are ignored. Because of this, information about failure-states barely propagates to any states that lie two or more steps away. In our case, this means that with a randomly initialized neural network, an agent would try to lift the molecule on similar paths each episode until it is absolutely certain that this path is not viable. This leads it to fail at nearly identical locations each time and therefore it does not explore efficiently. For general RL settings, this may be the desired behavior, but we need the agent to learn as quickly as possible to avoid the failure state and the states leading up to it. Our solution is to use a negative temperature T train during training, which has the effect of inverting the importance of the Q-values as computed by Eq. 1. Now, low Q-values, which indicate danger of rupture, are given a high importance. This information about a rupture is therefore propagated much further.
There is a trade-off here, because the further away the agent tries to stay from known failure states, the more likely it becomes that it misses a viable trajectory. We can control this trade-off by changing the training temperature. In the simulation environment, we found that the optimal training temperature was T train = −0.1.
We tested whether using RL is necessary at all, then, if all the agent does is avoid the regions of rupture states.
We conducted an experiment in which the agent tries a random trajectory in the first episode, and in all following episodes chooses its actions such that it stays as far away as possible from all previously occurred ruptures. Despite significant experimentation with this approach, the agent never reached the goal state but got stuck in dead ends.
RL on the other hand can identify such dead ends and plan trajectories to avoid them.

A.7 Exploiting the Cartesian state description for model-based planning
We use a slight variation of Dyna [28] for model-based planning. Dyna in general updates Q-values with state transitions sampled from a learned environment model. To learn an environment model, we make use of the fact that the result of performing an action from a known state deterministically results in a new state, since the state description includes only the Cartesian coordinates of the tip, and each action moves the tip by a specified amount. To obtain a state from our environment model, we pick a random state from the unique set of actually visited states, and sample a position around it, which becomes s t . Then, we sample an action a t and the resulting