Research ArticleAPPLIED SCIENCES AND ENGINEERING

Autonomous robotic nanofabrication with reinforcement learning

See allHide authors and affiliations

Science Advances  02 Sep 2020:
Vol. 6, no. 36, eabb6987
DOI: 10.1126/sciadv.abb6987
  • Fig. 1 Subtractive manufacturing with an RL agent.

    (A) PTCDA molecules can spontaneously bind to the SPM tip and be removed from a monolayer upon tip retraction on a suitable trajectory. Bond formation and breaking cause strong increases or decreases in the tunneling current (left inset). The removal task is challenging, because PTCDA is retained in the layer by a network of hydrogen bonds (dotted lines in right inset). The RL agent can repeatedly choose from the five indicated actions a15 (green arrows) to find a suitable trajectory (action set A: ∆z = 0.1 Å step plus ±0.3-Å step in the x or y direction, or no lateral movement). (B) STM image of a PTCDA layer with 16 vacancies created by the RL agent (scale bar, 5 nm). (C) Probability of bond rupture in intervals of 0.5 Å around tip height z as a function of z, based on all bond-breaking events accumulated during the RL agent experiments (inset). (D) The Q function is approximated by a neural network with 30 neurons in the first and 2 × 15 neurons in the second hidden layer. This dueling network architecture (39) features separate outputs Ai and V, with Qi = V + Ai for actions ai=15. The actually performed action is then randomly chosen from A with probabilities computed with the policy π.

  • Fig. 2 Training and performance of RL agents.

    (A) Map [two-dimensional (2D) slice through the 3D system] of the synthetic bond rupture criteria used to study the RL agent’s behavior under controlled conditions. The criteria are based on a successful experimental trajectory around which a corridor of variable diameter has been created (light red) beyond which the bond ruptures (blue). The corridor diameter is chosen to approximately reproduce the experimental bond-rupture probabilities (Fig. 1C). One successful trajectory [see (C)] is indicated in green. (B) Probability of agent failure in z intervals of 0.5 Å in the simulation in (A). (C) Learning progress of one RL agent. The six plots show 2D cuts (y = 0) through the color-encoded value function V after the number of episodes indicated in the upper right corner. A 2D projection of the agent’s trajectory in each episode is shown as a black line. Crosses indicate bond-breaking events triggered according to the criteria in (A) (see Supplementary Animation for a 3D view). (D) Swarm plot comparing the performance of different types of RL agents acting in the simulation (A). Plotted is the number of episodes n required to accomplish the removal task for four sets of 80 simulated experiments each. An experiment was considered a failure after 150 unsuccessful episodes. The respective probabilities of agent failure are indicated in the upper part of the graph.

  • Fig. 3 Performance of the RL agent in experiment.

    (A) Swarm plot of the number of episodes n required to accomplish the removal task. Groups of at least three data points acquired with the same tip are identically colored (except black). If a tip capable of removal (proven by a successful experiment) failed in another experiment, the respective data point is labeled as “agent fail.” Points labeled as “tip fail” denote tips with which the removal task has never been accomplished, notwithstanding that this could, in principle, also be a failure of the agent. (B) Density of (x, y) positions where all (ultimately successful) tip trajectories pass through the z region of highest bond-rupture probability (z = 2 Å; Fig. 1C). The positions for Tip D (strong tip) and Tip E (weak tip) are indicated by dots. (C) (x, y) projections of all bond ruptures occurring within the first 10 episodes for R- and P-agents. Cross sizes indicate rupture heights z. The quoted numbers give the percentage of rupture points located in each of the four quadrants of the coordinate system. The green curve shows the last trajectory chosen by the P-agent during its pretraining. Its direction indicates why the P-agents have a clear preference to explore the promising (B) lower left quadrant, which explains their performance edge (A).

Supplementary Materials

  • Supplementary Materials

    Autonomous robotic nanofabrication with reinforcement learning

    Philipp Leinen, Malte Esders, Kristof T. Schütt, Christian Wagner, Klaus-Robert Müller, F. Stefan Tautz

    Download Supplement

      Other Supplementary Material for this manuscript includes the following:

      Files in this Data Supplement:

    Stay Connected to Science Advances

    Navigate This Article