Research ArticleChemistry

Extraction of organic chemistry grammar from unsupervised learning of chemical reactions

See allHide authors and affiliations

Science Advances  07 Apr 2021:
Vol. 7, no. 15, eabe4166
DOI: 10.1126/sciadv.abe4166


Humans use different domain languages to represent, explore, and communicate scientific concepts. During the last few hundred years, chemists compiled the language of chemical synthesis inferring a series of “reaction rules” from knowing how atoms rearrange during a chemical transformation, a process called atom-mapping. Atom-mapping is a laborious experimental task and, when tackled with computational methods, requires continuous annotation of chemical reactions and the extension of logically consistent directives. Here, we demonstrate that Transformer Neural Networks learn atom-mapping information between products and reactants without supervision or human labeling. Using the Transformer attention weights, we build a chemically agnostic, attention-guided reaction mapper and extract coherent chemical grammar from unannotated sets of reactions. Our method shows remarkable performance in terms of accuracy and speed, even for strongly imbalanced and chemically complex reactions with nontrivial atom-mapping. It provides the missing link between data-driven and rule-based approaches for numerous chemical reaction tasks.


Humans leverage domain-specific languages to communicate and record a variety of concepts. Every language contains structural patterns that can be formalized as a grammar, i.e., a set of rules that describe how words can be combined to form sentences. Through the use of these rules, it is possible to create an infinite number of comprehensible clauses (knowledge) using a set of domain characteristic elements (words) obeying domain-specific rules (grammar and syntax). When applied to scientific and technical domains, a language is often more a method of computation than a method of communication.

Organic chemistry rules, for instance, have been developed over two centuries, in which experimental observations were translated into a specific language where molecular structures are words and reaction templates the grammar. These grammar rules illustrate the outcome of chemical reactions and are routinely taught using specific diagrammatic representation (Markush representations). More convenient representations like reaction SMILES (1) also exist for information technologies applied to synthesis planning and reaction prediction. In both Markush and SMILES representations, the grammar rules are present as latent knowledge in the historical corpus of raw reaction data.

The digitization of these rules proved to be a successful approach to design modern computer programs (2) aiding chemists in synthetic laboratory tasks. Compiling reaction rules from domain data is tedious, requiring decades of labor hours and challenging to scale. The availability of an automatic and reliable method for annotating how atoms rearrange in chemical reactions, a process known as atom-mapping, could change profoundly the way organic chemistry is currently digitized. However, the process of atom-mapping is an NP-hard problem, dealt with computational technologies since 1970s (3, 4). Most atom-mapping solutions are either structure based (510) or optimization based (1115). The current state of the art is a combination of heuristics, a set of expert-curated rules that precompute candidates for complex reactions, and a graph-theoretical algorithm to generate the final mapping as developed by Jaworski et al. (16). Nonetheless, brittle preprocessing steps, closed-source code, computationally intensive strategies (more than 100 s for some reactions), and the need for expert-curated rules hinder its wider adoption. Most public reaction data come with rule-based Indigo atom-maps (17), which are taken as ground truth for subsequent work (1823), irrespective of the explicit warnings about atom-maps quality issues (24).

Natural language processing (NLP) models (25) are among the few neural network architectures showing a substantial impact on synthetic chemistry (26) and not relying on atom-mapping algorithms. Their ability to encode latent knowledge from a training set of molecules and reactions represented as text [SMILES (1)] avoids the need to codify the chemical reaction grammar. Molecular Transformer models, a recent addition to the NLP family, are the state of the art for forward reaction prediction tasks, achieving an accuracy higher than 90% (2730). Understanding the reasons for this performance requires the analysis of the neural network’s hidden weights, which introduces the inherent complexity of interpreting neural networks.

Here, we report the evidence that Transformer encoder models (31, 32) learn atom-mapping as a key signal when trained on unmapped reactions on the self-supervised task of predicting the randomly masked parts in a reaction sequence, a process depicted in Fig. 1A. Transformer architectures can learn the underlying atom-mapping of chemical reactions, without any human labeling or supervision, solely from a large training set of reaction SMILES tokenized by atoms (28, 33). After establishing an attention-guided atom-mapper and introducing a neighbor attention multiplier, we were able to achieve 99.4% correct full atom-mappings on a test set of 49k strongly unbalanced patent reactions (34) with high-quality atom-maps (35).

Fig. 1 Overview.

(A) Process that led to the discovery of the atom-mapping signal and ultimately to the development of RXNMapper. (B) Directly affected chemical reaction prediction tasks. (C) Importance of atom-mapping in affected downstream applications.

The advantage of this approach is its unsupervised nature. In contrast to supervised approaches, here, the atom-mapping signal is learned during training as a consistent pattern hidden in the reaction datasets, without ever seeing any example of atom-mapped reactions. As a consequence, the quality of this approach is not limited by the quality of labeled data generated by an existing annotation tool. Moreover, the unsupervised nature allows scaling the extraction of chemical reaction grammar without the need of increasing human resources.

Numerous deep learning methods developed for organic chemistry, like forward and backward reaction prediction, will benefit from better atom-mapping (Fig. 1B). Examples range from template-based approaches that use atom-mapping to automatically extract the templates from chemical reaction datasets (18, 3638), to graph-based approaches, predicting bond changes or graph edits, that require atom-mapped reactions to extract the labels used for training the models (19, 21). Even the predictions of atom-mapping–independent and template-free SMILES-2-SMILES approaches (28, 33) may benefit from better atom-mapping, thus becoming more transparent and interpretable. In SMILES-2-SMILES approaches, the models generate the product structures sequentially atom-by-atom given the precursors or vice versa, generate the precursors given the product, without any support from atom-mapping information. After adding the atom-mapping in a postprocessing step, predictions can be linked back to training reactions with the same reaction template. The atom-maps also enable the use of quantum mechanical simulations to compute reaction energies and the mechanism without human intervention by providing the corresponding atom pairs between precursors and products.

Moreover, our contributions will lead to improvements in the downstream applications that depend on better atom-mapping and chemical reaction rules (Fig. 1C): retrosynthesis planning methods (36, 38, 39), chemical reactivity predictions using graph neural network algorithms (21), reactant-reagent role assignments (34), interpretation of predictions (28), and knowledge extraction from reaction databases (40).

The attention-guided reaction mapper (henceforth referred to as RXNMapper) can handle stereochemistry and unbalanced reactions and is, in terms of speed and accuracy, the state-of-the-art open-source tool for atom-mapping, providing an effective alternative to the time-intensive human extraction of chemical reaction rules. We release RXNMapper together with the atom-mapped public reaction dataset of Lowe (24) and a set of retrosynthetic rules (18, 3638) extracted from it. The observed atom-mapping performance indicates that a consistent set of atom-mapping grammar rules exists as latent information in large datasets of chemical reactions, providing the link between data-driven/template-free and rule-based systems.


Attention-guided chemical reaction mapping

Self-attention is the major component of algorithms called Transformers that are setting records on NLP benchmarks, e.g., BERT (31) and ALBERT (32), and even creating breakthroughs in the chemical domain (28, 33, 41). Transformers use several self-attention modules, called heads, across multiple layers to learn how to represent each token in an input—e.g., each atom and bond in a reaction SMILES—given the tokens around it. Each head learns to attend to the inputs independently. When applied to chemical reactions, Transformers use attention to focus on atoms relevant to understand important molecular structures, describe the chemical transformation, and detect useful latent information. Fortunately, the internal attention mechanisms are intuitive to visualize and interpret using interactive tools (4244). Through visual analysis, we observed that some Transformer heads learn distinct chemical features. Specific heads learned how to connect product atoms to reactant atoms, the process defined above as atom-mapping. We call these Transformer heads atom-mapping heads.

Throughout this work, our Transformer architecture of choice is ALBERT (32). ALBERT’s primary advantage over its predecessor BERT (31) is that it shares network weights across layers during training. This both makes the model smaller and keeps the functionality learned by a head the same across layers and consistent across inputs. Learned functions such as forward and backward scanning of the sequence, focusing on nonatomic tokens (ring openings/closures), and atom-mapping all perform similarly, irrespective of the input.

From raw attention to atom-mapping

To quantify our observations, we developed an attention-guided algorithm that converts the bidirectional attention signal of an atom-mapping head into a products-to-reactants atom-mapping. This specific mapping order ensures that each atom in the products corresponds to an atom in the reactants, which is important given that the most sizable open-source reaction datasets (24, 45) report only major products and show reactions that have fewer product atoms than reactant atoms.

The product atoms are mapped to reactant atoms one at a time, starting with product atoms that have the largest attention to an identical atom in the reactants. At each step, we introduce a neighbor attention multiplier that increases the attention connection from adjacent atoms of the newly mapped product atom to adjacent atoms of the newly mapped reactant atom, boosting the likelihood of an atom having the same adjacent atoms in reactants and products. This process continues until all product atoms are mapped to corresponding reactant atoms. The constraint of mapping only to equivalent atoms led to negligible improvements in terms of atom-mapping correctness, indicating that the model had already learned this rule in its atom-mapping function.

We selected the best performing model/layer/head combination after evaluation on a curated set of 1k patent reactions by Schneider et al. (34) originally mapped with the rule-based NameRXN tool (35). We used the remaining 49k reactions as a test set. We consider the atom-maps in NameRXN (35) to be of high quality because they are a side product of successfully matched reaction rules humanly designed. We used our best ALBERT model (12 layers and 8 heads) configuration (at layer 11, head 6, and multiplier 90) for RXNMapper.

Atom-mapping evaluation

The predominant use case for atom-mapping algorithms is to map heavily imbalanced reactions, such as those in patent reaction datasets (24, 45) or those predicted by data-driven reaction prediction models (28). After training RXNMapper on unmapped reactions (24), we investigated the chemical knowledge our model had extracted by comparing our predicted atom-maps to a set of 49k test reactions (34). The majority (96.8%) of the atom-mappings matched the reference, including methylene transfers, epoxidations, and Diels-Alder reactions (Fig. 2). We manually annotated the remaining discrepancies to find edge cases where RXNMapper seemingly failed. A more careful analysis showed that of the 1551 nonmatching reactions, only 284 predictions were incorrect. In 415 reactions, RXNmapper gave atom-maps equivalent to the original (e.g., tautomers), and in 436, the atom-maps were better than the reference. In 369 cases, the original reaction was questionable and likely wrongly extracted from patents. For 47 reactions, the key reagents to determine the reaction mechanisms were missing. After removing questionable reactions from the statistics and counting the equivalent mappings as correct, the overall correctness increased to 99.4%.

Fig. 2 Reaction map and examples.

(A) Visualizing the results on the whole 49k Schneider test set with a focus on the mismatched atom-mappings (together with 1.5 k matches for context) using reaction tree maps (TMAPs) (41, 58). (B) Examples of atom-mappings generated by RXNMapper. Reactants and reagents were not separated in the inputs.

Among the most frequent failures of RXNMapper, we find examples of wrong atom ordering in rings and azide compounds (Fig. 2B, d). In others, the model assigns wrong mappings to a single oxygen atom, like in reductions (Fig. 2B, e) or in Mitsunobu reactions (Fig. 2B, f), where the phenolic oxygen should become part of the product, but the model maps the primary or secondary alcohol instead.

We also observed counterexamples of Mitsunobu reactions (Fig. 2B, c) for which our model correctly mapped the reacting oxygen, while the rule-based reference contained the wrong mapping as a result of the reaction not matching the Mitsunobu reaction rule. Although the overall quality of the reference atom-maps in the 49k test set (46) is high, we were able to identify few important advantages of using RXNMapper instead of the rule-based mapped dataset. RXNMapper correctly assigns the oxygen of the primary alcohols to be part of the major product for esterification reactions (Fig. 2B, a) like Fischer-Speier and Steglich esterifications as opposed to the annotated ground truth. It also correctly recognizes anhydrides (Fig. 2B, b) and peroxides as reactants in acylation and oxidation reactions where the ground truth favored formic acid and water.

RXNMapper not only excels on patent reactions but also performs remarkably well on reactions involving rearrangements of the carbon skeleton where humans require an understanding of the reaction mechanism to correctly atom-map. Notable examples include an intramolecular Claisen rearrangement used to construct fused seven- to eight-membered ring in the synthesis of the natural product micrandilactone A (Fig. 3A) (47, 48) and the tandem Palladium-catalyzed semipinacol rearrangement/direct arylation used for a stereoselective synthesis of benzodiquinanes from cyclobutanols (Fig. 3B) (49). In both cases, RXNMapper completes the correct atom-mapping despite the entirely rearranged carbon skeletons resulting in different ring sizes and connections. ReactionMap, Marvin, ChemDraw, and Indigo failed at this atom-mapping task. RXNMapper also succeeds in atom-mapping the ring rearrangement metathesis of a norbornene to form a bicyclic enone under catalysis by Grubbs-(I) catalyst (Fig. 3C) (50). In this case, ChemDraw successfully completes the mapping, while the other tools failed. Furthermore, RXNMapper performs well with multicomponent reactions such as the Ugi four-component condensation of isonitriles, aldehydes, amines, and carboxylic acids to form acylated aminoacid amides (Fig. 3D) (51). Here, RXNmapper maps all atoms correctly except for the carbonyl oxygen atom of the isonitrile-derived carboxamide. RXNMapper assigns this oxygen atom to the oxygen atom of the carbonyl group of the aldehyde reagent, although this atom actually comes from the hydroxyl group of the carboxylic acid reagent. All other tools failed this atom-mapping task except for Mappet.

Fig. 3 Atom-mapping on complex reactions.

Examples and results for commercially available tools from the complex reactions dataset by Jaworski et al. (16). (A) Bu3Al-promoted Claisen rearrangement (47, 48). (B) Palladium-catalyzed semipinacol rearrangement and direct arylation (49). (C) Grubbs-catalyzed ring rearrangement metathesis reaction (50). (D) Ugi reaction (51).

Similar to Jaworski et al. (16), we analyzed the atom-mapping in United States Patent and Trademark Office (USPTO) patent reactions according to the number of bond changes (Fig. 4A ). RXNMapper performs better than Mappet (16) on all reactions except for those involving only one bond change. With an average time to solution of 7.7 ms per reaction on graphics processing unit (GPU) accelerators and 36.4 ms per reaction on central processing unit (CPU), RXNMapper’s speed is similar to the Indigo toolkit (17) on balanced reactions and far exceeds Indigo on unbalanced ones (Fig. 4B). As a comparison, Mappet (16) takes more than 10 s per reaction for 3.2% of their balanced test set reactions and for few of the reactions even more than 100 s per reaction. In addition, RXNMapper outputs a confidence score for the generated atom-maps. An analysis of the confidence scores and more detailed comparisons are available in the Supplementary Materials.

Fig. 4 Comparison with other tools.

(A) Comparison of RXNMapper, Mappet (16), and the original Indigo mapping from the USPTO dataset (281 reactions). The error bars show the Wilson confidence interval (59). (B) Mapping speed comparison between RXNMapper and Indigo (17), which is orders of magnitude faster than Mappet (16). For Indigo of 500 ms, we set a timeout of 500 ms, after which the tool would return an incomplete mapping. We averaged the timing on the imbalanced reactions for Indigo without timeout on 20k reactions.

The advantages of RXNMapper compared to the open-source Indigo (17) and the closed-source Mappet (16) are summarized in Table 1. RXNMapper is noticeably faster than other tools, handles strongly unbalanced reactions, performs well even on complex reactions, and is open-source. It can also be used for compiling retrosynthetic rules, which are of crucial importance for several reaction and retrosynthesis prediction schemes. For instance, in the Chematica project (2), numerous Ph.D. students and Postdocs across 15 years continuously worked to extract reactions from literature and convert them into retrosynthetic rules. With unsupervised schemes such as RXNMapper, the extraction of retrosynthetic rules can be completed in a matter of weeks, with little human intervention. We demonstrate such an extraction by atom-mapping the entire USPTO datasets and by extracting the retrosynthetic rules using the approach described by Thakkar et al. (38). We make available the corresponding atom-mappings of the USPTO dataset and the 21k most frequently extracted retrosynthetic rules along with the most commonly used reagents, the corresponding patent numbers, and the first year of appearance. The application of unsupervised schemes demonstrates the feasibility of running a completely unassisted construction of retrosynthetic rules in just a few days—three orders of magnitude faster than previous human curation protocols. The use of unsupervised schemes will facilitate the compilation of previously unidentified retrosynthetic rules in existing rule-based systems.

Table 1 Comparison of different atom-mapping tools.

Comparing RXNMapper to Indigo (17) and Mappet (16).

View this table:


We have shown that the application of unsupervised, attention-based language models to a corpus of organic chemistry reactions provides a way to extract the organic chemistry grammar without human intervention. We unboxed the neural network architecture to extract the rules governing atom rearrangements between products and reactants/reagents. Using this information, we developed an attention-guided reaction mapper that exhibits remarkable performance in both speed and accuracy across many different reaction classes. We showed how to create a state-of-the-art atom-mapping tool within 2 days of training without the need for tedious and potentially biased human encoding or curation. Because the entire approach is completely unsupervised, the use of specific reaction datasets can improve the atom-mapping performance on corner cases. The resulting atom-mapping tool is significantly faster and more effective than existing tools, especially for strongly imbalanced reactions. Last, our work provides evidence that unannotated collections of chemical reactions contain all the relevant information necessary to construct a coherent set of atom-mapping rules. Numerous applications built on atom-mapping will immediately benefit from our findings (21, 36, 38), and others will become more interpretable exploiting the potential of unsupervised atom-mappings (28, 33).

The use of symbolic representations and the means to learn autonomously from rich chemical data led to the design of valuable assistants in chemical synthesis (26). A strengthened trust between human and interpretable data-driven assistants will spark the next revolutions in chemistry, where domain patterns and knowledge can be easily extracted and explained from the inner architectures of trained models.



Transformers are a class of deep neural network architectures that relies on multiple and sequential applications of self-attention layers (27). These layers are composed of one or more heads, each of which learns a square attention matrix A∈RN × N of weights that connect each token’s embedding Yi in an input sequence Y of length N to every other token’s embedding Yj. Thus, each element Aij is the attention weight connecting Yi to Yj. This formulation makes the attention weights in the Transformer architecture amenable to visualizations as the curves connecting an input sequence to itself, where a thicker, darker line indicates a higher attention value.

The calculation of the attention matrix of each head can be easily interpreted as a probabilistic hashmap or lookup table over all other elements Yj. Each head in a self-attention layer will first convert the vector representation of every token Yi into a key, query, and value vector using the following operationsKi=WkYiQi=WqYiVi=WvYi(1)where Wk ∈Rdk × de, Wq ∈Rdk × de, and Wv ∈Rdv × de are learnable parameters. Ai, or the vector of attention out of token Yi, is then a discrete probability distribution over the other input tokens, and it is calculated by taking a dot product over that token’s query vector and every other token’s key vector followed by a softmax to convert the information into probabilitiesAi=softmax (Qi(WkY)dk)(2)

Note that one can define input sequence Y as an N × de matrix and matrix Wk as a dk × de matrix, where de is the embedding dimension of each token and dk is the embedding dimension shared by the query and the key.

Each head must learn a unique function to accomplish the masked language modeling task, and some of these functions are inherently interpretable to the domain of the data. For example, in NLP, it has been shown that certain heads learn dependency and part of speech relationships between words (52, 53). Using visual tools can make exploring these learned functions easier (42).

Model details

For our experiments, we used PyTorch (v1.3.1) (54) and huggingface transformers (v2.5.0) (55). The ALBERT model was trained for 48 hours on a single Nvidia P100 GPU with the hyperparameters stated in the Supplementary Materials. Schwaller et al. (28) developed the tokenization regex used to tokenize the SMILES. We expect further performance improvements when using more extensive datasets (e.g., commercially available ones). The RXNMapper model uses 12 layers, 8 heads, a hidden size of 256, an embedding size of 128, and an intermediate size of 512. In contrast to ALBERT base (32) with 12M parameters, our model is small and contains only 770k trainable parameters.


The work by Lowe (24) provides the datasets used for training, composed of chemical reactions extracted from both grants and patent applications. We removed the original atom-mapping from this dataset, canonicalized the reactions with RDKit (56), and removed any duplicate reactions. The dataset includes reactions with fragment information twice, once with and once without fragment bonds, as defined in the work of Schwaller et al. (33). The final training set for the masked language modeling task contained a total of 2.8M reactions. For the evaluation and the model selection, we sampled 996 random reactions from the dataset of Schneider et al. (34).

To test our models, we first used the remaining 49k reactions from the Schneider 50k patents dataset (34). We do not distinguish between reactants and reagents in the inputs of our models. We also used the human-curated test sets that were introduced by Jaworski et al. (16) to compare our approach to previous methods. Table 2 shows an overview of the test sets. Note that patent reactions differ from the reactions in Jaworski et al. (16) because the latter removes most reactants and reagents in an attempt to balance the reactions.

Table 2 Test datasets.

Datasets used for the comparison with other tools.

View this table:

Attention-guided atom-mapping algorithm

The attention-guided algorithm relies on the construction of the attention matrix for a selected layer and head, where we sum the product-to-reactant and the corresponding reactant-to-product atom attentions. Algorithm 1 provides the exact atom-mapping algorithm. By default, after matching a product-reactant pair, the attentions to those atoms are zeroed. Optionally, atoms in product and reactants can have multiple corresponding atoms. We always mask out attention to atoms of different types.

Atom-mapping curation

Chemically equivalent atoms exist in many chemical reactions. Most of the chemically equivalent atoms could be matched after canonicalizing the atom-mapped reaction using RDKit (56, 57). Exceptions were atoms of the same type connected to another atom with different bond types, which would form a resonance structure with delocalized electrons. We manually curated these exceptions and added them as alternative maps in the USPTO bond changes test set (16).

Embedded Image


Supplementary material for this article is available at

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license, which permits use, distribution, and reproduction in any medium, so long as the resultant use is not for commercial advantage and provided the original work is properly cited.


Acknowledgments: We thank the RXN for Chemistry team, and the Reymond group for insightful discussions and comments. H. Strobelt is a visiting research scientist at MIT. Funding: This work was supported by IBM Research. Author contributions: The project was conceived and planned by P.S. and B.H. and supervised by J.-L.R., H.S., and T.L. P.S. implemented and trained the models. B.H. and H.S. developed the visualization tools. P.S. and B.H. built RXNMapper. P.S., T.L., and J.L.R. analyzed and compared the atom-mapping. All the authors were involved in discussions on the project and wrote the manuscript. Competing interests: The authors declare that they have no competing interests. Data and materials availability: All data needed to evaluate the conclusions in the paper and/or the Supplementary Materials. Additional data related to this paper may be requested from the authors. All our generated atom-mappings, including those for the largest open-source patent dataset (24), the unmapped training, validation, and test set reactions, can be found in the following repository The code is available at and a demo at

Stay Connected to Science Advances

Navigate This Article