Estimating the synthetic accessibility of molecules with building block and reaction-aware SAScore

JCIM. 2024. ISSN: 1758-2946

The one question most chemists ask when viewing molecules generated using computational tools, machine-learning methods or AI models is but how easy is it to make? MedChemica users will know that we use Peter Ertl’s SAScore1 within CoreDesign to help users understand how synthetically accessible the molecules that are being suggested are. This score is between 1 and 10, where 1 is easy to synthesise and 10 is very difficult. Chen and Yung have taken this score and tried to enhance it to consider building blocks and actual synthetic routes, to give BR-SAScore.

In addition, to taking building blocks and reaction pathways into account they wanted the score to be quick unlike some other synthetic accessibility scoring methods, such as RAScore2, which is more than 300 times slower than SAScore. Chen et al. use three datasets constructed from a variety of places all labelled with either ES for easy-to-synthesize or HS for hard-to-synthesize. These labels are either taken from its source database or created using Retro*3 to determine if the synthesis can be resolved. If it can be resolved within 10 steps then it is labelled ES if not then HS.

SAScore is calculated using a fragmentScore – complexityPenalty whereas the BR-SAScore introduces a BR-fragmentScore like so BR – fragmentScore – complexityPenalty which tries to encapsulate building block fragments that are used in the synthesis planning program. The BR-fragmentScore is calculated by fragmenting the molecule by building blocks and reaction knowledge, see Figure 1, and producing a BScore and RScore from these fragments. These two scores are calculated from the fragments by extracting the extended-connectivity fingerprints with radius 2 (ECFP4) and excluding fingerprint bits that are associated with radius 0 or 1. The count of fingerprint bits divided by 0.1% of the number of fragments is logged to generate both scores.

CompChem POTM Nov1

 

The BR-SAScore was then compared to six existing methods, two likeness-based scores (SAScore and CLScore) and four learning-based methods (SYBA, RAScore, GASA and DeepSA). The three datasets (TS1, TS2 and TS3) are scored by all seven methods. The results can be seen in Figure 2. BR-Score shows higher precision and true positive rates at nearly all recall and false positive rate values. Chen and Yung also examine the area under the graph for both sets of graphs whilst also taking into account computational time suggesting that BR-SAScore is the best followed closely by SAScore and DeepSA.

CompChem POTM Nov2

 

Additional tests were then run on more complex structures and it was demonstrated how each atom within some molecules contributes to the BR-SAScore and provides a comparison to DeepSA. BR-SAScore tends to align with the confidence level of the synthesis planning program.

BR-SAScore is another interesting take on an age-old problem of knowing how to numerically quantify how easy a molecule is to synthesis. It will be intriguing to see the uptake of the score as it develops further on the existing score SAScore but takes into account building blocks and actual synthetic routes, which in the examines presented leads to a slightly more accurate scoring.

 

  1. Ertl P, Schuffenhauer A (2009) Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J Cheminform 1:8. https://doi.org/10.1186/1758-2946-1-8
  2. Thakkar A, Chadimová V, Bjerrum EJ et al (2021) Retrosynthetic accessibility score (RAscore)—rapid machine learned synthesizability classification from AI driven retrosynthetic planning. Chem Sci 12:3339–3349. https://doi.org/10.1039/D0SC05401A
  3. Chen B, Li C, Dai H, Song L (2020) Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search. In: Proceedings of the 37th International Conference on Machine Learning. PMLR, pp 1608–1616

 

JCIM. 2024. ISSN: 1758-2946