Identify protein targets in SARS-CoV-2 via machine learning


The 2019 coronavirus disease (COVID-19) pandemic has affected nearly 271 million people and killed 5.32 million, the most recent episode being that of the coronavirus delta variant 2 of severe acute respiratory syndrome (SARS-CoV-2). The COVID-19 pandemic only adds to the list of infectious diseases that were potential global threats like Severe Acute Respiratory Syndrome (SARS), Middle East Respiratory Syndrome (MERS), Ebola and Zika. Such infections highlight the need to develop therapeutic agents to fight against emerging pathogens.

To study: Language models for predicting SARS-CoV-2 inhibitors. Image Credit: Andrii Vodolazhskyi / Shutterstock

The process of developing therapeutic solutions against new viruses is tedious and prohibitively long, taking up to 10 to 15 years. The initial step of determining molecules of interest and therapeutic targets for further investigation is crucial due to the large size of the chemical space, which precludes exhaustive research using expensive experiments and assays. Tools from machine learning (ML) and high performance computing (HPC) are increasingly used to guide the selection of promising drug candidates. Although computational methods can partially mitigate some of the associated experimental costs, normally a large library of compounds with measured properties is required for learning the ML algorithm. Therefore, organizing a rapid response to an emerging pandemic also poses a challenge for computational methods, as one has to generate large datasets with the target of interest.

To automate the drug discovery process requires an algorithm that (i) leverages large libraries of existing compounds without the need for chemical property measurements; (ii) predict the affinities for new protein targets with very limited experimental data available; (iii) explore the chemical space of the target pathogen / infection to effectively identify compounds for further investigation.

To meet all three criteria, researchers at Oak Ridge National Laboratory used high performance computing (HPC) to train generalizable ML models for both candidate generation and affinity prediction.

Their experience was recently posted to the pre-print server. bioRxiv * and provided an overview of the use of an ML-based algorithm to analyze and predict therapeutic targets in emerging pathogens with a range of mutations.


To take advantage of existing large libraries of compounds, the researchers used a textual representation of molecular data known as the Simplified Molecular Input Line Entry System (SMILES). Using the Enamine REAL database as a starting point, they generated a new dataset of approximately 9.6 billion unique molecules. The dataset was used to pre-train a Transformer (i.e. BERT) model, using the mask prediction task commonly found in natural language processing applications. During pre-training, the subsequences of a given molecule were replaced with a mask, and the model was verified to be able to predict the appropriate sequence depending on the context. Thus, the model learned a representation of the chemical structure in a completely unsupervised manner that did not require additional property measurements.

When pre-training the Deep Learning Language Model (BERT) on approximately 9.6 billion molecules, the researchers achieved peak performance of 603 petaflops in mixed precision. This experiment thus made it possible to successfully reduce the pre-training time from a few days to a few hours, compared to previous efforts with this architecture. This process also increased the size of the dataset by almost an order of magnitude.

For scoring, the researchers refined the language model using an assembled set of multiple protein targets with binding affinity data and looked for specific protein target inhibitors, SARS-CoV-2 Mpro and PLpro. They used a genetic algorithm approach to find optimal candidates using the language model’s generation and scoring capabilities.


The main constraint in locating therapeutic targets in emerging pathogens and infections is the presence of mutations. The ML algorithm developed by the researchers can be effectively used to speed up the identification of protein binding sites on the surfaces of mutant pathogens and help model inhibitor-based drugs for these emerging therapeutic targets.

*Important Notice

bioRxiv publishes preliminary scientific reports that are not peer reviewed and, therefore, should not be considered conclusive, guide clinical practice / health-related behavior, or treated as established information.


Comments are closed.