scholarly journals Adaptive machine learning for protein engineering

2022 ◽  
Vol 72 ◽  
pp. 145-152
Author(s):  
Brian L. Hie ◽  
Kevin K. Yang
2021 ◽  
Vol 42 (3) ◽  
pp. 151-165
Author(s):  
Harini Narayanan ◽  
Fabian Dingfelder ◽  
Alessandro Butté ◽  
Nikolai Lorenzen ◽  
Michael Sokolov ◽  
...  

2020 ◽  
Author(s):  
Adam C. Mater ◽  
Mahakaran Sandhu ◽  
Colin Jackson

AbstractMachine learning (ML) has the potential to revolutionize protein engineering. However, the field currently lacks standardized and rigorous evaluation benchmarks for sequence-fitness prediction, which makes accurate evaluation of the performance of different architectures difficult. Here we propose a unifying framework for ML-driven sequence-fitness prediction. Using simulated (the NK model) and empirical sequence landscapes, we define four key performance metrics: interpolation within the training domain, extrapolation outside the training domain, robustness to sparse training data, and ability to cope with epistasis/ruggedness. We show that architectural differences between algorithms consistently affect performance against these metrics across both experimental and theoretical landscapes. Moreover, landscape ruggedness is revealed to be the greatest determinant of the accuracy of sequence-fitness prediction. We hope that this benchmarking method and the code that accompanies it will enable robust evaluation and comparison of novel architectures in this emerging field and assist in the adoption of ML for protein engineering.


2021 ◽  
Author(s):  
Christian Dallago ◽  
Jody Mou ◽  
Kadina E Johnston ◽  
Bruce Wittmann ◽  
Nicholas Bhattacharya ◽  
...  

Machine learning could enable an unprecedented level of control in protein engineering for therapeutic and industrial applications. Critical to its use in designing proteins with desired properties, machine learning models must capture the protein sequence-function relationship, often termed fitness landscape. Existing benchmarks like CASP or CAFA assess structure and function predictions of proteins, respectively, yet they do not target metrics relevant for protein engineering. In this work, we introduce Fitness Landscape Inference for Proteins (FLIP), a benchmark for function prediction to encourage rapid scoring of representation learning for protein engineering. Our curated tasks, baselines, and metrics probe model generalization in settings relevant for protein engineering, e.g. low-resource and extrapolative. Currently, FLIP encompasses experimental data across adeno-associated virus stability for gene therapy, protein domain B1 stability and immunoglobulin binding, and thermostability from multiple protein families. In order to enable ease of use and future expansion to new tasks, all data are presented in a standard format. FLIP scripts and data are freely accessible at https://benchmark.protein.properties/home.


2021 ◽  
Vol 2021 ◽  
pp. 1-7
Author(s):  
Li-li Jia ◽  
Ting-ting Sun ◽  
Yan Wang ◽  
Yu Shen

Artificial intelligence technologies such as machine learning have been applied to protein engineering, with unique advantages in protein structure, function prediction, catalytic activity, and other issues in recent years. Screening better mutants is still a bottleneck in protein engineering. In this paper, a new sequence-activity relationship method was analyzed for its application in improving the thermal stability of Aspergillus terreus (R)-ω-selective amine transaminase. The experimental data from 6 single-point mutated enzymes were used as a learning dataset to build models and predict the thermostability of 26 mutants. Based on digital signal processing (DSP), this method digitized the amino acid sequence of proteins by fast Fourier transform (FFT) and then established the best model applying partial least squares regression (PLSR) to screen out all possible mutants, especially those with high performance. In protein engineering, the innovative sequence activity relationship (ISAR) method can make a reasonable prediction using limited experimental data and significantly reduce the experimental cost. The half-life ( T 1 / 2 ) of (R)-ω-transaminase was fitted with the amino acid sequence by the ISAR algorithm, resulting in an R 2 of 0.8929 and a cvRMSE of 4.89. At the same time, the mutants with higher T 1 / 2 than the existing ones were predicted, laying the groundwork for better (R)-ω-transaminase in the later stage. The ISAR algorithm is expected to provide a new technique for protein evolution and screening.


2020 ◽  
Vol 60 (6) ◽  
pp. 2773-2790 ◽  
Author(s):  
Yuting Xu ◽  
Deeptak Verma ◽  
Robert P. Sheridan ◽  
Andy Liaw ◽  
Junshui Ma ◽  
...  

2020 ◽  
Vol 104 (24) ◽  
pp. 10515-10529
Author(s):  
Sanni Voutilainen ◽  
Markus Heinonen ◽  
Martina Andberg ◽  
Emmi Jokinen ◽  
Hannu Maaheimo ◽  
...  

Abstract In this work, deoxyribose-5-phosphate aldolase (Ec DERA, EC 4.1.2.4) from Escherichia coli was chosen as the protein engineering target for improving the substrate preference towards smaller, non-phosphorylated aldehyde donor substrates, in particular towards acetaldehyde. The initial broad set of mutations was directed to 24 amino acid positions in the active site or in the close vicinity, based on the 3D complex structure of the E. coli DERA wild-type aldolase. The specific activity of the DERA variants containing one to three amino acid mutations was characterised using three different substrates. A novel machine learning (ML) model utilising Gaussian processes and feature learning was applied for the 3rd mutagenesis round to predict new beneficial mutant combinations. This led to the most clear-cut (two- to threefold) improvement in acetaldehyde (C2) addition capability with the concomitant abolishment of the activity towards the natural donor molecule glyceraldehyde-3-phosphate (C3P) as well as the non-phosphorylated equivalent (C3). The Ec DERA variants were also tested on aldol reaction utilising formaldehyde (C1) as the donor. Ec DERA wild-type was shown to be able to carry out this reaction, and furthermore, some of the improved variants on acetaldehyde addition reaction turned out to have also improved activity on formaldehyde. Key points • DERA aldolases are promiscuous enzymes. • Synthetic utility of DERA aldolase was improved by protein engineering approaches. • Machine learning methods aid the protein engineering of DERA.


Author(s):  
Surojit Biswas ◽  
Grigory Khimulya ◽  
Ethan C. Alley ◽  
Kevin M. Esvelt ◽  
George M. Church

AbstractProtein engineering has enormous academic and industrial potential. However, it is limited by the lack of experimental assays that are consistent with the design goal and sufficiently high-throughput to find rare, enhanced variants. Here we introduce a machine learning-guided paradigm that can use as few as 24 functionally assayed mutant sequences to build an accurate virtual fitness landscape and screen ten million sequences via in silico directed evolution. As demonstrated in two highly dissimilar proteins, avGFP and TEM-1 β-lactamase, top candidates from a single round are diverse and as active as engineered mutants obtained from previous multi-year, high-throughput efforts. Because it distills information from both global and local sequence landscapes, our model approximates protein function even before receiving experimental data, and generalizes from only single mutations to propose high-functioning epistatically non-trivial designs. With reproducible >500% improvements in activity from a single assay in a 96-well plate, we demonstrate the strongest generalization observed in machine-learning guided protein function optimization to date. Taken together, our approach enables efficient use of resource intensive high-fidelity assays without sacrificing throughput, and helps to accelerate engineered proteins into the fermenter, field, and clinic.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Yunan Luo ◽  
Guangde Jiang ◽  
Tianhao Yu ◽  
Yang Liu ◽  
Lam Vo ◽  
...  

AbstractMachine learning has been increasingly used for protein engineering. However, because the general sequence contexts they capture are not specific to the protein being engineered, the accuracy of existing machine learning algorithms is rather limited. Here, we report ECNet (evolutionary context-integrated neural network), a deep-learning algorithm that exploits evolutionary contexts to predict functional fitness for protein engineering. This algorithm integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. As such, it enables accurate mapping from sequence to function and provides generalization from low-order mutants to higher-order mutants. We show that ECNet predicts the sequence-function relationship more accurately as compared to existing machine learning algorithms by using ~50 deep mutational scanning and random mutagenesis datasets. Moreover, we used ECNet to guide the engineering of TEM-1 β-lactamase and identified variants with improved ampicillin resistance with high success rates.


2020 ◽  
Vol 4 (1) ◽  
pp. 7-9
Author(s):  
Harry F. Rickerby ◽  
Katya Putintseva ◽  
Christopher Cozens

Sign in / Sign up

Export Citation Format

Share Document