scholarly journals Accurate contact-based modelling of repeat proteins predicts the structure of new repeats protein families

2021 ◽  
Vol 17 (4) ◽  
pp. e1008798
Author(s):  
Claudio Bassot ◽  
Arne Elofsson

Repeat proteins are abundant in eukaryotic proteomes. They are involved in many eukaryotic specific functions, including signalling. For many of these proteins, the structure is not known, as they are difficult to crystallise. Today, using direct coupling analysis and deep learning it is often possible to predict a protein’s structure. However, the unique sequence features present in repeat proteins have been a challenge to use direct coupling analysis for predicting contacts. Here, we show that deep learning-based methods (trRosetta, DeepMetaPsicov (DMP) and PconsC4) overcomes this problem and can predict intra- and inter-unit contacts in repeat proteins. In a benchmark dataset of 815 repeat proteins, about 90% can be correctly modelled. Further, among 48 PFAM families lacking a protein structure, we produce models of forty-one families with estimated high accuracy.

2019 ◽  
Author(s):  
Claudio Bassot ◽  
Arne Elofsson

AbstractRepeat proteins are an abundant class in eukaryotic proteomes. They are involved in many eukaryotic specific functions, including signalling. For many of these families, the structure is not known. Recently, it has been shown that the structure of many protein families can be predicted by using contact predictions from direct coupling analysis and deep learning. However, their unique sequence features present in repeat proteins is a challenge for contact predictions DCA-methods. Here, we show that using the deep learning-based PconsC4 is more effective for predicting both intra and interunit contacts among a comprehensive set of repeat proteins. In a benchmark dataset of 819 repeat proteins about one third can be correctly modelled and among 51 PFAM families lacking a protein structure, we produce models of five families with estimated high accuracy.Author SummaryRepeat proteins are widespread among organisms and particularly abundant in eukaryotic proteomes. Their primary sequence present repetition in the amino acid sequences that origin structures with repeated folds/domains. Although the repeated units are easy to be recognized in primary sequence, often structure information are missing. Here we used contact prediction for predicting the structure of repeats protein directly from their primary sequences. We benchmark our method on a dataset comprehensive of all the known repeated structures. We evaluate the contact predictions and the obtained models set for different classes of proteins and different lengths of the target, and we benchmark the quality assessment of the models on repeats proteins. Finally, we applied the methods on the repeat PFAM families missing of resolved structures, five of them modelled with high accuracy.


Author(s):  
Edwin Rodriguez Horta ◽  
Martin Weigt

AbstractCoevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop two strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. An analysis of these data shows that the strongest coevolutionary couplings, i.e. those used by Direct Coupling Analysis to predict contacts, are only weakly influenced by phylogeny. However, phylogeny-induced spurious couplings are of similar size to the bulk of coevolutionary couplings, and dissecting functional from phylogeny-induced couplings might lead to more accurate contact predictions in the range of intermediate-size couplings.The code is available at https://github.com/ed-rodh/Null_models_I_and_II.Author summaryMany homologous protein families contain thousands of highly diverged amino-acid sequences, which fold in close-to-identical three-dimensional structures and fulfill almost identical biological tasks. Global coevolutionary models, like those inferred by the Direct Coupling Analysis (DCA), assume that families can be considered as samples of some unknown statistical model, and that the parameters of these models represent evolutionary constraints acting on protein sequences. To learn these models from data, DCA and related approaches have to also assume that the distinct sequences in a protein family are close to independent, while in reality they are characterized by involved hierarchical phylogenetic relationships. Here we propose Null models for sequence alignments, which maintain patterns of amino-acid conservation and phylogeny contained in the data, but destroy any coevolutionary couplings, frequently used in protein structure prediction. We find that phylogeny actually induces spurious non-zero couplings. These are, however, significantly smaller that the largest couplings derived from natural sequences, and therefore have only little influence on the first predicted contacts. However, in the range of intermediate couplings, they may lead to statistically significant effects. Dissecting phylogenetic from functional couplings might therefore extend the range of accurately predicted structural contacts down to smaller coupling strengths than those currently used.


Author(s):  
Maureen Muscat ◽  
Giancarlo Croce ◽  
Edoardo Sarti ◽  
Martin Weigt

AbstractPredicting three-dimensional protein structure and assembling protein complexes using sequence information belongs to the most prominent tasks in computational biology. Recently substantial progress has been obtained in the case of single proteins using a combination of unsupervised coevolutionary sequence analysis with structurally supervised deep learning. While reaching impressive accuracies in predicting residue-residue contacts, deep learning has a number of disadvantages. The need for large structural training sets limits the applicability to multi-protein complexes; and their deep architecture makes the interpretability of the convolutional neural networks intrinsically hard. Here we introduce FilterDCA, a simpler supervised predictor for inter-domain and inter-protein contacts. It is based on the fact that contact maps of proteins show typical contact patterns, which results from secondary structure and are reflected by patterns in coevolutionary analysis. We explicitly integrate averaged contacts patterns with coevolutionary scores derived by Direct Coupling Analysis, reaching results comparable to more complex deep-learning approaches, while remaining fully transparent and interpretable. The FilterDCA code is available at http://gitlab.lcqb.upmc.fr/muscat/FilterDCA.Author summaryThe de novo prediction of tertiary and quaternary protein structures has recently seen important advances, by combining unsupervised, purely sequence-based coevolutionary analyses with structure-based supervision using deep learning for contact-map prediction. While showing impressive performance, deep-learning methods require large training sets and pose severe obstacles for their interpretability. Here we construct a simple, transparent and therefore fully interpretable inter-domain contact predictor, which uses the results of coevolutionary Direct Coupling Analysis in combination with explicitly constructed filters reflecting typical contact patterns in a training set of known protein structures, and which improves the accuracy of predicted contacts significantly. Our approach thereby sheds light on the question how contact information is encoded in coevolutionary signals.


2017 ◽  
Vol 114 (13) ◽  
pp. E2662-E2671 ◽  
Author(s):  
Guido Uguzzoni ◽  
Shalini John Lovis ◽  
Francesco Oteri ◽  
Alexander Schug ◽  
Hendrik Szurmant ◽  
...  

Proteins have evolved to perform diverse cellular functions, from serving as reaction catalysts to coordinating cellular propagation and development. Frequently, proteins do not exert their full potential as monomers but rather undergo concerted interactions as either homo-oligomers or with other proteins as hetero-oligomers. The experimental study of such protein complexes and interactions has been arduous. Theoretical structure prediction methods are an attractive alternative. Here, we investigate homo-oligomeric interfaces by tracing residue coevolution via the global statistical direct coupling analysis (DCA). DCA can accurately infer spatial adjacencies between residues. These adjacencies can be included as constraints in structure prediction techniques to predict high-resolution models. By taking advantage of the ongoing exponential growth of sequence databases, we go significantly beyond anecdotal cases of a few protein families and apply DCA to a systematic large-scale study of nearly 2,000 Pfam protein families with sufficient sequence information and structurally resolved homo-oligomeric interfaces. We find that large interfaces are commonly identified by DCA. We further demonstrate that DCA can differentiate between subfamilies with different binding modes within one large Pfam family. Sequence-derived contact information for the subfamilies proves sufficient to assemble accurate structural models of the diverse protein-oligomers. Thus, we provide an approach to investigate oligomerization for arbitrary protein families leading to structural models complementary to often-difficult experimental methods. Combined with ever more abundant sequential data, we anticipate that this study will be instrumental to allow the structural description of many heteroprotein complexes in the future.


2011 ◽  
Vol 108 (49) ◽  
pp. E1293-E1301 ◽  
Author(s):  
F. Morcos ◽  
A. Pagnani ◽  
B. Lunt ◽  
A. Bertolino ◽  
D. S. Marks ◽  
...  

2021 ◽  
Vol 3 (2) ◽  
Author(s):  
Bernat Anton ◽  
Mireia Besalú ◽  
Oriol Fornes ◽  
Jaume Bonet ◽  
Alexis Molina ◽  
...  

Abstract Direct-coupling analysis (DCA) for studying the coevolution of residues in proteins has been widely used to predict the three-dimensional structure of a protein from its sequence. We present RADI/raDIMod, a variation of the original DCA algorithm that groups chemically equivalent residues combined with super-secondary structure motifs to model protein structures. Interestingly, the simplification produced by grouping amino acids into only two groups (polar and non-polar) is still representative of the physicochemical nature that characterizes the protein structure and it is in line with the role of hydrophobic forces in protein-folding funneling. As a result of a compressed alphabet, the number of sequences required for the multiple sequence alignment is reduced. The number of long-range contacts predicted is limited; therefore, our approach requires the use of neighboring sequence-positions. We use the prediction of secondary structure and motifs of super-secondary structures to predict local contacts. We use RADI and raDIMod, a fragment-based protein structure modelling, achieving near native conformations when the number of super-secondary motifs covers >30–50% of the sequence. Interestingly, although different contacts are predicted with different alphabets, they produce similar structures.


2019 ◽  
Author(s):  
Marco Fantini ◽  
Simonetta Lisi ◽  
Paolo De Los Rios ◽  
Antonino Cattaneo ◽  
Annalisa Pastore

AbstractDirect Coupling Analysis (DCA) is a powerful technique that enables to extract structural information of proteins belonging to large protein families exclusively by in silico analysis. This method is however limited by sequence availability and various biases. Here, we propose a method that exploits molecular evolution to circumvent these limitations: instead of relying on existing protein families, we used in vitro mutagenesis of TEM-1 beta lactamase combined with in vivo functional selection to generate the sequence data necessary for evolutionary analysis. We could reconstruct by this strategy, which we called CAMELS (CouplingAnalysis byMolecularEvolutionLibrarySequencing), the lactamase fold exclusively from sequence data. Through generating and sequencing large libraries of variants, we can deal with any protein, ancient or recent, from any species, having the only constraint of setting up a functional phenotypic selection of the protein. This method allows us to obtain protein structures without solving the structure experimentally.


Sign in / Sign up

Export Citation Format

Share Document