Deep-learning-based single-domain and multidomain protein structure prediction with D-I-TASSER

Datasets

Benchmark dataset collection

To test our methods, the single-domain proteins in the benchmark dataset (Benchmark-I) were collected from the SCOPe 2.06 database⁴⁵ (717 targets), PDB (257 targets released after 1 May 2022) and the FM and FM/TBM targets from CASP 8–14 (refs. ^{46,47,48,49,50}; 288 targets). Then, redundancy was removed using a pairwise sequence identity cutoff of <30%, and only sequences with lengths between 30 and 850 amino acids were kept in the benchmark dataset. Furthermore, discontinuous targets were removed if the residue indices were not consecutive or the C_α distance between two consecutive residues was greater than 5 Å. In total, there were 1,262 targets consisting of 323 α proteins, 164 β proteins and 775 α/β or α + β proteins in the benchmark dataset, which can be classified as 211 trivial (TBM-easy), 551 easy (TBM-hard), 383 hard (FM/TBM) and 117 very hard (FM) targets (see ‘Deep learning module for contact map, distance map and HB network prediction’) based on LOMETS3 (refs. ^26,51,52). In the benchmark analysis, the ‘trivial’ and ‘easy’ targets were combined into one group called ‘easy targets’ (762), while the ‘hard’ and ‘very hard’ targets were integrated into one group called ‘hard targets’ (500).

The multidomain proteins presented in the benchmark dataset, known as Benchmark-II, were sourced from the PDB database⁵³. To eliminate redundancy, a pairwise sequence identity cutoff of less than 30% was used. In total, 230 targets within a length ranging from 80 to 1,250 amino acids were chosen. These targets cover 557 domains and can be divided into 167 two-domain targets, 37 three-domain targets and 26 high-order domain (≥4 domains) targets. Notably, 43 of the targets within Benchmark-II contain at least one discontinuous domain. Here a discontinuous domain is defined as a domain that contains two or more segments from separate regions of the protein sequence.

Please note that when LOMETS3 threading was performed, all homologous templates with a sequence identity >30% to the target were excluded.

Human proteome dataset

The human proteome dataset contains 20,595 proteins with lengths between 2 and 34,350 amino acids collected from UniProt. To meet the scalability of D-I-TASSER (3.0), we only kept proteins with lengths ≤1,500. Additionally, we removed proteins with lengths <40 because proteins shorter than 40 amino acids generally form simple helix or coil structures, which are useless to predict. In total, 19,512 human proteins are predicted by this work. The resulting 19,512 (94.7%) proteins contain 12,236 single-domain proteins and 7,276 multidomain proteins as classified by FUpred (v1.0)³⁶ or ThreaDom³⁵ (v1.0; see ‘Protocols for domain partition and multidomain structural assembly’). The 7,276 multidomain proteins can be further split into 22,732 domains. Consequently, in total, there are 34,968 (=12,236 + 22,732) domains for D-I-TASSER domain-level modeling.

As defined by LOMETS (v3.0), for the 19,512 full-chain proteins, 43%/57% were identified as easy/hard targets, while for the 34,968 domain-level proteins, the proportion of easy targets was higher, with a ratio of 65:35 for easy and hard targets (Supplementary Fig. 8a). Meanwhile, the average n_eff of the MSAs for the domain-level proteins (501) is more than two times higher than that of the full-chain proteins (238; Supplementary Fig. 8b). These data suggested the advantage of domain-level structure predictions because more homologous templates provide a better starting conformation, and higher n_eff MSAs contain more complete co-evolution information, thus helping AlphaFold2 (ref. ¹²), AttentionPotential and DeepPotential to create better restraints to assist D-I-TASSER simulations.

D-I-TASSER pipeline

The D-I-TASSER is a hybrid approach for uniform single-domain and multidomain protein structure prediction, coupling deep learning and threading assembly simulations. The pipeline consists of the following six steps: (1) deep MSA generation, (2) threading template identification, (3) inter-residue constraint prediction, (4) domain boundary partition and assembly, (5) iterative structure assembly simulation and (6) atomic-level structure refinement and model quality estimation (Fig. 1).

DeepMSA2 for MSA generation

To generate a sufficient number of homologous sequences in an MSA, we extended our previous MSA generation method, DeepMSA⁵⁴ (v1.0) to DeepMSA2 (refs. ^54,55; v2.0, https://zhanggroup.org/DeepMSA2), which uses HHblits⁵⁶ (v2.0.15), Jackhmmer⁵⁷ (3.1b2) and HMMsearch^56,57 (3.1b2) to iteratively search three whole-genome sequence databases, including Uniclust30 (ref. ⁵⁸), UniRef30 (ref. ⁵⁸) and UniRef90 (ref. ⁵⁹), and six metagenome sequence databases, including Metaclust⁶⁰, BFD⁶¹, Mgnify⁶², TaraDB⁶³, MetaSourceDB⁶⁴ and JGIclust⁶⁵ (Supplementary Fig. 9). Because the metagenomics databases include a lot more sequence information than normal genome databases, their inclusion may help improve the MSA quality. The detailed descriptions of these genome and metagenome databases can be found in Supplementary Note 1. As shown in Supplementary Fig. 9, DeepMSA2 contains the following three pipelines: dMSA, qMSA and mMSA (see details in Supplementary Note 2). The MSAs generated from dMSA, qMSA and mMSA are ranked by a simplified version of AlphaFold2, in which the template detection module is deactivated, and the embedding parameter is set to one to expedite the model generation process. Here up to ten MSAs are obtained from the MSA generation step, and each of these MSAs is used as input for the simplified AlphaFold2 program, resulting in the creation of five structural models. Among these models, the highest pLDDT score is assigned as the ranking score for that specific MSA. Ultimately, the MSA with the highest-ranking score among all generated MSAs is selected as the final MSA, representing an optimization of the information content contributing to the folding process.

To quantify the diversity of an MSA, we define the number of effective sequences (n_eff) by

$${{n_{\rm{eff}}}}=\frac{1}{\sqrt{{{L}}}}\mathop{\sum }\limits_{{{n}}=1}^{{{n_{\rm{MSA}}}}}\frac{1}{1+{\sum }_{{{m}}=1,{{m}}\ne {{n}}}^{{{n_{\rm{MSA}}}}}{{I}}\left({{{S}}}_{{{m}},{{n}}}\ge 0.8\right)},$$

(1)

where L is the length of a query protein, n_MSA is the number of sequences in the MSA, ${{{S}}}_{{{m}},{{n}}}$ is the sequence identity between the mth and nth sequences and I() represents the Iverson bracket, which takes the value I(S_m,n ≥ 0.8) = 1 if S_m,n ≥ 0.8, and 0 otherwise.

LOMETS3 pipeline for meta-server threading

LOMETS3 (https://zhanggroup.org/LOMETS)^26,51,52 is a meta-threading server for quick template-based fold recognition and protein structure prediction. It integrates the following 11 state-of-the-art threading programs: five contact-based threading programs, namely CEthreader⁶⁶ (v1.0), Hybrid-CEthreader⁶⁶ (v1.0), MapAlign⁶⁷ (v1.0), DisCovER⁶⁸ (v1.0) and EigenThreader⁶⁹ (v1.0), and six profile-based threading programs, namely HHpred⁷⁰ (v1.0), HHsearch⁷¹ (2.0.15), FFAS3D⁷² (v1.0), MUSTER⁷³ (v1.0) and SparksX⁷⁴ (v1.0), to help improve the quality of the meta-threading results. All individual threading methods are locally installed and run on our computer cluster to ensure the quick generation of initial threading alignments. Also, template libraries are updated weekly. Currently, the template library contains 106,803 domains/chains with a pairwise sequence identity of <70%. For a protein chain that consists of multiple domains, both the whole-chain and individual domain structures are included in the library. Due to its speed and accuracy, LOMETS3 is used as the initial step of D-I-TASSER to identify structural templates and generate query-template alignments.

The LOMETS3 pipeline consists of the following three consecutive steps: generation of sequence profiles, fold recognition through its component threading programs and template ranking and selection.

Generation of sequence profiles

Starting from a target protein sequence, the DeepMSA2 (refs. ^54,55) method (see ‘LOMETS3 pipeline for meta-server threading’) is used to generate deep MSAs by iterative sequence homology searches through multiple sequence databases. The deep profiles are calculated from the MSAs in the form of sequence profiles or profile hidden Markov models (HMMs), which are prerequisites for the different individual threading programs. The MSAs are also used to predict residue–residue contacts, distances and hydrogen bond (HB) geometries that are used by the five contact-based threading programs and template ranking.

Fold recognition through the component threading programs

The profiles generated in the first step are used by the 11 LOMETS3 threading programs to identify template structures from the template library, where profiles are prebuilt for each template.

Template ranking and selection

For a given target, 220 templates are generated by the 11 component servers, where each server generates 20 top templates that are sorted by their z scores for each threading algorithm. The top ten templates are finally selected from the 220 templates based on the following scoring function that integrates the z score—a score representing confidence in each method—and the sequence identity between the identified templates and query sequence:

$${\rm{score}}\left({{i}},{{j}}\right)={\rm{conf}}\left({{j}}\right)\times \frac{{{z \mathrm{score}}}\left({{i}},{{j}}\right)}{{{{Z}}}_{0}\left(\;{{j}}\right)}+{\rm{seqid}}\left({{i}},{{j}}\right),$$

(2)

where seqid (i, j) is the sequence identity between the query and the ith template for the jth program, and conf (j) is the confidence score for the jth program, which was calculated by determining the average TM scores over the first templates to the native structures on a training set of 243 nonredundant target proteins⁵¹. The detailed definition of z score (i, j) can be found in Supplementary Note 3, which includes three score terms from contacts, distances and HB geometries predicted by AttentionPotential (v1.0) and DeepPotential (v1.0), and one sequence profile score term from the original profile-based threading methods. z₀ (j) is the z-score cutoff for defining good/bad templates for the jth program, which was determined by maximizing the MCC for distinguishing a good template (with a TM score ≥0.5) from a bad template (TM score <0.5) on the same training set. As a result, the parameters z₀ (j) (and conf (j)) are 6.1 (0.495), 7.8 (0.478), 6.0 (0.472), 22.0 (0.471), 3.8 (0.471), 8.5 (0.461), 6.0 (0.456), 6.9 (0.445), 46.0 (0.440), 6.0 (0.437) and 83.0 (0.389) for Hybrid-CEthreader, SparksX, CEthreader (https://zhanggroup.org/CEthreader), HHsearch, MapAlign, MUSTER (https://zhanggroup.org/MUSTER), MRFsearch, DisCovER, FFAS3D, EigenThreader and HHpred, respectively.

Based on the quality and number of threading alignments from LOMETS3, protein targets can be classified as ‘trivial’, ‘easy’, ‘hard’ or ‘very hard’. The classification of targets was considered in the contact prediction and REMC simulation sections of D-I-TASSER to train the parameters and weights with regard to different target types. The detailed procedure of target classification is shown as follows:

For each protein target, we first select the top template for each of the 11 threading methods in LOMETS3. Based on the selected templates, z_a, the average normalized z score (divided by z₀) is calculated for the 11 threading methods. We further calculate the pairwise TM scores among the 11 templates selected by the 11 threading methods. There are 55 ($={{{C}}}_{11}^{2}=11\times 10/2$) distinct template–template pairs and corresponding TM scores. We define TM1, TM2, TM3 and TM4 as the average TM scores over the quartiles of the template pairs ranked by their TM scores (beginning with the top ranker). Thus, we get a set of nine scores, that is, S = {z_a, TM1, TM2, TM3, TM4, z_a × TM1, z_a × TM2, z_a × TM3, z_a × TM4}. Based on set S, the target can be classified by the following rule:

$${\rm{Target}}\; {\rm{is}}\; {\rm{classified}}\; {\rm{as}}\left\{\begin{array}{ll}{\rm{Trivial}},&{\rm{if}}\left|\left\{{{s}}\in {{S}},|,{{s}} > 1.8\times {\rm{cut}}2\left({{s}}\right)\right\}\right|\ge 8\\ {\rm{Easy}},&{\rm{else}}\; {\rm{if|}}\{{{s}}\in {\rm{S|s}} > 1.0\times {\rm{cut}}2({{s}})\}{|}\ge 7\\ {\rm{Very}}\; {\rm{hard}},&{\rm{else}}\; {\rm{if|}}\{{{s}}\in {{S|s}} < 1.0\times {\rm{cut}}1({{s}})\}{|}\ge 6\\ {\rm{Hard}},&{\rm{otherwise}}\end{array}\,\right.,$$

(3)

where cut1 (S) = {0.620, 0.273, 0.250, 0.216, 0.185, 0.151, 0.137, 0.096, 0.093} and cut2 (S) = {1.052, 0.508, 0.396, 0.350, 0.339, 0.353, 0.279, 0.239, 0.209}. Here $\left|\{\ldots \}\right|$ means the number of items in the set {&hellips;}.

To simplify the logic of the analyses in the manuscript, we redefined target classification as the following two groups of targets: easy targets and hard targets, where easy targets here include both ‘trivial’ and ‘easy’ types, while hard targets are a combination of both the ‘hard’ and ‘very hard’ groups. However, for the parameter determination, we still keep the four classification groups.

Deep learning module for contact map, distance map and HB network prediction

The deep learning module contains DeepPotential, AttentionPotential, AlphaFold2 and five contact predictors, which are designed for predicting spatial restraints for use in D-I-TASSER folding simulation, including contacts, distances and HB networks.

First, the definitions of contact, distance and HB are shown in the following sections.

Inter-residue contact

A contact is defined as a pair of residues where the distance between their C_α or C_β atoms is less than or equal to 8 Å, provided that they are separated by at least five residues in the sequence. The long-, medium- and short-range contacts are defined by sequence separation |i − j| ≥24, 23 ≥ |i − j| ≥ 12 and |i − j| ≤11, respectively.

Inter-residue distance

A distance is defined as the C_α–C_α or C_β–C_β distance between a pair of residues.

Inter-residue HB

The HBs used in D-I-TASSER are defined as the inner cross products of two local Cartesian coordinate systems formed by a residue pair i and j. As shown in Supplementary Fig. 10, for residue i, three unit direction vectors, A_i, B_i and C_i, are used to define the local coordinate system to describe the hydrogen direction. Here B_i is the direction vector of the plane formed by three neighboring atoms, Cα_i − 1, Cα_i and Cα_i + 1, while A_i and C_i are mutually perpendicular vectors located in the plane. The equations of A_i, B_i and C_i are shown in equations (16–18), respectively. For two residues i and j, we can define the AA, BB and CC as the inner product of A_i/A_j, B_i/B_j and C_i/C_j, respectively. AA, BB and CC are used to represent the HBs between two residues, which are helpful to correct the secondary structures in the modeling simulations. The equations of AA, BB and CC are shown in equations (19–21), respectively.

Second, we list the predictors used in the deep learning module.

DeepPotential pipeline

DeepPotential pipeline is used to predict contacts, distances and HB networks. In DeepPotential (https://zhanggroup.org/DeepPotential), a set of co-evolutionary features are extracted from the MSA obtained by DeepMSA2. The raw coupling parameters from the pseudo-likelihood maximized (PLM) 22-state Potts model and the raw mutual information (MI) matrix are the two major two-dimensional features in DeepPotential. Here the 22 states represent the 20 standard amino acids, the nonstandard amino acid type and the gap state. Here the PLM feature minimizes the following loss function:

$$\begin{array}{ccc}{{{L}}}_{{\rm{PLM}}} & = & -\mathop{\sum }\limits_{{{l}}=1}^{{{L}}}\mathop{\sum }\limits_{{{n}}=1}^{{{N}}}\log \frac{\exp \left({{{e}}}_{{{i}}}\left({{{X}}}_{{{n}},{{i}}}\right)+\mathop{\sum }\nolimits_{{{j}}=1,\;{{j}}\ne {{i}}}^{{{L}}}{{{P}}}_{{{i}},\;{{j}}}\left({{{X}}}_{{{n}},{{i}}},{{{X}}}_{{{n}},\;{{j}}}\right)\right)}{{\sum }_{{{q}}=1}^{{{Q}}}\exp \left({{{e}}}_{{{i}}}\left({{q}}\right)+\mathop{\sum }\nolimits_{{{j}}=1,\;{{j}}\ne {{i}}}^{{{L}}}{{\rm{P}}}_{{{i}},\;{{j}}}\left({{q}},{{{X}}}_{{{n}},\;{{j}}}\right)\right)}\\ & & +{{{\lambda }}}_{{\rm{single}}}\mathop{\sum }\limits_{{{i}}=1}^{{{L}}}{\|{{{e}}}_{{\rm{i}}}\|}_{2}^{2}+{{{\lambda }}}_{{\rm{pair}}}\mathop{\sum }\limits_{{{{i}},\;{{j}}=1}\atop{{i}\ne {{j}}}}^{{{L}}}{\|{{{P}}}_{{{i}},\;{{j}}}\|}_{2}^{2}\end{array},$$

(4)

where X is the N by L matrix representing the MSA. ${{e}}\in {{{R}}}^{{{L}}\times {{Q}}}$ and ${{P}}\in {{{R}}}^{{{L}}\times {{L}}\times {{Q}}\times {{Q}}}$ are the field and coupling parameters of the Potts model, respectively; λ_single = 1 and λ_pair = 0.2 × (L − 1) are the regularization coefficients for e and P; and L is the sequence length. The MI feature of residue i and j is defined as follows:

$${{{M}}}_{{{i}},\;{{j}}}\left({{{q}}}_{1},{{{q}}}_{2}\right)={{{f}}}_{{{i}},\;{{j}}}\left({{{q}}}_{1},{{{q}}}_{2}\right)\mathrm{ln}\frac{{{{f}}}_{{{i}},\;{{j}}}\left({{{q}}}_{1},{{{q}}}_{2}\right)}{{{{f}}}_{{{i}}}\left({{{q}}}_{1}\right){{{f}}}_{{{j}}}\left({{{q}}}_{2}\right)}$$

(5)

Here ${{{f}}}_{{{i}}}\left({{{q}}}_{1}\right)$ is the frequency of a residue type ${{{q}}}_{1}$ at position i of the MSA, ${{{f}}}_{{{i}},{{j}}}\left({{{q}}}_{1},{{{q}}}_{2}\right)$ is the co-occurrence of two residue types ${{{q}}}_{1}$ and ${{{q}}}_{2}$ at positions i and j.

For a given sequence, s, the corresponding parameters for each residue pair in the PLM and MI matrices, P_i,j (s_i, s_j) and M_i,j (s_i, s_j), are also extracted as additional features that measure query-specific co-evolutionary information in an MSA, where s_j indicates the residue type of position i of the query sequence. The field parameters e_i and the self-mutual M_i,j information are considered as one-dimensional features, incorporated with HMM features. The one-hot representation of the MSA and other descriptors, such as the number of sequences in the MSA, are also considered. The one-dimensional features and two-dimensional features are fed into deep convolutional neural networks separately, where each of them is passed through a set of ten one-dimensional and two-dimensional residual blocks, respectively, and are then tiled together. The feature representations are considered as the inputs of another fully residual neural network containing 40 2D residual blocks, which output several inter-residue interaction terms (Fig. 1a, left, column 2).

AttentionPotential pipeline

AttentionPotential pipeline is an improved model that can predict various inter-residue geometry potentials, including contacts, distances and HB networks. In the AttentionPotential model (Fig. 1a, left, column 1), the co-evolutionary information is directly extracted using the attention transformer mechanism that can model the interactions between residues instead of the precomputed evolutionary coefficients used in DeepPotential. Starting from an MSA ${{{m}}}_{{{si}}}^{{\rm{init}}}$, with S aligned sequences and L positions, the InputEmbedder module was applied to get the embedded MSA representation ${{{m}}}_{{{si}}}$ and the pairwise representation z_ij. Additionally, the MSA embeddings and attention maps from MSA transformer, that is, ${{{m}}}_{{{si}}}^{{\rm{esm}}}$ and ${{{z}}}_{{{ij}}}^{{\rm{esm}}}$, were linearly projected and added to m_si and z_ij, respectively. Please note that ${{{m}}}_{{{si}}}^{{\rm{esm}}}$ is the MSA representation of the last hidden layer and ${{{z}}}_{{{ij}}}^{{\rm{esm}}}$ stacks the attention maps of each hidden layer in the MSA transformer. The obtained representations are then fed into the Evoformer model consisting of 48 Evoformer stacks. The equations that define the process are as follows:

$${{m}},{{z}}={{{\varnothing }}}_{{{e}}}\left({{{m}}}^{{\rm{init}}}\right)$$

(6)

$${{{m}}}^{{\rm{esm}}},{{{z}}\;}^{{\rm{esm}}}={{{\varnothing }}}_{{{t}}}\left({{{m}}}^{{\rm{init}}}\right)$$

(7)

$${\widehat{{{m}}}}^{{\rm{esm}}},{\widehat{{{z}}}}^{{\rm{esm}}}={{{\varnothing }}}_{{{m}}}\left({{{m}}}^{{\rm{esm}}}\right),{{{\varnothing }}}_{{{z}}}\left({{{z}}}^{\;{\rm{esm}}}\right)$$

(8)

$$\widehat{{{m}}},\widehat{{{z}}}={{{\varnothing }}}_{{\rm{Evo}}}\left({{m}}+{\widehat{{{m}}}}^{{\rm{esm}}},{{z}}+{\widehat{{{z}}}}^{{\rm{esm}}}\right),$$

(9)

where ${{{\varnothing }}}_{{{e}}}$ and ${{{\varnothing }}}_{{{t}}}$ are the InputEmbedder module and MSA transformer, respectively. ${{{\varnothing }}}_{{{m}}}$ and ${{{\varnothing }}}_{{{z}}}$ are the projectors for ${{{m}}}_{{{si}}}^{{\rm{esm}}}{\rm{and}}\,{{{z}}}_{{{ij}}}^{{\rm{esm}}}$, respectively. ${{{\varnothing }}}_{{\rm{Evo}}}$ defines the Evoformer, which is the backbone network of AttentionPotential. The inter-residue geometry prediction was based on ${\hat{{{z}}}}_{{{ij}}}$ in the form of multitask learning. Each of the geometry terms is predicted by its separate projection of ${\hat{{{z}}}}_{{{ij}}}$, followed by a softmax layer, which can produce a multinomial distribution for each residue pair.

We implemented and trained AttentionPotential with PyTorch (1.7.0). For the MSA transformer, the weights are initialized with the pretrained model⁷⁵ and kept fixed during the training and inference. To make the deep learning model trainable on limited resources, that is, a single V100 GPU, the channel sizes of pair and MSA representations in Evoformer blocks were set to 64. The number of heads and the channel size in MSA row- and column-wise attention were set to 8. Please note that the row- or column-wise dropout layers were not implemented as the model is considered at a small scale.

The C_α–C_α contacts, C_β–C_β contacts, C_α–C_α distances, C_β–C_β distances and C_α-based HB network geometry descriptors between residues are considered as prediction terms. The contact, distance, orientations and HB geometry values are discretized into binary descriptions, and the neural networks were trained using cross-entropy loss.

AlphaFold2 pipeline

The AlphaFold2 pipeline was used to predict contact maps and distance restraints for D-I-TASSER across all benchmarks presented in this study. The AlphaFold2 method was originally developed by DeepMind, where an end-to-end network architecture is implemented to predict the 3D structure of monomeric proteins from an MSA and homologous templates¹². In D-I-TASSER, a slightly modified version of the AlphaFold2 program has been used to predict the structural models associated with the C_β–C_β distance restraints, in which the default input MSA is replaced by the DeepMSA2 MSA, and the default templates are replaced by LOMETS3 templates. Finally, AlphaFold2 generates five models. The distance output from the model with the highest pLDDT score is used for guiding D-I-TASSER folding simulation together with distance restraints from DeepPotential and AttentionPotential pipelines.

Five contact predictors

In addition to contact predictions from AttentionPotential, DeepPotential and AlphaFold2, D-I-TASSER also uses contact map information from TripletRes⁷⁶ (v1.0), ResTriplet⁷⁷ (v1.0), ResPRE⁶⁶ (v1.0), ResPLM⁷⁷ (v1.0) and NeBcon⁷⁸ (v1.0), the methods of which are outlined in Supplementary Note 4.

Finally, we show the selection strategies for contact, distance and HB in the following sections.

Contact selection and reranking

Due to the variation of scoring schemes used by different contact predictors, we chose different confidence score cutoffs for different predictors that correspond to a contact precision of at least 0.5 for different ranges, including long-, medium- and short-range contacts with sequence separations |i − j| ≥24, 23 ≥ |i − j| ≥12 and |i − j| ≤11, respectively. For each individual contact predictor p, we first rank all of the residue–residue pairs in descending order of confidence scores predicted by the predictor. A residue–residue pair (i, j) is selected as the predicted contact if ${{\rm{Ionf}}}^{\;{{p}}}\left({{i}},{{j}}\right) > {{\rm{conf}}}_{{\rm{cut}}}^{\;{{p}}}({{r}})$, where conf^p (i, j) is the confidence score of the residue–residue pair (i, j) predicted by predictor p, and ${{\rm{conf}}}_{{\rm{cut}}}^{{{p}}}({{r}})$ is the confidence score cutoff for the predictor p at range type ${{r}}\in$ (short, medium and long range) or L_c(p) < L_cut(p) where L_c(p) is the currently selected number of contacts by predictor p and L_cut(p) is the cutoff for the minimum number of selected contacts by predictor p. It is important to note that all the confidence cutoffs and parameter sets were determined on a separate set of 243 training proteins—L_cut(p) = L for all predictor p; ${{\rm{conf}}}_{{{\rm{cut}}}}^{{{p}}}$ (short range) = 0.310, 0.418, 0.647, 0.809, 0.607, 0.604, 0.483 and 0.512; ${{\rm{conf}}}_{{\rm{cut}}}^{{{p}}}$ (medium range) = 0.328, 0.433, 0.622, 0.789, 0.581, 0.598, 0.626 and 0.652; ${{\rm{conf}}}_{{\rm{cut}}}^{{{p}}}$ (long range) = 0.308, 0.422, 0.678, 0.806, 0.654, 0.652, 0.849 and 0.906 for AttentionPotential, DeepPotential, TripletRes, ResTriplet, ResPRE, ResPLM, NeBconB and NeBconA, respectively.

After the contacts have been selected from each contact predictor, we normalize the contact prediction results from different predictors. For each of the predicted contacts $({{i}},{{j}})$, the new normalized confidence scores over different contact predictors are calculated as follows:

$${{{U}}}_{{{i}},\;{{j}}}=\frac{1}{{{n}}}\times \mathop{\sum }\limits_{{{p}}=1}^{{{n}}}{{{w}}}_{{{p}}}\left({{i}},{{j}}\right)$$

(10)

$$\begin{array}{l}{{{w}}}_{{{p}}}\left({{i}},{{j}}\right)=\\\left\{\begin{array}{l}2.5\times \left(1+{{\rm{conf}}}^{\;{{p}}}\!\left({{i}},{{j}}\right)-{{\rm{conf}}}_{{\rm{cut}}}^{\;{{p}}}\!\left({{r}}\right)\right)\times {\rm{Fw}},\;{\rm{if}}\; {\rm{predictor}}\; {{p}}\; {\rm{selects}}\; {\rm{out}}\left({{i}},{{j}}\right)\\ 0,\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\,\,{\rm{else}}\end{array}\right.,\end{array}$$

(11)

where n is the number of predictors. ${{\rm{conf}}}^{{{p}}}\left({{i}},{{j}}\right)$ is the contact confidence score of the residue–residue pair $({{i}},{{j}})$ predicted by predictor p, and ${{\rm{conf}}}_{{\rm{cut}}}^{{{p}}}\,({{r}})$ is the contact confidence score cutoff for predictor p at range type ${{r}}\in$ (short, medium and long range), which is given above. Fw = 0.62, 1.25, 6.25 and 5 for trivial, easy, hard and very hard target types, respectively, when n_eff > 50, while Fw = 0.62, 1.5, 3 and 3.75 accordingly, when n_eff < 50.

Distance selection

For the C_α–C_α distances and C_β–C_β distances, four upper thresholds, including 10 Å, 13 Å, 16 Å and 20 Å, were used. Considering that both AttentionPotential and DeepPotential tend to have a higher confidence for distance models with shorter distance cutoffs, four sets of distance profiles for each method were generated with distance ranges from (2,10), (2,13), (2,16) and (2,20) Å, where the four ranges were divided into 18, 24, 30 and 38 distance bins, respectively; only the distance profiles from the lower distance cutoffs were selected, that is, distances from (2–10) Å were selected from model set 1, distances from (10–13) Å from set 2, (13–16) Å from set 3 and (16–20) Å from set 4. In contrast, AlphaFold2 predicted the C_β–C_β distances ranging from 2 Å to 22 Å, and the distances were divided into 64 bins. Only one distance restraint is selected from the AlphaFold2, AttentionPotential and DeepPotential models for a given pair (i, j) based on the higher value of

$${{{S}}}_{{{i}},{{j}}}=\frac{1}{1+{{{\sigma }}}_{{{i}},\;{{j}}}-0.4\times {\sum }_{{{k}}=1}^{{{n}}}{{{P}}}_{{{i}},\;{{j}}}\left({{k}}\right)-0.2\times \mathop{\max }\limits_{{{k}}}\left({{{P}}}_{{{i}},\;{{j}}}\left({{k}}\right)\right)}\,,$$

(12)

where ${{{P}}}_{{{i}},{{j}}}\left({{k}}\right)$ is the probability for a residue pair (i, j) located in the kth bin, n is the number of bins, ${{{\sigma }}}_{{{i}},{{j}}}$ is the s.d. of the distance distribution for a residue pair (i, j). After the selection of ${{{S}}}_{{{i}},{{j}}}$ for each (${{i}},{{j}}$) between AlphaFold2, AttentionPotential and DeepPotential models, a second round of selection is performed to select the set of distance restraints that have the highest value of ${{{S}}}_{{{i}},{{j}}}$. For trivial and easy targets, the top 0.5L, 2L, and 5L distances are selected from the short (separation ≥ 3), medium and long range, respectively, while for hard and very hard targets, the top 0.25L, 1L and 2.5L distances are selected from the short (separation ≥ 3), medium and long range, respectively. The combined distances were then converted into a negative logarithm-style function used as the distance potential (equation (27)).

HB selection

For HBs, the AttentionPotential and DeepPotential pipelines predict the angles between the corresponding unit vectors of residue i and residue j (that is, ${{\bf{A}}}_{{{i}}}$ and ${{\bf{A}}}_{{{j}}}$) if the distance between i and j is below 10 Å, which is assessed using the sum of the predictive probability below the cutoff (10 Å). Please note that for each residue pair (i, j), only one set of HBs will be selected from AttentionPotential or DeepPotential, based on whichever has the largest sum of the predictive probability. Finally, the top 5L predicted angles are selected and sorted by the predicted probabilities. The predicted probability distribution of angles is then converted into an HB energy potential with a similar form as the distance energy.

Distance assessment measures

To assess the accuracy of the deep learning distance predictions, we used the measure MAE_n as the mean absolute distance error between the top k × L predicted distances and the corresponding distances calculated from the experimentally solved structures. The equation is as follows:

$${{\rm{MAE}}}_{{{n}}}=\frac{1}{{{kL}}}\mathop{\sum }\limits_{({{i}},\;{{j}})}^{{{kL}}}\left|{{{d}}}_{{{i}},\;{{j}}}^{{\rm{pred}}}-{{{d}}}_{{{i}},\;{{j}}}^{\;\exp }\right|,$$

(13)

where ${{{d}}}_{{{i}},{{j}}}^{\exp }$ is the C_α–C_α (or C_β–C_β) distance between residue i and j in the experimental structure, and ${{{d}}}_{{{i}},{{j}}}^{{\rm{pred}}}$ is the predicted C_α–C_α (or C_β–C_β) distance between residue i and j predicted by AlphaFold2, AttentionPotential or DeepPotential. Because AlphaFold2 (C_β–C_β), AttentionPotential (C_α–C_α and C_β–C_β) or DeepPotential (C_α–C_α and C_β–C_β) predict the probability distribution for each residue pair (i, j), the distance distributions were first ranked by their peak probability (only distances <20 Å were considered, or 22 Å for AlphaFold2). Then, the top k × L-ranked distance distributions were used to calculate ${{\rm{MAE}}}_{{{n}}}$, where ${{{d}}}_{{{i}},{{j}}}^{{\rm{pred}}}$ was estimated as the middle value of the bin where the highest probability was located. In particular, we used the top 5L-ranked long-range (|i − j| > 23) C_β–C_β distances from the combined AlphaFold2, AttentionPotential and DeepPotential models to calculate ${{\rm{MAE}}}_{{{n}}}$ because we found it had the maximal PCC with TM scores from the predicted models.

To quantify how well the predicted models fit with the predicted distances from the deep learning models, we defined another measure MAE_m as the mean absolute distance error between the top k × L (where L is the protein length) predicted distances and the corresponding distances calculated from the D-I-TASSER models. The equation is as follows:

$${{\rm{MAE}}}_{{{m}}}=\frac{1}{{{kL}}}\mathop{\sum }\limits_{({{i}},\;{{j}})}^{{{kL}}}\left|{{{d}}}_{{{i}},\;{{j}}}^{\mathrm{mod}}-{{{d}}}_{{{i}},\;{{j}}}^{{\rm{pred}}}\right|,$$

(14)

Similarly to ${{\rm{MAE}}}_{{{n}}}$, the top 5L-ranked long-range (|i − j| > 23) C_β–C_β distances from the combination of AlphaFold2, AttentionPotential and DeepPotential were used to calculate the MAE_m. ${{{d}}}_{{{i}},{{j}}}^{\mathrm{mod}}$ is the C_β–C_β distance between residues i and j in the predicted model structure.

Protocols for domain partition and multidomain structural assembly

To model multidomain proteins, we introduced a new domain partition and structural assembly module into the D-I-TASSER pipeline. In contrast to our previous domain handling module used in CASP14, which attempted to dock the domain-level models into full-chain models, the new module creates full-chain models directly from the full-chain level D-I-TASSER assembly simulations under the guidance of the composite domain-level and whole-chain-level restraints from LOMETS and deep learning models. The new domain partition and structural assembly module consists of the following five steps: domain boundary prediction, domain-level template and restraint prediction, full-chain level restraint collection, full-chain level MSA collection and spatial restraint creation and full-chain level D-I-TASSER structural assembly.

Domain boundary prediction

The domain boundaries of the query sequence are predicted by two complementary programs^35,36.

First, ThreaDom (https://zhanggroup.org/ThreaDom) is a template-based algorithm for protein domain boundary prediction derived from threading alignments. Given a protein sequence, ThreaDom first threads the target through the PDB library to identify protein templates with similar structural folds. A domain conservation score (DCS) is then calculated for each residue, which combines information from the template domain structures, terminal and internal gaps and insertions. Finally, the domain boundary information is derived from the DCS profile distribution. ThreaDom is designed to predict both continuous and discontinuous domains. The templates used in ThreaDom are obtained using LOMETS3 (see ‘LOMETS3 pipeline for meta-server threading’) with the full-chain query sequence as input.

Second, FUpred (https://zhanggroup.org/FUpred) is a newly developed domain prediction method that uses a recursive strategy to detect domain boundaries based on predicted contact maps and secondary structure information. The core idea of the algorithm is to predict domain boundary locations by maximizing the number of intradomain contacts while minimizing the number of interdomain contacts from the contact maps. FUpred achieved state-of-the-art performance on domain boundary detection, especially for discontinuous domains³⁶. The contact map used in FUpred is predicted by the deep learning module (see ‘Deep learning module for contact map, distance map and HB network prediction’) with the full-chain query sequence and deep MSA as input.

Depending on the LOMETS definition of the target class, the final boundary models are taken from ThreaDom (if the query is an easy target) or FUpred (if the query is a hard target).

Domain-level threading and restraint generation

After domain boundaries have been detected, the full-chain query sequence is divided into domain-level sequences. Subsequently, the sequence of each individual domain is input to DeepMSA2 for domain-level MSA construction, to LOMETS3 for domain-level template detection and to the deep learning module for domain-level spatial restraint prediction.

Full-chain level MSA collection and spatial restraint creation

The domain-level MSAs and the initial full-chain MSA from DeepMSA2 are used for assembling a new checkerboard-style full-chain MSA, in which the full-chain homologous sequences in the initial full-chain MSA are first put into the new MSA, followed by the placement of domain-level sequences of each domain with gap padding to all other domains (Fig. 1b). This newly assembled MSA is again fed to the deep learning module to predict a new set of full-chain-level spatial restraints (see ‘Deep learning module for contact map, distance map and HB network prediction’). The final restraint set consists of the full-chain-level deep learning restraints plus the restraints converted from domain-level deep learning restraints with reordered residue indexes.

Full-chain level template collection

The domain-level threading templates are assembled into ‘full-chain’ templates using DEMO2 (ref. ⁷⁹; v2.0, https://zhanggroup.org/DEMO). Here starting from domain-level LOMETS templates, DEMO2 identifies a set of ten analogous global template structures that cover as many domains as possible from a nonredundant multidomain protein structure library by matching each domain template to the multidomain template structures using TM-align⁸⁰ (22 August 2019). A limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) optimization is then performed starting from initial global templates to detect each domain’s optimal translation vectors and rotation angles. The optimization is guided by a comprehensive energy function that includes a knowledge-based potential, a template-based potential and the interdomain spatial restraints from the deep learning module. The translation vectors and rotation angles with the lowest energy are selected to construct a set of assembled ‘full-chain’ templates. The final template set consists of the DEMO2 assembled full-chain templates plus the full-chain-level LOMETS threading templates.

Multidomain structure construction by D-I-TASSER

Starting with the full-chain templates, full-chain multidomain structural models are reassembled D-I-TASSER simulations, which are guided by the above-collected full-chain spatial restraints. Technically, the domain-level structural folding is mainly controlled by the domain-level threading and deep learning modeling, while the interdomain orientations are guided by the full-chain-level deep learning restraints and global threading alignments, together with the inherent knowledge-based D-I-TASSER force field. A detailed description of the unified D-I-TASSER structural assembly and model selection for both single-domain and multidomain proteins is given in Methods (see ‘REMC protocol in D-I-TASSER’, ‘D-I-TASSER force field’, ‘Model selection and atomic structure generation’ and ‘Global quality estimation of D-I-TASSER structure predictions’).

REMC protocol in D-I-TASSER

D-I-TASSER is an extension of the established I-TASSER pipeline^15,22 for REMC protein structure assembly simulations. The initial conformations used in the REMC simulation came from LOMETS3 threading templates, together with the full-length models built by AlphaFold2 and DeepFold (v1.0, https://zhanggroup.org/DeepFold)⁸¹ with the spatial restraints. In the initial conformation generation step, a total of ten full-length models are created by DeepFold L-BFGS folding system using spatial restraints collected from LOMETS3 templates (see ‘LOMETS3 pipeline for meta-server threading’) and predicted by the DeepPotential or AttentionPotential (see ‘Deep learning module for contact map, distance map and HB network prediction’). To assist the L-BFGS folding process, the probabilities of distance terms for each pair of residues are converted into smooth potentials for the gradient-descent-based protein folding system. The negative log of the raw probability histogram is then interpolated using a cubic spline to derive the potentials. For distance probability histogram of residue pair i and j, the probability, ${{{P}}({{i}},{{j}})}_{{\rm{dis}}}$, is a fusion probability combining the raw probability ${{{P}}({{i}},{{j}})}_{{\rm{dis}}}^{{\rm{dp}}}$ predicted from DeepPotential (or AttentionPotential) and the statistical probability ${{{P}}({{i}},{{j}})}_{{\rm{dis}}}^{{\rm{tem}}}$ derived from LOMETS3 top n ranked templates with alignment coverages >0.5 for ‘easy’ targets and alignment coverages >0.6 for ‘hard’ targets. Here n is 50 for an ‘easy’ target, and n is 30 for a ‘hard’ target. The fusion probability ${{{P}}({{i}},{{j}})}_{{\rm{dis}}}$ can be calculated as follows:

$${{{P}}\left({{i}},{{j}}\right)}_{{\rm{dis}}}={{{wP}}\left({{i}},{{j}}\right)}_{{\rm{dis}}}^{{\rm{dp}}}+\left(1-{{w}}\right){{P}}{({{i}},{{j}})}_{{\rm{dis}}}^{{\rm{template}}},$$

(15)

where w is a weight and equals to 0.8. Five models were generated using DeepFold, with varying random seeds, using restraints from either DeepPotential or AttentionPotential combined with LOMETS3 templates. Thus, a total of 15 full-length models, including five AlphaFold2 models, five AttentionPotential-based models and five DeepPotential-based models, are collected from the deep learning module. These models are merged with 220 top-ranked LOMETS3 threading templates to provide initial conformations for D-I-TASSER REMC folding simulations.

To reduce the conformational search space, only the C_α atom of each residue is treated explicitly by restricting the C_α trace to a 3D underlying cubic lattice system with a lattice grid of 0.87 Å (Supplementary Fig. 11a). The backbone length of the structural model is allowed to fluctuate from 3.26 Å to 4.35 Å (that is, the actual distance from C_α(i) to C_α(i + 1) is required to be in the range (3.26 Å, 4.35 Å) in Supplementary Fig. 11a) to preserve sufficient flexibility for the conformational movements and geometric fidelity of the structure representation. Therefore, 312 basic vectors can be used to represent the virtual and reasonable C_α–C_α bonds. The average vector length is about 3.8 Å, consistent with the value of real proteins. Furthermore, the reasonable C_α–C_α bond angle is restricted to the experimental range (65°, 165°) to reduce the configurational entropy. Please note that all of the allowable C_α–C_α bond combinations are precalculated.

The positions of three consecutive C_α atoms define the local coordinate system, which in turn is used to determine the remaining two interaction units—the β carbon (C_β; except glycine) and the center of side-group heavy atoms (SG; except glycine and alanine). As shown in Supplementary Fig. 10b, let V_i − 1 be the vector from C_α(i − 1) to C_α(i) and U_i − 1 be the unit vector for V_i − 1. Thus, the local Cartesian coordinate system can be represented in the form of

$${\bf{A}_{i}},=\,{\bf{e}_{xi}}=\frac{{\bf{U}_{i-1}}\,+\,{\bf{U}_{i}}}{\left|{\bf{U}_{i-1}}\,+\,{\bf{U}_{i}}\right|}$$

(16)

$${{{\bf{{B}}}}_{{{i}}}}\,=\,{{{\bf{e}}}_{{{yi}}}}=\frac{{{{{{\mathbf{U}}}}}_{{{i}}-1}}\,\times {{{{{\mathbf{U}}}}}_{{{i}}}}}{\left|{{{{{\mathbf{U}}}}}_{{{i}}-1}}\,\times {{{{{\mathbf{U}}}}}_{{{i}}}}\right|}$$

(17)

$${{{\bf{C}}}_{{{i}}}}\,=\,{{{\bf{e}}}_{{{zi}}}}=\frac{{{{{{\mathbf{U}}}}}_{{{i}}-1}}\,-{{{{{\mathbf{U}}}}}_{{{i}}}}}{\left|{{{{{\mathbf{U}}}}}_{{{i}}-1}}-{{{{{\mathbf{U}}}}}_{{{i}}}}\right|}.$$

(18)

Here ${{{\bf{B}}}_{{{i}}}}$ is also the direction of the HB. Furthermore, we can use three inner products, AA, BB and CC (see below), to represent the hydrogen bonds.

$${\rm{AA}}={{{\bf{A}}}_{{{i}}}}\,\cdot\, {{{\bf{A}}}_{{{j}}}}$$

(19)

$${\rm{BB}}={{{\bf{B}}}_{{{i}}}}\cdot{{{\bf{B}}}_{{{j}}}}$$

(20)

$${\rm{CC}}={{{\bf{C}}}_{{{i}}}}\cdot {{{\bf{C}}}_{{{j}}}}.$$

(21)

Let C_β(i) be the position of the ith C_β atom, and SG(i) be the position of the ith center of the side-group heavy atoms. Therefore, the corresponding vectors relative to C_α(i) can be represented as follows:

$${{{{{\mathbf{V}}}}}_{{{i}}}^{{{C}}_{{\beta }}}\left({{\rm{AA}}}_{{{i}}}\right)}={{{x}}}^{{{C}}_{{\beta }}}\left({{\rm{AA}}}_{{{i}}}\right)\times {{{\bf{e}}}_{{{xi}}}}\,+\,{{{y}}}^{{{C}}_{{\beta }}}\left({{\rm{AA}}}_{{{i}}}\right)\times {{{\bf{e}}}_{{{yi}}}}+{{{z}}}^{{{C}}_{{\beta }}}\left({{\rm{AA}}}_{{{i}}}\right)\times {{{\bf{e}}}_{{{zi}}}}$$

(22)

$${{{{{\mathbf{V}}}}}_{{{i}}}^{{\rm{SG}}}\left({{\rm{AA}}}_{{{i}}}\right)}\,=\,{{{x}}}^{{\rm{SG}}}\left({{\rm{AA}}}_{{{i}}}\right)\times {{{\bf{e}}}_{{{xi}}}}+{{{y}}}^{{\rm{SG}}}\left({{\rm{AA}}}_{{{i}}}\right)\times {{{\bf{e}}}_{{{yi}}}}+{{{z}}}^{{\rm{SG}}}\left({{\rm{AA}}}_{{{i}}}\right)\times {{{\bf{e}}}_{{m{zi}}}},$$

(23)

where the parameters ${{{x}}}^{{{C}}_{{\beta }}}({{\rm{AA}}}_{{{i}}})$, ${{{y}}}^{{{C}}_{{\beta }}}({{\rm{AA}}}_{{{i}}})$, ${{{z}}}^{{{C}}_{{\beta }}}({{\rm{AA}}}_{{{i}}})$, ${{{x}}}^{{\rm{SG}}}({{\rm{AA}}}_{{{i}}})$, ${{{y}}}^{{\rm{SG}}}({{\rm{AA}}}_{{{i}}})$ and ${{{z}}}^{{\rm{SG}}}({{\rm{AA}}}_{{{i}}})$ are amino acid type-dependent statistical values that were extracted from the PDB.

The structure reassembly in D-I-TASSER is conducted by REMC simulations, which make use of the following six types of conformational movements (Supplementary Fig. 11c): (1) two-bond vector walk, (2) three-bond vector walk, (3) four-bond vector walk, (4) five-bond vector walk, (5) six-bond vector walk and (6) N- or C-terminal random walk. To speed up the simulations, the two-bond and three-bond conformational changes—referred to as movements (1) and (2)—for any given distance vector within the moving window are precalculated and rapidly applied using a look-up table. Movements (3)–(5) can also be performed rapidly by recursively conducting combinations of movements (1) and (2).

Following the standard REMC protocol, there are n simulation replicas that are implemented in parallel, with the temperature of the ith replica being

$${{{T}}}_{{{i}}}={{{T}}}_{\min }{\left(\frac{{{{T}}}_{\max }}{{{{T}}}_{\min }}\right)}^{\frac{{{i}}-1}{{{n}}-1}},$$

(24)

where T_min and T_max are the temperatures of the first and the last replicas, respectively. ${{n}}\in \left(40,\,80\right)$, ${{{T}}}_{\min }\in \left(1.6\,{{{k}}}_{{{B}}}^{-1},\,1.98\,{{{k}}}_{{{B}}}^{-1}\right)$ and ${{{T}}}_{\min }\in \left(66\,{{{k}}}_{{{B}}}^{-1},\,106\,{{{k}}}_{{{B}}}^{-1}\right)$, depending on the protein size. Larger proteins have more replicas and higher temperatures. These parameter settings can result in an acceptance rate of ~3% for the lowest-temperature replica and ~65% for the highest-temperature replica for different-sized proteins.

As shown in Supplementary Fig. 11d, after every 200 × L local conformational movements, where L represents the protein length, a global swap movement between each pair of neighboring replicas is attempted following the standard Metropolis criterion with a probability of min $(1,\,{{\rm{e}}}^{({{{E}}}_{{{i}}}-{{{E}}}_{{{j}}})(\frac{1}{{{k}}{{{T}}}_{{{i}}}}-\frac{1}{{{k}}{{{T}}}_{{{j}}}})})$, where k is a constant and the temperature distribution is shown in equation (24). This parameter setting results in an approximate 40% acceptance rate for the swap movement between each neighboring replica.

D-I-TASSER force field

The D-I-TASSER simulations are governed by different energy terms that achieve various effects on the generation of native-like states. The overall force field used in D-I-TASSER is as follows:

$$\begin{array}{rcl}{{E}} & = & {{{w}}}_{1}{{{E}}}_{{\rm{Sdist}}}^{{{C}}_{{\alpha}}}+{{{w}}}_{2}{{{E}}}_{{\rm{Sdist}}}^{{{C}}_{{\beta }}}+{{{w}}}_{3}{{{E}}}_{{\rm{SHB}}}+{{{w}}}_{4}{{{E}}}_{{\rm{Scon}}}^{{{C}}_{{\alpha }}}+{{{w}}}_{5}{{{E}}}_{{\rm{Scon}}}^{{{C}}_{{\beta}}}\\&& +{{{w}}}_{6}{{{E}}}_{{\rm{dist}}}^{{\rm{Short}}}+{{{w}}}_{7}{{{E}}}_{{\rm{dist}}}^{{\rm{Long}}}+\,{{{w}}}_{8}{{{E}}}_{{\rm{Tcon}}}^{{{C}}_{{\alpha}}}+{{{w}}}_{9}{{{E}}}_{{\rm{Tcon}}}^{{\rm{SG}}}\\&& +{{{w}}}_{10}{{{E}}}_{{\rm{burial}}}^{{\rm{SG}}}+{{{w}}}_{11}{{{E}}}_{\mathrm{sec}}^{{{C}}_{{\alpha }}}+{{{w}}}_{12}{{{E}}}_{{\rm{crumpling}}}+{{{w}}}_{13}{{{E}}}_{\mathrm{sec}}^{{\rm{frag}}}\\&& \,+\,{{{w}}}_{14}{{{E}}}_{{\rm{pair}}}^{{{C}}_{{\alpha }}-{\rm{SG}}}+{{{w}}}_{15}{{{E}}}_{{\rm{pair}}}^{{\rm{SG}}}+{{{w}}}_{16}{{{E}}}_{{\rm{P}}}^{{{C}}_{{\alpha }}}+{{{w}}}_{17}{{{E}}}_{{\rm{NP}}}^{{{C}}_{{\alpha }}}+{{{w}}}_{18}{{{E}}}_{{\rm{HB}}}\\&& \,+\,{{{w}}}_{19}{{{E}}}_{{\rm{corr}}}^{{{C}}_{{\alpha }}}+{{{w}}}_{20}{{{E}}}_{{\rm{vol}}}^{{\rm{SG}}}+{{{w}}}_{21}{{{E}}}_{{\rm{mvol}}}^{{\rm{SG}}}+{{{w}}}_{22}{{{E}}}_{{\rm{Spair}}1-5}^{{{C}}_{{\alpha }}}+{{{w}}}_{23}{{{E}}}_{{\rm{cprof}}}+{{{w}}}_{24}{{{E}}}_{{\rm{Ncon}}}\end{array}$$

(25)

There are 24 energy terms in the D-I-TASSER force field, which can be categorized into seven energy groups (or E groups), including (E group 1) deep learning sequence-based spatial geometric restraints, (E group 2) threading template-based restraints, (E group 3) burial interaction restraints, (E group 4) secondary structure-based restraints, (E group 5) statistical pairwise potentials, (E group 6) HB restraints and (E group 7) statistical restraints from the PDB library. Below, we explain in detail the newly developed E group 1 terms built on the deep learning restraints, while the other six E groups extended from the classical I-TASSER force fields are explained in Supplementary Note 5.

E group 1: deep-learning sequence-based spatial geometric restraints

This group, including distance restraints, HB restraints and contact restraints predicted, is newly implemented to guide the folding simulations based on deep learning predictions in D-I-TASSER.

Distance restraints

Sequence-based distances are predicted from AlphaFold2, AttentionPotential and DeepPotential; only one distance restraint is selected from the AlphaFold2, AttentionPotential and DeepPotential models for a given pair (i, j) based on the higher value of S_i,j score defined in equation (12). A set of high-confidence distance restraints is selected by sorting the S_i,j values (see ‘Distance selection’). The selected distances were converted into a negative logarithm-style function used as the distance potential as described below:

$${{{E}}}_{{\rm{Sdist}}}^{{{C}}_{{\alpha }}/{{C}}_{{\beta }}}=\mathop{\sum }\limits_{{{i}}=1}^{{{L}}-1}\mathop{\sum }\limits_{{{j}} > {{i}}}^{{{L}}}{{{{E}}}}_{{\rm{Sdist}}}^{{{C}}_{{\alpha }}/{{C}}_{{\beta }}}\left({{{d}}}_{{{ij}}}\right)$$

(26)

$${{{{E}}}}_{{\rm{Sdist}}}^{{{C}}_{{\alpha }}/{{C}}_{{\beta }}}\left({{{d}}}_{{{ij}}}\right)=-\log \left(\frac{{{{P}}}_{{{ij}}}\left({{{d}}}_{{{ij}}}\right)+{{{P}}}_{{{ij}}}^{{{n}}}}{2{{{P}}}_{{{ij}}}^{{{n}}}}\right),$$

(27)

where ${{{d}}}_{{{ij}}}$ is the distance between residue pair i and j, which follows a predicted probability distribution ${{{P}}}_{{{ij}}}$. ${{{P}}}_{{{ij}}}({{{d}}}_{{{ij}}})$ is the probability that the distance is located at ${{{d}}}_{{{ij}}}$, and ${{{P}}}_{{{ij}}}^{{{n}}}$ is the probability of the last distance bin below the upper threshold (that is, 10 Å, 13 Å, 16 Å and 20 Å as described in the ‘Distance selection’). The illustration of the distance restraints is shown in Supplementary Fig. 12a.

HB restraints

The predicted probability distribution of angles is converted into an energy potential with a similar form as the distance energy, where the potential is described as follows:

$${{{{E}}}}_{{\rm{SHB}}}=\mathop{\sum }\limits_{{{i}}=2}^{{{L}}-2}\mathop{\sum }\limits_{{{j}} > {{i}}}^{{{L}}-1}{{{{E}}}}_{{\rm{SHB}}}^{{\rm{AA}}}\left({{\rm{\theta }}}_{{{ij}}}^{{\rm{AA}}}\right)+\mathop{\sum }\limits_{{{i}}=2}^{{{L}}-2}\mathop{\sum }\limits_{{{j}} > {{i}}}^{{{L}}-1}{{{{E}}}}_{{\rm{SHB}}}^{{\rm{BB}}}\left({{\rm{\theta }}}_{{{ij}}}^{{\rm{BB}}}\right)+\mathop{\sum }\limits_{{{i}}=2}^{{{L}}-2}\mathop{\sum }\limits_{{{j}} > {{i}}}^{{{L}}-1}{{{{E}}}}_{{\rm{SHB}}}^{{\rm{CC}}}\left({{\rm{\theta }}}_{{{ij}}}^{{\rm{CC}}}\right)$$

(28)

$${{{E}}}_{{\rm{SHB}}}^{\rm{{AA}}/{{BB}}/{{CC}}}\left({{\rm{\theta }}}_{{{ij}}}^{\rm{{AA}}/{{BB}}/{{CC}}}\right)=-\log \left(\frac{{{{P}}}_{{{ij}}}\left({{\rm{\theta }}}_{{{ij}}}^{\rm{{AA}}/{{BB}}/{{CC}}}\right)+{{\varepsilon }}}{{{{P}}}_{{{ij}}}^{{{n}}}+{{\varepsilon }}}\right),$$

(29)

where ${{\rm{\theta }}}_{{{ij}}}^{\rm{{AA}}/{{BB}}/{{CC}}}$ is the hydrogen angle between residue pair i and j, that is, the angle between vector ${{{{{\mathbf{A}}}}}_{{{i}}}}/{{{{{\mathbf{B}}}}}_{{{i}}}}/{{{{{\mathbf{C}}}}}_{{{i}}}}$ and ${{{{{\mathbf{A}}}}}_{{{j}}}}/{{{{{\mathbf{B}}}}}_{{{j}}}}/{{{{{\mathbf{C}}}}}_{{{j}}}}$, which follows a probability distribution ${{{P}}}_{{{ij}}}$ predicted by AttentionPotential or DeepPotential, ${{{P}}}_{{{ij}}}({{\rm{\theta }}}_{{{ij}}}^{\rm{{AA}}/{{BB}}/{{CC}}})$ is the probability that the angle is located at ${{\rm{\theta }}}_{{{ij}}}^{\rm{{AA}}/{{BB}}/{{CC}}}$ and ${{\varepsilon }}=$1.0 × 10⁻⁴ is a pseudo count introduced to avoid the logarithm of zero. The illustration of the HB restraints is shown in Supplementary Fig. 12b. Here for each residue pair (i, j), only one set of HBs will be selected from AttentionPotential or DeepPotential, based on whichever has the largest sum of the predictive probability under the threshold of 10 Å (see ‘HB selection’).

Contact restraints

This energy term was developed to account for the restraints from the predicted contacts, where for each residue pair (i, j), the predicted contacts from different deep learning predictors are combined using equations (10) and (11) as described in ‘Deep learning module for contact map, distance map and HB network prediction’. We define it as the three-gradient contact potential, which has the following form for both C_α and C_β atoms:

$${{{{E}}}}_{{\rm{Scon}}}^{{{C}}_{{\alpha }}/{{C}}_{{\beta }}}=\mathop{\sum }\limits_{{{i}}=1}^{{{L}}-1}\mathop{\sum }\limits_{{{j}} > {{i}}}^{{{L}}}{{{{E}}}}_{{\rm{Scon}}}^{{{C}}_{{\alpha }}/{{C}}_{{\beta }}}\left({{{d}}}_{{{ij}}}\right)$$

(30)

$${{{{E}}}}_{{\rm{Scon}}}^{{{C}}_{{\alpha }}/{{C}}_{{\beta }}}\left({{{d}}}_{{{ij}}}\right)=\left\{\begin{array}{ll}-{{{U}}}_{{{ij}}},&{{{d}}}_{{{ij}}} < {{{d}}}_{{\rm{cut}}}\\ -\frac{1}{2}{{{U}}}_{{{ij}}}\left(1-\sin \left(\frac{{{{d}}}_{{{ij}}}-\left(\frac{{{{d}}}_{{\rm{cut}}}+{{D}}}{2}\right)}{{{D}}-{{{d}}}_{{\rm{cut}}}}{\rm{\pi }}\right)\right),&{{{d}}}_{{\rm{cut}}}\le {{{d}}}_{{{ij}}} < {{D}}\\\frac{1}{2}{{{U}}}_{{{ij}}}\left(1+\sin \left(\frac{{{{d}}}_{{{ij}}}-\left(\frac{{{D}}+80}{2}\right)}{\left(80-{{D}}\right)}{{\pi }}\right)\right),&{{D}}\le {{{d}}}_{{{ij}}} < 80\,{\text{\AA }}\\ {{{U}}}_{{{ij}}},&{{{d}}}_{{{ij}}}\ge 80\,{\text{\AA }}\end{array}\right.,$$

(31)

where ${{{d}}}_{{{ij}}}$ is the C_α or C_β distance between the ith and jth residues of the model, and ${{{U}}}_{{{ij}}}$ is calculated by equation (10). ${{{d}}}_{{\rm{cut}}}=8\,{\text{\AA }}$ and ${{D}}=8\,{\mathring{\rm{A}}}+{{{d}}}_{{\rm{well}}}$, where ${{{d}}}_{{\rm{well}}}$ is the well width of the first sine function term and 80-D is the well width of the second sine function term. The well width (${{{d}}}_{{\rm{well}}}$) is a crucial parameter to determine the rate at which residues that are predicted to be in contact are drawn together, and it was tuned based on the length of the training proteins.

Model selection and atomic structure generation

Decoy structures generated from the REMC simulations of D-I-TASSER are then clustered by SPICKER (v3.0) with the backbone atoms added by REMO (v1.0) and the side chains repacked by FASPR (v1.0) to remove steric clashes. Finally, the fragment-guided molecular dynamics (FG-MD) refinement pipeline is used to derive the atomic-level structural models.

SPICKER³⁰ (https://zhanggroup.org/SPICKER) is a clustering algorithm to identify near-native models from a pool of protein structure decoys. The most frequently occurring conformations in the D-I-TASSER structure assembly simulations are selected by the SPICKER clustering program. These conformations correspond to the models with the lowest free energy states in the Monte Carlo simulations because the number of decoys at each conformational cluster n_c is proportional to the partition function Z_c, that is, ${{{n}}}_{{{C}}} \sim {{{Z}}}_{{{c}}}=\int {{{e}}}^{-{\rm{\beta }}{{E}}}{{\rm{d}}{E}}$. Thus, the logarithm of the normalized cluster size is related to the free energy of the simulation, that is, ${{F}}=-{{{k}}}_{{\rm{B}}}{{T}}\log {{Z}} \sim \log ({{{n}}}_{{{c}}}/{{{n}}}_{{\rm{tot}}})$ where n_tot is the total number of decoys submitted for clustering. After SPICKER clusters the structure decoys produced by the first round of simulations, the cluster centroids are generated by averaging all the clustered structures after superposition. Because the centroid models often contain steric clashes, a second round of assembly simulations is conducted by D-I-TASSER to remove the local clashes and to further refine the global topology. Starting from the cluster centroid conformations, the REMC simulations are performed again. The distance and contact restraints in the second round of the D-I-TASSER simulations are taken from the combination of the centroid structures and the PDB structures searched by the structure alignment program TM-align⁸⁰ based on the cluster centroids. The conformation with the lowest energy in the second round is selected. Finally, REMO (https://zhanggroup.org/REMO)⁸² is used to add backbone atoms (N, C and O), and FASPR (https://zhanggroup.org/FASPR)⁸³ is used to build side-chain rotamers.

The FG-MD⁸⁴ protocol (https://zhanggroup.org/FG-MD) is a molecular dynamics (MD)-based algorithm for atomic-level protein structure refinement. Starting from a target protein structure, the sequence is split into separate secondary structure elements (SSEs). The substructures of every three consecutive SSEs, together with the full-chain structure, are used as probes to search through a nonredundant PDB library by TM-align⁸⁰ for structure fragments closest to the target. The top 20 template structures with the highest TM scores²⁸ are used to collect spatial restraints. Simulated annealing MD simulations are then carried out using a modified version of LAMMPS⁸⁵ (9 January 2009), which is guided by the following four energy potential terms: distance map restraints, explicit hydrogen bonding, a repulsive potential and the AMBER99 force field⁸⁶. The final refined models are selected on the basis of the sum of the z score of the HBs, z score of the number of steric clashes and z score of the FG-MD energy.

Global quality estimation of D-I-TASSER structure predictions

The global quality of a structural model is usually assessed by the TM score (https://zhanggroup.org/TM-score) between the model and the experimental structure:

$${\rm{TM}}\,{\rm{score}}=\frac{1}{{{L}}}\mathop{\sum }\limits_{{{i}}=1}^{{{{L}}}_{{\rm{ali}}}}\frac{1}{1+{\left(\frac{{{{d}}}_{{{i}}}}{{{{d}}}_{0}}\right)}^{2}},$$

(32)

where L is the number of residues, d_i is the distance between the ith aligned residue and ${{{d}}}_{0}=1.24\sqrt(3){{{L}}-15}-1.8$ is a scaling factor. The TM score ranges between 0 and 1, with TM scores ≥0.5 indicating that the structural models have correct global topologies. Stringent statistics showed that a TM score >0.5 corresponds to a similarity with two structures having the same fold defined in SCOP/CATH²⁹.

Please note that the TM score can be discrepant with the widely used RMSD for some protein structure pairs. On the one hand, RMSD $(=\sqrt{\frac{1}{{{n}}}{\sum }_{{\rm{i}}=1}^{{{n}}}{{{d}}}_{{{i}}}^{2}})$ is calculated as an average of distance error (${{{d}}}_{{{i}}}$) with equal weight over all residue pairs. Therefore, a large local error on a few residue pairs may result in a quite large RMSD. On the other hand, by putting ${{{d}}}_{{{i}}}$ in the denominator, the TM score naturally weighs more for smaller distance errors than larger distance errors, resulting in the TM score value being more sensitive to the global structural similarity rather than to the local structural errors, compared to RMSD. Another advantage of the TM score is the introduction of the scale ${{{d}}}_{0}=1.24\sqrt(3){{{L}}-15}-1.8$, which makes the magnitude of TM score length independent for random structure pairs, while RMSD is a length-dependent metric²⁸. Due to these reasons, our discussion of modeling results is mainly based on the TM score. Because RMSD is intuitively more familiar to most readers, however, we also list RMSD values when necessary.

For real-world protein structure prediction, when experimental structures are not available, an estimation of the modeling accuracy is essential for users to decide how to use the models in their own research. In this study, we make use of the eTM score of the structure assembly simulations to assess the expected accuracy of the D-I-TASSER structural models:

$$\begin{array}{lll}{\rm{eTM}}\,{\rm{score}} & = & {{{w}}}_{1}\mathrm{ln}\left(\frac{{{M}}}{{{{M}}}_{{\rm{total}}}}\times \frac{1}{ < {\rm{RMSD}} > }\right)+{{{w}}}_{2}\mathrm{ln}\left(\mathop{\prod}\limits _{{{m}}}\frac{{{Z}}\left({{m}}\right)}{{{{Z}}}_{0}\left({{m}}\right)}\right)\\ & & +{{{w}}}_{3}{{{w}}}_{{{n_{\rm{eff}}}}}{{\rm{l}}}{{\rm{n}}}\left(\frac{{\rm{O}}\left({{\rm{CM}}}^{{\rm{model}}},{{\rm{CM}}}^{{\rm{pred}}}\right)}{{{n}}\left({{\rm{CM}}}^{{\rm{pred}}}\right)}\right)\\ & & +{{{w}}}_{4}{{{w}}}_{{{n_{\rm{eff}}}}}\mathrm{ln}\frac{1}{{{n}}}\mathop{\sum }\limits_{\left({{i}},\;{{j}}\right)}^{{{n}}}\left|{{{d}}}_{{{i}},\;{{j}}}^{{\rm{pred}}}-{{{d}}}_{{{i}},\;{{j}}}^{{\rm{model}}}\right|+{{{w}}}_{5}{{{w}}}_{{{n_{\rm{eff}}}}}{\rm{pLDDT}}+{{{w}}}_{6}\end{array}$$

(33)

$${{{w}}}_{{{n_{\rm{eff}}}}}=\min \left(\max \left(0.98,\frac{{\log }_{2}\left({{n}_{\mathrm{eff}}}\right)-4}{12-4}\right),1\right),$$

(34)

where M_total is the total number of decoy conformations used for clustering, M is the number of decoys in the top cluster and is the average RMSD among decoys in the same cluster. These three terms describe the extent of convergence of the structure assembly simulations. Z(m) is the score of the top template by the threading method, m, and Z₀(m) is a cutoff above which templates are considered reliable/good. These z-score-related measures describe the significance of the LOMETS3 threading templates and alignments. ${{n}}({{\rm{CM}}}^{{\rm{pred}}})$ is the number of predicted contacts used to guide the REMC simulation, and ${{O}}({{\rm{CM}}}^{{\rm{model}}},{{\rm{CM}}}^{{\rm{pred}}})$ is the number of overlapped contacts between the final model and the predicted contacts. These three terms account for the contact satisfaction rate. ${{{d}}}_{{{i}},{{j}}}^{{\rm{model}}}$ is the C_β–C_β distance between residue i and j extracted from the D-I-TASSER structural model, ${{{d}}}_{{{i}},{{j}}}^{{\rm{pred}}}$ is the predicted C_β–C_β distance between residue i and j from a combination of AlphaFold2, AttentionPotential and DeepPotential and the n_eff is calculated by equation (1). pLDDT is the pLDDT score from AlphaFold2. ${{{w}}}_{1}=0.032$, ${{{w}}}_{2}=0.010$, ${{{w}}}_{3}=0.014$, ${{{w}}}_{4}=-0.071$, ${{{w}}}_{5}=-0.052$ and ${{{w}}}_{6}=0.660$ are free parameters that we obtained by linear regression.

We analyzed the effect of the eTM score on evaluating the model quality, as shown in Fig. 5a. We calculated the true TM scores between models and experimental structures and the eTM scores for the predicted models for 1,492 (=1,262 single domain + 230 multidomain) mixed proteins in benchmark datasets. We found that the eTM score had a strong correlation with the real TM score, with PCCs of 0.79 for the dataset.

COFACTOR for function annotation

COFACTOR (v2.0, https://zhanggroup.org/COFACTOR)⁴⁰ is a structure, sequence and protein–protein interaction (PPI) based method for biological function annotation of protein molecules. Starting from the 3D structural model, COFACTOR will thread the query through the BioLiP (https://zhanggroup.org/BioLiP) protein function database by local and global structure matches to identify functional sites and homologies. Functional insights, including GO, EC and LBSs, will be derived from the best functional homology templates.

GO term prediction

MetaGO (v1.0, https://zhanggroup.org/MetaGO)⁸⁷ is used for predicting the GO terms of proteins. It consists of three pipelines to detect functional homologs through (1) local and global structure alignments, (2) sequence and sequence profile comparison and (3) partner-homology-based PPI mapping. The final function predictions are a combination of the following three pipelines via logistic regression: (1) structure-based pipeline, (2) sequence-based pipeline and (3) PPI-based pipeline.

In the structure-based pipeline, the query structure is compared to a nonredundant set of known proteins in the BioLiP library⁸⁸ through two sets of local and global structural alignments based on the TM-align (https://zhanggroup.org/TM-align/) algorithm⁸⁰, for functional homology detections. Here BioLiP is a semi-manually curated structure–function database containing known associations of experimentally solved structures and biological functions of proteins in terms of GO terms, EC number and LBSs. The current version of BioLiP contains 35,238 entries annotated with 465,838 GO terms.

In the sequence-based pipeline, a query is searched against the UniProt-GOA by BLAST (2.5.0+) with an E value cutoff of 0.01 to identify sequence homologs, where unreviewed annotations inferred from electronic annotation or no biological data available evidence codes are excluded. Similarly, a three-iteration PSI-BLAST search is performed for the query through the UniRef90 (ref. ⁵⁹) database to create a sequence profile, which is used to jump-start a one-iteration PSI-BLAST (2.5.0+) search through UniProt-GOA.

In the PPI-based pipeline, the query is first mapped to the STRING⁸⁹ PPI database by BLAST; only the BLAST hit with the most significant E values is subsequently considered. GO terms of the interaction partners, as annotated in the STRING database, are then collected and assigned to the query protein. The underlying assumption is that interacting protein partners tend to participate in the same biological pathway at the same subcellular location and, therefore, may have similar GO terms.

EC number prediction

The pipeline of EC number prediction is similar to the structure-homology-based method used in GO prediction. Enzymatic homologs are identified by aligning the target structure, using TM-align, to a library of 8,392 enzyme structures from the BioLiP library, with the active site residues mapped from the Catalytic Site Atlas database⁹⁰.

LBS prediction

Ligand-binding prediction in COFACTOR consists of the following three steps:

First, functional homologies are identified by matching the query structure through a nonredundant set of the BioLiP library, which currently contains 58,416 structure templates harboring a total of 76,679 LBSs for interaction between receptor proteins and small molecule compounds, short peptides and nucleic acids. The initial binding sites are then mapped to the query from the individual templates based on the structural alignments.

Next, the ligands from each individual template are superposed to the predicted binding sites on the query structure using superposition matrices from a local alignment of the query and template binding sites. To resolve atomic clashes, the ligand poses are refined by a short Metropolis Monte Carlo simulation under rigid-body rotation and translation.

Finally, the consensus binding sites are obtained by clustering all ligands that are superposed to the query structure, based on distances of the centers of mass of the ligands using a cutoff of 8 Å. Different ligands within the same binding pocket are further grouped by the average linkage clustering with chemical similarity, using the Tanimoto coefficient⁹¹ with a cutoff of 0.7. The model with the highest ligand-binding confidence score among all the clusters is selected.

Resource requirement

The standalone version of D-I-TASSER is available for download at https://zhanggroup.org/D-I-TASSER/download/ and can be installed on any Linux-based machine, ranging from laptops to high-performance computing clusters. The package itself requires approximately 15 GB of hard disk space, with an additional 200 GB to 3 TB needed for the library, depending on whether the DeepMSA2 databases are included. We tested the D-I-TASSER standalone package on 645 proteins, with sequence lengths ranging from 30 to 350 amino acids, using ten CPUs, with detailed running time comparisons provided in Supplementary Fig. 13. On average, D-I-TASSER generates five models within 8.2 h, requiring approximately 20 GB of memory. While these resource requirements and running times are slightly higher than those of AlphaFold2 (1.2 h and 60 GB of memory), the improved modeling performance of D-I-TASSER justifies the modest increase in computational demand, particularly when considering the substantial amount of experimental effort and expense likely to be driven by the predictions.

Model quality assessment and data analysis

TM score (22 August 2019) program is used in the work to assess the model quality, and all data statistical analyses are done by R (v4.4.2).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link

Tags: Agriculture, Bioinformatics, Biomedical Engineering/Biotechnology, Biomedicine, Biotechnology, general, Life Sciences, Molecular modelling, Protein folding, Protein structure predictions