WP2
–
IDR functions and evolution
Objectives
WP2 will delve into exploring the functionality of IDRs from an evolutionary standpoint, taking into consideration their unique characteristics. The main objective of WP2 is to drive the development of novel computational methods to capture the functional features of IDRs.
Intrinsic disorder arises from the absence of a stable hydrophobic core and is often characterised by high proportions of charged and polar amino acids, low sequence complexity, and a lack of long-range interactions. However, the connection between the types of disorder and their function is still not well understood. While IDRs have been linked to a lack of conservation at the sequence level in general, the specific sequence patterns or amino acid biases associated with different types of IDRs, such as linkers, transient binders, regions involved in generating amyloids and LLPS, as well as multivalent interactions, have not been thoroughly examined yet.
WP2 will exploit the recent development of alternative representations of proteins like the one provided by the embeddings generated by protein Language Models (pLMs) to study IDR functions. pLMs offer an alternative representation of protein sequences, enabling direct comparisons of proteins through a multidimensional vector which capture much more information compared to multiple sequence alignments which have represented the state-of-the art since recently. Additionally, the availability of near-experimental resolution quality predictions of protein structures, as provided by AlphaFold, allows for evolutionary analysis based on structures (3D coordinates) rather than sequences (1D strings), thereby facilitating the identification of distant relationships that were previously statistically inaccessible.
This WP will concentrate on evolutionary studies of IDRs, leveraging the wealth of AI-generated data to uncover novel evolutionary relationships and elucidate the connections between IDRs and other types of non-globular proteins, such as repeated and membrane proteins.
The WP will also explore the role of IDRs in phenomic studies where prokaryotes interact with higher organisms. The functional studies carried out in this WP will be instrumental for the development of a new software toolkit of portable and standardised prediction algorithms which can be easily integrated into large-scale analysis pipelines like Galaxy, or whose predictions will be integrated into IDR databases (DisProt, MobiDB, PED, …) and core resources as those maintained at the EMBL-EBI (InterPro, UniProtKB, IntAct, …).
Particular attention will be put on testing the accuracy of the software tools developed as well as their usability and carbon footprint to improve the overall sustainability of the output of this project. The IDPfun Consortium will leverage the collaboration with the Critical Assessment of Function Annotation (CAFA) initiative and the Gene Ontology (GO) Consortium.
Task list
Task 2.1 – Emerging properties of different flavours of IDRs
Details
Task 2.1 focuses on the analysis of IDRs and their known subtypes in relation to sequence patterns. The analysis will keep into consideration various dimensions and features including the level of sequence (low-)complexity in terms of amino acid composition, sequence conservation across species and structural dynamicity (when available). The analysis will also extend beyond amino acid sequences to encompass DNA genomic and coding sequences, delving into codon usage bias and the distribution of regulatory elements. The task aims at defining parameters and constraints that can distinguish IDR subtypes as well as IDR from other biases observed in other non-globular proteins like tandem repeat and membrane proteins. By using the developed computational tools, proteome-wide analyses of these region types will be performed in order to unearth novel
evolutionary relationships. The task will leverage available predicted structures, while also assessing the constraints of traditional sequence-based methodologies like alignment algorithms and Hidden Markov Models for evolutionary analyses.
Task 2.2 – IDRs in prokaryotes
Details
Prokaryotes constitute the oldest life forms on earth, and they interact with animals, plants and fungi to constitute the so-called holobionts. In Task 2.2 we will use known holobionts data derived from modern phenomics studies to validate and test the findings of Task 2.1. We will explore to which extent IDRs are present in the gene sets that prokaryotes use to associate with their eukaryotic hosts (mutualistic and pathogenic associations). We will analyse the IDRs available in the associative proteins considering their putative function, conformational diversity, evolutionary rates, textual forms (i.e. codon usage), i.e. to which extent codons with the highest translational adaptation are preferentially enriched. The coding strategy for IDPs/IDRs from prokaryotic organisms with high- and with low-GC content will be compared to each other, and with homologs in eukaryotes when available. Results are expected to provide new knowledge on the incidence, type, structural features and encoding strategies of IDRs that participate in the associative life in prokaryotes.
Task 2.3 – Novel software for the prediction of IDPs and their function
Details
The goal of Task 2.3 is to develop novel techniques to enhance the sequence-based prediction of IDRs and their functional roles. These approaches can take advantage of novel machine learning (ML) models based on Deep Learning (DL) and protein Language Models (pLMs) as well as the analyses provided in Task 2.1 and 2.2 and the evaluation setting provided in WP3. The developed
ML models will follow the DOME recommendations. The prediction software will be standardised and containerized for easy execution in pipelines in the Galaxy and other infrastructures, like the CAID Prediction Portal. The novel methods will aid in the automatic annotation of IDRs with Gene Ontology (GO) terms and structural propensities, among other features.
Deliverables
D2.1
Functional studies of IDR types
[Confidential document]
New IDR type and function predictions from the sequence