The Evolution of Proteomics: From the Human Genome Project to Modern Advances
Discover the journey of proteomics from its origins to its current impact on biological research. Learn how technological advancements have transformed proteomics into a crucial field for understanding complex biological functions and diseases.
- Introduction: From the Human Genome Project to Proteomics
- The Birth of Proteomics: A Historical Perspective
- Origins of Proteinomics Technology
- Rapid Development of Proteomics Technologies: From Qualitative to Quantitative
- Proteomics: Evolving Towards Maturity
Introduction: From the Human Genome Project to Proteomics
In the 1990s, the Human Genome Project captivated the world's attention, as advancements in gene sequencing technology made it possible to explore the mysteries of life and decipher the "book of life." There was a widespread misconception that completing the Human Genome Project would unlock the secrets of human aging, diseases, and ultimately lead to significant advancements in medicine. However, as scientists delved deeper into the vast amount of genomic data and conducted functional genomics research, they realized that the reality was far more complex than anticipated. While genomics provided insights into gene activity and disease correlations, it became evident that most diseases were not solely caused by genetic changes. The expression patterns of genes were intricately complex, and the same gene could play vastly different roles under different conditions and at different stages. Genomics alone could not provide answers to these questions.
The term "proteome" was coined by Australian scientists Williams and Wilkins in 1995 to describe the total sum of proteins expressed by an organism's genome. Unlike genomics, which focuses on genes, the proteome encompasses all the proteins encoded by a genome, reflecting the dynamic and ever-changing nature of biological systems. Proteomics, as a discipline, involves the use of various technological approaches to study the proteome. Its primary goal is to investigate the types, expression levels, and modification states of all proteins within an organism, to understand the interactions and connections between proteins, and to elucidate the functions of proteins and the rules governing cellular life activities.
The Birth of Proteomics: A Historical Perspective
Since the mid-20th century, the discovery of the DNA double helix structure has marked the advent of the molecular era in life sciences. It was once believed that all genetic information of a species or individual was contained within the genome, and a comprehensive decoding of the genome could fully elucidate the molecular basis of life activities. Therefore, in the early 1990s, American scientists pioneered and organized the Human Genome Project (HGP), which involved scientists from multiple countries, including China. The goal of this project was to determine the nucleotide sequence of the 3 billion base pairs comprising the human chromosomes (haploid), thereby mapping the human genome and identifying its genes and sequences, ultimately aiming to decipher the genetic information of humanity. The Human Genome Project was a significant step taken by humanity to explore its own mysteries, akin to the Manhattan Project and the Apollo Moon Landing program, representing another monumental engineering feat in the history of human science. In 2001, the draft of the human genome was published, marking a milestone in the success of the Human Genome Project. However, through comparative analysis with genomes of organisms like yeast and fruit flies, it was discovered that the number of protein-coding genes in the human genome was only four times that of single-cell yeast, similar to lower organisms like fruit flies.
So, what factors determine the characteristics of the human species and the complexity of the human body? As the Human Genome Project was completed, scientists realized that relying solely on the genome would not suffice to answer this question. In response, the focus of biological research shifted from unraveling genetic information to studying biological functions at a holistic level, ushering in the post-genomic era, also known as the functional genomics era. Researchers began employing techniques like gene expression profiling and RNA sequencing to study gene expression in biological samples. However, recent large-scale proteomic analyses across multiple species have shown that mRNA abundance does not exhibit the high correlation with protein abundance as previously believed. A comprehensive review published in the journal Cell in 2016 summarized these findings, revealing a correlation coefficient of less than 0.4 between mRNA and protein abundance, indicating that transcriptomic analysis alone cannot fully reflect protein expression levels. Furthermore, due to the intricate post-translational modifications, subcellular localization, conformational changes, and interactions with other biomolecules, protein-level information is challenging to obtain from DNA and mRNA levels alone. This prompted a shift towards directly studying the functional executors of genes - proteins - their composition, expression, and functional patterns, to elucidate the fundamental principles of life. Consequently, the internationally renowned academic journals Nature and Science, alongside the publication of the human genome sequencing results, respectively released articles titled "And now for the proteome" and "Proteomics in genomeland," marking the advent of the proteomics era.
Origins of Proteinomics Technology
Scientific progress relies on technological innovation and breakthroughs, and proteinomics is no exception. Originating in the 1950s, the Edman degradation protein sequencing technology allowed for the analysis of purified protein sequences. Using this technique, scientists successfully identified the sequences of many important proteins such as hemoglobin and insulin. However, with its low throughput and time-consuming nature, researchers began seeking alternative protein/peptide identification techniques to replace Edman degradation.
Interestingly, mass spectrometry technology became a focal point for many researchers. However, until the 1980s, most mass spectrometers were equipped with ion sources using electron ionization (EI) mode. Peptides generated by proteolytic cleavage often possess strong polarity and lack volatility, making ionization challenging under EI conditions. Consequently, the primary challenge in protein/peptide mass spectrometry analysis was ionization. Researchers often resorted to chemical derivatization to determine amino acid sequences. Despite this, EI-MS-based derivatized peptide mass spectrometry still had many flaws. Consequently, many researchers shifted focus to the ionization mode itself, aiming to achieve mass spectrometric detection of peptides or proteins through direct ionization.
In the late 1980s, the field of mass spectrometry analysis witnessed two groundbreaking advancements: matrix-assisted laser desorption/ionization (MALDI) and electrospray ionization (ESI). Originally designed for analyzing non-volatile organic small molecules, MALDI technology, spearheaded by Dr. Koichi Tanaka of Shimadzu Corporation in Japan, achieved a significant milestone by enabling the direct detection of large biomolecules such as proteins. Conversely, ESI, pioneered by Professor John Fenn at Yale University, introduced an alternative ionization mode, converting molecules in solution directly into gas-phase ions. This laid the groundwork for the rapid development of liquid chromatography-mass spectrometry (LC-MS) technology. Notably, ESI's ability to generate multiply charged ions from biomolecules led to enhanced detection capabilities for high molecular weight proteins. In recognition of their remarkable contributions to biomolecular structure identification, Dr. Tanaka and Professor Fenn shared the 2002 Nobel Prize in Chemistry with Swiss scientist Kurt Wüthrich, who invented the method for determining the three-dimensional structures of biological macromolecules in solution using nuclear magnetic resonance (NMR) technology. While the emergence of MALDI and ESI technologies significantly boosted protein/peptide detection, the challenge remains in deducing amino acid sequences from mass spectrometry data. Due to the vast number of permutations and combinations of the 20 natural amino acids that can comprise a protein, relying solely on mass spectrometry data complicates the de novo sequencing process.
During that period, mass spectrometry technology was often limited to validating known proteins. However, in the late 20th century, the rapid advancement of genomics, driven by innovative sequencing techniques, led to the deciphering of genomes from many simple organisms. This enabled more accurate prediction of gene products, such as protein sequences, facilitating the interpretation of mass spectrometry analysis data for known or unknown proteins. Against this backdrop, various international research groups independently developed protein identification strategies based on peptide mass fingerprinting (PMF). This approach involved enzymatically digesting isolated proteins into peptide mixtures, followed by mass spectrometry analysis and matching of ions with theoretical mass-to-charge ratios corresponding to sequences in protein databases. Complemented by high-resolution two-dimensional polyacrylamide gel electrophoresis (2-D PAGE) or high-performance liquid chromatography (HPLC) analysis, PMF found widespread application in protein identification analysis of biological samples. Leveraging this technique, Professors Wilkins and Humphery-Smith from the University of Sydney, Australia, completed the identification of 50 proteins in Mycoplasma in 1995 and introduced the concept of proteomics for the first time.
Rapid Development of Proteomics Technologies: From Qualitative to Quantitative
Due to its reliance solely on single peptide determination for relative molecular mass information, PMF technology often resulted in high false positive rates in peptide sequence database matching, especially when using low-resolution mass spectrometers such as ion traps or quadrupole analyzers. To address these issues, by the end of 1994, the research groups of Yates and Mann independently developed two similar protein database identification search algorithms based on secondary peptide mass spectrometry information, significantly enhancing the efficiency and accuracy of protein identification. The former opted to search all fragment ion information from secondary mass spectrometry into a virtual secondary spectrum library based on protein sequence databases, while the latter chose to rapidly de novo sequence the secondary mass spectrometry information, establishing several sequence tags before entering similar virtual libraries for searching.
Although both algorithms had their advantages and disadvantages, undoubtedly, the former's development and application were more extensive, eventually forming the well-known SEQUEST algorithm system. Nevertheless, what was then termed "proteomics research" mainly focused on the composition analysis of protein complexes, with little analysis of complete cellular or tissue proteomes. The reason for this was largely due to the limitations of offline electrophoresis or chromatographic separations, which restricted the throughput of proteome analysis. Therefore, researchers developed many proteomics analysis methods based on liquid chromatography-mass spectrometry (LC-MS) techniques to enhance the sensitivity and throughput of protein identification in complex systems. For instance, in 2001, the Yates group applied a technique called Multidimensional Protein Identification Technology (MudPIT), an online two-dimensional chromatography coupled with mass spectrometry technique. This approach, for the first time, identified around 1500 proteins in yeast cells, achieving a significant leap in the number of identified proteins in a single organism.
From then on, proteomics research entered a phase of rapid development, with various new technologies and methods emerging incessantly. Particularly, the rapid advancement of mass spectrometry technology, such as the emergence of novel orbitrap mass spectrometers and data-dependent acquisition modes, greatly enhanced the sensitivity of protein analysis [19], resulting in a significant increase in the number of proteins identified in a single proteomic analysis. It became evident that traditional manual validation methods for verifying protein identification results were no longer feasible. How to effectively ensure the accuracy of large-scale protein identification? Addressing this question, the Gygi research group developed an evaluation method based on the false discovery rate (FDR) of reverse-phase protein sequence decoy databases, which has now become the recognized standard for assessing the accuracy of proteomic analysis [20].
Entering the 21st century, proteomics research gradually transitioned from qualitative identification to quantitative analysis. Representative technologies include the spectral counting method developed by the Yates group [21], isotope-coded affinity tag (ICAT) technology developed by the Aebersold group [22], stable isotope labeling by amino acids in cell culture (SILAC) technology developed by the Mann group [23], and isobaric tags for relative and absolute quantitation (iTRAQ) technology developed by Applied Biosystems [24]. Additionally, some techniques widely used in small molecule quantification analysis, such as quantitative techniques based on the intensity of primary spectrum signals (MS1 filtering) [25] and multiple reaction monitoring (MRM) [26], have been successfully applied to protein quantification in proteomics. Over the past decade, quantitative proteomics technologies have demonstrated a diverse landscape of development. In recent years, new technologies and methods continue to emerge. For example, at the end of 2012, the Coon group and Domon group proposed the concept of parallel reaction monitoring (PRM), which quantifies peptides and proteins using all ion signals in high-resolution tandem mass spectrometry (MS/MS) spectra [27]. Additionally, in the same year, the Aebersold group developed a quantitative proteomics technology called SWATH, based on a data-independent acquisition mode, which effectively retains qualitative and quantitative information for almost all peptide segments, making it particularly suitable for digitizing and storing the proteomes of trace rare biological samples [28].
Proteomics: Evolving Towards Maturity
With technological advancements, the scope of proteomics research has expanded significantly, evolving from initial qualitative and relative expression analysis of proteins to absolute quantification of proteins, protein-protein interactions, post-translational modifications, spatial localization within organelles and subcellular compartments, and dynamic changes of proteins or modifications under specific physiological and pathological conditions. A PubMed search using the keyword "proteome" reveals a three-order increase in research papers on proteomics over the past 20 years. Qualitative and quantitative techniques in proteomics have also become increasingly mature. For instance, in 2011, the Mann and Aebersold teams independently reported the identification of 9,207 genes encoding 10,255 proteins in HeLa cells and 7,716 genes encoding 11,548 proteins in U2OS cells, marking a milestone in deep coverage of the human cell proteome [29,30].
Apart from advancements in coverage depth, significant improvements have been made in analysis speed as well. The rapid qualitative and quantitative proteomics technology developed by Dr. Qin and Dr. Qian's teams reduced the time required for deep coverage of the proteome from three days to 12 hours [31]. In 2014, the Coon group reported the identification of nearly 4,000 parent proteins within 1.3 hours, covering almost 90% of yeast gene expression products [32]. With continuously updated mass spectrometers and highly efficient chromatographic separation systems, many proteomics experiments can now identify 6,000-8,000 proteins from cell or tissue samples within 8-12 hours [33]. Furthermore, the precision and repeatability of quantitative proteomics technologies have been greatly improved. For example, in 2014, the Paulovich group collaborated with research teams from different laboratories in Seattle, Boston, and South Korea to perform quantitative proteomics analysis based on MRM of 319 proteins in breast cancer cells [34]. The results showed good correlation between measurements from different laboratories, demonstrating the method's potential for standardization across laboratories and borders, which will facilitate the establishment of new standards for standardized quantitative measurement of all human proteins using global resources.
In the wake of these technological advancements, proteomics research has entered a new era of rapid progress, marked by successive groundbreaking findings. For example, in 2015, two independent research groups simultaneously published the first draft of the human proteome in the journal Nature. Using mass spectrometry-based proteomic techniques, they analyzed dozens of different types of tissues or bodily fluids, collectively identifying nearly 20,000 protein products encoded by genes in non-diseased human bodies. This laid the foundation for a better understanding of organismal changes in disease states. In 2006, the United States established the Clinical Proteomics Tumor Analysis Consortium (CPTAC), a collaborative group dedicated to studying the proteomes of several major cancers, which has made significant progress in recent years. Members of this consortium, including the Liebler and Carr groups, reported large-scale proteogenomic studies on breast cancer in Nature in 2014 and 2016, respectively. They analyzed protein expression in nearly one hundred corresponding tumor tissues collected by The Cancer Genome Atlas (TCGA), comparing and integrating the data with existing genomic data and clinical information, providing an important theoretical basis for the precise classification and study of tumor biology. Another study by this consortium on the proteogenomics of ovarian cancer was published in the journal Cell in 2016.
Notably, with the rapid development of large-scale proteomics research, there has been a significant increase in the speed of generating mass spectrometry data, posing higher demands on the storage, sharing, and quality control of proteomic data. To address this, several proteomics public repositories have been developed, such as PRIDE and PeptideAtlas. Taking the PRIDE database developed by the European Bioinformatics Institute (http://www.ebi.ac.uk/pride/) as an example, it provides an open-source database for protein identification, allowing researchers to store, share, and compare their results. This freely accessible database aims to facilitate researchers in retrieving peer-reviewed standard data and allows users to use this standard to transmit data by collecting proteomic data from various sources.
In just a few decades, proteomics has made significant explorations in areas such as cell proliferation, differentiation, and tumor formation, involving more than ten major diseases such as leukemia, breast cancer, colorectal cancer, ovarian cancer, prostate cancer, lung cancer, kidney cancer, and neuroblastoma. It has discovered various new diagnostic markers and therapeutic innovative drugs, laying an important foundation for comprehensively improving the level of disease prevention, diagnosis, and treatment, and vigorously promoting the development of "precision medicine," a new medical model.
References:
1. Wasinger, V.C., et al., Progress with gene-product mapping of the Mollicutes: Mycoplasma genitalium. Electrophoresis, 1995. 16(7): p. 1090-4.
2. The promise of proteomics. Nature, 1999. 402(6763): p. 703.
3. Venter, J.C., et al., The sequence of the human genome. Science, 2001. 291(5507): p.1304-51.
4. Gholami, A.M., et al., Global proteome analysis of the NCI-60 cell line panel. Cell Rep, 2013. 4(3): p. 609-20.
5. Selevsek, N., et al., Reproducible and consistent quantification of the Saccharomycescerevisiae proteome by SWATH-mass spectrometry. Mol CellProteomics, 2015. 14(3): p. 739-49.
6. Jovanovic, M., et al., Immunogenetics. Dynamic profiling of the protein life cycle inresponse to pathogens. Science, 2015. 347(6226): p.1259038.
7. Liu, Y., A. Beyer, and R.Aebersold, On the Dependency of Cellular Protein Levels on mRNAAbundance. Cell, 2016. 165(3): p. 535-50.
8. Abbott, A., And now for the proteome. Nature, 2001. 409(6822): p. 747.
9. Fields, S., Proteomics in genomeland. Science, 2001. 291(5507): p.1221-4.
10. Edman, P., A method forthe determination of amino acid sequence in peptides. ArchBiochem, 1949. 22(3): p. 475.
11. Fenn, J.B., et al., Electrospray ionization for mass spectrometry of largebiomolecules. Science, 1989. 246(4926): p. 64-71.
12. Rosenfeld, J., et al., In-gel digestion of proteins for internal sequence analysis afterone- or two-dimensional gel electrophoresis. Anal Biochem, 1992.203(1): p. 173-9.
13. Mortz, E., et al., Identification of proteins in polyacrylamide gels by massspectrometric peptide mapping combined with database search. BiolMass Spectrom, 1994. 23(5): p. 249-61.
14. Monch, W. and W. Dehnen, High-performance liquid chromatography of peptides. J Chromatogr, 1977. 140(3): p. 260-2.
15. O'Farrell, P.H., Highresolution two-dimensional electrophoresis of proteins. J BiolChem, 1975. 250(10): p. 4007-21.
16. Eng, J.K., A.L. McCormack, and J.R. Yates, An approach to correlate tandem mass spectral dataof peptides with amino acid sequences in a protein database. J AmSoc Mass Spectrom, 1994. 5(11): p. 976-89.
17. Mann, M. and M. Wilm, Error-tolerant identification of peptides in sequence databases bypeptide sequence tags. Anal Chem, 1994. 66(24): p. 4390-9.
18. Washburn, M.P., D.Wolters, and J.R. Yates, 3rd, Large-scale analysis of the yeastproteome by multidimensional protein identification technology.Nat Biotechnol, 2001. 19(3): p. 242-7.
19. Makarov, A., Electrostaticaxially harmonic orbital trapping: a high-performance technique ofmass analysis. Anal Chem, 2000. 72(6): p. 1156-62.
20. Elias, J.E., et al., Comparative evaluation of mass spectrometry platforms used inlarge-scale proteomics investigations. Nat Methods, 2005. 2(9):p. 667-75.
21. Liu, H., R.G. Sadygov, and J.R. Yates, A model for random sampling and estimation ofrelative protein abundance in shotgun proteomics. Anal Chem, 2004. 76(14): p. 4193-201.
22. Gygi, S.P., et al., Quantitative analysis of complex protein mixtures usingisotope-coded affinity tags. Nat Biotechnol, 1999. 17(10):p. 994-9.
23. Ong, S.E., et al., Stableisotope labeling by amino acids in cell culture, SILAC, as a simpleand accurate approach to expression proteomics. Mol CellProteomics, 2002. 1(5): p. 376-86.
24. Ross, P.L., et al., Multiplexed protein quantitation in Saccharomyces cerevisiae usingamine-reactive isobaric tagging reagents. Mol Cell Proteomics, 2004. 3(12): p. 1154-69.
25. Cox, J. and M. Mann, MaxQuant enables high peptide identification rates, individualizedp.p.b.-range mass accuracies and proteome-wide proteinquantification. Nat Biotechnol, 2008. 26(12): p. 1367-72.
26. Anderson, L. and C.L.Hunter, Quantitative mass spectrometric multiple reactionmonitoring assays for major plasma proteins. Mol Cell Proteomics, 2006. 5(4): p. 573-88.
27. Peterson, A.C., et al., Parallel reaction monitoring for high resolution and high massaccuracy quantitative, targeted proteomics. Mol Cell Proteomics, 2012. 11(11): p. 1475-88.
28. Gillet, L.C., et al., Targeted data extraction of the MS/MS spectra generated bydata-independent acquisition: a new concept for consistent andaccurate proteome analysis. Mol Cell Proteomics, 2012. 11(6):p. O111 016717.
29. Nagaraj, N., et al., Deep proteome and transcriptome mapping of a human cancer cell line.Mol Syst Biol, 2011. 7: p. 548.
30. Beck, M., et al., Thequantitative proteome of a human cell line. Mol Syst Biol, 2011.7: p. 549.
31. Ding, C., et al., Afast workflow for identification and quantification of proteomes.Mol Cell Proteomics, 2013. 12(8): p. 2370-80.
32. Hebert, A.S., et al., Theone hour yeast proteome. Mol Cell Proteomics, 2014. 13(1):p. 339-47.
33. Riley, N.M., A.S. Hebert, and J.J. Coon, Proteomics Moves into the Fast Lane. Cell Syst, 2016. 2(3): p. 142-3.
34. Kennedy, J.J., et al., Demonstrating the feasibility of large-scale development ofstandardized assays to quantify human proteins. Nat Methods, 2014. 11(2): p. 149-55.
35. Wilhelm, M., et al., Mass-spectrometry-based draft of the human proteome. Nature, 2014. 509(7502): p. 582-7.
36. Kim, M.S., et al., Adraft map of the human proteome. Nature, 2014. 509(7502):p. 575-81.
37. Zhang, B., et al., Proteogenomic characterization of human colon and rectal cancer.Nature, 2014. 513(7518): p. 382-7.
38. Mertins, P., et al., Proteogenomics connects somatic mutations to signalling in breastcancer. Nature, 2016. 534(7605): p. 55-62.
39. Zhang, H., et al., Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer. Cell, 2016. 166(3): p. 755-765.