Proteomics Databases: UniProt, PDB, and Other Must Know Resources
Proteomics is the large-scale study of proteins, which plays a very important role in understanding biological processes, disease mechanisms and drug development. Since almost all cellular functions require the participation of proteins in regulation, it is very important for researchers to obtain the latest and accurate protein data.
Fortunately, many proteomics databases provide detailed information about protein sequences, structures, interactions and functions. In this guide, we introduce some commonly used proteomics databases, including UniProt, PDB, PRIDE, STRING, etc., and explain how to effectively use these resources for research.
UniProt
UniProt (Universal Protein Resource) is a free, open-access database of protein sequences and functional information, which aims to provide high-quality protein information to the scientific community. It is jointly maintained by the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR) in the United States.
The core components of UniProt include:
-
UniProt Knowledge Base (UniProtKB): This is the main protein sequence database, divided into two parts: UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. The former contains manually annotated and reviewed protein sequences, ensuring high accuracy and detailed functional information. The latter contains automatically annotated protein sequences, providing a wider coverage.
-
UniProt Reference Clusters (UniRef): By clustering similar protein sequences, reducing redundancy and accelerating sequence search and analysis.
-
UniProt Archive (UniParc): As a comprehensive non-redundant database, it stores all publicly available protein sequences, ensuring that each unique sequence is stored only once.
How to use UniProt
1. Visit UniProt: https://www.uniprot.org
2. Search for target proteins:
-
Direct search: Enter the protein name, gene name, UniProt entry number (such as P12345) or other relevant keywords in the search bar on the homepage, and then click the "Search" button.
-
Advanced search: Click the "Advanced" link on the right side of the search bar to enter the advanced search page. Here, you can use specific fields (such as gene name, protein name, species, etc.) to build complex query conditions to obtain more precise search results.
3. Filter and browse search results: The search results page will display a list of proteins that match your query, including information such as entry name, protein name, gene name, species origin, and sequence length. You can use the filter options on the left side of the page to further filter the results based on species, sequence status, evidence of protein existence, etc.
4. View protein details: Click the protein entry of interest to enter its details page.
5. Download sequence and related data: On the protein details page, you can choose different formats (such as FASTA, XML, etc.) to download protein sequences or related annotation information.
6. Batch retrieval and ID mapping: If you need to query the information of multiple proteins, you can use UniProt's batch retrieval or ID mapping tools. Click the "Retrieve/ID mapping" link at the top of the page, enter the protein ID list as prompted, and select the target database to obtain relevant information in batches.
Protein Data Bank
The Protein Data Bank (PDB) is an international open access database that collects protein and nucleic acid three-dimensional structure data.
The database is composed of experimental data submitted by structural biologists around the world, mainly obtained through methods such as X-ray crystallography, nuclear magnetic resonance spectroscopy (NMR) and cryo-electron microscopy.
PDB is managed by the Worldwide Protein Database (wwPDB), and its member institutions include:
-
RCSB PDB (USA)
-
PDBe (Europe)
-
PDBj (Japan)
The data in PDB are free and open to the public and are an important resource for structural biology research. Many scientific journals and funding agencies require scientists to submit their structural data to PDB to ensure the integrity and authority of the data. Based on the data in PDB, several databases have been developed to classify structural data according to different principles, such as SCOP and CATH for classifying protein structures, and PDBsum for providing a graphical overview of PDB entries. PDB initially adopted the PDB file format with fixed column width, and later introduced the more flexible mmCIF format, which was standardized in 2014.
In addition, there are formats such as PDBML (XML version of PDB). These structural files can be viewed through a variety of free and open source computer programs, including Jmol, Pymol, VMD, Molstar, and Rasmol. As of March 14, 2017, the PDB database contains a total of 127,352 structural data, most of which were obtained by X-ray crystallography. These data provide scientists with valuable resources for in-depth understanding of the structure and function of biological macromolecules.
How to usa Protein Data Bank
1. Visit the official PDB website: PDB is maintained by multiple organizations around the world. You can choose the appropriate portal based on your region, for example:
-
RCSB PDB (US): https://www.rcsb.org/
-
PDBe (Europe): https://www.ebi.ac.uk/pdbe/
-
PDBj (Japan): https://pdbj.org/
2. Search for target molecular structures:
-
Direct search: Enter the name of the protein or nucleic acid of interest, the PDB entry ID (such as "1A2B"), or other relevant keywords in the search bar on the homepage, and then click the "Search" button.
-
Advanced search: Using the advanced search function, you can filter by specific conditions such as molecule type, experimental method, resolution, etc. to obtain more precise search results.
3. Browse search results: The search results page will list the structural entries that match your query, including information such as PDB ID, molecule name, experimental method, resolution, etc. You can further filter or sort the results as needed.
4. View structure details: Click on the structure entry of interest to enter its details page. This page usually contains the following:
-
Structure summary: Provides basic information such as molecular function, source species, and experimental method.
-
3D structure viewing: With the built-in molecular visualization tool, you can interactively view the 3D structure of a molecule.
-
Download options: Structural files in multiple formats (such as PDB, mmCIF, PDBML, etc.) are available for download, as well as related experimental data.
5. Analyze the structure using visualization tools: After downloading the structure file, you can use the following free or open source molecular visualization tools for in-depth analysis:
-
PyMOL: Powerful molecular visualization software that supports advanced drawing and animation.
-
Jmol: Java-based interactive molecular viewer suitable for teaching and research.
-
VMD: A visualization tool designed for biomolecular systems, suitable for processing large structures.
-
RasMol: A classic molecular graphics program suitable for quick viewing and simple analysis.
Through the above steps, you can effectively retrieve, view, and analyze the 3D structural information of biomacromolecules in PDB to support your research and learning needs.
Other Essential Proteomics Databases
PRIDE
PRIDE (Proteomics Identifications Database) is a public data repository dedicated to storing mass spectrometry-based proteomics data. Maintained by the European Bioinformatics Institute (EBI), PRIDE is part of the ProteomeXchange (PX) consortium, which aims to standardize the submission and dissemination of proteomics data.
Main features:
-
Data submission: Researchers can submit protein and peptide identification or quantification data, as well as associated mass spectrometry data, directly to the PRIDE archive through ProteomeXchange's submission tool.
-
Data access: PRIDE provides a user-friendly interface that allows scientists and researchers to browse, search, and download publicly available proteomics datasets.
PRIDE supports a variety of standard data formats, including mzML, mzIdentML, and mzTab. These formats facilitate data standardization and sharing, ensuring data reproducibility and reusability. The data in PRIDE are widely used in biomedical research, systems biology and proteomics research, supporting scientists to gain a deeper understanding of protein functions, interactions and roles in different biological processes.
Access method: https://www.ebi.ac.uk/pride/archive/
PeptideAtlas
PeptideAtlas is a multi-species, publicly accessible proteomics data resource that brings together tandem mass spectrometry (MS/MS) datasets from around the world. By collecting and reprocessing these data, PeptideAtlas provides researchers with high-quality peptide identification information to support proteomics research and improved genome annotation.
Main Features:
-
Data Collection and Processing: PeptideAtlas collects mass spectrometry data from a variety of organisms, including humans, mice, yeast, etc. These data are processed through the latest search engines and protein sequence alignments and processed through the Trans-Proteomic Pipeline (TPP) to ensure high-quality peptide identification.
-
Data Sharing: All processed data, including raw data, search results, and complete construction results, are available for researchers to download for further analysis and research.
Currently, PeptideAtlas is maintained and developed by Professor Robert Moritz's team at the Institute of Systems Biology, and aims to achieve comprehensive annotation of eukaryotic genomes by verifying expressed proteins. PeptideAtlas data can be used to plan targeted proteomics experiments, improve genome annotations, and support data mining projects. It provides scientists with a high-quality peptide identification resource and promotes in-depth proteomics research.
Access method: https://peptideatlas.org/
STRING(Search Tool for the Retrieval of Interacting Genes/Proteins)
STRING is a bioinformatics resource designed to integrate known and predicted protein-protein interaction information. The database brings together data from multiple sources, including experimental results, computational prediction methods, and public text collections. It provides researchers with a comprehensive web platform for exploring and analyzing functional associations between proteins.
Main features:
-
Diverse data sources: STRING integrates experimentally verified interaction data, computational predictions (such as gene proximity, gene fusion events, co-occurrence patterns, etc.), and literature mining results. This multi-level data integration improves the coverage and reliability of the interaction network.
-
Extensive species coverage: As of the latest version, STRING covers about 24.5 million protein information from 5090 species, making it a valuable resource for studying protein interactions in different biological systems.
-
User-friendly interface: STRING provides intuitive network visualization tools that allow users to easily browse and analyze protein interaction networks. In addition, the database also provides functional enrichment analysis to help identify significant functional categories in protein collections.
STRING has broad applications in systems biology, functional genomics, and proteomics research. By analyzing protein interaction networks, researchers can reveal a systems-level understanding of cellular processes, predict protein functions, identify potential drug targets, and explore the molecular mechanisms of diseases.
Access method: https://string-db.org/
BioGRID
BioGRID (Biological General Repository for Interaction Datasets) is a manually curated biomedical database that specializes in collecting and storing information on protein-protein interactions, genetic interactions, chemical interactions, and post-translational modifications. The database is jointly maintained by institutions such as the University of Montreal, Princeton University, and Mount Sinai Hospital in Toronto, and aims to provide a comprehensive interaction data resource for major model organisms.
Main functions:
-
Data collection and organization: BioGRID extracts interaction data from biomedical literature through systematic manual organization to ensure the accuracy and reliability of the data.
-
Multi-species support: As of January 2021, BioGRID covers interaction data of more than 80 species, including major model organisms such as humans, mice, and yeast.
-
Data types: In addition to protein-protein and genetic interactions, BioGRID also includes information such as chemical interactions and post-translational modifications, providing a comprehensive view of biomolecular interactions.
BioGRID data are widely used in systems biology research, helping to understand the interaction network between biomolecules, reveal gene functions, and explore the molecular mechanisms of diseases.
Access method: https://thebiogrid.org
How to Choose the Right Proteomics Database for Your Research
Database Categories & Use Cases
Database Type | Example Databases | Best For |
Protein Sequences | UniProt, PeptideAtlas | Identifying protein sequences, functional annotations |
3D Protein Structures | PDB | Structural analysis, drug discovery |
Protein-Protein Interactions | STRING, BioGRID | Studying molecular networks |
Protein Expression & Abundance | HPA, PaxDb | Analyzing protein localization, disease studies |
Mass Spectrometry Proteomics | PRIDE | Storing and sharing experimental data |
If you want to conduct in-depth protein analysis, it is necessary to learn to integrate multiple protein databases.
Explore Advanced proteomics with MetwareBio
Proteomics databases provide indispensable tools for researchers studying protein functions, interactions, and structures. Whether analyzing sequences in UniProt, visualizing structures in PDB, or exploring protein networks in STRING, leveraging these resources effectively can accelerate discoveries in biomedical research, drug development, and beyond.
As a leading provider of metabolomics and proteomics services, MetwareBio is at the forefront of biotechnology advancements, using comprehensive analytical techniques to enhance the understanding and application of human proteins. Our state-of-the-art services are designed to support industry in tackling the complexity of metabolomics, lipidomics and proteomics, empowering your business with precise, innovative solutions. Visit us at www.metwarebio.com to learn how MetwareBio can transform your product development.