Essential for metabolomics research: a summary of commonly used public metabolite databases
Metabolomics research relies heavily on access to high-quality metabolite data for accurate identification and interpretation of biochemical pathways. Public metabolite databases play a crucial role in storing, organizing, and sharing metabolic information, enabling researchers to advance discoveries in systems biology, drug development, and disease research.
Database | Description | Key Features | Data Coverage |
COCONUT | A comprehensive open-source natural products database integrating multiple sources. | Contains chemical structures, biological activities, and metadata from various sources. | Over 400,000 natural products. |
SciFinder | A commercial database for chemical substances, reactions, and literature search. | Extensive chemical compound and reaction data, literature references, and patent information. | Millions of chemical substances and reactions. |
Reaxys | A chemical database offering detailed reaction and compound information. | Includes reaction mechanisms, synthesis pathways, and bioactivity data. | Over 240 million chemical substances and reactions. |
DNP | A curated database of known natural products and their properties. | Includes chemical structures, biological activity, and references. | Over 328,000 natural products. |
Napralert | A natural products database focusing on ethnobotany, pharmacology, and biological activity. | Includes biological activities, chemical properties, and ethnomedicinal uses. | Extensive plant-derived natural products data. |
ChEMBL | A bioactivity database containing drug-like and natural compounds. | Includes chemical structures, targets, and activity data. | Over 2.3 million bioactive compounds. |
PubChem | A general chemical database including synthetic and natural compounds. | Chemical structures, properties, bioactivity data. | Over 100 million compound entries. |
KEGG | A database for metabolic pathways and natural compounds. | Pathway maps, compound data, and reaction networks. | Data on 700+ organisms. |
MetaCyc | A metabolic pathway database with curated biochemical reactions. | Pathways, enzymes, metabolites, and reactions. | Over 2,700 pathways across multiple species. |
NPASS | A curated database for natural product activity and source organisms. | Includes bioactivity, taxonomy, and structure data. | Over 35,000 natural products. |
KNApSAcK | A phytochemical database linking metabolites to plant species. | Includes plant metabolites, sources, and biological activities. | Over 50,000 metabolites. |
CMAUP | A curated database focusing on bioactive compounds in traditional medicine. | Includes natural compounds, pharmacokinetics, and target data. | Over 23,000 compounds. |
1. COCONUT
COCONUT (COlleCtion of Open Natural ProdUcTs) Online is an open source database for storage, search and analysis of natural products (NPs). It collects data from more than 50 open NP resources and is free and without any restrictions. Each entry corresponds to a "two-dimensional" NP structure and is associated with its known stereochemical form, literature, organisms that produce them, natural geographical occurrence and various pre-computed molecular properties when available.
So-called "simple" searches can be performed using the title search bar. There users can enter molecule names (e.g. "curcumin"), SMILES, InChI, InChi keys, COCONUT ids and molecular formulas. Name searches use native MongoDB text indexes, allowing fuzzy, flexible searches in the "name" and "synonyms" fields. Current version (2021.3) 406747 unique natural products.
2. SciFinder
Produced by Chemical Abstracts Service (CAS) under the American Chemical Society (ACS), it is a research and development application platform that provides literature, substances and reaction information on chemistry and related disciplines since 1957. SciFinder covers multidisciplinary and interdisciplinary scientific and technological information in chemistry and related fields such as chemistry, biology, medicine, engineering, agronomy, physics, etc. The types of documents included in SciFinder include journals, patents, conference papers, dissertations, books, technical reports, reviews and online resources. Currently, there are more than 300,000 natural metabolites in the database. Through SciFinder, you can obtain and search the following database information: CAplusSM (literature database), CAS REGISTRYSM (substance information database), CASREACT® (chemical reaction database), MARPAT® (Markush structure patent information database), CHEMLIST® (controlled chemical information database), CHEMCATS® (chemical business information database), MEDLINE® (US National Library of Medicine database).
Openness: Currently, the database is open to college students, teachers and some pharmaceutical companies. Registration requires a school email address.
3. Reaxys
Reaxys is the world's largest database of physical and chemical properties, factual reactions and medicinal chemistry under Elsevier, covering more than 119 million compounds, 46 million reactions, 500 million experimental data verified by experiments, and 53 million abstract records. It covers the world's seven major patent offices, 16,000 journals and more than 1,000 books. Reaxys' predecessor is the long-standing German Beilstein and Gemerin database, whose literature coverage can be traced back to 1771.
Database usage: Query metabolic reactions
4. DNP(Dictionary of Natural products)
The Dictionary of Natural products is a sub-database of the Chapman & Hall/CRC Chemical Database, a structured database containing information about chemical substances. It includes descriptive and numerical data on the chemical, physical, and biological properties of compounds; systematic and common names of compounds; references; structure diagrams and their associated connection tables. The Dictionary of Natural products is the only comprehensive and fully edited database on natural products.
5. NaprAlert
NAPRALERTSM is the abbreviation of NAtural PRoducts ALERT. It is the world's largest literature database of ethnic and traditional medicine, chemistry, botanicals, microorganisms and marine medicines. NAPRALERTSM also contains a lot of chemical and pharmacological data (including in vivo studies) of known structural metabolites from nature. NAPRALERTSM currently contains information on 200,000 scientific research papers from 1650 to the present, of which 25% are abstracts and 75% are original article content. These papers contain 151,000 plants, marine products, microorganisms or animals. Currently, the database contains relevant literature on about 250 plants.
6. ChEMBL
ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactive and genomic data with the goal of collecting medicinal chemistry data and knowledge in the drug research and development process. Information on small molecules and their bioactivity is collected from full-text articles in several core medicinal chemistry journals and combined with data on approved drugs and clinical development candidates, such as mechanisms of action and therapeutic indications.
7. Pubchem
PubChem was established by the National Institutes of Health (NIH) and is the world's largest collection of freely accessible chemical information. PubChem mainly contains small molecules, but also larger molecules such as nucleotides, carbohydrates, lipids, peptides, and chemically modified macromolecules. Chemical substances can be searched by name, molecular formula, structure, and other identifiers to find chemical and physical properties, biological activity, safety and toxicity information, patents, literature citations, and more. There are 110 million compounds, 278 million substances, 295 million biological activities, 34 million related literature, and 4.2 million patents available for retrieval.
8. KEGG
KEGG is a database resource for understanding the advanced functions and utility of biological systems (such as cells, organisms, and ecosystems) from molecular-level information, especially large-scale molecular data sets generated by genome sequencing and other high-throughput experimental techniques. It was established in 1995 by the Kanehisa Laboratory of the Bioinformatics Center of Kyoto University, Japan. It is one of the most commonly used bioinformatics databases in the world and is known as a "repository of advanced functions and utility for understanding biological systems."
9. MetaCyc
MetaCyc is a sub-database of BioCyc, which provides a reference of genomes and metabolic pathways of thousands of sequenced organisms. MetaCyc is a metabolic pathway database in the life sciences elucidated by experimental data, containing 3006 pathways from 3349 different organisms. MetaCyc contains pathways involving primary and secondary metabolism and related metabolites, reactions, enzymes and genes. MetaCyc can be used to predict metabolic pathways in sequenced genomes.
10. NPASS(The Natural Product Activity and Species Source Database)
The NPASS database in its current version (V1.0) provides 35032 unique natural products isolated from 25041 source organisms, and 446552 activity records for 5863 targets. In addition to isolated nanoparticles, many organisms (whole organisms or certain parts) have long been used as medicinal materials in traditional Chinese medicine (TCM) and other traditional medicines. NPASS records also include the chemical constituents of TCMs and their biological activities. For example, at an activity cutoff <= 10uM, NPASS includes 719 TCM constituent chemicals against 464 human protein targets (1936 NP-target pairs). Its content is searchable by keywords, range of physicochemical properties, structural similarity, species, and target search tools, and is freely accessible. This database was created to provide a reliable source for highly curated NPs with structures, experimental activity values, and the organisms that synthesized them.
11. KNApSAcK
KNApSAcK is a database describing the relationship between species and their metabolites, and in 2012 the database contained 41,548 GZ-plant pair entries, including 222 GZs and 15,240 medicinal/edible plants. KAMPO DB consists of 336 formulations containing 278 medicinal plants; JAMU DB consists of 5,310 formulations containing 550 medicinal plants. The Bioactivity Database consists of 2,418 bioactivities and pairwise relationships between 33,706 medicinal plants and their bioactivities. The current statistics of binary relationships between the various databases are characterized by degree distribution analysis, predicting that there are at least 1,060,000 metabolites in all plants. Updated to 2022, the database contains 57,906 substances, as well as 141,486 metabolite-species pairs.
12. CMAUP
The CMAUP database provides a collection of molecular activities of 5645 plants, including 2567 medicinal plants used in 79 countries, against 646 human target proteins and 2473 gene ontologies, 234 KEGG pathways and their relationships with 656 human diseases. To develop the database, the chemical constituents of plants were collated by manually checking 1562 literatures or from other relevant databases, resulting in a comprehensive dataset of plant constituents. Next, the biological activities of these constituents were collected from ChEMBL and NPASS databases by matching the chemical structures of the constituents. The plant medicinal information was manually collected from the literature or a few traditional medicine databases. Only the activities of constituents with activity types of IC50/EC50/AC50/Ki/Potency and activity units of nM against human target proteins were included for further analysis (activities against non-human proteins were excluded). After annotating the activities of individual plant components, proteins with activities ≤1000nM (≤10000nM as the 2nd level cutoff) were defined as human target proteins of plants. Gene ontology and KEGG pathway enrichment analysis were performed on the gene list of human target proteins. Approved or clinically investigated therapeutic targets and their associated human diseases were collected from the Therapeutic Target Database (TTD). According to the TTD data, human diseases are associated with plants when one or more target proteins of the plant are the targets of the disease.
Discover Metabolomics Services with MetwareBio
Metabolomics is a powerful tool for unraveling the complexities of metabolism and its implications for health and disease. By understanding the full spectrum of metabolites and their interactions, researchers can gain valuable insights into disease mechanisms, identify potential biomarkers, and develop targeted therapies. If you're embarking on a metabolomics research project, MetwareBio offers a comprehensive suite of metabolomics services, including TM widely-targeted metabolomics, to help you achieve your goals. Our expertise in data analysis and extensive metabolite database will provide you with the foundation for groundbreaking discoveries.