Abstract
The database at Nutrigenetics.net has been under development since 2007 to facilitate the identification and classification of PubMed articles relevant to human genetics. A controlled vocabulary (i.e., standardized terminology) is used to index these records, with links back to PubMed for every article title. This enables the display of indexes (alphabetical subtopic listings) for any given topic, or for any given combination of topics, including for genes and specific genetic variants. Stepwise use of such indexes (first for one topic, then for combinations of topics) can reveal relationships that are otherwise easily overlooked. These relationships include environmental and lifestyle variables with potential relevance to risk modification (both beneficial and detrimental), and to prevention, or at least to the potential delay of symptom onset for health conditions like Alzheimer disease among many others. Thirty-four specific genetic variants have each been mentioned in at least ≥1,000 PubMed titles/abstracts, and these numbers are steadily increasing. The benefits of indexing with standardized terminology are illustrated for genetic variants like MTHFR 677C-T and its various synonyms (e.g., rs1801133 or Ala222Val). Such use of a controlled vocabulary is also helpful for numerous health conditions, and for potential risk modifiers (i.e., potential risk/effect modifiers).
Introduction
The PubMed database at https://pubmed.ncbi.nlm.nih.gov/ is an online resource freely provided by the National Library of Medicine which allows users to browse through >30 million journal article records. For highly specific queries, the number of search results can be kept to a manageable number of citations. However, for larger result sets, PubMed does not provide a way to sort the results by subtopics. A breakout by subtopics could help identify not only individual variables or subtopics of interest, but also general classes of subtopics like health conditions, potential risk (or effect) modifiers like drugs or dietary practices, lifestyle variables such as exercise or tobacco use, genetic variables such as mutations, common genetic variants, or gene expression, etc.
Some topics like gene expression have been mentioned in hundreds of thousands of PubMed articles. Regarding genetic variants, APOE4 has been specifically mentioned using one of its various synonyms in >9,000 PubMed records. The BRAF p.Val600Glu and MTHFR 677C-T variants have both been mentioned in >5,000 PubMed titles or abstracts. Table 1 lists 34 genetic variants which have each been mentioned in at least 1,000 PubMed records, and the volume of articles on these and many other variants continues to grow steadily.
In addition to the daunting volume of this literature, another challenge when using PubMed is the inconsistent terminology used by different authors, particularly when naming the genetic variants being reported. For anyone not already familiar with most of the synonyms, this can result in missing many of (sometimes even most of) the potentially relevant articles when searching in PubMed.
To help deal with such challenges, the development of the database at Nutrigenetics.net began in early 2007, with the goal of identifying PubMed articles relevant to human genetics and the related omics. Records are created in the Nutrigenetics.net database which correspond to PubMed records, but are further indexed using a controlled vocabulary which also takes into account the most common synonyms.
Perhaps the greatest benefit of using such indexing is that it enables individual online database users to create their own customized index (i.e., an alphabetical subtopic listing with subtopic article number counts) for any given topic, or combination of topics, including genes and genetic variants. Stepwise use of such indexes (first for 1 topic only, and then for ≥2 in combination) enables a multitopic exploration of potentially important relationships that can otherwise be easily overlooked. Once a topic or combination of topics is selected using the indexes for guidance, the corresponding PubMed titles can be easily displayed via the Index/Search drop-down box, and each title serves as a link back to PubMed. More complete bibliographic information, including the PubMed article IDs (PMID numbers) can be found by clicking on the “View Printer-Friendly Indexes/Searches” link.
Exploring indexes with the aim of discovering/recognizing less obvious relationships can also stimulate the interest and engagement of a wider audience. For researchers and clinicians, this can lead to both investigational and translational opportunities, including those relevant to the prevention of health consequences. Beyond health professionals, students, educators, and other members of the public like journalists can benefit, especially in the light of the abundance of new articles which appear continuously. The large new-article volume makes it difficult for journalists and others to effectively inform members of the public, resulting in a growing information gap [1].
Another system of terminology standardization for genes and gene-related proteins has been developed by the Gene Ontology Consortium [2], but with different aims. The Nutrigenetics.net database provides a unique tool for enhancing the identification of genetics-relevant PubMed articles for almost anyone who is interested, and is especially useful for the purposes of education, engagement, and information dissemination. Because of the inherent intricacies associated with genetics and the related omics, the goal here is to equip both the public and professionals to hold a more effective dialogue, thereby facilitating partnerships that can result in appropriate applications of the emerging science.
Coping with Inconsistent Terminology
The terminology used by the Nutrigenetics.net database for indexing is often based on the National Library of Medicine’s Medical Subject Headings or MeSH terms (https://meshb.nlm.nih.gov/search). For most genes, PubMed lists a corresponding gene product/protein in their MeSH terms; however, because the Nutrigenetics.net database is gene-centric, official gene symbols (https://www.genenames.org) are used to index for both the genes themselves and for any corresponding protein products. Each gene variant name uses the gene symbol as a prefix, e.g., MTHFR 677C-T. This allows the gene and any related gene variants to be grouped together within an alphabetical index listing.
Gene variant terminology presents a particular challenge, since more than one naming convention is often used. With very few exceptions, PubMed does not list gene variants within its MeSH terms. To illustrate the magnitude of the terminology challenge, when a gene variant occurs within a coding region of a gene, it often results in an amino acid substitution. For gene variants like MTHFR 677C-T, instead of referring to the nucleotide substitution, some authors refer, instead, to the corresponding amino acid substitution with 3-letter abbreviations for the amino acid change (e.g., Ala222Val), while other authors use single-letter abbreviations (A222V). If the gene variant happens to be a single-nucleotide polymorphism (SNP) like MTHFR 677C-T, then some authors prefer to use the reference SNP number (“rs number”), which for this variant is rs1801133 (https://www.ncbi.nlm.nih.gov/snp/). Others may prefer to use an alternative nucleotide substitution designation for the variant, e.g., C665T or C677T.
Clearly, the multiple naming options make it more difficult to find all of the relevant literature in PubMed. When PubMed article records are carefully searched for all of the potential naming options in combination with the gene symbol or protein name (methylenetetrahydrofolate reductase in this example), the total number of PubMed articles for this gene variant is >5,000. If, instead, PubMed is searched for rs1801133 alone, without the other synonyms, then only about 500 records are currently found, i.e., around only 10% of the total number of articles that are actually of potential relevance.
Another caveat is that when more than just a few rs numbers are mentioned in an article, typographical errors may easily creep in. Moreover, some authors append information about the nucleotide changes (alleles) to the end of the rs numbers, which can then result in missing the article when conducting a search in PubMed. For instance, a PubMed search for rs1801133 will not find rs1801133C or rs1801133C-T [3] unless a wildcard symbol (*) is added to the end of the rs number, e.g., rs1801133*.
The use of rs numbers can be helpful for many gene variants, and their use has been encouraged [4, 5]. One major limitation is that rs numbers have not been assigned to every type of genetic variant. Even when assigned, some rs numbers like rs121912438 can refer to >1 nucleotide substitution, which can result in >1 amino acid substitution. For the multiple reasons mentioned above, it is clear that a complete transition to the use of rs numbers for all types of gene variants is unlikely to occur. For this reason, the Nutrigenetics.net database selects one of the most commonly used synonyms as a standard, and then adds cross-references to the other synonyms.
Besides rs numbers, another major naming convention for gene variants is to refer to the amino acid substitution, which creates its own opportunities for misinterpretation. For any gene variants that are unfamiliar, note that single-letter amino acid abbreviations can be easily confused with nucleotide base names (e.g., “A” can refer to either alanine or adenine, “T” either threonine or thymine, “C” either cysteine or cytosine, and G either glycine or guanine). Therefore, the Nutrigenetics.net database follows the naming convention used by the OMIM database (https://www.ncbi.nlm.nih.gov/omim), where nucleotide substitutions are denoted by the letter(s) following the location, e.g., 677C-T is used instead of C677T for the example of MTHFR described above. Although the risk of such confusion is unlikely for substitution variant names like MTHFR A222V, it is more apt to occur for others like SOD1 Gly93Ala which is frequently abbreviated in PubMed as G93A. The database at Nutrigenetics.net deals with this by adding the following type of cross-reference to the auto-suggest feature when selecting the topic(s) of interest: SOD1 G93A (select SOD1 GLY93ALA instead). This database also displays the most common synonyms for genes and genetic variants (and many other topics) whenever an alphabetical index listing of subtopics is generated, such as that in the description column of Figure 1 for the gene MTHFR.
Coping with Article Overload
Now that the PubMed evidence base has become quite extensive for genetics-related topics, database assistance with information management is clearly useful. The 34 genetic variants listed in Table 1 (each of which has been mentioned in at least ≥1,000 PubMed records) represent a total of >64,000 PubMed records when combined (for the publication year 2000 up to mid-2020). The middle and right-most sections of Table 1 show how many articles from within this group of 64,000 PubMed citations also mention ≥1 of the various health conditions or potential risk (or effect) modifiers listed.
Lists of potential risk/effect modifiers like those shown in Table 1 can be useful for identifying opportunities for more research or for practical applications, including opportunities for managing or mitigating risks. However, it should be cautioned that risk/effect modifiers can be either helpful or harmful depending on the circumstances, which is why individualized, precision healthcare has been gaining attention.
The potential role of both environmental and lifestyle factors is illustrated by even just a cursory inspection of the potential risk/effect modifiers listed, including things like nutrition, smoking, alcohol consumption, stress, or exercise, to name just a few. Although many of these are already recognized by health professionals, this type of listing, especially when correlated with the matching PubMed records, can be extremely useful for educational purposes, disseminating information, and engaging members of the public. Several genomics-guided studies have also found encouraging results with regard to study outcomes [6-8].
Growing Environmental and Lifestyle Opportunities, Including Prevention
There are tens of thousands of PubMed articles which touch on genetics/genomics and the environment, with articles on lifestyle also gaining momentum [9-11]. As might be expected, some of the most commonly mentioned potential lifestyle/environment risk/effect modifiers include smoking, exercise or the lack thereof (a sedentary lifestyle), pollution of various types, alcohol, social environment, vitamins or dietary supplements, dietary fats, stress, intestinal microbiota, pesticides, fruits and vegetables, and meats (especially red or processed meats).
Awareness of potential risk/effect modifiers can also be an important factor when it comes to prevention. The number of PubMed article records relevant to risk score, gene score, polygenic score, or genotype score, has increased dramatically in recent years, and some may be on the verge of clinical application [12].
Database Features and Limitations
The Nutrigenetics.net database is a subset of PubMed for the publication years 2000 up to the present for articles relevant to human genetics/epigenetics and the related omics (and potentially relevant models) but has been indexed further with a controlled vocabulary. Access to the database is freely available to the public on weekends (US Pacific time) via the complimentary login, to be found at: https://www.nutrigenetics.net/Login.aspx.
The original focus of the Nutrigenetics.net database was nutrition. However, because nutrition is a profound example of an environmental variable with many overlaps into other health-related areas, the database soon expanded its coverage to include all types of interactions such as lifestyle, social environment, pharmaceuticals, etc., all of which can be important for physical and mental health and well-being as well as performance.
Except for genes and genetic variants, most of the standardized terminology used for indexing in the Nutrigenetics.net database follows the conventions used by the Medical Subject Headings (https://meshb.nlm.nih.gov/search). For genetic variants, it should be noted that most of the PubMed records contain traditional naming conventions. The Nutrigenetics.net database therefore often uses a traditional term for a genetic variant but includes cross-references with other synonyms, including rs numbers where applicable.
An effort is made to keep the Nutrigenetics.net database reasonably current, especially for genetic variants and related subtopics frequently mentioned in PubMed. For instance, a higher priority is placed on adding records where PubMed mentions an rs number in titles or abstracts, e.g., FTO rs9939609. Indexing for certain topics like FTO rs9939609 is straightforward because there are essentially no synonyms for this particular variant in PubMed. In sharp contrast, for rs numbers like rs1801133 (MTHFR 677C-T), the indexing process involves multiple synonyms that must all be taken into consideration, searched, and combined in order to identify all of the relevant PubMed records, and priority is given to updating the indexing for the more frequently cited genetic variants.
The Nutrigenetics.net database uses a proprietary combination of both manual and semiautomated processes for identifying the relevant PubMed records and indexing them with a controlled vocabulary. Controlled-vocabulary indexing with cross-referencing requires extra effort compared to simple full-text indexing, so some variability in update frequency does occur. The Nutrigenetics.net database serves as an auxiliary, multitopic indexing tool which can reveal otherwise easily overlooked relationships. It is intended to further enhance PubMed’s usefulness but not to compete with or replace PubMed. New topics and cross-references to synonyms are added on an ongoing basis and in response to user requests submitted to info@Nutrigenetics.net.
Since its inception, the database at Nutrigenetics.net has been adding records which correspond to PubMed at an average rate of >100,000 citations per year, with the current total count at >1.7 million records (the latest count, plus additional information can be found at https://nutrigenetics.net/AboutUs/FrequentlyAskedQuestions.aspx).
Conclusion
With the continuing emergence of environmental and lifestyle evidence, along with abundant reports on genetics/epigenetics and the related omics, there is now a growing opportunity for its practical use [13]. Any translational applications must be chosen and applied responsibly; this presents its own challenges including the acute need for genetics literacy/education, information dissemination, partnering with and among healthcare professionals, community and public engagement, and avoiding/minimizing healthcare disparities [14-20]. As more PubMed records appear, the aim of the Nutrigenetics.net database is to become increasingly useful to all potential audiences.
Conflict of Interest Statement
R.L.M. is the founder and president of Nutrigenetics Unlimited, Inc., which produces the database at Nutrigenetics.net. He is one of the original members of the International Society of Nutrigenetics/Nutrigenomics (ISNN) and has been providing ISNN members with 24/7 access to the Nutrigenetics.net database.
Funding Sources
There was no funding.