Database Structure and Content
The Human Gene Mutation database (HGMD) constitutes a comprehensive collection of single base-pair substitutions in coding (missense and nonsense), regulatory and splicing-relevant regions of human nuclear genes, micro-deletions and micro-insertions, indels, repeat expansions, as well as gross gene lesions (deletions, insertions and duplications) and complex gene rearrangements (Table A3-1). This unique resource currently contains in excess of 96,631 different germline mutations and disease-associated/functional polymorphisms in a total of over 3,600 nuclear genes (December 2009 release) causing or associated with human inherited disease. Inspection of HGMD mutation data reveals that every year ~300 new ‘inherited disease genes’ are identified (Fig. 1.3-1), corresponding to ~9,000 new mutations.
Table A3-1: Summary of Mutation Data in HGMD (December 2009) |Favorite Table|Download (.pdf) Table A3-1: Summary of Mutation Data in HGMD (December 2009)
|Mutation type ||Number of entries ||Proportion of total (%) |
| Single base-pair substitutions |
|Missense ||43600 ||45.1 |
|Nonsense ||10822 ||11.2 |
|Splicing ||9267 ||9.6 |
|Regulatory ||1700 ||1.8 |
| Other lesions |
|Small (≤20-bp) deletions ||15231 ||15.8 |
|Small (≤20-bp) insertions ||6273 ||6.5 |
|Small (≤20-bp) indels ||1413 ||1.5 |
|Gross (>20-bp) deletions ||5912 ||6.1 |
|Gross (>20-bp) insertions and duplications ||1210 ||1.2 |
|Complex rearrangements (including inversions) ||912 ||0.9 |
|Repeat variations ||291 ||0.3 |
| Total || 96631 || 100 |
|Note: Numbers include data not yet publicly available. |
HGMD does not include somatic lesions or mitochondrial genome mutations. This is because there are several databases devoted to the collection of somatic mutations, the most comprehensive of which is COSMIC.2 Mitochondrial mutations are well covered by MITOMAP.3 HGMD provides links to both of these databases.
HGMD currently provides free access to the bulk of its mutation data to over 33,000 registered academic/non-profit users worldwide. It receives 6,900 user logins per month and 15,600 gene queries per month (2009 average figures given). In the absence of any public funding, HGMD is maintained courtesy of a subscription-based version (HGMD Professional), distributed through BIOBASE GmbH (http://www.biobase-international.com).
Data input is managed by a small dedicated team (Peter Stenson, Eddie Ball, Matthew Mort, Katy Howells, and Andrew Phillips) based in Cardiff, UK who are responsible for identifying relevant mutation reports from the literature [or from locus-specific mutation databases (LSDBs)], assessing each report for novelty and accuracy, augmenting the mutation data with flanking DNA sequence, annotating the data where necessary and appropriate, and applying a uniform data entry format.
Single base-pair substitutions in coding regions are presented in terms of a triplet change with an additional flanking base included if the mutated base lies in either the first or third position in the triplet. Substitutions causing regulatory abnormalities are logged with 30 nucleotides flanking the site of mutation on both sides; the location of the mutation relative to the transcriptional initiation site, initiator ATG, or polyadenylation site is given. Mutations affecting mRNA splicing are presented in brief with information specifying the relative position of the lesion with respect to a numbered intron donor or acceptor splice site. Positions logged as positive integers refer to a 3′ (downstream) location; negative integers refer to a 5′ (upstream) location. Microdeletions (of 20 bp or less) are presented in terms of the deleted bases in lowercase plus, in uppercase, 10-bp less) are presented in terms of the deleted bases in lowercase plus, in uppercase, 10 bp DNA sequences flanking both sides of the lesion. The numbered codon is preceded in the given sequence by the caret character (^). In cases where any location parameter is listed as "?", either the location is unknown or a consistent nucleotide/codon numbering system is lacking. Where deletions extend outside the coding region of the gene in question, other positional information is provided—e.g., 5′ untranslated region (5′ UTR) or exon 6/intron 6 boundary (E6I6). Chromosome coordinates are provided for the vast majority of missense and nonsense mutations, microinsertions, microdeletions, indels and regulatory mutations. We also provide HGVS (Human Gene Variation Society; http://www.hgvs.org/mutnomen) nomenclature for microlesions wherever possible.
It should be noted that codon numbering may in some cases display inconsistencies with the literature owing to different residue numbering systems being adopted for the same protein. For most genes (where there is no risk of error or ambiguity), residue numbering has been standardized with respect to the generally accepted numbering system. For gross deletions, gross insertions, repeat variations, and complex rearrangements, information regarding the nature and location of a lesion is logged in narrative form owing to the extremely variable quality of the original data reported.
The database can be electronically searched by gene symbol, gene/protein name, disease/clinical phenotype, or Online Mendelian Inheritance in Man (OMIM) accession number. We also intend, in the not too distant future, to provide an expanded search facility to allow much more specific subcategorization of mutations.
Mutation data in HGMD are accessible on the basis of every gene being allocated one Web page per mutation type, if data of that type are present. Meaningful integration with phenotypic, structural, and mapping information has been accomplished through bidirectional links between HGMD and OMIM.4 Links to GenAtlas, 5 the HUGO Nomenclature Committee, 6 GeneCards, 7 GeneTests-GeneClinics, 8 and LocusLink9 also have been established.
Each mutation is entered into HGMD only once (citing the first literature report) in order to avoid confusion between recurrent and identical-by-descent lesions. Additional phenotype and functional characterization data are handled by provision of extra references for cited mutations. Silent mutations within the coding region that do not alter the encoded amino acid are not recorded unless there is evidence of altered splicing and/or a disease association. Mutations that have not been adequately or unambiguously described in the corresponding report are also excluded unless full details can subsequently be obtained from the authors. Such problems could be minimized if authors were strictly to follow published mutation nomenclature guidelines.10 In addition to mutation data, HGMD provides supplementary information that can be used to assist in interpretation e.g., links to cDNA reference sequences (defined as coding sequences stretching from the ATG translational initiation codon to the termination codon, duly translated; they currently number 3532), mutation maps, and splice junction data (for ~60 genes).
Data are obtained by means of a combined electronic and manual search procedure. HGMD currently contains data derived from more than 1,260 different life-science and medical journals, with entries accumulating at a rate in excess of 9,000 per annum (Table A3-2). Data from five journals (Human Mutation, American Journal of Human Genetics, Human Molecular Genetics, Human Genetics, and Nature Genetics), ranked by number of mutations reported and listed in HGMD, account for just over 36% of all HGMD entries, with the top 100 journals accounting for more than 84% of entries.
Table A3-2: Summary of Entries in HGMD by Year of Publication (December 2009) |Favorite Table|Download (.pdf) Table A3-2: Summary of Entries in HGMD by Year of Publication (December 2009)
|Year ||Number of Entries |
|Before 1990 ||746 |
|1990 ||450 |
|1991 ||747 |
|1992 ||1182 |
|1993 ||1391 |
|1994 ||2293 |
|1995 ||2405 |
|1996 ||3020 |
|1997 ||3804 |
|1998 ||4674 |
|1999 ||5264 |
|2000 ||5528 |
|2001 ||5905 |
|2002 ||5556 |
|2003 ||6056 |
|2004 ||6021 |
|2005 ||8187 |
|2006 ||8562 |
|2007 ||8439 |
|2008 ||8027 |
|2009 ||8374 |
HGMD does not usually include mutations lacking obvious phenotypic consequences, although a few such coding sequence variants have been included where they could conceivably have some clinical effect (e.g. albumins, butyrylcholinesterases). Some variants have, however, been included on the basis that they significantly reduce the expression of a given gene or the functional activity of its protein product, even though these mutations may not yet have been shown to be of direct clinical relevance.
Disease-associated polymorphisms currently constitute ~5% of mutation entries in HGMD . The vast majority of disease-associated polymorphisms are single base-pair substitutions, but a small number are of an insertion/deletion type. They generally occur within gene promoters or coding regions, but an increasing number of ‘functional polymorphisms’ are being found in introns and 3′ UTRs. These polymorphisms often serve to alter either the level of expression of the gene or the functional activity of the gene product. Currently, ~55% of the polymorphisms recorded in HGMD have been shown to be disease-associated. The remainder, while not yet known to be disease-associated, can nevertheless still exert a marked influence on either the level of expression of the associated gene or the functional activity of the protein product.
Although functional polymorphisms with no known disease association do not have any immediate clinical relevance, these data are potentially very valuable in terms of understanding inter-individual differences in disease susceptibility. The distinction between a disease-associated polymorphism and a pathological mutation sensu stricto is of course to some extent arbitrary. For example, the F508del mutation in the CFTR gene -which, in a homozygous or compound heterozygous state, is responsible for the majority of cases of cystic fibrosis in Caucasian populations - occurs at polymorphic frequency. By contrast, the Arg506Gln substitution in the F5 gene known as factor V Leiden - which occurs at a frequency of 3-5% in the general population - is found in between 15 and 25% of individuals with venous thrombosis. In terms of their relationship to disease, the former category of mutation may be regarded as causative whereas the latter may be regarded only as associated, owing to reduced penetrance.
Many reports of disease-associated polymorphisms that appear in the literature are of uncertain significance and may eventually turn out to be unreliable as a result of inappropriate case-control matching, inadequate choice of statistical test, inconsistent phenotype definition, etc. The decision as to whether to include disease-associated polymorphisms in HGMD is not an arbitrary one but rather involves an exercise of judgment. To be included as disease-associated, a statistically significant (p<0.05) association between the polymorphism and phenotype must have been observed, or either in vitro or in vivo expression/functional data should have served to demonstrate that the polymorphism influences either the level of expression or the function of the protein product. Such information is generally required in order to ensure that ‘neutral’ variants in linkage disequilibrium with other disease-causing (but hitherto unidentified) variants are not included. In some instances, the above criteria have been only partially satisfied so that the HGMD curators remained unconvinced as to the phenotypic relevance of the variants reported. A decision to include the polymorphism may nevertheless have been made, either as a result of other supporting information that became available since publication of the original (first) report, or because the associated gene/disease state was deemed to be of sufficient importance for it to warrant confirmation or otherwise by further work. Such variants have been ascribed the descriptor "association with?" (as opposed to "association with" without a question mark) to indicate that some degree of uncertainty is involved. The difficulty inherent in making decisions regarding the inclusion or exclusion of such potential disease associations highlights the need for a methodical and methodologically uniform approach to assessing such reports as they appear in the literature.11
Some 8.5% of HGMD mutation entries are listed in dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/). This does not however mean that all these variants necessarily occur at polymorphic frequencies (it simply means that dbSNP are now beginning to include pathological variants as well as polymorphisms sensu stricto). Of the 4,833 bona fide polymorphisms listed in HGMD, 1,964 are ‘disease-associated’, 783 are ‘disease-associated with supporting functional evidence’ for direct disease involvement, and 2,086 are ‘functional polymorphisms’ which serve either to alter gene expression or the structure/function of the gene product in some way. Some 86% of the 4,833 polymorphisms in HGMD are single-base-pair substitutions.
Polymorphic variants affecting individual drug response, patient survival times after diagnosis and responses to surgical intervention are not generally included in HGMD. Studies that simply report SNPs in association with disease (and hence which are likely to represent merely a linkage disequilibrium effect), but with no additional evidence of direct functional involvement of the variants in question, are also not included. Reports of haplotypes associated with an increased risk of disease are not included unless there is some indication as to precisely which variant(s) within the haplotype is/are responsible for the disease association or functional effect. These fairly stringent criteria mean that HGMD is currently the only database that focuses specifically on the collation of functional/disease-associated polymorphic variants to the exclusion of linkage (or tag) markers.
From January 2006, HGMD has been working in partnership with BIOBASE GmbH (http://www.biobase.de). Since HGMD does not receive any public funding to support its upkeep, it has been necessary to develop a sustainable model to ensure both current and future funding of the database. The ideal model would be a mixture of income from both public and private sources. This, in principle, would allow HGMD to provide free database access to academic/non-profit users alongside a subscription-based distribution for commercial users marketed by a commercial company. With this eventual aim in mind, the HGMD curators opted to market their data in collaboration with BIOBASE GmbH. As part of the commercial agreement, Cardiff University, as HGMD’s host institution, agreed to provide BIOBASE with a period of exclusive access to newly added mutational information. This period extends to 2½ years from the date of initial inclusion. BIOBASE provides HGMD (in the form of HGMD Professional) as a stand-alone product as part of its database subscription package. The publicly available version of HGMD will, however, continue to be made available as a free service to registered users from academic/nonprofit institutions via the Cardiff website.
HGMD Professional provides access to the very latest mutation data but also contains many valuable extra features, including an expanded search engine, genomic coordinates, additional literature references, Human Genome Variation Society (HGVS) nomenclature (http://www.hgvs.org/mutnomen) and a suite of advanced search tools that greatly enhance the utility of the database. Together with BIOBASE, we are working toward being able to make all HGMD data and search tools available to the academic community free of charge and in a timely fashion, with the costs of upkeep being borne primarily by industry and commerce. We believe that this funding model should not only guarantee the financial viability of HGMD, but ought also to allow this unique resource to be sustainable into the long term, to the benefit of the scientific community.