Types of Nucleotide Substitutions
A database containing reports of mutations in the coding regions of human genes causing genetic disease, mainly characterized by DNA sequencing, has been maintained by two authors of this chapter (D.N.C. and M.K.). As of April 15, 2000, this database includes 21,591 entries in 1039 genes (this database is referred to as the Human Gene Mutation Database (HGMD3) throughout this chapter; http://www.hgmd.org). Earlier versions of the HGMD have been published.4,5 Only one example of each mutation is recorded, owing to the difficulty in determining whether repeated mutations are identical by descent or truly recurrent. Fig. 13-1 illustrates the spectrum of mutations logged in the database. Missense nucleotide substitutions represent the most common type of mutations, accounting for 50 percent of the total entries. Regarding missense mutations, evidence for causality comes from one or more of the following sources:
Spectrum of different types of human gene mutations logged in Human Gene Mutation Database as of September 13, 1998 (http://www.hgmd.org).
Occurrence of the mutation in a region of known structure or function
Occurrence of the lesion in an evolutionarily conserved residue
Previous independent occurrence of the mutation in an unrelated patient
Failure to observe the mutation in a large sample of normal controls
Novel appearance and subsequent cosegregation of the gene lesion and disease phenotype through a family pedigree
Demonstration that a mutant protein produced in vitro possesses the same biochemical properties and characteristics as its in vivo counterpart
Reversal of the pathological phenotype in the patient/cultured cells by replacement of the mutant gene/protein with its wild-type counterpart.
The spectrum of single-base-pair substitutions logged in the HGMD by November 1997 (the time it was last subjected to an extensive meta-analysis6) is summarized in Table 13-2. Mutations occurring in CpG dinucleotides account for 2133 (29.3 percent) of the total. Therefore, they represent a major cause of human genetic disorders (see below). If only CG-to-TG and CG-to-CA transitions (i.e., consistent with methylation-mediated deamination) are considered, this figure falls to 1675 (23 percent). Breakdown of the data by chromosomal location revealed that the proportion of CG-to-TG or CG-to-CA substitutions was significantly higher for autosomal genes (1325/5296 = 25.0 percent) than for X-chromosomal genes (350/1975 = 17.7 percent; χ2 = 43.21, 1 df, P<10−5). In part, this disparity can be explained by a generally more pronounced CpG suppression observed in X-linked genes: the average CpG content was 3.67 percent for the 401 autosomal cDNA sequences provided by the HGMD, and 2.86 percent for the 45 X-chromosomal cDNAs (Student's t = 2.35, 444 df, P < 0.01). Analysis of Table 13-2 yields transversion (T to A or G, A to T or C, G to C or T, C to G or A) and transition (T to C, C to T, G to A, A to G) frequencies of 37.5 percent and 62.5 percent, respectively. There is therefore a highly significant excess of transitions as compared with the expected frequency (33 percent). Most but not all of this excess can be attributed to the hypermutability of the CpG dinucleotide. However, even when CpG mutations are removed (36.9 percent of all transitions and 16.8 percent of all transversions) from the analysis, the excess of transitions is still significant (55.8 percent vs 33 percent expected). It is important to point out that mutation frequencies observed in the context of human inherited disease are unlikely to reflect the true underlying rates of mutation occurrence. Since different amino acid substitutions have different effects upon protein structure and function, they have necessarily come to clinical attention (and thus entered the HGMD) with different probabilities. Moreover, codon frequencies differ from one another, implying that different amino acid residues are involved in a mutational event with different prior probabilities. Relative single-base-pair substitution rates corrected for these two confounding factors6 are presented in Table 13-3. The data in this table confirm the existence of a high rate of C-to-T and G-to-A substitutions (48 percent of total).
Table 13-2: Spectrum of Single-Base-Pair Substitutions, in the Human Gene Mutation Database |Favorite Table|Download (.pdf) Table 13-2: Spectrum of Single-Base-Pair Substitutions, in the Human Gene Mutation Database
| ||Nucleotide Resulting from Single-Base-Pair Change |
| || |
|Initial Nucleotide ||T ||C ||A ||G ||Total |
|T || || 654 ||271 ||312 ||1237 |
|C || 1632 (940) || ||371 ||340 ||2343 |
|A ||201 ||163 || || 538 ||902 |
|G ||619 ||453 || 1717 (735) || ||2789 |
|Total ||2452 ||1270 ||2359 ||1190 ||7271 |
Table 13-3: Relative Single-Base-Pair Substitution Rates in Human Nuclear Genes Causing Inherited Disease |Favorite Table|Download (.pdf) Table 13-3: Relative Single-Base-Pair Substitution Rates in Human Nuclear Genes Causing Inherited Disease
| ||Substituting Nucleotide |
| || |
|Original Nucleotide ||T ||C ||A ||G |
|T ||— ||1.525 ||0.374 ||0.410 |
|C ||2.702 ||— ||0.541 ||0.505 |
|A ||0.187 ||0.268 ||— ||1.127 |
|G ||0.521 ||0.712 ||3.128 ||— |
DNA Polymerase Fidelity and Single-Nucleotide Substitutions.
DNA replication occurs as a result of an accurate, yet error-prone, multistep process. The final accuracy depends on the initial fidelity of the replicative step and the efficiency of subsequent error-correction mechanisms.7 Since DNA polymerases are involved in replication, recombination, and repair processes (Table 13-4), 8 their base incorporation fidelity is probably a critical factor in determining mutation rates in the cell. To test the hypothesis that nonrandom base misincorporation during DNA replication is a major contributory factor in human mutations, Cooper and Krawczak5 compared the base substitution rates from Table 13-3 with the in vitro measured base substitution error rates (data from studies by Kunkel and Alexander9 and others) exhibited by vertebrate DNA polymerases α, β, and δ. A significant correlation between these two sets of values was observed for polymerase β but not for polymerases α or δ (Spearman rank correlation coefficient, 0.74; P< 0.005). In this comparison, any consideration of the efficacy of the different proofreading and postreplicative mismatch-repair mechanisms was excluded. This is because the purified polymerase preparations used in vitro lacked the 3′ to 5′ exonuclease activities thought to be responsible for proofreading in vivo. The result obtained for DNA polymerase β is consistent with the postulate that a substantial proportion of the nucleotide substitutions causing human genetic disease are due to misincorporation of bases during DNA replication.
Table 13-4: Eukaryotic DNA Polymerases |Favorite Table|Download (.pdf) Table 13-4: Eukaryotic DNA Polymerases
| ||α || β || γ || δ || ε |
|Catalytic polypeptide ||165 kDa ||40 kDa ||140 kDa ||125 kDa ||255 kDa |
|Associated subunits ||70, 58, 48, kDa ||None ||Unknown ||48 kDa ||Unknown |
|Cellular localization ||Nuclear ||Nuclear ||Mitochondrial ||Nuclear ||Nuclear |
|Associated activities || || || || || |
|3′ → 5′ Exonuclease ||None ||None ||Yes ||Yes ||Yes |
|Primase ||Yes ||None ||None ||None ||None |
|Properties || || || || || |
|Processivity ||Medium ||Low ||High ||Low ||High |
|Fidelity ||High ||Low ||High ||High ||High |
|Major characteristics ||Principal replicative ||Short-patch ||Mitochondrial ||Leading-strand ||UV-induced repair synthesis |
| ||DNA polymerase, lagging strand DNA synthesis ||DNA Repair ||DNA polymerase ||DNA synthesis || |
Slipped Mispairing and Single-Nucleotide Substitutions.
A mechanistic model for single-base-pair mutagenesis, the slipped-mispairing model, 10 seeks to explain nucleotide misincorporation through transient misalignment of the primer-template caused by looping out of a template base. During replication synthesis, the template strand slips back one base, resulting in the misincorporation of the next nucleotide on the primer strand. After realignment of both primer and template strand, the mismatch may be corrected in favor of the misincorporated base (Fig. 13-2). Misalignment or dislocation mutagenesis is thought to be mediated by runs of identical bases or by other repetitive DNA sequences in the vicinity. If misincorporation mediated by one-base-pair (1-bp) slippage is important, then a substantial proportion of point mutations should exhibit identity of the newly introduced base to one of the bases flanking the mutation site. Comparison in the HGMD of the observed and expected frequency of this type of mutation revealed that this is indeed the case, but only at certain codon positions.6 Mutation toward the 5′ flanking nucleotide occurred significantly more often than expected at the second position (642 observed vs 558 expected) but not at the first position (565 observed vs 568 expected) or last position of a codon (167 observed vs 170 expected); mutation toward the 3′ flanking base was significantly favored at the first position (490 observed vs 390 expected) but disfavored at the second position (592 observed vs 659 expected) of a codon. These findings suggest a mechanism of mutation at either position 1 or 2 in the codon (both critical in specifying the encoded amino acid residue) that is biased toward the nucleotide at the other position. Inspection of the genetic code reveals that such a bias invariably serves to avoid the de novo introduction of termination codons.
Schematic representation of the slipped-mispairing model for single nucleotide substitutions.
CpG Dinucleotides as Hotspots for Nucleotide Substitutions (Methylation-Mediated Deamination of 5-Methylcytosine)
CpG Distribution in the Vertebrate Genome and Its Origins.
In eukaryotic genomes, 5-methylcytosine (5mC) occurs predominantly in CpG dinucleotides, the majority of which appear to be methylated.11,12 Methylation of cytosine results in a high level of mutation due to the propensity of 5mC to undergo deamination to form thymine (Fig. 13-3). Deamination of 5mC probably occurs with the same frequency as either cytosine or uracil. However, whereas uracil DNA glycosylase activity in eukaryotic cells can recognize and excise uracil, thymine—being a normal DNA base—is thought to be less readily detectable and hence removable by cellular DNA repair mechanisms. One consequence of the hypermutability of 5mC is the paucity of CpG in the genomes of many eukaryotes, the heavily methylated vertebrate genomes exhibiting the most extreme CpG suppression.12 In vertebrate genomes, the frequency of CpG dinucleotides is between 20 and 25 percent of the frequency predicted from observed mononucleotide frequencies.13,14 The distribution of CpG in the genome is also nonrandom: About 1 percent of the vertebrate genome consists of a fraction that is rich in CpG and that accounts for about 15 percent of all CpG dinucleotides (reviewed by Bird15). In contrast to most of the scattered CpG dinucleotides, these CpG islands represent unmethylated domains and in many cases appear to coincide with transcribed regions. The evolution of the heavily methylated vertebrate genome has been accompanied by a progressive loss of CpG dinucleotides as a direct consequence of their methylation in the germ line.
Schematic representation of the molecules for cytosine, 5-methylcytosine, and thymine and the chemical events for the transformation of cytosine to thymine.
The CpG Dinucleotide and Human Genetic Disease.
An excess of C-to-T transitions was first reported by Vogel and Röhrborn16 in a study of the mutations responsible for hemoglobin variants in humans. Further studies confirmed the existence of this phenomenon.17 Many additional studies in eukaryotes (reviewed by Cooper and Krawczak5) have now shown that the CpG dinucleotide is specifically associated with a high frequency of C-to-T and G-to-A transitions. The G-to-A transitions arise as a result of a 5mC-to-T transition on the antisense DNA strand, followed by miscorrection of G to A on the sense strand. A high frequency of polymorphism has also been detected in the human genome by restriction enzymes containing CpG in their recognition sequences.18 CpG was found by molecular analysis to be a hotspot for mutation first in the factor VIII (F8C) gene19,20 and subsequently in a wide range of different human genes.21,22 From the relative dinucleotide mutabilities as estimated by Cooper and Krawczak4,5 (see below for a description), it follows that the CG-to-TG or CG-to-CA substitutions are approximately 13 times more likely than any other substitution in the CG dinucleotide. This is perhaps an underestimate, since in the HGMD each nucleotide substitution has been logged only once, resulting in the systematic exclusion of multiple independent de novo mutations. It has been repeatedly noted in various genes that specific CG-to-TG or CG-to-CA mutations recur independently. For example, the number of CG-to-TG or CG-to-CA mutations in the factor VIII (F8C) gene causing hemophilia A is 25 percent of all different single-nucleotide substitutions, but 48 percent if recurrent mutations are considered (based on 586 F8C point mutations; see Kaufman and Antonarakis23 and should be http://hadb.org.uk/). The observed frequency of CG-to-TG and CG-to-CA mutation varies between human genes; for example, it is less than 10 percent in the β-globin (HBB) and HPRT genes, but it is greater than 50 percent in the ADA gene. In two studies of the coagulation F8C and F9 genes in which almost all mutations in a given set of patients have been identified, approximately 35 percent of nucleotide substitutions were CG to TG or CG to CA.24-26 The distribution of CpG mutations within a given gene may also be uniform. For example, 9 of 122 single-base-pair substitutions in exon 7 of the protein C (PROC) gene occur in a CpG; by contrast, none of 13 point mutations reported in exons 5 and 6 are in CpG dinucleotides, 27 although these exons contain a larger number of CpGs. In the assumed absence of a detection bias (see below), variation in CpG mutability is due either to differences in germ-line DNA methylation and/or relative intragenic CpG frequency. Indeed, CpG hypermutability in inherited disease implies that the affected sites are methylated in the germ line and thereby rendered prone to 5mC deamination. That 5mC deamination is directly responsible for mutational events has been evidenced by the fact that several cytosine residues known to have undergone a germ-line mutation in the low-density-lipoprotein receptor gene (LDLR) (hypercholesterolemia) and the tumor protein 53 (TP53) gene (various types of tumor) are indeed methylated in sperm DNA.28
The frequency of CG-to-TG or CG-to-CA mutations may differ between male and female germ lines because there is a profound difference in DNA methylation in the germ cells of the two sexes: the oocyte is markedly undermethylated, whereas sperm is heavily methylated.29,30 Thus, it may be that CG-to-TG or CG-to-CA mutations occur more commonly in male germ cells. Table 13-5 shows the germ-line origin of mutations in the F9 gene. In this data set, there is a sevenfold male excess of transitions at CpG dinucleotides.31 Pattinson et al32 have noted differences between ethnic groups in the mutation frequency at specific CpG sites within the F8C gene in a small sample. By contrast, the pattern of germ-line CpG mutation in the F9 gene appeared to be indistinguishable between Asians, mostly of Korean origin, and Caucasians.33 This finding argues for the absence of population-specific methylation patterns and is consistent with no differences in methylation between individuals from different ethnic backgrounds.34 In somatic tissues, 5mC deamination also appears to be an important mechanism of single-base-pair substitution.35,36 Indeed, the relative rate of mitotic cancer-associated CG-to-TG or CG-to-CA transitions observed in the TP53 gene, the most widely mutated gene in human tumorigenesis, is very similar to the overall germ-line rate observed in other human genes.37 Fig. 13-4 depicts the codon usage in human genes (data from 6,130,940 human codons from GenBank release 107), together with the relative frequency at which codons are affected by any of the 8604 missense/nonsense recorded in the HGMD on July 14, 1998; http://www.hgmd.org). It is obvious from Fig. 13-4 that codons for Arg and Gly underwent more mutations than expected from codon usage alone in human genes. Although four codons for Arg contain CG dinucleotides, it is less clear why the codons for Gly are hypermutable. They all start with G and could therefore be part of a CG dinucleotide. In addition, they all are GGN, and a nearest neighbor analysis of single-base substitutions (Table 13-6) indicated that a mutated G is often flanked by another G at its 5′ side. To overcome the bias of counting independent mutations only once, we also compared, in Fig. 13-5, the number of recurrent mutations found in different codons of five X-linked genes (F8C, F9, L1CAM, OTC, and BTK) with the codon usage in these five genes. The information included was extracted from the following locus-specific databases: for BTK, http://structure.bmc.lu.se/idbase/BTKbase/ for F8C, http://hadb.org.uk/ for F9, http://www.biochem.ucl.ac.uk/pavithra/fix/structure.html.php; for L1CAM, http://www.l1cammutationdatabase.info/ mutations; and, for OTC, http://ureacycle.cnmcresearch.org/otc/. It is again apparent from Fig. 13-5 that codons for Arg are more vulnerable to point mutations, emphasizing the hypermutability of the CG dinucleotide.
Histogram of codon usage in human genes and mutations found in codons for the various amino acids. The values on the x-axis were normalized for 10,000.
Histogram of the independent recurrent mutations found in codons of five X-linked genes and the occurrence of these codons in these five genes. The genes were F8C, F9, L1CAM, OTC, and BTK.
Table 13-5: Germ-Line Origin of Mutations in the Clotting Factor IX Gene |Favorite Table|Download (.pdf) Table 13-5: Germ-Line Origin of Mutations in the Clotting Factor IX Gene
| ||Male ||Female ||M/F Ratio || p Value |
|All base substitutions ||20 ||16 ||2.5 ||4.99 × 10–3 |
|All deletions ||3 ||11 ||0.55 ||NS |
|Transitions || || || || |
|At CpG ||10 ||3 ||6.7 ||1.65 × 10–3 |
|Non-CpG ||5 ||4 ||2.5 ||NS |
|Transversions ||5 ||9 ||1.1 ||NS |
|Deletions || || || || |
|Small (<50 bp) ||1 ||8 ||0.25 ||NS |
|Large (>50 bp) ||2 ||3 ||1.3 ||NS |
|Insertions ||1 ||1 || || |
|Total ||24 ||28 || || |
Table 13-6: Nucleotide Frequencies at the 3’ and 5’ Sides of Point Mutations Causing Human Genetic Disease |Favorite Table|Download (.pdf) Table 13-6: Nucleotide Frequencies at the 3’ and 5’ Sides of Point Mutations Causing Human Genetic Disease
| ||3′ Neighboring Base |
|(a) Mutated Base || |
| ||T ||C ||A ||G ||Total |
|T ||202 ||240 ||164 ||631 ||1237 |
|C ||235 ||354 ||619 ||1135 (195) ||2343 (1403) |
|A ||311 ||218 ||209 ||164 ||902 |
|G ||493 (374) ||613 (457) ||732 (547) ||951 (676) ||2789 (2054) |
|Total ||1241 (1122) ||1425 (1269) ||1724 (1539) ||2881 (1666) ||7271 (5596) |
| || Mutated Base |
| || |
| (b) 5′ Neighboring Base || T || C || A || G || Total |
|T ||182 ||509 (314) ||161 ||669 ||1521 (1326) |
|C ||438 ||716 (350) ||295 ||998 (263) ||2447 (1346) |
|A ||347 ||519 (355) ||173 ||345 ||1384 (1220) |
|G ||270 ||599 (384) ||273 ||777 ||1919 (1704) |
|Total ||1237 ||2343 (1403) ||902 ||2789(2054) ||7271 (5596) |
Are other mechanisms also responsible for CpG deamination? The suggestion that CpG deamination may result from endogenous enzymatic activity has been mooted by Steinberg and Gorman, 38 who found that some 70 percent of their (independent) mouse lymphoma cell mutants possessed a specific CGG-to-TGG substitution converting Arg 334 to Trp in the gene-encoding protein-kinase regulatory subunit. In 5 percent of these mutants, a second mutation (CGT to TGT) was found converting Arg 332 to Cys. The co-occurrence of these two mutations at such a high frequency argues for some type of enzymatic mechanism and against two independent methylation-mediated deamination events. Such a mechanism could involve a deaminase, although no such activity has yet been purified. The relevance of the observation to human gene mutation is doubtful, since (1) there are no known examples, including CpG dinucleotides, of pathologic base changes that occur with such a high proportional frequency in humans, and (2) although a very few isolated examples of double mutation have been reported as causes of human genetic disease, these do not involve CpG dinucleotides. Shen et al39 have reported that DNA methyltransferase is capable of including C-to-T transitions directly in prokaryotes, and the mutation frequency was sensitive to the concentration of the methyl donor, S-adenosylmethionine. The importance of this putative deamination mechanism in eukaryotes is at present unclear.
Non-CpG Point-Mutation Hotspots
In an early and not updated analysis, among the 879 point mutations in HGMD not readily explicable by methylation-mediated deamination, a total of 30 codons in 16 different genes were identified as potential hotspots for single-base-pair substitutions. These residues were characterized either by a single base being affected by at least two nonidentical substitutions or by mutations affecting two or three nucleotides within that codon. Some trinucleotide and tetranucleotide motifs are significantly overrepresented within 10 bp on either side of the mutation hotspots. These motifs are TTT (17 observed vs 8 expected), CTT (18 vs 8), TGA (23 vs 11), TTG (20 vs 8), CTTT (8 vs 2), TCTT (8 vs 2), and TTTG (11 vs 2). In addition, Cooper and Krawczak5 screened a region of 10 bp around 219 non-CpG base substitution sites for triplets and quadruplets that occurred at significantly increased frequencies. Only one trinucleotide was found again to occur at a frequency significantly higher than expected: CTT, the topoisomerase-I cleavage site consensus sequence.40 CTT was observed 36 times in the vicinity of a point mutation, whereas the expected frequency was 20. By contrast, two tetranucleotides were significantly overrepresented at the screened positions. TCGA was observed 17 times (7 expected; this was probably because TaqI restriction enzyme was used for detection of the mutations), whereas TGGA was observed 25 times (12 expected). The latter motif fits perfectly with the deletion hotspot consensus sequence drawn up previously for human genes, 41 which, in turn, resembles the putative arrest site for DNA polymerase α.42 Thus, it may be that the arrest or pausing of the polymerase at the replication fork disposes the replication complex to misincorporation of nucleotides as well as deletions.
A Nearest-Neighbor Analysis of Single-Base-Pair Substitutions
Methylation-mediated deamination as a primary cause of point mutation is characterized by an increased rate of CG-to-TG and CG-to-CA transitions. However, the relative likelihoods of point mutations at other dinucleotides may also vary, as is suggested by the nearest-neighbor frequencies observed in the HGMD (Table 13-6). (Note that each point mutation can be regarded as occurring within two distinct dinucleotides, depending on whether one considers the 5′ or the 3′ neighboring base.) In Table 13-6, considerable differences are apparent with respect to nucleotides occurring adjacent to sites of point mutation. For example, G residues are clearly overrepresented as 3′ flanking nucleotides when T is mutated, and a mutated G is often flanked by another G residue on the 5′ side.
Differences in the phenotypic consequences of specific point mutations, and thus in the likelihood of their coming to clinical attention, introduce a serious bias to the observed spectrum of mutations underlying human disease. In-depth studies of the phenotypic effect of large numbers of different missense mutations in a specific gene are few. One such study for missense mutations in the F9 gene43 showed that mutations at generic residues (amino acid residues conserved in F9 of other mammalian species and in three related serine proteases) would invariably cause disease. Mutations at F9-specific residues (residues conserved in the factor IX of other mammalian species but not in three related serine proteases) were some sixfold less likely to cause disease, whereas mutations at nonconserved residues were 33 times less likely to result in a hemophilia-B phenotype. Bottema et al43 estimated that 40 percent of all possible missense changes would cause hemophilia B, implying that 60 percent of residues serve merely as spacers to maintain the relative position of critical amino acid residues and probably do not fulfill any specific (known) function. Thus, detectable mutations, identified by virtue of their effect on protein structure and function and subsequently on clinical phenotype, appear to be a subset of a rather larger number of mutations, many of which have no clinical effect, at least in the case of hemophilia B. To what extent this finding in hemophilia B (in which <5 percent normal F9 activity must be present to generate a clinically abnormal phenotype) can be extrapolated to other genetic disorders is, however, unclear. Nevertheless, it would seem reasonable to suppose that the phenotypic consequences of a given point mutation are determined by the magnitude of the amino acid exchange as assessed by the resulting structural perturbation of the protein. Thus, specific amino acid substitutions might come to clinical attention more readily, depending on the severity of the resulting phenotype. Several methods have been reported for assessing the relative net effect of a specific amino acid exchange.44,45 Perhaps the best comparative measure of amino acid relatedness available is that devised by Grantham, 45 who combined the three interdependent properties of composition, polarity, and molecular volume to assign each amino acid pair a mean chemical difference. Krawczak et al6 devised an iterative multivariate procedure that takes into account the phenotypic consequences of a mutation, measured by means of Grantham's chemical difference between the wild type and mutant amino acid. Over and above the hypermutability of CpG dinucleotides, only a subtle and locally confined influence of the surrounding DNA sequence upon relative single-base-pair substitution rates was observed which extended no further than 2 bp from the substitution site.6 Maximum-likelihood estimates of relative substitution rates taking the immediate 5′ and 3′ flanking nucleotides into account are summarized in Table 13-7. A steady increase in clinical observation likelihood with increasing chemical difference was also noted. Furthermore, nonsense mutations were found to be more than twice as likely to come to clinical attention as the most extreme missense mutations and three times more likely than the average amino acid change. However, the phenotypic consequences of a given mutation must depend not only on the nature of the amino acid substitution, but also on the location of the substitution within the protein. In general, and with the exception of charged residues, most amino acids that make critical interactions (e.g., disulfide bonds, hydrophobic forces, and hydrogen bonds) are rigid or buried within the protein structure, and their mutational substitution will be profoundly destabilizing.
Table 13-7: Relative Dinucleotide Mutabilities |Favorite Table|Download (.pdf) Table 13-7: Relative Dinucleotide Mutabilities
| ||Newly Introduced 5′ ||Newly Introduced 3′ Base |
| || || |
|d ||T ||C ||A ||G ||T ||C ||A ||G |
|TT ||— ||1.17 ||0.31 ||0.36 ||— ||0.71 ||0.20 ||0.28 |
|CT ||1.17 ||— ||0.31 ||0.41 ||— ||1.57 ||0.27 ||0.43 |
|AT ||0.44 ||0.20 ||— ||2.06 ||— ||1.53 ||0.40 ||0.34 |
|GT ||0.86 ||0.71 ||3.13 ||— ||— ||1.17 ||0.39 ||0.30 |
|TC ||— ||0.93 ||0.37 ||0.19 ||2.06 ||— ||0.37 ||0.56 |
|CC ||1.16 ||— ||0.49 ||0.32 ||2.54 ||— ||0.39 ||0.40 |
|AC ||0.23 ||0.37 ||— ||0.95 ||2.27 ||— ||0.48 ||0.44 |
|GC ||0.48 ||0.71 ||2.68 ||— ||2.06 ||— ||0.51 ||0.32 |
|TA ||— ||1.19 ||0.32 ||0.34 ||0.16 ||0.24 ||— ||1.36 |
|CA ||1.28 ||— ||0.43 ||0.42 ||0.12 ||0.43 ||— ||1.23 |
|AA ||0.14 ||0.22 ||— ||0.87 ||0.12 ||0.16 ||— ||0.92 |
|GA ||0.45 ||0.60 ||2.82 ||— ||0.24 ||0.15 ||— ||0.63 |
|TG ||— ||1.86 ||0.37 ||0.52 ||0.42 ||0.62 ||1.84 ||— |
|CG || 9.01 ||— ||0.88 ||0.90 ||0.90 ||1.17 || 12.17 ||— |
|AG ||0.10 ||0.19 ||— ||0.60 ||0.30 ||0.46 ||1.38 ||— |
|GG ||0.46 ||0.69 ||3.09 ||— ||0.49 ||0.54 ||1.76 ||— |
Strand Difference in Base Substitution Rates
A noteworthy feature of Table 13-7 is that it reveals some asymmetry, suggesting a strand difference for single-base-pair substitutions. For example, the relative rates of CT to CC and AG to GG differ by more than twofold. Since the latter transition is complementary to the former, these two figures should coincide if point mutagenesis were acting similarly on both DNA strands. Estimation of relative substitution rates conditional on both the 5′ and 3′ flanking nucleotides served to identify 10 pairs of substitutions, complementary to each other, that exhibit the same feature.6 These are listed in Table 13-8. A strand difference in mutation rates has already been described by Wu and Maeda.46 By comparison of nonfunctional sequences near the β-globin genes of six primate species, they demonstrated that purine-to-pyrimidine (R-to-Y) transversions occurred approximately 1.5 times more frequently than their pyrimidine-to-purine counterparts. However, complementary transitions were found to occur at equal frequencies. These findings are compatible with the mutational spectrum from the HGMD: R to Y was observed 11 percent more frequently than Y to R, and both T-to-C and A-to-G transitions account for some 10 percent of the mutations in Table 13-2. A slightly different result was obtained for G-to-A transitions, which are 1.4 times more frequent than C-to-T transitions. Nevertheless, Table 13-7 reveals that strand differences in mutation rates depend on the nucleotides flanking the site of mutation. For example, whereas CT to CC is more than 2.5 times more likely than AG to GG, TA is 15 percent more likely to mutate to TG than to CA. A disparity between the likelihoods of CG-to-TG and CG-to-CA transitions is also evident from inspection of Tables 13-7 and 13-8. This observation strongly suggests that, at least within gene coding regions, the two strands are differentially methylated and/or differentially repaired. Holmes et al47 have demonstrated in vitro the existence of a strand-specific correction process in human and Drosophilia cells whose efficiency depends on the nature of the mispair. Such differential repair could also account for the observed strand differences in mutation frequency.
Table 13-8: A Strand Difference in Relative Single-Base-Pair Substitution Rates |Favorite Table|Download (.pdf) Table 13-8: A Strand Difference in Relative Single-Base-Pair Substitution Rates
|Original Substitution ||Relative Rate ||Watson-Crick Homologue ||Relative Rate |
|GGT > GTT ||1.16 ± 0.16 ||ACC > AAC ||0.51 ± 0.08 |
|TGG > TAG ||1.64 ± 0.13 ||CCA > CTA ||0.99 ± 0.09 |
|CGG > CAG ||13.01 ± 0.88 ||CCG > CTG ||8.35 ± 0.47 |
|CTT > CCT ||1.14 ± 0.20 ||AAG > AGG ||0.35 ± 0.10 |
|CTC > CCC ||1.20 ± 0.17 ||GAG > GGG ||0.32 ± 0.07 |
|TGC > TCC ||0.76 ± 0.13 ||GCA > GGA ||0.19 ± 0.06 |
|CTG > CCG ||1.87 ± 0.15 ||CAG > CGG ||0.81 ± 0.13 |
|GGT > GAT ||1.79 ± 0.21 ||ACC > ATC ||0.68 ± 0.14 |
|CTG > CAG ||0.22 ± 0.04 ||CAG > CTG ||0.03 ± 0.02 |
|CTT > CGT ||0.47 ± 0.11 ||AAG > ACG ||0.12 ± 0.04 |
Single-Base-Pair Substitutions in Human mRNA Splice Junctions
Single-base-pair substitutions (point mutations) affecting mRNA splicing are nonrandomly distributed, and this nonrandomness can be related to the phenotypic consequences of mutation.48 Naturally occurring point mutations that affect mRNA splicing fall into four main categories: (1) Mutations within 5′ or 3′ consensus splice sites. Such lesions usually reduce the amount of correctly spliced mature mRNA and/or lead to the utilization of alternative splice sites in the vicinity. This results in the production of mRNAs that either lack a portion of the coding sequence (exon skipping) or contain additional sequence of intronic origin (cryptic splice-site utilization). (2) Mutations within an intron or exon that may serve to activate cryptic splice sites and lead to the production of aberrant mRNA species. (3) Mutations within a branch-point sequence. (4) Mutations in the introns that may regulate the efficiency of splicing, balance of alternative transcripts, and spliceosome assembly. Our understanding of the mechanism of these latter mutations is poor.
Splice-Junction Mutations Causing Human Genetic Disease
Splicing defects are not an uncommon cause of human genetic disease. The vast majority of known gene lesions that affect splicing are point mutations within 5′ and 3′ splice sites (ss). As shown in Fig. 13-1, the 1373 splicing mutations account for 9.6 percent of the 14363 mutations recorded in the HGMD (as of September 13, 1998). Krawczak et al48 first collected from the literature (until mid-1991) a total of 101 different examples of point mutation in the vicinity of exon-intron splice junctions of human genes that altered the accuracy or efficiency of mRNA splicing and were responsible for a specific disease phenotype. Since then, the accelerated pace of gene and mutation discovery has greatly enriched our understanding of the importance of different splice signal sequences. Of 1373 different splice-site mutations, 797 (58 percent) affected the 5′ ss (donor splice site), 464 (34 percent) were located in 3′ ss (acceptor splice site), and most of the remaining 112 resulted in the creation of novel splice sites. Fig. 13-6 shows the consensus splice-site sequences of mammalian genes. For both the wild-type and mutated splice sites, consensus values (CVs 49) can be calculated that reflect the similarity of any one splice site to the consensus sequence. A splice site containing the least frequent bases at each position would yield a CV of 0, whereas splice sites containing only the most frequent bases would have a CV of 1. CV for the wild-type splice sites (consensus value for normal [CVN] splice sites) studied were from 0.7 to 1, with a mean of about 0.83 for the 5′ and 3′ ss. Sequences with either extremely small or extremely high CVN were lacking. This finding suggested that splice sites that are less than optimal in terms of their similarity to the consensus sequence are especially prone to the deleterious effects of mutation. Splice sites with an already extremely low degree of similarity would not be further functionally impaired by single-base changes. An analysis was also conducted for the CV of mutated splice sites (CVM48). These CVMs were from 0.48 to 0.74 for the 3′ ss and from 0.5 to 0.84 for the 5′ ss. This clearly indicated that mutations at splice-site junctions serve to reduce the similarity to the consensus.
Consensus sequences for the 5′ splice site (ss) (donor site), 3′ ss (acceptor site), and the branch point. Numbers corresponding to the nucleotides represent frequencies of each given nucleotide in the collections of Padget et al50 and Shapiro and Senapathy.49
Location and Spectrum of Splice-Site Mutations
Comparison of the number of mutations in the HGMD reported at particular splice-site positions with their corresponding expectations, based on substitution rates from human gene-coding regions, indicates that point mutations are significantly overrepresented at the invariant positions +1 (observed, 414; expected, 189.5) and +2 (observed, 89; expected, 51.0) of the 5′ ss, and positions −1 (observed, 192; expected 62.8) and −2 (observed, 168; expected, 23.3) of the 3′ ss. Mutations at all other positions within splice sites were underrepresented. Table 13-9 summarizes the observed and expected frequencies of point mutations at different positions in 5′ ss and 3′ ss. Of the 1373 splicing mutations in the HGMD, 414 (30.1 percent) occur at the 5′-ss position +1 and 89 (6.5 percent) at position +2 (http://www.hgmd.org). The majority (58 percent) of the G+1 mutations were to A, and the majority of the T+2 mutations were to C. In the 3′ ss, there are 168 mutations (12 percent) in the invariant A−2, and 192 (14 percent) in the invariant G−1. The majority (53 percent) of the G−1 mutations were to A, and the majority (69 percent) of the A−2 mutations were to G. Therefore, the four invariant nucleotides (of the 24 involved in the splice-site consensus sequences) in the 5′ ss and the 3′ ss represent a total of 863, i.e., 62.8 percent of the splicing mutations. Fig. 13-7 depicts the distribution of mutations within the consensus 5′ ss and 3′ ss logged in the HGMD. It is of interest that a considerable number of mutations have been found in nucleotides +5 and −1 of the consensus 5′ ss, although these positions are not invariant. At position +5 of the 5′ ss, a total of 103 mutations have been reported, whilst, at position −1, a total of 100 mutations have been found (as of September 13, 1998). Table 13-9, however, shows that these numbers are not higher than expected under a model of random mutations.
Mutations in the consensus sequences of splice junctions recorded in the Human Gene Mutation Database.
Table 13-9: Observed and Expected Frequencies of Point Mutations at Different Positions in 5’ and 3’ Splice Sites (from the Human Gene Mutation Database, September 13, 1998) |Favorite Table|Download (.pdf) Table 13-9: Observed and Expected Frequencies of Point Mutations at Different Positions in 5’ and 3’ Splice Sites (from the Human Gene Mutation Database, September 13, 1998)
|5′ Splice Sites ||3′ Splice Sites |
| || |
|Pos ||Obs ||Exp ||Pos ||Obs ||Exp |
|–2 ||15 ||84 ||–6 ||11 ||47 |
|–1 ||100 ||154 ||–5 ||5 ||57 |
|+1 ||414 ||189 ||–4 ||5 ||48 |
|+2 ||89 ||51 ||–3 ||26 ||45 |
|+3 ||42 ||73 ||–2 ||168 ||23 |
|+4 ||20 ||57 ||–1 ||192 ||63 |
|+5 ||103 ||119 ||–1 ||4 ||72 |
|+6 ||14 ||68 ||–2 ||4 ||61 |
It appears very likely that the observed nonrandomness of mutation within splice sites is a reflection of relative phenotypic severity (and hence detection bias) rather than any intrinsic difference in the underlying frequency of mutation. The replacement of G residues at positions +1 and +5 of 5′ ss would be predicted to reduce significantly the stability of base pairing of the splice site with the complementary region of U1 small nuclear RNA (snRNA). Binding to U1 snRNA is essential for the pre-mRNA to be folded correctly before cleavage and ligation can occur within the spliceosome. The same argument holds true for the mutations observed at position −1.51 Only 42 examples of mutations at the +3 and 20 at the +4 positions of 5′ ss, respectively, were noted; the corresponding residues in U1 snRNA are pseudouridines rather than a cytosine. Thus, the spectrum of 5′-ss mutations observed in vivo suggests an important role for U1 snRNA binding.
Mutations Creating Novel Splice Sites
A different category of mutation affecting mRNA splicing is provided by single-base-pair substitutions outside actual splice sites that create novel splice sites that substitute for the wild-type sites. This category may contain more mutations than currently appreciated, because very few sequence data exist for introns as compared with coding regions. A total of 13 mutations creating novel splice sites (13 percent of the 101 splice mutations) were collected in a survey by Krawczak et al;48 in all but one case, the novel splice site was situated upstream of the original wild-type site. One intriguing finding for mutations creating novel 3′ acceptor splice sites should be noted: All six mutations introduced an A at −2, but never a G at −1. CVs for the activated cryptic splice sites (CVAs) were calculated when possible; in 8 of 12 cases, the CVA was as high as or higher than the wild-type CVN, suggesting that the novel splice sites successfully compete with the wild-type sites for splicing factors. For mutations in the vicinity of 3′ ss, the relative proportion of cryptic splice-site-utilizing mRNA appeared to correlate positively with the CVA/CVN ratio, whereas, at 5′ ss, the distance to the wild-type site may have also played an important role. The current version of the HGMD contains 112 mutations outside the consensus splice sites, and most of them create novel splice sites.
Phenotypic Consequences of Splice-Site Mutation in Vivo
The phenotypic consequences of naturally occurring point mutations in the 5′ ss of seven human genes were studied by Talerico and Berget, 52 who observed exon skipping in six cases as compared with only one case (β-globin gene) of cryptic splice-site usage. These initial results suggested that exon skipping might be the preferred in vivo phenotype, an assertion confirmed by many subsequent reports. One major mRNA species was usually observed, and this invariably lacked either the exon upstream of the mutated 5′ ss or downstream of the mutated 3′ ss. A detection bias is nevertheless possible, since a single exon-skipped transcript might be easier to detect/identify than a number of less frequent transcripts each resulting from the use of a different cryptic splice site. Several instances of the detection of small amounts of residual wild-type mRNA from the cells of patients with a 5′-ss defect have also been reported. All these involve the mutation of bases outside the invariant GT dinucleotide, suggesting that normal splicing is still possible in such cases, albeit at greatly reduced efficiency. The choice between exon skipping and cryptic splice-site usage may be visualized merely as a decision about whether to utilize the next available legitimate splice site or the next best, albeit illegitimate, sequence in the immediate vicinity. This choice may be made on the basis of the presence/absence of sites capable of competing with the mutated splice site for splicing factors. Krawczak et al48 studied the regions both upstream and downstream of their collection of mutations in an attempt to correlate sequence properties with the observed phenotypic consequences of mutation. They presented evidence that, at least for 5′-ss mutations, cryptic splice-site usage is favored under conditions in which a number of such sites are present in the immediate vicinity and exhibit sufficient homology to the splice-site consensus sequence for them to compete successfully with the mutated splice site. Fig. 13-8 schematically represents the consequences of splice mutations with reference to representative examples. In this chapter, exon skipping as a consequence of nonsense mutations in the skipped exon63,67 is discussed under nonsense mutations.
Examples of exon skipping and utilization of cryptic splice sites as a result of mutations in splice sites. Solid square and circle denote normal or activated 3′ ss and 5′ ss, respectively. Open square and circle represent cryptic 3′ ss and 5′ ss, respectively. The arrow denotes a nonsense mutation. Examples of exon skipping due to 5′ ss mutations are reported in Weil et al, 53,54 Grandchamp et al, 55 Carstens et al, 56 and Wen et al;57 exon skipping due to 3′ ss mutations is reported in Tromp and Prockop58 and Dunn et al59 ; use of cryptic 5′ ss due to 5′ ss mutations is reported in Treistman et al60 and Atweh et al61 ; use of cryptic 3′ ss due to 3′ ss mutations is reported in Carstens et al56 and Su and Lin62 ; exon skipping due to nonsense mutations is reported in Dietz et al63 ; and activation of cryptic 5′ ss and 3′ ss is reported in Orkin et al, 64 Nakano et al, 65 and Mitchell et al66
Mutations within the Pyrimidine Tract
The HGMD contains 69 mutations in the pyrimidine tract of the 3′ ss. Some examples include the steroid 21-hydroxylase B (CYP21B) and HBB genes causing adrenal hyperplasia and β thalassemia, respectively.68-70 It is not clear how and why these mutations at polypyrimidine tracts exert a pathologic influence on efficient mRNA splicing. It may be that some 3′ ss are more susceptible to the effects of pyrimidine loss than are others by virtue of the relative length of the pyrimidine tract.
Mutations at the Branch Point
An intermediate stage in eukaryotic RNA splicing is the formation of a lariat structure utilizing an A (adenosine) residue approximately 10 to 50 nucleotides from the 3′ ss. A weak consensus sequence, CTRAY, for this branch point has been observed in mammalian genes. After lariat structure formation, the first downstream AG dinucleotide is usually chosen as the acceptor splice site.71 In a family with X-linked hydrocephalus, an A-to-C mutation 19 nucleotides upstream from a normal splice acceptor site of exon Q of the L1CAM gene on Xq28 segregated with the disease phenotype.72 The mutation resulted in several RNA species exhibiting exon-Q skipping, insertion of 69 bp due to utilization of a cryptic splice site, or normal splicing. Another example of such mutation was reported in the COL5A1 gene causing Ehlers-Danlos syndrome type II. Affected members from two British families were heterozygous for a T-to-C point mutation in intron 32 (IVS32, −25T>G), causing loss of the 45-bp exon 33 from the mRNA in 60 percent of transcripts of the mutant gene.73 The mutation lies only 2 bp upstream of a highly conserved adenosine in the consensus branch-site sequence that is required for lariat formation. A similar branch-site point mutation (IVS4, −22T>C) in the LCAT was observed in a family with fish eye disease, a condition characterized by corneal opacities and low plasma high-density-lipoprotein (HDL) cholesterol. The mutation caused intron retention rather than exon skipping.74 In a patient with erythropoietic protoporphyria, a C-to-T mutation in intron 2 (IVS2, −23C>T) of the ferrochelatase (FECH) gene was found.75 This mutation was associated with skipping of exon 2.
Mutations in Alu Sequences and Creation of a New 5′ Splice Site
The creation of 3′ ss consequent to a point mutation in a member of the Alu family of human repetitive elements was noted by Mitchell et al.76 Analysis of the ornithine aminotransferase mRNA of a patient with gyrate atrophy revealed a 142-nucleotide insertion at the junction of exons 3 and 4. The patient possessed a much reduced level (5 percent) of abnormal mRNA in his fibroblasts and an even smaller amount of normal-sized mRNA. An Alu sequence is normally present in intron 3 of the ornithine δ-aminotransferase (OAT) gene, 150 bp downstream of exon 3. The patient was homozygous for a C-to-G transversion in the right arm of this Alu repeat, which served to create a new 5′ ss. This mutation activated an upstream cryptic 3′ ss (the polyT complement of the Alu polyA tail followed by an AG dinucleotide) and a new “exon,” containing the majority of the right arm of the Alu sequence, was recognized by the splicing apparatus and incorporated into the mRNA. The splice-mediated insertion of an Alu sequence in reverse orientation may yet prove to be no unusual mechanism of insertional mutagenesis, since Alu sequences are interspersed through many coding sequences, the sequence requirements for a functional 3′ ss are far from stringent, and the reverse complement of a consensus Alu repeat contains at least two cryptic 3′ ss and several potential 5′ ss.
There are certainly several intron sequence motifs, not yet fully recognized, that contribute to the regulation of the splicing mechanism. Mutations for example were detected in IVS3 of the human growth hormone (GH1) gene that affect a novel putative, consensus sequence which also perturb splicing, resulting in exon skipping.77 These mutations did not occur within the 5′ and 3′ ss or branch consensus sites. The first was a G to A at nt +28 of the second deleted 18 bp (del+28−45) of IVS3 of the human GH1 gene. These mutations segregated with autosomal dominant GH deficiency in both kindreds, and no other allelic GH1 gene changes were detected. Reverse transcriptase-polymerase chain reaction (RT-PCR) amplification showed a >10-fold preferred use of alternative splicing. Both mutations were located 28 bp downstream from the 5′ ss, and both perturbed an intronic XGGG repeat similar to that found to regulate mRNA splicing in chicken β-tropomyosin. Binding of heterogeneous nuclear ribonucleoprotein (hnRNP) to these sequences in pre-mRNA transcripts is thought to play an important role in pre-mRNA packaging and transport as well as 5′-ss selection in pre-mRNAs that contain multiple 5′ ss.77
In patients with frontotemporal dementia with parkinsonism, three heterozygous mutations in a cluster of 4 nts +13 to +16 of exon 10 of the tau (MAPT) gene were found.78 All of these mutations destabilized a potential stem-loop structure that is probably involved in regulating the alternative splicing of exon 10. This caused more frequent use of the 5′ ss and an increased proportion of tau transcripts that include exon 10. The increase in exon 10+ mRNA was expected to increase the proportion of tau protein containing four microtubule-binding repeats, which is consistent with the neuropathology described in families with this type of frontotemporal dementia. Mutations in intron regions that regulate the proportion of alternatively spliced exons may therefore be an important mechanism for late-onset phenotypes.
mRNA Processing (Other Than Splicing) and Translation Mutations
Mutations affecting mRNA processing and translation may exert their pathologic effects at any one of the various stages between transcriptional initiation and translation. Mutations other than those affecting mRNA splicing are now described and their phenotypic consequences assessed.
The transcription of an mRNA is initiated at the cap site (+1), so named because of the posttranscriptional addition of 7-methylguanine at this position to protect the transcript from exonucleolytic degradation. Wong et al 79 described an A-to-C transversion at the cap site in the HBB gene of an Indian patient with β thalassemia. Kozak80 collated known eukaryotic mRNA sequence data and showed that the cap site is an adenine in 76 percent of cases. A cytosine residue at position 1 was noted in only 6 percent of cases. It is not clear, however, whether it is transcription of the β-globin gene that is severely reduced in the above patient or whether transcriptional initiation occurs efficiently but at a different, incorrect site. In the latter case, the resulting transcript could be either incomplete or unstable.
Mutations in Initiation Codons.
There are 59 mutations recorded in the HGMD affecting Met (ATG) translational initiation codons, with a preponderance of M-to-V substitutions. The consequences for mRNA transcription and translation have not been well studied. It is particularly useful to compare and contrast the two ATG mutations reported in the α1- and α2-globin genes, respectively. The α1-globin gene mutation was associated with a reduction in the steady-state α1-globin mRNA level to one-fourth normal.81 The corresponding α2-globin mRNA level consequent to the α2-globin gene lesion was similarly reduced to one-third normal.82 The α2-globin gene mutation results in a greater reduction in α-globin synthesis and a more severe α-thalassemia phenotype than its α1-globin counterpart. This is presumably because, in normal individuals, the ratio of α2 to α1 mRNA produced from the two genes is 2.6, reflecting the relative importance of the α2-globin gene in α-globin synthesis. The observed reductions in steady-state mRNA levels are reminiscent of the consequences of nonsense mutations (see below). Mitchell et al 66 reported a normal amount of OAT mRNA in Lebanese patients with gyrate atrophy who were homozygous for an initiation codon mutation.
Is the mutant mRNA translated? The answer is likely to be determined by a complex interplay of the different structural features of an mRNA that serve to modulate its translation (reviewed by Kozak83). Until fairly recently, it was thought that an AUG codon was an absolute requirement for translational initiation in mammals. However, some exceptions are now known—for example, ACG and CUG (reviewed by Kozak83)—indicating that some mutations might be tolerated more than others. The scanning model of translational initiation predicts that the 40S ribosomal subunit initiates at the first AUG codon to be encountered within an acceptable sequence context (GCC A/G CCAUGG is believed to be optimal83). Ribosomes may be capable of utilizing mutated AUG codons, albeit with reduced efficiency, or they may be able to initiate translation at the next best available site downstream.84 The phenotypic consequences of a given ATG mutation are thus likely to depend on the nature of the mutational lesion, the tolerance of the ribosome with respect to translational initiation codon recognition, the presence of alternative downstream ATG codons with flanking translational initiation site consensus sequence, and the functional importance of the absent N-terminal end of the protein.
Creation of a New Initiation Codon.
Another type of mutation that interferes with correct initiation is the creation of a cryptic ATG codon (in the context of a favorable Kozak consensus sequence) in the vicinity of the one normally used. An example of this type of lesion is provided by the G-to-A transition at position 122 (relative to the cap site) of the β-globin gene causing β-thalassemia intermedia.85 This cryptic initiation codon is 26 bp 5′ to the normal ATG codon, and its use would lead to a frameshift and premature termination 36 bp downstream. Although the relative extent of utilization of the two ATG codons in this patient is not known, the comparatively mild clinical phenotype suggests that at least some β-globin is correctly initiated and translated. There are 208 cases in the HGMD for the creation of an ATG (Met) codon, but it is unclear how many of these are then used as aberrant initiation codons.
Mutation in Termination Codons
The first reported example of a mutation in a termination codon was that in the α2-globin (HBA2) gene causing Hemoglobin Constant Spring (Hbconstant spring), an abnormal hemoglobin that occurs frequently in Southeast Asia.86,87 The associated α-globin chain is 172 amino acids long, rather than the normal 141 amino acids, as a result of a TAA-to-CAA transition in the termination codon. In this patient, translation extends into the 3′ noncoding region of the α2-globin mRNA. The resulting mRNA is highly unstable, resulting in low production of hemoglobin in the red cells of heterozygous carriers.88 Several other mutations are known to occur in the α2-globin termination codon, and a similar phenotype to Hbconstant spring is observed.89 A total of 15 point mutations in the termination codon have been included in the HGMD. Nine of these occur in the TGA, five in the TAA, and one in the TAG termination codons, respectively, of the APRT, ARSB, ATM, CTSK, FGFR3, HBA2, IDUA, PROS1, and AIRE1 genes (see the HGMD). With regard to the distribution of termination codons in mammalian genes, TGA is found in 52 percent, TAA in 27 percent, and TAG in 21 percent of these genes. Elongated proteins may also be generated by a second mechanism—a frameshift mutation close to the natural termination codon that results in the extension of translation until the next available downstream termination codon. A number of examples of this type of lesion are known to cause β thalassemia.69,90-94 All give rise to an imbalance in α- and β-globin chain synthesis and inclusion-body (containing precipitated α and β chains) formation and are associated with the dominant form of the disease.
Polyadenylation/Cleavage Signal Mutations
All polyadenylated mRNA in higher eukaryotes possess the sequence AAUAAA, or a close homologue, 10 to 30 nucleotides upstream of the polyadenylation site. This motif is thought to play a role in 3′-end formation through endonucleolytic cleavage and polyadenylation of the mRNA transcript. Several single-base-pair substitutions are now known in the cleavage/polyadenylation signal sequences of the α2- and β-globin genes, and all of these cause a relatively mild form of thalassemia due to the reduction of HbA2 synthesis to 3 to 5 percent of the normal level. In the β-globin gene mutants, cleavage and polyadenylation at the normal site are markedly reduced but do still occur at <10 percent of the normal level as judged by both in vivo and in vitro assays.95,96 These mutants are characterized by a novel species of β-globin mRNA 1500 nucleotides long and 900 nucleotides larger than the wild-type transcript. This results from the use of an alternative cleavage/polyadenylation site (AATAAAA) 900 bp 3′ to the mutated site; polyadenylation occurs within 15 nucleotides of this cryptic site. This abnormal mRNA may be highly unstable, since it was extremely difficult to isolate. Several other polyadenylated mRNA species up to 2900 bp in length have been reported in an Israeli patient with a polyadenylation site mutation;97 the β+-thalassemia phenotype exhibited by this patient was consistent with the translation of these extended mRNA species. Outside of the globin genes, a polyadenylation mutation has been described (AATAAC to AGTAAC) in the ARSA gene and causes arylsulfatase pseudodeficiency.98 An unusual T-to-C substitution causing β-globin gene, 12 bp upstream of the AATAAA polyadenylation signal, has been described in an Irish family.85 It is thought that this lesion may serve to destabilize the β-globin mRNA.
Nonsense Mutations and Their Effect on mRNA Levels
Nonsense mutations obviously cause premature termination of translation and truncated polypeptides, but these lesions may also exert their effects at the transcriptional level. Benz et al99 first noticed that some patients with β thalassemia who had nonsense codons in the β-globin gene exhibited very low levels (<1 percent) of β-globin mRNA in erythrocytes. Subsequently, a considerable number of nonsense or frameshift mutations from a variety of different genes have been shown to be associated with dramatic reductions in the steady-state level of cytoplasmic mRNA. However, this rule is not completely inviolable; a few nonsense mutations are associated with normal levels of cytoplasmic mRNA that appears to be efficiently translated to generate a truncated protein (e.g., low-density lipoprotein receptor [LDLR], 100 apolipoprotein C-II [apo C-II]), 101 and β-globin102). Moreover, considerable variation in mRNA levels is apparent between different instances of introduced nonsense codons within the same gene: Thus, measured reticulocyte β-globin mRNA varied from <1 percent normal in a patient with β thalassemia who had a 1-bp frameshift deletion at codon 44 (Kinniburgh et al103) to 15 percent in a patient with a nonsense mutation in codon 17 (Chang and Kan104). Brody et al105 observed that mutations that cause premature termination in the terminal exon of the OAT gene have no effect on mRNA level, but termination in the penultimate exon or earlier is associated with markedly reduced levels of mRNA. Decreased in vitro accumulation of cytoplasmic mRNA has been reported to be associated with several nonsense mutations in the β-globin gene but not with missense mutations.106-110 One potential explanation for the observed effect of nonsense mutations on mRNA metabolism is that mRNA which is incompletely translated is not protected properly from RNase digestion on the ribosome and is therefore likely to exhibit an increased turnover rate. Consistent with this postulate, the β-globin mRNA bearing the codon-44 mutation appears to be highly unstable.111 Moreover, Daar and Maquat112 reported that all triosephosphate isomerase I (TPI1) gene nonsense and frameshift mutations tested in vitro exhibited a reduced mRNA stability but did not alter the rate of transcription. At least for the β-globin codon 39 mutation, however, the decreased steady-state levels of both nuclear and cytoplasmic mRNA have been shown not to be due to increased mRNA instability in the cytoplasm.106,107,109
The mechanism by which an in-frame termination codon results in a decrease in concentration of steady-state cytoplasmic mRNA is not understood. One or more parameters could be affected—the transcription rate, the efficiency of mRNA processing or transport to the cytoplasm, or mRNA stability.113 Urlaub et al114 showed that whereas nonsense mutations in the dihydrofolate reductase (DHFR) gene located prior to the final exon resulted in drastically reduced (10- to 20-fold) mRNA levels, nonsense mutations in the last exon of the gene yielded normal levels of DHFR mRNA. Nuclear run-on studies and experiments with the transcriptional inhibitor actinomycin demonstrated that the low mRNA levels resulted neither from a reduced rate of transcription nor from decreased mRNA stability. Similar results were obtained for nonsense mutations artificially introduced into the TPI1 gene and expressed in vitro.115 Urlaub et al. 114 proposed two explanatory models that imply some form of coupling between processing and/or transport of the mRNA and translation: (1) Translational translocation model: This model proposes that translation of the mRNA on the ribosome would begin as soon as the mRNA emerged from the nuclear pore and would serve to pull the pre-mRNA physically through the splicing apparatus and through the pores in the nuclear membrane. Nonsense mutations would halt the pulling process, leaving the RNA molecule vulnerable to RNase digestion. However, nonsense mutations occurring in the last exon would not be recognized until the translocation of the mRNA from the nucleus was virtually complete. (2) Nuclear scanning of translation frames model: In this model, pre-mRNA are scanned within the nucleus for nonsense mutations prior to their translocation through the nuclear membrane. Detection of an in-frame termination codon would then result in a slowing of mRNA splicing/translocation. Such a mechanism might be an intrinsic part of the mRNA-splicing process since open-reading-frame recognition could be important for exon definition. The translational translocation model would predict a probability gradient from 5′ to 3′, with a gradually increasing likelihood that an mRNA containing a termination codon would be successfully transported across the nuclear membrane. In support of this hypothesis are the several examples of normal levels of mRNA transcripts derived from genes bearing termination codons in their 3′-most exons (see the OAT example105) and the TPI1 and DHFR examples that may imply links between pre-mRNA splicing, mRNA transport, and translation. However, counterexamples, such as the β-globin gene codon-17 and codon-44 nonsense codons quoted above, argue against its validity in all cases, since they are inconsistent with a perfect linear relationship between the relative position of the nonsense mutation and the level of mRNA produced by the mutant allele. The problem with invoking any one model alone is that it cannot adequately explain the inconsistencies observed between studies regarding the possible position effect associated with nonsense mutations in vivo and the role of changes in mRNA stability if they occur. In practical terms, the common finding of greatly reduced or absent cytoplasmic mRNA associated with nonsense mutations has important implications for mutation screening. Attempts to obtain mRNA for RT-PCR amplification and DNA sequencing116,117 may be thwarted in patients with nonsense mutations by a cellular mechanism that links mRNA processing/transport to translation.
Nonsense Mutations and Exon Skipping
Dietz et al63 and Naylor et al67 have reported exon skipping in exons that contain nonsense mutations. In a patient with Marfan syndrome, exon B of the fibrillin gene FBN1 that contained a TAT-to-TAG nonsense mutation was completely skipped.63 The exon skipping was discovered by RT-PCR analysis of fibroblast mRNA. Two additional examples of this phenomenon have been reported by the same authors in the OAT transcripts of patients with gyrate atrophy: exon 6 was skipped when a Trp 178 to Stop mutation was present in this exon; similarly, exon 8 with a Trp 275 to Stop mutation was skipped. The skipping of the exons with nonsense codons in the OAT cases was partial, that is, there were RNA species that contained the nonsense-mutation-containing exons. The authors proposed a mechanism of reading pre-mRNA exon sequences in frame either by direct coupling between translation and RNA processing or by a scanning function of ribosome-like molecules in the nucleus. Naylor et al67 reported similar observations associated with two different nonsense mutations in exons 19 and 22 in the F8C gene in patients with hemophilia A. Partial skipping has been observed with the exon-19 nonsense mutation whereas, in the case of the exon-22 nonsense codon, only PCR products lacking exon 22 were observed. The exon skipping associated with nonsense mutations has been observed in more than 10 disease-related genes in humans. The mechanism that accounts for these observations is unknown.
Dietz and Kendzior, 118 using chimeric constructs in a model in vivo expression system, identified premature termination codons as determinants of splice-site selection. Nonsense codon recognition prior to RNA splicing necessitates the ability to read the frame of precursor mRNA in the nucleus. They proposed that maintenance of an open reading frame can serve as an additional level of scrutiny during exon definition.
Most mutations causing human genetic disease occur in transcribed regions. A different class of molecular lesion is that represented by regulatory mutations. These lesions disrupt the normal processes of gene activation and transcriptional initiation and serve either to increase or decrease the level of mRNA/gene product synthesized rather than altering its nature. The vast majority of regulatory mutations so far described are found in gene promoter regions—the 5′ flanking sequences that contain constitutive promoter elements, enhancers, repressors, the determinants of tissue-specific gene expression, and other regulatory elements. Mutations in the regulatory elements may have several consequences, such as alteration of the amount of mRNA transcript and/or alteration of the developmental expression of a gene. In the majority of regulatory mutations, the mRNA produced is qualitatively normal, and therefore mutation detection methods based on RT-PCR will fail to recognize these lesions. On the other hand, the detection of mutations in potential unknown regulatory elements may predict the existence of such elements. A total of 119 regulatory mutations have been cataloged in the HGMD. Some representative examples of mutations in regulatory elements in the human genome are discussed below.
Mutations in DNA Motifs in the Immediate 5′ Flanking Sequences.
Single-base-pair substitutions that occur in the promoter region 5′ to the β-globin (HBB) gene causing β thalassemia give rise to a moderate reduction in globin synthesis. The known naturally occurring mutations are highly clustered around two regions that have been implicated in the regulation of the human β-globin (HBB) gene. One is a CACCC motif located between −91 and −86 relative to the transcriptional initiation site and the other is the TATA box found at about −30. Mutations have been described in the CACCC motif at positions −92, −90, −88, −87, and −86, and the TATA motif at positions −31, −30, −29, and −28, of the β-globin gene.60,119-127 Almost all of these mutations are associated with a mild clinical phenotype. The CACCC box binds one or more erythroid-specific nuclear factors involved in the developmental activation of β-globin gene transcription. A −101 mutation occurs in the second upstream CACCC motif between −105 and −101 of the β-globin gene.128 Matsuda et al129 have reported a T-to-C transition at position −77 of the δ-globin (HBD) gene in Japanese patients with δ thalassemia. This lesion occurs at the second position of an inverted binding motif (TTATCT) for the DNA-binding protein GATA1. Gel retardation and CAT expression assays demonstrated that this mutation appears to impair δ-globin gene expression by abolishing GATA1 binding to its recognition sequence.
Mutations in cis-acting regulatory elements can also increase gene expression. The best examples of such mutations have been observed in hereditary persistence of fetal hemoglobin (HPFH), which is usually a heterozygous condition in which inherited gene lesions cause a marked but variable increase in HbF (α2 and γ2) synthesis above the normal adult level of <1 percent. The molecular analysis of HPFH has revealed both deletion and nondeletion forms. The nondeletion form of HPFH is caused by point mutations within the highly homologous promoter regions of the γ-globin genes. There are three examples of mutation at homologous positions in the Aγ- and Gγ-globin genes at positions −114, −175, and −202.130-135 The −202 mutation occurs within a GGGGCCCC motif reminiscent of the GC box (GGGCGG) that serves as a binding site for the transcription factor Sp1. The T-to-C mutations at −175 occur within an ATGCAAAT motif (−182 to −175) known as the octamer, found in the promoters of genes encoding immunoglobulins, histones, and snRNA. The −175 lesion has been shown to increase promoter activity between 3- and 20-fold in erythroid cells.136-140 This lesion appears to reduce or abolish the ability of the ubiquitous octamer-binding protein (OTF-1, which is thought to be a repressor of γ-globin gene transcription) to bind at this site140-143 and alters the binding of GATA1.136,139-141 Using gel-retardation assays, Fucharoen et al132 demonstrated that the −114 mutation abolishes the binding of CP1 to the distal CCAAT motif of the Gγ-globin gene, although the lesion does not affect the binding of erythroid-specific factors.
Hemophilia Bleyden is an F9 variant characterized by severe childhood hemophilia ameliorated at puberty, probably under the influence of testosterone, 144 and is an example of developmental specificity of regulatory mutations. The amelioration in clinical phenotype is foreshadowed by an increase in plasma F9 activity/antigen values from <1 percent to between 30 and 60 percent normal. Several mutations have been found in positions −20, −6, +6, +8, and +13 relative to the transcriptional initiation site in such patients.145-153 Reitsma et al146 noted that the region from −5 to +23 possesses significant homology with the region immediately upstream, from −31 to −6. All mutated sites occur within the region of homology. Crossley and Brownlee154 demonstrated that the +13 mutation lies within a binding site (+1 to +18) for the CCAAT/enhancer-binding protein (C/EBP) and serves to abolish binding of C/EBP to this site. Other transcription factors have been shown to bind in the −32 to +23 region.155 Hirosawa et al149 demonstrated that mutations at −20 and −6 were associated with lowered expression of the F9 gene and that restoration of expression in a concentration-dependent fashion was observed on treatment of the cultured cells with androgen. Crossley et al155 found that an AGCTCAGCTTGTACT motif between −36 and −22, with strong homology to the androgen-response element (ARE) consensus sequence, is functional. It would appear that, before puberty, several transcription factors (including C/EBP, LF-A1/HNF4, and a further protein that binds to the −6 site) are involved in potentiating the expression of the F9 gene. Since mutations interfering with the binding of any of these factors lead to the abolition of F9 gene transcription, these proteins probably act in concert. It is assumed that at puberty, when a testosterone-dependent mechanism mediated by the ARE comes into play, the binding of all three transcription factors ceases to be an absolute requirement for transcription to occur.
Mutations Outside the Immediate 5′ Flanking Sequences.
In addition to known mutations in the remote promoter element known as the locus control region (LCR; see below), Berg et al156 have reported a +ATA/−T mutation at −530 upstream the HBB gene that is associated with reduced β-globin synthesis. This lesion reportedly results in a ninefold increase in the binding capacity of BP1, a protein that may therefore possess the properties of a repressor. In two families with X-linked dominant Charcot-Marie-Tooth neuropathy, a T-to-G transversion at position −528 and a C-to-T transition at position −458 to the ATG start codon of the connexin 32 gene (GJB1) have been found.157 The first mutation is located in the nerve-specific GJB1 promoter just upstream of the transcription start site, whereas the second is located in the 5′ untranslated region (UTR) of the mRNA.
Regulatory mutations have also been reported in the 3′ flanking sequences of genes. A G-to-A transition 69 bp 3′ to the polyadenylation site appears to be responsible for drastically reducing the expression of the δ-globin gene, causing δ thalassemia.158 The lesion occurs within a motif homologous to the consensus recognition sequence for the erythroid-specific DNA-binding protein GATA1. Gel-retardation assays have shown that the G-to-A transition resulted in an increased binding affinity for GATA1.158
Mutation in Remote Promoter Elements.
The first indication that mutations at a considerable distance 5′ to the transcriptional initiation site could affect the expression of a downstream gene came from van der Ploeg et al:159 A >40-kb deletion of the Gγ-, Aγ-, and δ-globin genes was found in a Dutch case of γδβ thalassemia, but this deletion had left the β-globin gene intact, together with at least a 2.5-kb 5′ flanking sequence (Fig. 13-9). The implication was that the removal of sequences far upstream of the β-globin gene had resulted in suppression of its transcriptional activity. Kioussis et al160 then showed that although the β-globin gene in this patient was identical in sequence to that of the wild type, the surrounding chromatin appeared to be in an inactive conformation as judged by DNase-1 sensitivity and methylation analysis. Curtin et al161 reported a 90-kb deletion of the β-globin gene cluster in an English patient with γδβ thalassemia; the ε-globin gene and part of the Gγ-globin gene were deleted, but Aγ-, ψβ-, δ-, and β-globin genes were intact. A deletion more than 25 kb upstream of the β-globin gene therefore served to abolish its expression. Driscoll et al162 described an important 25-kb deletion in a Hispanic patient with γδβ thalassemia; the deletion was located between 9.5 and 39 kb upstream of the ε-globin gene and included three of the four erythroid cell-specific DNase-1 hypersensitive sites 5′ to the ε-globin gene. All of the globin genes, including the ε-globin gene, remained intact; the β-globin gene, some 60 kb downstream from the 3′ deletion breakpoint, was nevertheless nonfunctional. Grosveld et al163 showed that DNA containing the four erythroid-specific hypersensitive sites was capable of directing a high level of position-independent β-globin gene expression in vitro. The LCR 5′ to the ε-globin gene is thought to organize the β-globin gene cluster into an active chromatin domain and to enhance the transcription of individual globin genes. The Hispanic γδβ thalassemia deletion results in an altered chromatin structure throughout more than 100 kb in the β-globin gene cluster as revealed by a change in the sensitivity to DNase-1 digestion (see Fig. 13-9). A similar LCR is also present in the α-globin gene cluster at chromosome 16pter-p13.3.164 Hatton et al165 reported a 62-kb deletion causing α thalassemia encompassing the embryonic α-like ζ2-globin gene that left the other genes and pseudogenes of the α-globin gene cluster intact. Though the sequences of the α1- and α2-globin genes were found to be normal, they nevertheless appeared to be transcriptionally inactive. Several other examples of similar deletions 5′ to the α-globin gene cluster have now been reported.166-169 These deletions exhibit an area of overlap between 30 and 50 kb upstream of the α-globin genes. This region contains several DNase-1 hypersensitive sites (two erythroid-specific) and is capable of directing the high-level expression of an α-globin gene both in stably transfected mouse erythroleukemia cells and when integrated into the genomes of transgenic mice.170,171
Schematic representation of the deletions in the β-globin gene cluster that eliminate the locus control region (LCR) and result in silencing of the normal β-globin (HBB) gene. The extent of the deletions is shown as thick black line. The LCR and its four DNase hypersensitive sequences (HRS) are depicted. The bottom part of the figure shows the conversion of the entire β-globin gene cluster to a DNase I-resistant state as a result of the Hispanic γδ′-thalassemia deletion.
There are also reports of associations of apparently polymorphic variations in the vicinity of certain genes with certain phenotypes. All of these associations require large epidemiologic studies and detailed characterization of the potential regulatory elements to determine whether the polymorphic differences contribute to these phenotypes. For example, there is a G or A polymorphism at position +97 downstream from the termination codon in the 3′ UTR of the prothrombin (F2) gene. Eighteen percent of patients with a documented familial history of venous thrombophilia had the A at this position as compared with 1 percent of a group of healthy controls.172 An association was also found between the presence of the A allele and elevated prothrombin levels. This allele could be therefore used as a risk factor for venous thrombophilia. The polymorphism may either be directly responsible for the association or in linkage disequilibrium with another mutation which in turn is responsible for the observed association.