96 CATALYZING INQUIRY
are unlikely to be separated by recombination that takes place during reproduction. Further, only a
relatively small number of haplotype patterns appear across portions of a chromosome in any given
population.^114 This discovery potentially simplifies the problem of associating SNPs with disease be-
cause a much smaller number of “tag” SNPs (500,000 versus the estimated 10 million SNPs) might be
used as representative markers for blocks of variation in initial studies to find correlations between
parts of the genome and common diseases. In October 2002, the National Institutes of Health (NIH)
launched the effort to map haplotype patterns (the HapMap) across the human genome.
Developing a haplotype map requires determination of all of the possible tag SNP combinations
that are common in a population, and therefore relies on data from high-throughput screening of SNPs
from a large number of individuals. A difficulty is that a haplotype represents a specific group of SNPs
on a single chromosome. However, with the exception of gametes (sperm and egg), human cells contain
two copies of each chromosome (one inherited from each parent). High-throughput studies generally
do not permit the separate, parallel examination of each SNP site on both members of an individual’s
pair of chromosomes. SNP data obtained from individuals represent a combination of information
(referred to as the genotype) from both of an individual’s chromosomes. For example, genotyping an
individual for the presence of a particular SNP will result in two data values (e.g., A and T). Each value
represents an SNP at the same site on both chromosomes, and recently it has become possible to
determine the specific chromosomes to which A and T belong.^115
There are two problems in creating a HapMap. The first is to extract haplotype information
computationally from genotype information for any individual. The second is to estimate haplotype
frequencies in a population. Although good approaches to the first problem are known,^116 the second
remains challenging. Algorithms such as the expectation-maximization approach, Gibbs sampling
method, and partition-ligation methods have been developed to tackle this problem.
Some algorithmic programs rely on the concept of evolutionary coalescence or a perfect phylog-
eny—that is, a rooted tree whose branches describe the evolutionary history of a set of sequences (or
haplotypes) in sample individuals. In this scenario, each sequence has a single ancestor in the previous
generation, under the presumption that the haplotype blocks have not been subject to recombination,
and takes as a given that only one mutation will have occurred at any one SNP site. Given a set of
genotypes, the algorithm attempts to find a set of haplotypes that fit a perfect phylogeny (i.e., could
have originated from a common ancestor). The performance of algorithms for haplotype prediction
generally improves as the number of individuals sampled and the number of SNPs included in the
analysis increases. This area of algorithm development will continue to be a robust area of research in
the future as scientists and industry seek to associate genetic variation with common diseases.
Direct haplotyping is also possible, and can circumvent many of the difficulties and ambiguities
encountered when a statistical approach is used.^117 For example, Ding and Cantor have developed a
technique that enables direct molecular haplotyping of several polymorphic markers separated by as
many as 24 kb.^118 The haplotype is directly determined by simultaneously genotyping several polymor-
phic markers in the same reaction with a multiplex PCR and base extension reaction. This approach
does not rely on pedigree data and does not require previous amplification of the entire genomic region
containing the selected markers.
(^114) E.S. Lander, L.M. Linton, B. Birren, C. Nusbaum, M.C. Zody, J. Baldwin, et al., “Initial Sequencing and Analysis of the
Human Genome,” Nature 409(6822):860-921, 2001.
(^115) C. Ding and C.R. Cantor, “Direct Molecular Haplotyping of Long-range Genomic DNA with M1-PCR,” Proceedings of the
National Academy of Sciences 100(13):7449-7453, 2003.
(^116) See, for example, D. Gusfield, “Inference of Haplotypes from Samples of Diploid Populations: Complexity and Algo-
rithms,” Journal of Computational Biology 8(3):305-323, 2001.
(^117) J. Tost, O. Brandt, F. Boussicault, D. Derbala, C. Caloustian, D. Lechner, and I.G. Gut, “Molecular Haplotyping at High
Throughput,” Nucleic Acids Research 30(19):e96, 2002.
(^118) C. Ding and C.R. Cantor, “Direct Molecular Haplotyping of Long-range Genomic DNA with M1-PCR,” Proceedings of the
National Academy of Sciences 100(13):7449-7453, 2003.