HapBlock - The Dynamic Programming Algorithms for Haplotype Block Partitioning and Tag SNPs
Selection by Haplotype Data and Genotype Data
Program:
Our algorithms have been implemented in a program by C++, here are the pre-compiled executable files:
- The executable file for Unix operating system (Sun OS Release 5.6 Version).
- The executable file for Linux operating system.
- The executable file for Windows operating system (Windows 95,95, NT, 2000,XP).
After downloading the package, unzip it and copy all files into the same file folder.
A brief Help File (PDF format) for how to use this program can also be downloaded.
NEW FETURES of Program:
- Three Dynamic Programming Algorithms:
- A dynamic programming algorithm to find a block partition with minimum number of tag SNPs.
- A paramatric dynamic programming algorithm for block partitioning with a fixed genome coverage using the
minimum number of tag SNPs.
- A discrete dynamic programming algorithm for block partitioning with a fixed number of tag SNPs that can cover
maximum length of genome.
- New types of data can be processed:
- Haplotype data.
- Genotype data from unrelated individuals.
- Genotype data from general pedigrees.
- Three different haplotype block definitions:
- The coverage of common haplotypes.
- LD based block.
- No historical recombination
- A Set of Tag SNP Selection Options:
- Common haplotype.
- Haplotype diversity.
- Haplotype entropy.
- Haplotype determination coefficient.
- LD measure r^2.
- Common Haplotype with missing data.
Attention: This program is free for academic use.
The commercial users should contact Dr. Sun before using this program.
The part of program is produced with collaboration with Steve Qin and Jun Liu in the
department of statistics at Harvard University.
The souce code of program has not been provided and may be available upon request.
Copyright © 2003 The University of Southern California. All RIGHTS RESERVED.
Test Data Sets:
The haplotype data is simulated by Coalesence Process with recombination implemented in the lab of
Richard Hudson. The following data sets are used
in our testing and can be used for exploring our program:
- A haplotype data file - TestHap.dat, which contains 80 haplotypes consisting of 234 SNPs.
- A genotype data file - TestGenoInd.dat, which contains 40 individuals and is generated by randomly pairing up haplotypes.
- A genotype data file - TestGenoPed.dat, which contains 20 trios nad is generated from haplotype data.
- A file for all the possible blocks and their number of tag SNPs - TestBT.dat, which is generated by our program.
- A map file for all the SNPs - TestSNPSimuPos.dat, in which the genomic position of each SNPs is listed.
- A map file for all the SNPs - TestSNPIndexPos.dat, in which the position of each SNPs is consecutively coded from 1 to the number of SNPs.
- A file for pre-selected Tag SNPs - TestPreSNP.dat.
- A file for the name of all SNPs - TestSNPName.dat.
Results: Blocks, Tag SNPs and Haplotype Patterns
We test our program based on aforementioned data using a number of different setting of parameters. In the following, we list the parameter file,
the corresponding output files and the short explainations for those parameters and results. For the contents and format
of these files and the meaning of each parameter in the parameter file, please refer to our
help file (PDF format).
As for illustration, the following paramter files share many common parameters (Attention: since several options have been
implenmented in the new version of program, the format parameter file is sligly different from the previous one. Please refer to the
help file for details.). In all examples, the haplotypes with frequency
greater than 4.99% are considered as common haplotypes.
We also set the maximum number of samples, the maximum number
of SNPs, the maximum length of a block as 100, 250 and 100, respevtively.
- The dynamic programming algorithm is used to find a block partition with minimum number
of tag SNPs on the basis of haplotype data. A set of consecutive SNPs with size one or more are defined as a block only if
the percentage of common haplotypes is more than 80%. The tag SNPs are a minimum set of SNPs that can distinguish
a set of common haplotypes that can acocunt for at least 80% of all unambiguous haplotypes.
The program will output all intermediate results, all possible sets of tag SNPs and haplotype patterns. The tag SNPs are labeld
by their names provided from file TestSNPName.dat. 100 permutations are performed and the statistics
of interest are stored in a file. Please find the setting of parameters and results from the
following files:
- The dynamic programming algorithm is used to find a block partition with minimum number
of tag SNPs on the basis of genotype data from unrelated individuals. Here, the haplotype pairs assigned to each individual are used.
A set of consecutive SNPs with one or more are defined as block only 80% of all pair-wise LD greater than 0.70.
The tag SNPs are a minimum set of SNPs that can distinguish all common haplotypes within a block. The program will output all intermediate results,
all possible sets of tag SNPs and haplotype patterns. The tag SNPs are labeld
by their names provided from file TestSNPName.dat.
No permutations are performed. Please find the setting of parameters and results from the
following files:
- The dynamic programming algorithm is used to find a block partition with minimum number
of tag SNPs on the basis of genotype data from pedigrees.
A set of consecutive SNPs with size one or more are defined as a block only if
the percentage of common haplotypes is more than 80%.
The tag SNPs are a minimum set of SNPs that can distinguish account at least 90% of overall haplotype diversity
A set of pre-selected tag SNPs is included in the analysis and pre-stored in the file
TestPreSNP.dat , The program will output all intermediate results,
all possible sets of tag SNPs and haplotype patterns. The tag SNPs are labeld
by their indexes (starting from 0).
No permutations are performed. Please find the setting of parameters and results from the
following files:
- The parametric dynamic programming algorithm is used to find a block partition
with minimum number of tag SNPs that can cover 60% of genome on the basis of gneotype data from unrelated
individuals. The haplotype frequencies estimated from the PL-EM algorithm is used.
A set of consecutive SNPs with size one or more are defined as a block only if
the percentage of common haplotypes is more than 80%. The tag SNPs are a minimum set of SNPs that
can distinguish account at least 80% of overall haplotype entropy.
The program will output all intermediate results, all possible sets of tag SNPs and haplotype patterns.
The tag SNPs are labeld by their names provided from file TestSNPName.dat.
The position of a SNP is defined in the file TestSNPSimuPos.dat.
Please find the setting of parameters and results from the
following files:
- The two-dimension dynamic programming algorithm is used to find block partition
with 40 tag SNPs that can cover maximum fraction of genome on the basis of
all the possible blocks and the corresponding number of tag SNPs.
The blocks and the number of tag SNPs are in the file TestBT.dat,
which is extracted from another file TestBlockRes-001.dat.
The position of a SNP is defined as its index and stored in a
file.
Please find the setting of parameters and results from the
following files:
Old Programs:
If you would like to use the old program (only deal with haplotye data) and browse the supplemental materials of our paper
in Proc. Natl. Acad. Sci. USA 99: 7335-7339 , please go to
here.
If you would like to use another old program that can deal with genotype data from unrelated individuals
but only has a few options for block pratitioning and tag SNP selection, please go to
here.
Created Date: March 20, 2003
Last Updated Date: July 15, 2004
We are planning to incorporate more methods in our program. You are welcome to provide new methods
that you want us to implement into this program. We greatly appreciate if you could point out any bugs
when you use our program. Our contact information is:
Fengzhu Sun, Ph.D.
Program in Molecular and Computational Biology
Department of Biological Sciences
University of Southern California
Los Angeles, CA, 90089
(213) 740-2413 (phone)
(213) 740-2424 (fax)
Email: fsun@hto.usc.edu