Please note that the format of the compu* files has changed, so setup files from older version of MORGAN will not work. This README file contains: A. To install the ibd_haplo program within your MORGAN-3 B. Running the program on the Gold test examples (and your examples) C. Running the program on your test examples: D. Details of input files for examples E. Details of output files for examples F. A note on the python script pairwise.py ----------------------------------------------------- A. To install the ibd_haplo program within your MORGAN-3: ----------------------------------------------------- 0) Download and install your MORGAN-3 In the main MORGAN-3 directory you will say make morgan.gcc.dbg (Of course, you may use any of the morgan make options, but I always use this one) 1) Untar the ibdhap_prog.tar.gz file within your main MORGAN-3 directory. This will create a subdirectory IBD_Haplo, which contains a Makefile and four *.c source code files. 2) cd into the IBDHAP_PROG subdirectory; then make ibd_haplo.gcc.dbg Note 1: it is probably advisable to use the same make option as in step 0 Note 2: you will likely get a lot of warnings of the form ../Makefile.progs:188: warning: ignoring old commands for target `.cc.dml' You can ignore these; they come about because we have only one main program in this subdirectory 3) To remove the executable either simply rm ibd_haplo or make myclean Note: You MUST do this if for any reason you remake the rest of your MORGAN-3. The general "make morgan" commands will not clean and remake the ibd_haplo program, so library links will be incorrect. i.e. if you redo step 0, you MUST remove the ibd_haplo executable and redo step 2. -------------------------------------------------------------- B. Running the Gold test examples; Similarly to other MORGAN programs, the IBD_Haplo directory includes a Gold subdirectory, which actually includes this README file. 1. There are currently 4 examples, (i) for 4 sets of 4 haplotypes; ibd_haplo_gold (ii) for 4 pairs of individuals with genotypic data: ibd_geno_gold (iii) same as (ii) bt data input as haplotypes ibd_hapgen_gold (iv) for 4 pairs of individuals with partially phased genotypic data: ibd_pphas_gold (there are currently no gold output files for this example, but the parameter files are provided) 2. As with other MORGAN programs these more easily run using the "make" command: e.g. make ibd_haplo gold, but to prepare for your own examples you may prefer to run them directly: ../ibd_haplo ibd_haplo.par > ibd_haplo.out or ../ibd_haplo ibd_genos.par > ibd>genos.out 3. In this case, make sure you have cleaned our old output files first: see the Makefile, and/or say "make.help" for more info. --------------------------------------------------------------- C. Running the program on your test examples: To run your own examples you may prefer to have your own directory. The easiest way is to set up any directory, and put your input and parameter files (see section D) there and then 1. make a (soft) link to the ibd_haplo executable: for example ln -s /castor/thompson/MORGAN_3_Feb09_CVS/IBD_Halo/ibd_haplo ibd_haplo Note 1: Of course, you will use the full pathname of where your ibd_haplo program is. Note 2: If ls -l ibd_haplo shows there is already an ibd_haplo link, you should probably unlink it first: "unlink ibd_haplo". Note 3: If preferred the link may go in your bin directory, or anywhere your system will look for commands, as may be preferred by you or your system administrator. 4. As with other MORGAN programs, the general format is progname parfile > outfile So here, for example, one might say ./ibd_haplo ibd_haplo.par > ibd_haplo.out 5. You will find you have generated two new output files e.g. qibd_h.out, ibd_haplo.out These are described in detail below in section E. 6. Before rerunning, remove or move (if you want to keep them) these two output files!! In the case of "ibd_haplo.out", the program probably (depending on your setup) will not run if that file already exists. In the case of the "qibd_h.out" file, the next output will append to the file, and this (large) file gets ever larger and confusing!! ---------------------------------------------------------------------- C. Details of input files for examples 1. First look at the MORGAN parameter file in the Gold subdirectory ibd_haplo.par It just gives the names of two files: an input file ibdhap_input_files and an output file qibd_h.out -- which we met above. Note: clearly you could change these names for your examples. (The ibd_genos.par and ibd_hapgen.par give a parallel examples, but we do not describe the details.) 2. Now look at the MORGAN extra input file ibdhap_input_files It consists only of three (optionally four) more filenames: compu_4haps.dat haplotypes.dat chr07.markers (phasing.txt) -- only in case of partially phased genotypes which we now describe. Note: again, you could change these names for your examples. 3: chr07.markers: This file contains the marker information, in a very similar way to other MORGAN files, but without the explicit MORGAN parameter statements. The data are for 2132 SNP markers on "chr07". First the 2132 chromosomal positions of the SNPs are listed. For convenience they are here in 213 lines of 10, with two extra, but that is not required. These are sex-averaged cM positions-- only differences (cM distances) are used. If your position info is in bp, a rough translation is given by dividing by 10^6. Then the allele frequencies of each SNP are given, for markers 1 to 2132, in order (the integer count is for convenience, it is ignored by the program). Again these are put one per line for convenience. This is not required, but having them in correct order is!! 4. haplotypes.dat This file contains the haplotypic data for the haplotypes or genotypes to be analyzed. This particular data set consists of 16 haplotypes, each with an integer "name". The name is followed by 2132 alleles making up the haplotype. 1 and 2 are the SNP alleles, and 0 denotes missing. For genoptypic data the format would be the same, but there would be 2132*2 = 4264 alleles following each "name". The two alleles making up a genotype can be entered in either order ("1 2" or "2 1"). The program makes no assumptions about phase when analyzing genotypic data. Here the program has all the data for one haplotype/genotype on one line, because this is how the script produced it. Again, lines may be cut for convenience if desired. 5. compu_4haps.dat This is the complicated file that tells the program what to do!! The comment in the compu_4haps.dat file reads # [# of states] [# of allelic phenotypes] [data input as genotypic] [analysis to be done as genotypic] # [# of sets of haplotypes] [# of haplotypes in a set] [total chromosome length] # [total # of markers] [ffkin] [ffrate] [delta] # (a) This particular file "compu_4haps.dat" reads: 15 16 0 0 4 4 192.30 2132 0.15 0.1 0.2 (b) A similar example for genotypic data would read: 9 9 1 1 4 4 192.30 2132 0.1 0.1 0.2 (c) An analysis of partially phased genotypic data would read: 15 16 0 2 4 4 192.30 2132 0.15 0.1 0.2 Line one: # [# of states] [# of allelic phenotypes] [data input as genotypic] [analysis to be done as genotypic] Examples (a): For 4 haplotypes there are 15 ibd states, and 16 phenotypic data configurations at each SNP (not counting missing data): i.e. each SNP can be allele 1 or 2 on each of the 4 ordered haplotypes. Example (b): For a pair of genotypes there are 9 ibd states, and 9 data configurations -- each individual can be 1 1, 1 2 or 2 2. Example (c): For analysis of partially phased data, we model 15 underlying haplotypic states, although we may not be able to distinguish between some states if the data are unphased. The last two fields indicate that the data is formatted as haplotypic data (0) but should be analyzed as partially phased data (2). These parameters must have these values for partially phased data. Note the format of input data is now separated from the interpretation for analysis. That is data may be put in either as haplotypic (a line for each haplotype) or genotypic (a line for each individual), and then analyzed and either gentypic data or as haplotypic data. If haplotype data are to be analyzed as genotypic, the phasing is ignored. If genotypic data are to be analyzed as haplotypic, the first [second] allele of each pair is assumed to constitute the first [second] haplotype. (Currently, partially phased data can only be input as haplotypic-- future versions will do analysis on either form of input). Line two: # [# of sets of haplotypes] [# of haplotypes in a set] [total chromosome length] Example (a),(c): 4 sets of 4 haplotypes to be analyzed (b): same, but this time it will take each successive pair of alleles and interpret as unphased genotypes Total chromosome length is given in centimorgans. Line three: # [total # of markers] [ffkin] [ffrate] [delta] IMPORTANT: fkin is prior prob of IBD ---0.15 is VERY high -- unless you know you have a lot of IBD ffrate is rate change parameter for IBD-- 0.1 this is total change rate ped cM; approximately it is the inverse of ibd length between any pair of haplotypes, but where there are >2 haplotypes, the length in a given ibd state will be shorter. delta: This is a parameter that modifies the transition matrix to alloc for ancestral shared junctions. 6. phasing.txt This is an optional file for when the genotypes are only partially phased. Each line starts with the id numbers for the first "haplotype" in each pair, then has a 1 or a 0 for each locus on the haplotype. 1 indicates that the pair of genotypes has been phased into four haplotypes at that locus; 0 indicates that the genotypes are unphased, and the analysis should treat the phase as unknown. ---------------------------------------------------------------------- D. Details of output files for Gold examples As we have seen in B.5 there are two output files, one specified in the parameter file (e.g. qibd_h.out) and the other as standard output in your command line (e.g. ibd_haplo..out). "qibd_h.out": this is the core output, which can then be processed in R (e.g). Each line is for each marker: the marker number, 1,2,3,... the marker position, in cM ..as originally input and in the current example 15 additional probabilities summing (hopefully) to 1. IMPORTANT: these are the probabilities, under the given model and conditional on the data, of each of the 15 stated of ibd among the four haplotypes.a The ordering of the 15 states is states 1111,1122,1112,1121,1123, 1211,1222,1233, 1212,1221, 1213, 1231, 1223, 1232, 1234 Note: for genotypic analyses there will be 9 state probs (11 columns) we have the same 15 latent states, but genotypically equivalent ones are combined. The order is the same: states 1111,1122,1112+1121,1123, 1211+1222,1233, 1212+1221, 1213+1231+1223+1232, 1234 for pairwise haplotype analyses there will be 2 state probs (4 columns); of the two state-probs, first is ibd, and second is non-ibd. "ibd_haplo.out": standard output file. As with many MORGAN programs this is simply a summary of what has been read in, and what is processes, mainly so the user can check that all is as expected: first all the various parameters of the run are printed. then the genetic data, with the first 10 alleles only (for checking) Next the equilibrium state probabilities under the provided fkin value, and the latent-ibd- process transition matrix under the given parameters is printed for a 1cM distance. This is repeated (unnecessarily!!) for each set of haplotypes processed. The other two gold standards: ibd_genos_gold and ibd_hapgen_gold each has a similar two output files. --------------------------------------------------------------------------- There is a new option in ibdhaplo, to include data that are partially haplotypic and partially genotypic; see the ibd_pphas_gold above. Gold standard output files for this option remain to be added. ---------------------------------------------------------------------- F. The python script pairwise.py For details see the comments in this script file. Bascialy the script sets up the data files for an IBD_Haplo analysis of all pairs of individuals from a set of individuals, ----------------------------------------------------------------------