12. Population-based inference of IBD

See Concept Index for: population-based IBD inference.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.1 Introduction to ibd_create and ibd_haplo

See References, for details of the cited papers.

The program ibd_create is a suite of seven subprograms that together provide a set of tools for the creation of haplotypes and ibd specifications, including ibd graphs (See The ibd_class utility). These programs produce realistic simulated data for use in testing analysis programs such as ibd_haplo. Similarly to the ibd_class utility (See The ibd_class utility), ibd_create calls each of its seven subprograms through a command line option. This option also determines which MORGAN parameter statements will be recognized and how they will be interpreted.

All the examples for ibd_create and ibd_haplo are based on the current gold standards. The marker data for these gold standards have been updated to the publicly available European samples of the 1000 Genomes project. These provide 758 phased and imputed haplotypes based on the GBR, FIN, IBS, CEU and TSI subpopulation samples. The phased haplotypes were downloaded from the Browning website, http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes.phase1_release_v3/ while allele frequency and map position information were obtained from the original 1000 genomes vcf files: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ (See http://mathgen.stats.ox.ac.uk/impute/README_1000G_phase1integrated_v3.txt for additional information.) A BEAGLE DAG model was constructed for these haplotypes by running BEAGLE with a scale parameter 1.0. All files with the partial name "eur_recode_s1" derive from this data set and DAG model.

For greater clarity we here divide the programs into three subgroups. The first two subprograms are:

beaglesim: A program written by Chris Glazner to simulate independent realizations of haplotypes out of a BEAGLE DAG [Browning06]. The input to BEAGLE is a set of real SNP haplotypes from a population sample. The goal of beaglesim is to realize haplotypes on this set of real SNPs, at known real SNP locations, with the sample SNP allele frequencies, and with the LD structure as fit by BEAGLE to the original sample.
- beaglesim also allows an “LD relaxation parameter” to allow haplotypes to be generated from the same DAG but at varying LD levels.
- However, for purposes of realistic simulation studies, it is recommended that BEAGLE be run with the low scale parameter 1.0 (not the value 4.0 which is the default for the BEAGLE 3.3.1 option that produces the DAG), and that beaglesim then be run without additional LD tuning (i.e. the LD relaxation probability parameter set equal to 0).
- beaglesim can take and produce data either as character (A,C,G,T) or numeric (1=reference allele, 2=alternate allele). For downstream use in other MORGAN programs, it will be found more convenient to have converted to numeric allele labels.
- Raw data haplotypes are typically coded with a row for each SNP, and a column for each haplotype. beaglesim generates realizations by haplotype, but can output these haplotypes either as rows or as columns or in MORGAN marker data genotypic format (See ‘marker data’ in Concept Index). Care must be taken in ensuring the correct row/column specification for downstream use in other MORGAN programs, or other software.
For additional information about beaglesim see [BGZT12].
beagledag: This is a prototype program to parse and compute local haplotype probabilities given a BEAGLE DAG model. In its current form, it provides a list of haplotypes and their DAG probabilities for all feasible local haplotypes between two specified markers. Some of the routines of this program will be useful in developing fast methods to compute the DAG probabilities of specific local haplotypes. These probabilities could be used to weight realized IBD segments inferred using gl_auto or ibd_haplo. This would provide an adjustment for LD when this inferred ibd is used in gl_lods to produce Monte Carlo lod score estimates (See Parameter files for the gl_lods program). (See The ibd_class utility for more information on ibd graphs).

The next subprogram of ibd_create is:

simpop_fgl: This is a version of a program originally written by Chris Glazner, and modified by Fiona Grimson. It simulates both a population pedigree, and the crossover process on a chromosome. With MORGAN version 3.3.2 the program is further modified, so that the cross-over process is in terms of (micro)-centiMorgans rather than base pairs. The program produces an ibd graph of current individuals in terms of a compact specification of the FGL segments of their maternal and paternal gametes (See The ibd_class utility for more information on ibd graphs).
- A population of diploid size N is simulated, with N/2 males and N/2 females in each generation. For each of N times, a random male and random female is chosen, and the couple produces a son and a daughter.
- A modification due to Fiona Grimson reduces the numbers of half-sibs and multiple matings. A parameter m defines the remating probability. If a selected individual has been selected k times previously as a parent, he/she is accepted again with probability m to the power k. With m=0.3, this gives an effective population size close to the census size – the greater variance of offspring number caused by families of size 2, compensated by the less-than-Poisson variation in the number of matings of a parent.
- In any meiosis, crossovers between the two parental gametes are generated at rate 1 per Morgan (100 million micro-centiMorgans) across the chromosome, and these crossover events are used in generating the FGL segments of the offspring gamete.
simped_fgl: This subprogram is no longer in ibd_create It is replaced in MORGAN V3.4 by a new version of ibddrop See Introduction to ibddrop. The new ibddrop both simulates descent at a finite number of linked markers, but also can now simulate the recombination breakpoints across the chromosome.

The final three subprograms of ibd_create are

fgl2ibd: An initial version of this program for sets of four gametes was first written in C by Fiona Grimson. This was later generalized focusing initially on sets of 6 gametes, but later generalized to (in principle) any number. The program has three inputs:
1. The key input is an ibd graph for a set of gametes specified in compressed FGL form; with MORGAN 3.3.2 the recombination breakpoints of the graph are in micro-centiMorgans.
2. Also input is a specified set of SNP marker locations. These may be input either
  - on the cM scale via standard MORGAN map statements, or
  - on the base-pair (bp) scale.
  In the latter case, a parameter statement is provided to covert from the bp scale to the cM of other MORGAN programs.
3. The MORGAN program set proband gametes statements are used to define the gametes among which ibd is scored.
The output is ibd states for the specified gametes, at the specified marker locations: this output is in a variety of formats specified more fully below.
fgl2haplo: This program assigns haplotypes to individuals on the basis of their location-specific IBD as specified by the FGLs of the ibd graph. The haplotypes are given at specified SNP markers. The data generated are typically used in testing analysis methods. The program has been updated, so that repeated runs use different random assignments of founder haplotypes to the FGL.
- The inputs to the program include an ibd graph in compact FGL form, a map of locations of selected markers at which haplotypes are to be scored.
- Also input is a set of haplotypes to be assigned to these FGLs. The haplotypes are typically either from a public data base or generated by beaglesim from a BEAGLE DAG that was fit to such haplotypes. Each FGL is assigned a unique haplotype. To permit multiple realizations on a single IBD structure, the set of haplotypes is randomly permuted before the assignment is made.
- The output of the program is the haplotypes of the individuals of the ibd graph, constructed by assigning segments of the input haplotypes to the individuals in accordance with the defined FGL structure.
- The output haplotypes may be as rows or as columns, or in MORGAN marker data genotypic format.
fgl2dgl: This program takes as input the ibd graph in the compact FGL form with distances in micro-centiMorgans, as produced by simpop_fgl. Its output is an ibd graph in the format produced by the gl_auto program, which can therefore be input into gl_lods; See Introduction to lm_auto gl_auto and lm_pval.

The program ibd_haplo computes conditional probabilities of gene IBD (identity by descent) states, given data for marker loci for specified sets of proband gametes (i.e. scoresets). The proband gametes in each scoreset are specified gametes (maternal or paternal) of specified individuals. These are input as for the lm_auto program: See Sample lm_auto parameter file.

The program has been generalized to allow for sets of up to ten gametes, although computational limitations suggest that considering more than seven gametes jointly is impractical. Internally, IBD states are the gametic states (that is, 15 states for 4 gametes), and states are ordered lexicographically, although the “traditional” Jacquard ordering may be requested for sets of four gametes.

The marker data are read in using a standard MORGAN marker data file; See Sample lm_auto parameter file. Each individual named in the marker data must have a unique name. There may be missing data, but each single-locus genotype of an individual must be either present or absent(" 0 0"); presence of a single allele cannot be specified. The marker data are read in as genotypes, but they may be analyzed as an ordered pair of alleles (i.e. phased), and must be so if only a single gamete of any individual is specified in the scoreset. (??)

The program uses a HMM model for the latent IBD states. There are two options for the transition matrix of the HMM latent IBD state; the one applicable to any number of gametes is the ‘2011’ matrix developed by Chaozhi Zheng [BGZT12]. Given the latent state, the locus-specific genotype probabilities are based on the premise that IBD DNA should be of the same allelic type, and that non-IBD DNA is of independent allelic types, although allowance is made for typing error to eliminate zero emission probabilities. The transition matrices are also modified to eliminate zero transition probabilities. A forward-backward HMM computation provides the probabilities for each IBD state at each locus for each set of proband gametes.

For the case when the scoreset consists of both maternal and paternal gametes of a set of (normally two) individuals, additional options are available. For two individuals there are 15 IBD states, although only 9 are distinguishable from unphased genotypes. In this case there are two transition matrices available; the earlier ‘2009’ matrix of [Tho08b] is also an option. The input genotypes may be analyzed as an ordered or unordered pair of alleles (i.e. phased or unphased). There is also an option for partial phasing, in which segments of chromosome (sets of contiguous markers) are specified as phased. If the data are analyzed as unphased or partially phased, the output state probabilities are of the genotypically distinguishable state classes (i.e. 9 states instead of 15). Finally, although internally the program still works with a lexicographic ordering of states, for four gametes output may be in terms of the more conventional ordering of the Jaquard states.

The methods and study results of this approach are provided in [Tho08b] and [BGZT12]. Note that the data files and software released for [BGZT12] are for an earlier version of ibd_haplo. The version described here includes improvements both in user interface, and also in the way the IBD transitions are implemented. This version was first released for MORGAN V3.1.1 with more minor improvements for MORGAN 3.2. Specifically, for MORGAN 3.2 there have been modifications in the computation of locus-to-locus transition probabilities, providing for better approximation to the underlying continuous process modeled in [BGZT12].

See Concept Index for: ibd_create introduction, ibd_haplo introduction, marker data, proband gametes, ibd graph

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.2 Sample parameter files for ibd_create; `beaglesim` and `beagledag`

Here is a parameter file for beaglesim:

# Include everything in the output file.
set printlevel 5

# Provide a file name for the beaglesim input DAG file.
input extra file              "./eur_recode_s1.bgl.dag.gz"

# Provide a file name for the beaglesim haplotypes file.
output overwrite scores file  "./eur_recode_s1.haplotypes"

# The sampler seeds are going to be 53 and 5353 (ie '0x35' '0x14e9').
set sampler seeds 53 5353     # Set the sampler seeds.

# The following Morgan parameter statements are needed by beaglesim.

output 758 haplotypes as rows
set LD relaxation probability 0.05

These statements are mostly self-explanatory: See beaglesim and beagledag parameter statements for more details.

Here is a parameter file for beagledag:

# Include everything in the output file.
set printlevel 5

# Provide a file name for the beagledag input DAG file.
input extra file              "./eur_recode_s1.bgl.dag.gz"

# Provide a file name for the beagledag reduced DAG file.
output overwrite scores file  "./eur_recode_s1.reduced"

Note that beagledag is still a prototype program. This particular version is hard-wired to compute local haplotype frequencies for a specific subset of consecutive markers from this DAG. The subroutine structures are more general and a more flexible example of the calling program will be released in the future.

See Concept Index for: sample parameter file for beaglesim and beagledag, using the BEAGLE DAG.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.3 Running ibd_create examples and sample output; beaglesim and beagledag

The ibd_create subprograms are called by invoking the program with a flag specific to the subprogram. Examples files are included in the ‘Haplo’ subdirectory of MORGAN_Examples. Here are the relevant run commands

./ibd_create -s beaglesim.par: Running ibd_create with the -s options runs beaglesim. The example runs on the BEAGLE DAG [Browning06] that was fit to 758 European haplotypes of the 1000 Genomes Data. Haplotypes are generated from the DAG model. It is recommended that both the output haplotypes and the unzipped version of the DAG file are removed after the program is run: these are large files.
./ibd_create -d beagledag.par: Running ibd_create with the -d options runs beagledag. This program computes local haplotype frequencies for a specific set of markers in the DAG file. It is still very much a prototype program; it may be generalized in the future. It is recommended that both the output reduced DAG file end the unzipped version of the DAG file are removed after the program is run: these are large files.

See Concept Index for; running beaglesim and beagledag.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.4 Sample parameter files for ibd_create; simpop_fgl

The formats for simpop_fgl have been modified for MORGAN version 3.3.2 and subsequent. This subprogram now works in centiMorgans rather than base pairs. The Gold standard parameter file is identical to the the one in the ‘MORGAN_Examples/Haplo’ subdirectory.

Here is a parameter file for simpop_fgl:

set printlevel 3

# Provide a file name for the simpop_fgl results file.
output overwrite scores file  "./simpop_fgl.ibdgraphs"

set sampler seeds 11 7654                      # Set the sampler seeds.

# The following Morgan parameter statements are being used to pass the integers
# and double precision reals needed as arguments for the simpop_fgl subprogram.

set 19 females per generation
set 30 offspring generations
output final 2 generations

set chromosome length 120.0 centiMorgans
set 0.9 remating probability weight

These statements are mostly self-explanatory: See simpop_fgl parameter statements for more details, including the control of individuals as parents through the remating probability.

See Concept Index for; sample parameter file for simpop_fgl, simulation of a population, simulation of descent in a pedigree.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.5 Running ibd_create examples and sample output; simpop_fgl

The examples here are the ones in the ‘IBD_Haplo/Gold’ directory. They may be run as the gold standards simpop_fgl_gold which is a part of gold.4 in this directory. Alternatively they may be run directly in the Gold directory as follows:

../ibd_create -p simpop_fgl.par
Running ibd_create with the -p options runs simpop_fgl.

In each case, a file containing the ibd graph of the requested individuals is produced. Note that these ibd graphs have a slightly different format from those used in the gl_auto and gl_lods programs. They are indexed in micro-centiMorgans, and contain additional information about the simulated population pedigree.

The fgl2dgl subprogram provides a translation between these two forms of ibd graph.

See Concept Index for; running simpop_fgl, ibd graph.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.6 Sample parameter files for ibd_create; fgl2ibd and fgl2haplo and fgl2dgl

See References, for details of the cited papers.

There are two basic alternative requests for the fgl2ibd subprogram of ibd_create. One requests scoring of ibd among all pairs of individuals; the other, used in this example, requests ibd scoring for specified sets of proband gametes. This example requests sets size 2, 3, 4, 5, 6, 7 and 10; the scoreset names indicate this but the names are arbitrary.

In this example, the marker data file provides marker locations in centiMorgans: thus any scaling factor to convert from base pairs to a genetic map is unnecessary and (if included) has no effect, although the output will report the scaling factor..

Here is the fgl2ibd_varying.par parameter file for this example:

# Provide the simulated fgl (simpop_fgl's output) file name.
input gamete data file        "./simpop_fgl.ibdgraphs"

#  Each FGL in the data set is assigned a unique haplotype.
#  The "sampler" seeds are used to randomly permute the haplotypes before
#     assignment, to permit multiple realizations on a single IBD data set.

set sampler seeds 0x00003039 0x00000431  # Replace by a seed file for real runs.

# Provide a marker data file name of a file containing the marker map.
input marker data file        "./eur_recode_s1_map.markers"

# Provide a file name for the fgl2ibd results file.
output overwrite scores file  "./fgl2ibd_varying.statelabels"

# Provide a file name for the fgl2ibd score set identifiers file.
output overwrite extra file   "./fgl2ibd_varying.ids"

# The following Morgan parameter statements are being used to specify which
# proband gametes to include in each analysis run by the fgl2ibd subprogram.

set scoreset 21 proband gametes 1148 0  1161 0

set scoreset 31 proband gametes 1148 0  1158 1  1162 0

set scoreset 41 proband gametes 1158 0  1158 1  1168 0  1168 1

set scoreset 51 proband gametes 1148 0  1151 0  1158 0  1161 0  1168 0

set scoreset 61 proband gametes
1158 0  1158 1  1161 0  1161 1  1168 0  1168 1

set scoreset 71 proband gametes
1148 1  1151 1  1158 1  1161 1  1168 1  1171 1  1178 1

set scoreset 101 proband gametes
1148 0  1148 1  1151 0  1151 1  1158 0  1158 1  1161 0  1161 1  1168 0  1168 1


# The following Morgan parameter statement specifies the marker locations
# at which ibd will be scored.

select markers  10    20    30    40    50    60    70    80    90   100
               110   120   130   140   150   160   170   180   190   200
               210   220   230   240   250   260   270   280   290   300
               310   320   330   340   350   360   370   380   390   400
               410   420   430   440   450   460   470   480   490   500
.........
              4410  4420  4430  4440  4450  4460  4470  4480  4490  4500
              4510  4520  4530  4540

For the fg12haplo program there are several options regarding the format of the input haplotypes. Here we give one example where the input haplotypes are specified as columns. The marker map is in centiMorgans so any scaling factor declared is irrelevant. However tha input value will be reported in output (or 1.0 will be reported, if the statement is not included)..

# Provide the ibd graph input (simpop_fgl's output) file name.
input gamete data file        "./simpop_fgl.ibdgraphs"

# Provide the output haplotype labels and output with spaces options.
output haplotype labels
output with spaces

# The following Morgan parameter statement specified the markers 
# over which haplotypes will be generated.  (Typically one will generate
# complete haplotypes.)  

select all markers

# The following Morgan parameter statement is irrelevant unless the input
#  marker map is specified in base pairs.  For a centiMorgan map there is
#  no scaling.   However the output will report the input value, or
#  1.0 if the statement is not included.

set 1.2 million base pairs per centiMorgan

# Provide a marker data file name of a file containing the marker map.
input marker data file        "./eur_recode_s1_map.markers"

# Provide the founder haplotypes file name.
input extra file              "./eur_recode_s1_hapcols"

# Tell fgl2haplo how to interpret the haplotype data read in from the founder
# haplotypes file.

input 758 haplotypes as columns of 4547 snps

# Provide a file name for the fgl2haplo results file.
output overwrite scores file  "./fgl2haplo_haps_as_cols.results"

The fgl2dgl subprogram converts the ibdgraphs file from simpop to the form used gy the gl_lods program; See Parameter files for the gl_lods program. The output file scores switchpoints at marker locations. In this example the marker locations are specified in base pairs, and a non-standard scaling is used in the conversion to illustrate the use of this statement. This is the ‘fgl2dgl_bp_0.75.par’ file:

# Provide the simulated fgls (simpop_fgl's output) file name.
input gamete data file        "./simpop_fgl.ibdgraphs"

# Provide a marker data file name of a file containing the marker map.
input marker data file        "./fgl2dgl_bp_map.markers"

# Provide a file name for the fgl2dgl results file.
output overwrite scores file  "./fgl2dgl_bp_0.75.ibdgraphs"

# The following Morgan parameter statement is required in order to satisfy
# the proc_auto subroutine.

select markers   120   130   290   450   525   770   979   980  1091  1290
                1492  1597  1726  1900  2116  2400  2810  3055  3250  3400

# The following Morgan parameter statement is now optional.  If the user needs
# to specify something other than 1.0 million base pairs per centiMorgan, it
# can be used to pass a double precision real number as a scale factor to be
# used by the fgl2dgl subprogram.

set 0.75 million base pairs per centiMorgan

See Concept Index for; sample parameter files for fgl2ibd, fgl2haplo and fgl2dgl.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.7 Running ibd_create examples and sample output; fgl2ibd and fgl2haplo and fgl2dgl

These two programs have a variety of input formats and output options. Here we include only one example of each program. For additional information see the README_userdoc file in MORGAN/IBD_Haplo of fgl2ibd and fgl2haplo and fgl2dgl parameter statements. All three programs use as input a file of ibd graphs in the format produced by simpop_fgl.

./ibd_create -i fgl2ibd_varying.par: Running ibd_create with the -i flag runs fgl2ibd. The program fgl2ibd simply scores the ibd in that file at specified markers for specified individuals. Note that you should retain (or re-create) the simpop_fgl output file, ‘simpop_fgl.ibdgraphs’ which is used as an input in fgl2ibd. The program produces two output files. The first specifies only the scored proband gametes, while the second gives the state specification at each marker for each proband gamete set. This separation simplifies downstream analysis, particularly if using R. Although for this example the output files are not large, it is good practice to remember to remove the output files after running the program.
./ibd_create -f fgl2haplo_hapcols.par: Running ibd_create with the -f flag runs fgl2ibd. The program fgl2haplo uses the supplied ibd graphs (for example those produced by simpop_fgl and supplied haplotypes to generate genetic marker data for specified individuals, in accordance with their ibd across the chromosome. Note that the "haps_as_cols" refers to the input format of haplotypes, not the output. Output is either as haplotypes or genotypes in rows. Remember to remove any large output files after running the program.
./ibd_create -g fgl2dgl_bp_0.75.par: Running ibd_create with the -g flag runs fgl2dgl. This program takes the supplied ibd graphs in the format produced by simpop_fgl, and produces an ibd graph (or dgl graph) in the format produced by gl_auto and used in programs such as gl_lods. The input has switch-points in micro-centiMorgans. The output form provides switch-points at selected markers, so that, if the marker map is in base-pairs rather than centiMorgans, the program makes the appropriate conversion.

See Concept Index for; running fgl2ibd, fgl2haplo and fgl2dgl, phased and unphased genotypes.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.8 Sample parameter files for ibd_haplo

Three sample parameter files for ibd_haplo can be found in the directory ‘MORGAN_Examples/Haplo’. All three examples are based on examples in the Gold standards for the program. The examples are analogous to the three examples previously used, except that the data have been changed to use (simulated) individuals and haplotypes from the 1000 Genomes data.

The first two examples (‘phased_2011.par’ and ‘unphased_2011.par’) use the same data, and score the same sets of 4 gametes, consisting of the maternal and paternal gametes in five pairs of individuals. The examples differ only in whether the data are treated as phased haplotypes or unphased genotypes.

Here is the unphased_2011.par parameter file:

set printlevel 3     # See comment below

input marker data file       "./sim76indivs.markers"
output overwrite scores file "./unphased_2011.qibd"
output overwrite extra file  "./unphased_2011.ids"

# The following five Morgan parameter statements specify the
#   computational set-up for the program

select 2011 state transition matrix
select unphased data

set population kinship              0.05
set kinship change rate             0.05
set transition matrix null fraction 0.05

set genotyping error rate           0.01
output four-gamete state order jacquard

# The following Morgan parameter statements are being used to specify which
# proband gametes to include in each analysis run by ibd_haplo.

set scoreset   1 proband gametes  1107 0 1107 1 1119 0 1119 1
set scoreset   2 proband gametes  1111 0 1111 1 1115 0 1115 1
set scoreset   3 proband gametes  1123 0 1123 1 1127 0 1127 1
set scoreset   4 proband gametes  1131 0 1131 1 1135 0 1135 1
set scoreset   5 proband gametes  1169 0 1169 1 1170 0 1170 1

# The program computes ibd at, and uses only,  a subset of the markers:
#  in fact every tenth marker in this example

select markers  10    20    30    40    50    60    70    80    90   100
               110   120   130   140   150   160   170   180   190   200
               210   220   230   240   250   260   270   280   290   300
.... lines omitted here
              4310  4320  4330  4340  4350  4360  4370  4380  4390  4400
              4410  4420  4430  4440  4450  4460  4470  4480  4490  4500
              4510  4520  4530  4540

Since ibd_haplo typically runs with a very large number of markers is is advisable to suppress printing of marker map and allele frequencies using the ‘printlevel’ setting. The file specifications, and marker data, are as for previous programs such as lm_auto: See Sample lm_auto parameter file. Note that there are two output files.

The marker data file ‘sim76indivs.markers’ contains the positions and allele frequencies of 4547 markers and marker genotypes of 76 individuals. (These data were created using simpop_fgl and fgl2haplo, starting from the BEAGLE DAG fit to the European chromosomes of the 1000 Genomes data. In this example a subset of the markers is selected– see the end of the file.

The second group of statements relate to the ibd_haplo implementation. The ‘2011’ transition matrix is to be used; this is the one described in [BGZT12] and is recommended. The earlier ‘2009’ option of [Tho08b] is retained for backwards compatibility. The data are to be analyzed as unphased genotypes, and there are four numerical parameters of the HMM model. Most importantly these include the ‘population kinship’, which is the mean a priori level of pairwise IBD between any pair of gametes.

For compatibility with previous output formats, the parameter file requests ordering of the states in all output information in the “genetic” (or Jacquard) order, rather than lexicographic order. Note that this option is only available for sets of four gametes. Here the four gametes are the two of each of two individuals, and data are analyzed as unphased, so there will be nine states in the output IBD probabilities. (The ordering of the states is given at the end of Running ibd_haplo examples and sample output.)

The next set of statements specifies four sets of four gametes among which IBD is to be scored. Since the data are to be analyzed as unphased it is required that each set contains both the maternal and paternal gametes of individuals. In general, any gametes of individuals in the marker data file may be specified.

The second example parameter file ‘phased_2011.par’ differs only in the names of the output file and in the statement ‘select phased data’, which specifies that the data should be treated as phased haplotypes. In this case it is not necessary that a scoreset consists of both gametes of individuals.

The third example ‘ten_ss.par’ shows the flexibility of scoresets. This example differs in that only a smaller subset of markers is used:

select markers 3010  3020  3030  3040  3050  3060  3070  3080  3090  3100
               3110  3120  3130  3140  3150  3160  3170  3180  3190  3200
               3210  3220  3230  3240  3250  3260  3270  3280  3290  3300

Additionally, the scoresets are quite varied:

set scoreset 61  proband gametes  1174 0 1174 1 1176 0 1176 1 1163 0 1163 1
set scoreset 41  proband gametes  1163 0 1163 1 1169 0 1169 1
set scoreset 43  proband gametes  1165 0 1165 1 1167 0 1167 1
set scoreset 51  proband gametes  1163 0 1163 1 1169 0 1169 1 1165 0
set scoreset 32  proband gametes  1163 0 1163 1 1169 0
set scoreset 21  proband gametes  1176 0 1176 1
set scoreset 42  proband gametes  1174 0 1174 1 1176 0 1176 1
set scoreset 52  proband gametes  1163 1 1165 1 1167 1 1169 1 1170 1
set scoreset 44  proband gametes  1170 1 1170 0 1172 0 1172 1
set scoreset 31  proband gametes  1167 0 1169 1 1169 0

The scoresets may have arbitrary numerical indicators, and range in size from 2 to 6 gametes. The program will reorder them according to size.

Note that the Jacquard ordering is requested for the four-gamete scoresets. There are three such scoresets; these will use Jaquard ordering, while for the remainder the ordering will be lexicographic. This again shows the flexibility of the program, but mixing orderings is likely to be confusing in real analyses.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.9 Running ibd_haplo examples and sample output

Run the examples in the ‘Haplo’ subdirectory of the ‘MORGAN-examples’ directory with the following command

./ibd_haplo  unphased_2011.par > unphased_2011.out
or
./ibd_haplo  phased_2011.par > phased_2011.out
or
./ibd_haplo  ten_ss.par > ten_ss.out

Each example produces two output files in addition to the standard output. The standard output gives little information when ‘printlevel 3’ is used. The proband gamete sets are specified, and the program reports as it analyzes each set. In between, the program does give the prior probability of IBD states for each size of scoreset requested, and the transition matrix at a distance of 1 centiMorgan – these are mainly for checking purposes.

The two main output files are the ‘qibd’ file and the ‘ids’ file. For the first example, these have been named as ‘unphased_2011.qibd’ and ‘unphased_2011.ids’ with analogous names for the other examples. (Of course, any names for these files can be specified in the parameter file.) Each file contains only numeric data, so that it can be read easily into R or other programs for further analysis. The ‘qibd’ file is the key output of probabilities of IBD states in each scoreset at each marker, computed conditional on marker data. The ‘ids’ file gives the scoresets.

For this example the ‘ids’ output file is

 1  4  9  1107 0  1107 1  1119 0  1119 1
 2  4  9  1111 0  1111 1  1115 0  1115 1
 3  4  9  1123 0  1123 1  1127 0  1127 1
 4  4  9  1131 0  1131 1  1135 0  1135 1
 5  4  9  1169 0  1169 1  1170 0  1170 1

That is, there are five scoresets numbered 1 to 5. Each consists of 4 gametes. These gametes are specified in the usual format: ‘0’ for a maternal gamete and ‘1’ for a paternal, of the individual whose name ID is given. Since the data are analyzed as unphased, there are only 9 IBD states for each scoreset.

The number of lines in the output ‘qibd’ file is the number of scoresets times the number of markers used: 2270 for the first two examples here. For the ‘unphased’ case, each line consists two integers, followed by 10 real numbers. The first line starts

 1    10    0.618463   0.0000  0.0000  0.0028  0.0003  0.0028  0.0003  0.8416 ....

while line 2037 starts

 5  2210   58.999373   0.1462  0.0009  0.2902  0.0011  0.1479  0.0007  0.0095 ....

The first item indicates the scoreset, the second the marker number, and the third the centiMorgan (or Mbp) position of the marker. The remaining 9 numbers are the probabilities of the 9 IBD states. In the first line, most of the probability (0.8416) is in state-7, which is the state ‘1212+1221’. That is the two individual’s are likely to share both gametes IBD at this first locus, but in this state there is no IBD between the gametes within individuals. In the second example, at marker 2210 in scoreset 5, these is probability 0.1462 that all four gametes of the two individuals are IBD, and even higher probability (0.2902) that the two gametes of the first individual are IBD, and shared IBD with one of the two gametes of the second individual.

For sets of 4 gametes, we use the traditional ordering of the 15 IBD states or 9 reduced genotypic states:

The order of the 15 states is 1111, 1122, 1112, 1121, 1123, 1211, 1222,
1233, 1212, 1221, 1213, 1231, 1223, 1232, 1234.

For the nine reduced states, the order is the same,
but genotypically equivalent ones are
combined:  1111, 1122, 1112+1121, 1123, 1211+1222,
1233, 1212+1221, 1213+1231+1223+1232, 1234.

For more general gamete sets, the ordering is lexicographic.

For more on the specification of IBD states see Sample lm_auto parameter file.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.10 Population-based IBD inference parameter statements

See Concept Index for population-based IBD inference parameter statements

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.10.1 beaglesim and beagledag parameter statements

The following statements are specific to the beaglesim and beagledag subprograms of ibd_create, or have a particular role in these subprograms.

set LD relaxation probability X

beaglesim can simulate haplotypes from the same base DAG but at varying (lesser) levels of LD. At each DAG node, with a probability equal to the parameter, the program selects a random node at the next level, breaking LD in the generation of this particular haplotype. The default value 0.0 is generally recommended, except where these varying LD levels are the target of analysis.

output haplotypes as (rows | columns)

beaglesim takes the DAG model produced by the BEAGLE software, and simulates haplotypes from the model. These haplotypes are generated one-by-one as “rows”, but may be output either as rows or a columns for easier use in downstream programs.

output I genotypes

This is an alternate output option for beaglesim. Note that either an output haplotypes ,,, or output ... genotypes is required.

beaglesim can output its haplotypes in MORGAN marker data file format, so that they can be more easily used in MORGAN downstream analyses. The output produced by beaglesim in this case consists of a set markers ... data ... statement with the gametes ordered as defined for this parameter statement.

output with spaces

This statement can be used to insert spaces between the alleles in the output file of haplotypes produced by beaglesim. The file is twice as large, but it may be easier for input to downstream analysis programs. The default is absence of spaces.

output haplotype labels

This statement implies the "output with spaces" option for haplotype output. If haplotype labels are requested, and if haplotypes are to be output as rows, then each row will include as the first item the ID number of the individual to whom the haplotype belongs. If labels are requested, and if haplotypes are output as columns, then the first line of the output will contain the ID numbers of the individuals to whom the corresponding haplotype column belongs.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.10.2 simpop_fgl parameter statements

The following statements are specific to the simpop_fgl subprogram of ibd_create, or have a particular role in these subprograms.

set chromosome length R centiMorgans

This statement is required by simpop_fgl.

It provides the total length of the chromosome in which these programs simulate recombination breakpoints at each meiosis.

set I offspring generations

This statement is required by simpop_fgl.

This sets the number of additional generations after the founder generation that simpop_fgl will generate.

set I females per generation

This statement is required by simpop_fgl.

simpop_fgl simulates generation of constant size, with equal numbers of males and females in each generation. This parameter specifies the number of females in each generation.

set X remating probability weight

This statement is optional for simpop_fgl.

simpop_fgl simulates each generation by successively selecting a random male and a random female and generating a male and a female offspring. If a sampled individual has been a member of m previous matings, then the selected individual is accepted with probability X^m. The value X = 1/3 generates reasonable human pedigrees, and gives an effective population size about equal to the census size. (Random mating is achieved by the default parameter value 1.0, but is not recommended.)

output final I generation

This statement is required by simpop_fgl.

simpop_fgl outputs only the last generations that it simulates. This parameter specifies the number of these final generations that will output.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.10.3 fgl2ibd and fgl2haplo and fgl2dgl parameter statements

The following statements are specific to the fgl2ibd and fgl2haplo subprograms of ibd_create, or have a particular role in these subprograms.

map [chromosome I] marker positions base pairs X1 X2...

Although genetic-map (centiMorgan) locations are to be preferred where available, the fgl2haplo, fgl2ibd, and fgl2ibd subprograms can alternatively be provided with marker locations in base pairs.

set R million base pairs per centiMorgan

The fgl2haplo, fgl2ibd, and fgl2ibd subprograms can provide marker locations in base pairs. In this case, the scaling factor R provided by this statement is used to convert these locations to the centiMorgan scale used by simpop_fgl. If no scaling is provided, the program equates 1 million base pairs with 1 centiMorgan.

input gamete data file S

This statement is used by both fgl2ibd and fgl2haplo to specify the input file from which the program should read the fgl-segment compact-format chromosomes that it uses. Currently it is assumed that the chromosomes are in the ibd graph format generated by simpop_fgl. This is a slightly different format from the ibd graphs produced from analysis of marker data on a defined pedigree that are produced by the Autozyg program gl_auto. (See The ibd_class utility).

input I haplotypes as (columns | rows) of I2 SNPs

fgl2haplo requires a set of marker haplotypes which it will apply to the fgl-based ibd graph. These may be input either as rows or as columns, but the number of SNPs and haplotypes should be specified.

input genotypes as rows

Alternatively, the input haplotypes for fgl2haplo may be input in standard MORGAN marker genotype format. In this case each row of marker genotypes will be interpreted as pair of phased haplotypes. (See Single and multiple meiosis LM-samplers and ``phased and unphased marker haplotypes'' in Concept Index)

output all genotypes

This is an alternative output option for fgl2haplo

fgl2haplo can output its haplotypes in MORGAN marker data file format, so that they can be more easily used in MORGAN downstream analyses. In this case the individual names from the simpop_fgl file are output as the first item in each line of genotypes.

output with spaces

This statement can be used to insert spaces between the alleles in the output file of haplotypes produced by fgl2haplo. The file is twice as large, but it may be easier for input to downstream analysis programs. The default is absence of spaces.

output haplotype labels

This statement implies the "output with spaces" option for haplotype output. If haplotype labels are requested, and if haplotypes are to be output as rows, then each row will include as the first item the ID name of the individual to whom the haplotype belongs. If labels are requested, and if haplotypes are output as columns, then the first line of the output will contain the ID names of the individual to whom the corresponding haplotype column belongs.

output four-gamete state order jacquard

In the case of four gametes, the user may select to output states in the traditional "Jacquard" order: 1111, 1122, 1112, 1121, 1123, 1211, 1222, 1233, 1212, 1221, 1213, 1231, 1223, 1232, 1234. If the gametes are of a pair of individuals, and the data are analyzed as unphased, the output state-probabilities will be reduced to nine, in the ordering: 1111, 1122, 1112+1121, 1123, 1211+1222, 1233, 1212+1221, 1213+1231+1223+1232, 1234.

See Concept Index for; state ordering: Jacquard state ordering: lexicographic

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.10.4 ibd_haplo parameter statements

The following statements are specific to ibd_haplo, or have a particular role in this program. Note that a number of the statements apply only when the proband gamete set consists of the four gametes of a pair of individuals.

select ([partially] phased | unphased) data

The "select ... data" statement is used to inform "ibd_haplo" whether to handle the data as phased data, unphased data or partially phased data. If the data are phased there is no restriction on whether proband gametes are related to each other or not. If the data are unphased or partially phased it is necessary that the proband gametes are pairs of haplotypes, each pair belonging to a whole individual.

select [2009 | 2011] state transition matrix

There are two different state transition matrices implemented in the ibd_haplo program. The user must specify which transition matrix to use for the analysis: see Sample parameter files for ibd_haplo. Note that the 2009 matrix is applicable only to sets of four gametes.

set transition matrix null fraction X

This statement sets a parameter that modifies the transition matrix to allow for transitions that can not occur under the base transition matrices. The argument, X, is a real number greater than or equal to 0.0 and less than or equal to 1.0.

set genotyping error rate E

This statement sets the genotyping error rate to be used by "ibd_haplo". The value of R is a real number greater than or equal to 0.0 and less than 1.0. The value 0.01 would be a typical value for R.

set population kinship X

This statement sets the prior population kinship parameter to be used by "ibd_haplo" to X, where X is a real number greater than 0.0 and less than 1.0. Typically in small populations a value from 0.01 to 0.05 might be reasonable.

set kinship change rate X

This statement sets the kinship change rate parameter for IBD. This is the total change rate per centiMorgan. It should be a real number greater than 0.0. It is approximately the prior for the inverse of an IBD segment length in centiMorgans between any pair of haplotypes. However, a smaller value than the typical expected length generally works better.

set [scoreset N] proband gametes N1 K1 N2 K2 …

One or more scoring sets may be given, where a scoring set consists of two or more haplotypes. If there is more than one set, each set is assigned a number 1 or greater. The maximum number of haplotypes in each set is limited to 10, due to computer memory considerations. Pairs of names and meiosis indicators are given, with 0 indicating maternal inheritance, 1 indicating paternal inheritance. At least one proband gametes score set must be specified when running ibd_haplo.

This statement is also used by fgl2ibd

set proband gametes all individual pairs

If this statement is used, then the ibd_haplo program will set up scoresets of 4 gametes for every pair of observed individuals. Typically the user will have unphased genotypes for this purpose, although typically the data may be specified an phased or as unphased.

This statement is also used by fgl2ibd

output four-gamete state order jacquard

output scores at markers I1 I2 ....

The HMM algorithms of ibd_haplo require that all markers by used in the computation of the marker-based conditional probabilities of ibd states. However, subsequent analyses may not require computation at every marker location. Thus, to limit output files, it is possible to request the state probabilities to be output only at a subset of the markers.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Elizabeth Thompson on September 6, 2019 using texi2html 1.82.