[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9. Estimating Conditional ibd Probabilities by MCMC

See Concept Index for: conditional ibd probabilities, identity by descent, ibd.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.1 Introduction to lm_auto, gl_auto and lm_pval

See References, for details of the cited papers.

The MORGAN programs lm_auto, gl_auto and lm_pval are referred to as “Autozyg” programs, as they estimate autozygosity, or identity by descent (ibd). The Autozyg programs use MCMC to infer patterns of ibd among members of a pedigree conditional on marker data, and possibly also on trait data. These inferred patterns may then be used in multipoint linkage analyses on large pedigrees where many individuals may be unobserved and exact computation is infeasible. The data are the genotypes at marker loci of observed individuals in pedigrees. For lm_auto and lm_pval there may also be affectation status (affected / unaffected / unknown) for the trait of interest.

lm_auto uses either the old (single-meiosis) or new (multiple meiosis) LM-sampler to realize ibd configurations from their conditional distribution given the marker and/or trait data. Given the data, it estimates conditional probabilities of genome sharing patterns (gene ibd) among specified haplotypes, often chosen from affected individuals. The marker data are used jointly in the sampling. The resulting ibd is either scored marginally at each marker locus, or over windows of a small number of loci.

gl_auto also uses either the old or new LM-sampler to realize ibd configurations from their conditional distribution given the marker data. By default, a null (no-data) unlinked ‘trait’ locus is also included: optionally this may be omitted Autozyg MCMC parameters and options. Rather than estimating ibd probabilities, gl_auto outputs realizations of ibd configurations directly to an output file. The output may either be of founder genome labels (FGL or ‘gl’) or of the meiosis indicators that determine the FGL, or both. In either case, the output is in a compact format where only changes of FGL or meiosis indicators are recorded, together with the positions of these changes. Positions are in terms of marker indices in the original marker data file, even where markers are subselected for MCMC. These output ibd graphs may be used for subsequent analyses of trait data on the pedigrees, without further reference to the pedigree structure or marker data. For further details see [Tho11].

lm_pval uses the LM-sampler to provide the conditional distribution of an ibd measure, T, given marker data. In principle it can be used to provide Monte Carlo estimates of any NPL (Non-Parametric Linkage) statistics for detecting linkage. Trait information provided to the program consists of the list of affected members of the pedigree, provided as the phenotypic status in the pedigree file.

The version of the program lm_pval in MORGAN 3.0 (originally released in MORGAN V.2.8) uses the latent p-value distribution of [TG07]. In lm_pval, marker data are assumed available on some pedigree members, at some of the marker loci. At each test genome location, the distribution of the ibd measure, T, conditional on marker data is compared to the unconditional distribution under the null a priori distribution uniform over all inheritance vectors. Then quantiles of a latent (fuzzy) p-value distribution are produced. A latent p-value distribution corrected for multiple testing over genome locations is also produced, by scoring the maximum of the ibd measure, T, over test locations.

Additional programs using latent p-values are under development, including programs for the distribution of latent lod scores obtained in MCMC sampling (lm_fuzlod), p-values and randomized tests based on latent lod score statistics (lm_fzplod), and randomized confidence sets for the location of a trait locus (lm_fzconf). The methods are described by Thompson in [Tho08a]. The program civil (see Introduction to lm_ibdtests and civil) also uses latent p-values [DT09].

See Concept Index for: Autozyg programs, lm_auto introduction, gl_auto introduction, lm_pval introduction, Markov chain Monte Carlo, autozygosity, meiosis indicators, identity by descent, descent graph and ibd graph, ibd, latent p-values, LM-sampler, multiple meiosis sampler.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.2 Sample lm_auto parameter file

lm_auto uses the parameter file ‘jv_rep_auto.par’ in the ‘IBD’ subdirectory:

 
input pedigree file 'jv_rep.ped'
input seed file '../sampler.seed'
output overwrite seed file '../sampler.seed'

set printlevel 5
select all markers  
select trait 1
set trait 1 tloc 1

map gender F  markers            dist 25.5 25.5 25.5 25.5 
map gender M  markers            dist 11.2 45.8 11.2 45.8 
map gender F  tloc 1  marker 2  dist 12.8
map gender M  tloc 1  marker 2  dist 5.8

set markers 1 2 3 4    allele freqs  .2 .2 .4 .1 .06 .04
set markers 5	       allele freqs  .3 .2 .3 .1

set tloc 1  allele freqs  .95 .05

set marker data 5
333   1 3  1 3  1 3  1 3  1 3 
331   3 4  3 4  3 4  3 4  3 4 
334   2 3  2 3  2 3  2 3  2 3 
431   3 4  3 4  3 4  3 4  3 4 
531   3 3  3 3  3 3  3 3  3 3 

343   1 3  1 3  1 3  1 3  1 3 
341   3 5  3 5  3 5  0 0  3 3 
344   4 6  4 6  4 6  2 4  2 4 
441   3 4  3 4  0 0  3 4  3 4 
541   3 3  3 3  3 3  3 3  3 3 

set window patterns 0 4
set locus window 3

set component 1 proband gametes  531 1  531 0  331 0  333 1
set component 2 proband gametes  541 1  541 0 

set L-sampler probability 0.2
use multiple meiosis sampler
set MC iterations 2000

The ‘select’ statement is analogous to genedrop’s ‘simulate’ statement (see genedrop computing requests). The statements first specify that all markers will be used. Then the trait (‘1’) for which data will be analyzed is identified and connected to a tloc (‘1’).

The trait values are specified in the pedigree file: the specified pedigree file, ‘jv_rep.ped’, is a 30-member, two-component pedigree. Since there is no ‘input pedigree record trait’ statement in the example parameter file, the default behavior is implemented and so the trait value is listed after the three names and integer gender in the pedigree file. In this example, this is the 5th and final column (2nd integer). Because the trait type is not specified in the parameter file via a ‘set trait data’ statement, by default the trait data are ‘genotypic’, so that they are are coded as ‘1’, ‘3’, ‘4’ or ‘0’, corresponding to trait locus genotypes of ‘1 1’, ‘1 2’ (or ‘2 1’), ‘2 2’ or ‘missing’, respectively. In the example, the final individuals of each pedigree component, named 531 and 541, have trait value ‘4’. All other individuals in the file have trait value ‘0’.

The ‘map’ statements specify the marker map and trait position in terms of genetic distances (centiMorgan). In this example there are five markers with gender-specific maps. The trait locus position is measured from the marker to its left. In this example, the trait locus for males is between markers 2 and 3 at a distance of 12.8 cM to the left of marker 2 (See See genedrop mapping model parameters. The ‘set markers’ statements specify the number and frequency of alleles for each marker. In the example, the first four markers each have six alleles (labeled 1–6) with frequencies 0.2, 0.2, 0.4, 0.1, 0.06 and 0.04. The fifth marker has four alleles with frequencies 0.3, 0.2, 0.3 and 0.1. The trait locus has two alleles; alleles ‘1’ and ‘2’ have frequencies 0.95 and 0.05, respectively.

The ‘set marker data’ statement specifies the number of markers to be five. Following the ‘set marker data’ statement are genotype data for typed individuals. Alternatively, lm_auto can read genotype data from a separate file specified with an ‘input marker data file’ statement. Note that in the parameter file the marker-5 allele frequencies do not sum to 1. By default, allele frequencies will not be normalized. The implication is that there are other alleles not present in the marker data, whose frequencies therefore need not be listed. If the program encounters in the marker data an allele at this locus other than ‘1’ to ‘4’, an error will be generated.

The ‘set window patterns’ and ‘set locus window’ statements instruct lm_auto to compute the probabilities that the gametes named in the ‘set proband gametes’ statement have a particular ibd pattern (also called state) jointly across several loci. The ‘set locus window’ statement specifies the number of loci to be examined simultaneously, in this case 3. This statement was discussed briefly in the ibddrop example: See Running ibddrop example and sample output. The probabilities in the ‘IBD’ column of the output specifies whether one of the specified set of patterns holds (‘1’) or not (‘0’) at each of the three loci across the window.

Using the ‘set window patterns’ statement in lm_auto, the user can specify ibd patterns of interest over two or more loci. Recall that in the ibddrop windows option, one can can only estimate the probability of two gametes being ibd or not. The ‘set window patterns’ statement indicates that we are interested in patterns ‘0’ and/or ‘4’, which correspond to ibd patterns ‘1 1 1 1’ and ‘1 1 2 2’, respectively. That is, in component 1, we are interested in the probability that all four of the gametes named in the ‘set proband gametes’ statement are ibd across 3-locus windows or that the first and second gametes (maternal and paternal haplotypes of individual 531) are ibd and the third and fourth gametes (maternal haplotype of individual 531 and paternal haplotype of individual 333) are ibd, but these two pairs are not ibd with each other.

Recall the output of the ibddrop program generated when using the parameter file ‘ibd.par’. In the section of the program output headed ‘Probabilities of IBD patterns’, each of the ibd patterns listed in the leftmost column is associated with a label in the right-most column.

 
Probabilities of IBD patterns

Proband gamete set 1:  541 0  541 1  341 0  343 1

pattern marker-1 marker-2  trait-1 marker-3 marker-4 marker-5    label

1 1 1 1    .0287    .0298    .0310    .0273    .0287    .0298        0
1 1 1 2    .0290    .0275    .0292    .0282    .0302    .0305        1
1 1 2 1    .0132    .0135    .0138    .0140    .0139    .0132        3

The ‘set window patterns’ statement in the parameter file for lm_auto expects one or more of these labels, which instruct it to calculate the probabilities of the associated pattern(s). This means that you must determine the labels of the patterns of interest (for example, by running ibddrop), before using lm_auto to estimate multi-locus probabilities.

The ‘set proband gametes’ statement is the key statement for lm_auto. It specifies which haplotypes are to be scored with ibd probabilities. The syntax is as follows, where N1, N2, ... are individual ID’s and K1, K2, ... indicate the haplotype as paternal (1) or maternal (0):

 
set [component M proband gametes N1 K1 N2 K2 ...

In the example, ‘531 1’ refers to the paternal (1) haplotype of individual ‘531’. The first statement requests scoring both haplotypes of 531, the maternal (0) haplotype of 331, and the paternal (1) haplotype of 333. Note that currently the number of proband gametes to be scored jointly is limited to 10. See ibddrop statements, for more discussion of the ‘set proband gametes’ statement.

As with all of MORGAN’s MCMC-based programs, the user can specify the desired number of MC iterations using the ‘set MC iterations’ statement, the desired number of burn-in iterations using ‘set burn-in iterations’, and the probability that the L-sampler is selected instead of the M-sampler using ‘set L-sampler probability’. In this example, 2000 sampling iterations are to be performed, using the L-sampler 20 percent of the time. These iterations are preceded by burn-in iterations. Because the number of burn-in iterations is not specified, lm_auto will use the default value of 10 percent of the number of main iterations. In practice, one would run the MCMC sampler much longer than 2000 iterations (on the order of 10^5). The ‘multiple meiosis’ sampler is requested and the ‘set global MCMC’ statement is not used, so MCMC will be performed separately for each pedigree component. For further details of these statements: See MCMC parameter statements.

See Concept Index for: lm_auto sample parameter file, trait data, genotypic trait, gender–specific maps, marker data file, window patterns, proband gametes, L-sampler, M-sampler, burn-in.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.3 Running lm_auto example and sample output

The syntax for running a MORGAN program is:

 
./<program> <parfile> [> <output file name>]

The lm_auto example can be run under the subdirectory ‘IBD

 
./lm_auto jv_rep_auto.par > auto.out

Below are sections of the output file ‘auto.out’, generated by running lm_auto using the parameter file ‘jv_rep_auto.par’. Note, as for the program ibddrop, the exact values of the probability estimates will depend on the value of the random seed. The tables of estimated ibd probabilities are given for each component, towards the end of the output. The estimated probabilities of gene ibd patterns are given for each marker and for the trait locus (in the map order). In the following extracted output, the MCMC diagnostic information has been omitted.

 
 ====== IBD scores for component 1 are estimated using MCMC ======

       Proband gamete set 1:  531 1  531 0  331 0  333 1

       pattern marker-1 marker-2 trt-geno marker-3 marker-4 marker-5    label

       1 1 1 1    .2345    .3375    .3655    .2945    .2450    .1870        0
       1 1 1 2    .1165    .1815    .2250    .1785    .1425    .0940        1
       1 1 2 1    .1540    .2025    .2595    .1605    .1300    .1070        3
       1 1 2 2    .0135    .0215    .0270    .0195    .0095    .0080        4
       1 1 2 3    .0260    .0270    .0435    .0250    .0210    .0195        5
       1 2 1 1    .0190    .0110    .0010    .0115    .0200    .0275        6
       1 2 1 2    .0715    .0430    .0130    .0600    .0665    .0720        7
       1 2 1 3    .0560    .0250    .0090    .0275    .0420    .0520        8
       1 2 2 1    .0345    .0205    .0030    .0265    .0380    .0310        9
       1 2 2 2    .0925    .0435    .0120    .0685    .1060    .1410       10
       1 2 2 3    .0425    .0285    .0105    .0325    .0410    .0515       11
       1 2 3 1    .0245    .0125    .0040    .0115    .0210    .0275       12
       1 2 3 2    .0620    .0300    .0130    .0515    .0670    .0885       13
       1 2 3 3    .0115    .0020    .0040    .0125    .0125    .0185       14
       1 2 3 4    .0415    .0140    .0100    .0200    .0380    .0750       15

    Probabilities of IBD for pattern set for windows of 3 loci

       Proband gamete set 1

       Pattern set:   0  4

         IBD  wndw 1 wndw 2 wndw 3 wndw 4

       0 0 0   .4955  .4635  .4630  .5410
       0 0 1   .0810  .0815  .0520  .0740
       0 1 0   .0345  .0415  .0375  .0430
       0 1 1   .1410  .0545  .0550  .0280
       1 0 0   .0495  .0515  .1520  .1125
       1 0 1   .0150  .0110  .0190  .0180
       1 1 0   .0280  .1295  .0930  .1085
       1 1 1   .1555  .1670  .1285  .0750


 ====== IBD scores for component 2 are estimated using MCMC ======

    Probabilities of IBD patterns

       Proband gamete set 1:  541 1  541 0

       pattern marker-1 marker-2 trt-geno marker-3 marker-4 marker-5    label

          1 1    .7000    .8760    .9580    .8165    .6570    .4670        0
          1 2    .3000    .1240    .0420    .1835    .3430    .5330        1

Interpretation of these results is similar to that of ibddrop See Running ibddrop example and sample output. Briefly, the probabilities are summarized by ibd pattern. A pattern is a series of integers, one representing each gamete listed in the ‘set proband gametes’ statement. The order of gametes in the output file patterns is the same as the order in which the gametes were listed in ‘set proband gametes’. Numbers that are the same indicate gametes that are ibd. For instance, in the first row of the table above, the pattern is ‘1 1 1 1’, which means that the values in the first row represent probabilities that all four gametes are ibd at each marker locus and at the trait locus. Likewise, ‘1 2 1 1’ means gametes 1, 3, and 4 are ibd while gamete 2 is not ibd with the others; ‘1 2 3 4’ means all four gametes are not ibd.

The second table in the above output is a result of the window size and ibd pattern statements in the parameter file. Its interpretation is similar to the output of ibddrop when statement ‘set locus window’ was used, See Running ibddrop example and sample output. Recall that in ibddrop, the values in the ‘IBD’ column of the output indicate whether the two gametes specified in the ‘set proband gametes’ statement are ibd (indicated by a ‘1’) or not (indicated by a ‘0’). With lm_auto, the user can specify additional ibd patterns of interest over two or more gametes. In this example, the parameter file ‘jv_rep_auto.par’ includes the statement ‘set window patterns 0 4’, which indicates that we are interested in ibd patterns ‘0’ and ‘4’, corresponding to ‘1 1 1 1’ and ‘1 1 2 2’, respectively, as discussed in the previous section. That is, we would like to know the probability that either all four gametes are ibd or that the first two are ibd and the second two are ibd, but the pairs are not ibd with one another for each window of three loci. Consequently, interpretation of the ‘IBD’ column of the lm_auto output is as follows. The row headed by ‘0 0 0’ gives probabilities that the gametes do not follow either of the two ibd patterns of interest at all three loci for each window. The row headed by ‘0 0 1’ gives probabilities that the gametes do not follow either of the two ibd patterns of interest at the first two loci in the window, but at the third loci either all four gametes are ibd or the first two are ibd and the last two are ibd, but the pairs are not ibd with one another.

In this section of the lm_auto output, the order of the marker and trait loci is the same as in the table of results for each locus; that is, the map order. In this example, the trait locus was between markers 2 and 3. Therefore, the windows are as below:

window

loci

wndw 1

marker 1, marker 2, trait

wndw 2

marker 2, trait, marker 3

wndw 3

trait, marker 3, marker 4

wndw 4

marker 3, marker 4, marker 5

For more information regarding the MCMC parameters and diagnostic output, See MCMC computational options.

See Concept Index for: running lm_auto example, lm_auto sample output, ibd pattern, proband gametes.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.4 Sample gl_auto parameter file

The parameter file of the gl_auto example is in the ‘IBD’ subdirectory of ‘MORGAN_Examples’. Like the earlier markerdrop example (see Sample markerdrop parameter file -- conditional on trait), the gl_auto uses the 3-component pedigree with a total of 73 individuals, ‘ped73.ped’, and the corresponding 10-marker data, ‘ped73.marker.missing’. Both these files are in the main ‘MORGAN_Examples’ directory; that is ‘..’ relative to the parameter file.

The gl_auto parameter file is given here in full, as it contains several options not so far encountered in this tutorial, as well as some statements specific to this program.

 
input pedigree file '../ped73.ped'
input seed file '../sampler.seed'
output overwrite seed file '../sampler.seed'
input marker data file '../ped73.marker.missing'
select all markers

set printlevel 5                # Include everything in the output file.

select trait 1
input pedigree record trait 1 integer 7     # dummy (0) trait values
set trait 1  tloc 11                        # Connect trait and tloc
map tloc 11 unlinked                        # Trait locus is unlinked.
set tloc 11 allele freqs 0.5 0.5

# Monte Carlo setup and requests. 
# Use the default option of sampling by component.
# Specify which meiosis sampler is to be used.

use multiple meiosis sampler
set limit for exact computation 12
set MCMC markers only                        # Do MCMC for markers only 
use sequential imputation for setup
use 5 sequ impu realiz for setup
sample by scan                               # Default: for clarity only 
set L-sampler probability 0.5

## For real analyses, recommended number of iterations is of order 10^5
set MC iterations 2000                  # For golds and checks only.
set burn-in iterations 15               # For golds and checks only.

# Specify what type of output is desired (or both).
# Specify the desired scoring interval.

output founder genome labels
output overwrite scores file './ped73_glauto.fgl`
output meiosis indicators
output overwrite extra file './ped73_glauto.meio'
output scores every 30 scored MC iterations

The first block of statements specify data and seed files in a way that should by now be familiar. Note all markers are selected; all will be used in the MCMC and output ibd graphs. The second block defines a trait and tloc combination. However the trait expected by gl_auto is a dummy trait consisting entirely of 0 values for ‘unobserved’.

The Monte Carlo requests are those used by most of the MCMC-based programs: See MCMC parameter statements. Unlike earlier examples, the multiple meiosis sampler is used, and sampling is (by default) by pedigree component as‘global MCMC’ is not requests. A limit of 12 meioses is set for exact computation; in fact even the smallest 11-member pedigree component has 14 meioses, so MCMC will be done on each of the three components. Other MCMC requests are standard and similar to those used in the lm_auto example. Note again than many more MCMC scans (and associated burn-in) would be done in a real example.

The final five parameter statements are specific to the gl_auto program. This program outputs ibd graphs or meiosis indicators (or both) to an output file. The ibd graph output is sent to the output scores file, so if this output is requested the name must be given. Meiosis indicators are sent to the output extra file, so if meiosis indicators are requested this file name must be given. It is strongly recommended that the overwrite option be used for both these outputs to avoid appending to previous runs of the program. Finally the the output scoring frequency is given. Here every MCMC iteration is ‘scored’ (the default), but ibd graphs and meiosis indicators are computed and output only every 30 iterations. Note that in pedigrees with multiple components, the same number of ibd graphs are generated on each component. On small components where exact IID realizations are generated, every realization is output; the number of such realizations is the same as the number output on any component for which MCMC is used.

See Concept Index for: gl_auto sample parameter file, multiple meiosis sampler, founder genome labels, exact computation, ibd graph.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.5 Running gl_auto example and sample output

The syntax for running a MORGAN program is:

 
./<program> <parfile> [> <output file name>]

The gl_auto example can be run under the subdirectory ‘IBD

 
./gl_auto ped73_glauto.par > ped73_glauto.out

The output file ‘ped73_glauto.out’, generated by running lm_auto using the parameter file ‘ped73_glauto.par’ is actually of little interest, but should be checked to see that the program has interpreted the data as expected. The output consists of summaries of the input pedigrees, and several MCMC diagnostics. As for the ibddrop lm_auto programs, the exact values of the probability estimates will depend on the value of the random seed; note that this parameter will output final seeds overwriting the previous input seed file.

The important output is contained in the ‘output scores file’, which the parameter file specified to be ‘ped73_glauto.fgl’. Note that if you run the example with different seeds your results will differ from the example output given below. Note also that the parameter file specifies that any previous output scores file of the same name will be overwritten; if you want to save an earlier one, rename it! The output scores file contains 9636 lines, which are 66 (2000/30) realizations of ibd graphs. There are three components, with 47, 11 and 15 individuals, and each individual is output on two lines – a maternal chromosome and a paternal chromosomes. The first 94 (47+47) lines is the first output realization on the first component. The other 65 realizations follow for a total of 94 times 66 (6204) lines. There are then 66 realizations of the second component ((11+11) times 66 = 1452 lines), and finally the realizations on the third component ((15+15) times 66 = 1980 lines). The total line-count of the file is 9636.

For the seeds on this run, the first few lines of the file are shown as

 
101 0 1 0
101 0 2 0
102 0 3 0
102 0 4 0
201 0 4 3 5 3  8 4  9 3
201 0 1 1 2 2

The first column is the name of the individual (2 lines for each individual). The zero second column may be ignored. The third column is the initial (first marker) FGL, and for 101 and 102 there are no FGL-changes as they are founders. Individual 201 is the offspring of 101 and 102. His maternal chromosome consists of segments of FGL 3 and 4: it is initially 4 and there are 3 switches. At marker 5 the switch is to FGL 3, at marker 8 back to 4, and at 9 back to 3 again. Individual 201’s paternal chromosome has only 1 switch, switching from 1 to 2 at marker 2.

Lower down the pedigree there may be more FGL and more switches. For example for the maternal chromosome of 407 we have

 
407 0 4 4 5 3  8 4  9 3  10 6

The initial FGL is 4, and there are 4 switches: to 3, 4, 3, and 6 at markers 5, 8, 9, and 10. respectively.

Next consider the meiosis output in the ‘output extra file’ which was specified to be ‘ped73_glauto.meio’. This file contains a line for every meiosis sampled. Typically this is 2 lines for every non-founder individual, but a single child of a founder provides only one meiosis. In the three components of size 47, 11, and 15, there are 13, 4 and 5 founders respectively, and hence 34, 7, and 10 non-founders, but only 65, 14, and 20 meioses sampled. Hence the 66 output realizations give 4290, 924, 1320 lines for a total of 6534 lines.

The first few lines of the output file are

 
201 2 0 0 1 0
201 2 1 0 0 1 5
202 3 0 0 0 0
202 3 1 0 1 3 4 7 9
301 5 0 0 1 0
301 5 1 0 0 0
302 6 0 0 1 1 8
302 6 1 0 0 1 8

The first two columns are the name and the internal MORGAN 0-origin index of the individual, and the third column specifies whether this is the maternal (0) or paternal (1) meiosis of the individual. The fourth column refers to the unlinked null locus, and is here null (0) as this locus was not sampled by the MCMC. The fifth column gives the meiosis indicator at the first marker, and the sixth the number of changes across the chromosome. The final integers, in number corresponding to the number of switches, give the change points. Unlike the ibd graph file, no additional details of the changes are needed. since changes are from 0 to 1, or 1 to 0,

Some important points are that, first, there are programs to process and use this ibd graph and meiosis output; the user should not be concerned with the details. However, it is important that the user knows how many graphs they have generated, and whether they are done by component or globally. Until processing programs with multi-component capability have been released, it is recommended to separate the graphs by component. In general. if sampling is by component, then output is for each component, for each realization, for each individual/meiosis. If sampling is global, then output is for each realization, for each component, for each individual/meiosis. Sampling by component is normally recommended: "global MCMC" is maintained mainly for backwards compatibility.

Finally, it should be seen that the output format is compact. Here we have only 10 markers, so could have output by marker. However, the files would be no larger on the same pedigrees if we have many thousands of markers. The number of switches depends on the length of chromosome and pedigree depth, not on the marker count. (See [Tho11]).

See Concept Index for: running gl_auto example, gl_auto sample output.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.6 Sample lm_pval parameter file

Files for lm_pval may be found in the ‘TraitTests’ subdirectory of ‘MORGAN_Examples’. The parameter file, ‘ped73_pval.par’ is similar to the parameter file used for lm_auto. An abbreviated version of ‘ped73_pval.par’ is given below:

 
input pedigree file '../ped73.ped'

input pedigree record trait 1 integer 3
select trait 1

input seed file '../sampler.seed'

input marker data file '../ped73.marker.missing'
select all markers

set L-sampler probability 0.2
set MC iterations 2000

For lm_pval, markers are selected, but no trait locus is selected. Therefore, no ‘map tloc marker’ statements are included, and no ‘set traits tlocs’ statement is included. The file ‘ped73.marker.missing’ contains the marker map and genotypes, and is accessed by the statement ‘input marker data file’.

Pedigree members affected with the disease must be specified when using lm_pval. The set of affected individuals is determined implicitly by using trait data. The statement ‘select trait 1’ instructs the program to determine the affected individuals by using the trait data for trait 1 in the pedigree file. The statement ‘input pedigree record traits’ is needed to define the correspondence between trait numbers and integers in the pedigree record, so that the program knows where to find the desired trait data. The trait data in this example are (by default) genotypic: the lm_pvals program treats both homozygotes and heterozygotes for the disease allele ‘2’ as affected.

See Concept Index for: lm_pval sample parameter file.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.7 Running lm_pval example and sample output

Under the subdirectory ‘TraitTests’, run the lm_pval example by typing:

 
./lm_pval ped73_pval.par > pval.out

A portion of the output giving latent (fuzzy) p-values is below. See ‘pval.out’ for the entire output file.

 
 Combined distribution of fuzzy p-values, by locus:
pval maxim  marker-1  marker-2  marker-3  marker-4  marker-5  marker-6  marker-7
            marker-8  marker-9 marker-10
0.00 0.000     0.000     0.000     0.000     0.000     0.000     0.000     0.000
               0.000     0.000     0.000
0.01 0.002     0.015     0.008     0.000     0.001     0.002     0.000     0.002
               0.000     0.000     0.000
0.02 0.006     0.025     0.019     0.000     0.002     0.002     0.000     0.007
               0.000     0.000     0.000
0.03 0.008     0.035     0.030     0.003     0.006     0.003     0.000     0.017
               0.000     0.000     0.000
0.04 0.012     0.045     0.041     0.007     0.009     0.004     0.000     0.028
               0.000     0.000     0.000
0.05 0.015     0.055     0.051     0.011     0.013     0.006     0.002     0.039
               0.000     0.000     0.000

The output table shows the cumulative distribution of the latent (fuzzy) p-values generated at each marker position, as well as the cumulative distribution of the maximum latent p-value over the markers. These distributions are over the latent inheritance patterns sampled, given the marker data. That is, for each value of ‘pval’ in the left column, the table gives the proportion of sampled inheritance vectors at each marker that yield a p-value less than ‘pval’. In the last row of the example output, when pval = 0.05, 0.6% of the realizations have a p-value less than 0.05 at marker-5; at marker-7 this value is 3.9%. Overall, 1.5% of the realizations have a maximum p-value over the markers that is less than pval = 0.05 (shown in the second column labeled ’maxim’).

Recall again, that exact values in the output will depend on the random seed. In the case of a relatively short run of lm_pval there may be substantial differences in the estimated latent p-value distributions.

For more information regarding the MCMC parameters and diagnostic output: See MCMC computational options.

See Concept Index for: running lm_pval example, lm_pval sample output, fuzzy p-value.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.8 Autozyg statements

Many of the lm_auto and other statements following are also used for the location lod scores programs. See Location lod scores statements.

See Concept Index for: Autozyg statements, lm_auto statements, gl_auto statements, lm_pval statements.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.8.1 Autozyg computing requests

select [chromosome I] all markers

This statement selects all markers on the chromosome for the computation; if not all markers are to be used, use the next statement.

select [chromosome I] markers J1 J2

This alternate form of the ‘select markers’ statement specifies a subset of the markers.

select [chromosome I] all markers except J1 J2

This new form of statement allows the user to specify markers on the chromosome to be omitted in computation.

select trait K

This statement names the selected trait, and is used in both lm_auto and lm_pval. For the former, this names the trait to be used in subsequent mappings to trait loci. For the latter, it is used to determine disease affection status of individuals.

set traits K1 … tlocs L1

This statement establishes the correspondence between traits and trait loci in lm_auto. Note that this statement is not applicable for lm_pval.

map tlocs L1 … unlinked

Use this statement for lm_auto if the trait locus is unlinked. L1 is the trait locus number.

check marker consistency [only]

Before running a MORGAN program which uses marker data, the setup routines check that the marker data for the selected markers are consistent with Mendel’s first law and stated pedigree information. In the absence of this parameter statement, the program will terminate with the first error found. If this statement is included, the program continues checking the rest of the markers for further inconsistencies and provides details regarding each inconsistency detected. The program terminates only after checking all the selected markers. In the absence of this statement, if no marker data inconsistencies are found, the program continues with any requested analyses. If the ‘check marker consistency only’ statement is used, the program will terminate after checking all the markers; this may be useful in any initial phase of checking the data.

These alternative terminations for ‘check marker consistency’ are given in tabular form:

  No error is present  Some error(s) present
No statement  No termination  Terminates at first error
check markers consist  No termination After checking all markers
check mark consist onlyAfter checking all markers  After checking all markers

See Concept Index for: Autozyg computing requests consistency (Mendelian) of marker data.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.8.2 Autozyg file identification statements

All of the general MORGAN file identification statements can be used with the ‘Autozyg’ programs. For a list of these statements, see File identification statements. Some additional file identification can be used by ‘Autozyg’ programs:

output score file ‘filename

This file can be used by lm_auto to save interim cumulative scores of ibd probabilities. The gl_auto program expects expects an ‘output score file’ to be specified, since it uses it for its primary output that is then input to other programs.

input rescue file ‘filename

A rescue file may be used to continue an lm_auto run instead of restarting at the beginning. This file contains intermediate data, which is periodically saved when an output rescue file has been specified in a preceding run. (Note this option is not currently used/tested/)

output rescue file ‘filename

This statement, which is optional for lm_auto, specifies the periodic dumping of intermediate results to files that may be used to restart the program midstream. Data are written alternately to files with ‘1’ and ‘2’ appended to the file name. (Note this option is not currently used/tested/)

See Concept Index for: Autozyg file identification statements, score file, rescue file.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.8.3 Autozyg pedigree file description

Both Autozyg programs use the general MORGAN pedigree file description statements; see Pedigree file description statements.

See Concept Index for: Autozyg pedigree file description.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.8.4 Autozyg output file description

Three output file description statements are expected by gl_auto and one additional one is optional for lm_auto.

output (founder genome labels | meiosis indicators)

This statement is expected by ‘gl_auto’. It tells the program whether its output is to be in the form of founder genome labels or meiosis indicators (or both). The founder genome labels are output to the ‘output scores file’ and the meiosis indicators are output to the ‘output extra file’ so these filenames must be specified if the output is requested.

output rescue data I iterations

This statement can be used to specify the frequency of dumping program data if an output rescue file is specified. (Note this option is not currently used/tested/)

output scores every I scored MC iterations

Note that this output option is in terms of scored iterations; not every MCMC iteration may be scored; see the compute scores statement. This statement is expected by ‘gl_auto’, since this program uses the d output score file for its primary output of founder genome labels or meiosis indicators (see the ‘output’ statement above. If an output score file is specified, but this statement is not present, ‘lm_auto’ writes the score file at the conclusion of the last iteration only (see ‘set MC iterations’ statement).

See Concept Index for: Autozyg output file description, founder genome labels.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.8.5 Autozyg mapping model parameters

See Concept Index for: Autozyg mapping model parameters.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.8.6 Autozyg population model parameters

set [chromosome I] marker names N1 N2...

This statement, which is optional for both lm_auto, gl_auto and lm_pval, specifies the names of the markers in the order of their position in the marker data file, for example, ‘set marker names D1S306 D1S249 D1S245 ....’.

See Concept Index for: Autozyg population model parameters.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.8.7 Autozyg computational parameters

set [chromosome I] markers K data N1 M11 M12 …[N2 M21 M22 …] …

Individuals with at least one observed marker are named, together with their marker genotypes. The number of allele pairs for an individual is the same as the number of markers mapped on the chromosome. Marker loci not observed for an individual are given alleles ‘0 0’. (Those individuals with no observed markers may but need not be included in this statement.)

In the example, there are 5 markers mapped for the chromosome:

 
set markers 5 data   343   1 3  1 3  1 3  1 3  1 3
                     331   3 4  3 4  3 4  3 4  3 4
                     334   2 3  2 3  2 3  2 3  2 3
                     431   3 4  3 4  3 4  3 4  3 4
                     531   3 4  3 3  0 0  3 3  3 3

In this example five individuals have some observed marker data, but individual ‘531’ is unobserved at marker 3.

set [component M] [scoreset N] proband gametes N1 K1 N2 K2

This statement is required for lm_auto. One or more scoring sets may be given for each pedigree component, where a scoring set consists of two or more haplotypes. If there is more than one set for the component, each set is assigned a number 1 or greater. The maximum number of haplotypes in each set is limited to 10, due to computer memory considerations.

Pairs of names and meiosis indicators are given, with 0 indicating maternal inheritance and 1 indicating paternal inheritance. In the example, there are two sets for the component:

 
set component 1 scoreset 2 proband gametes 531 1 531 0 331 0 331 1
set component 1 scoreset 4 proband gametes 561 1 362 0 364 1

At least one proband gamete set must be specified when running lm_auto.

set [chromosome I] locus window K

This statement is optional for lm_auto and gives the window size (number of loci) for which the multi-locus ibd probabilities are scored. If no size is given, each locus is scored separately.

set [component M] [scoreset N] window patterns L1

This statement is a companion to ‘set locus window’ and is required for lm_auto when the window option is chosen. It identifies the ibd patterns to be jointly scored for the proband gamete set N assigned by the ‘set proband gametes’ statement. A prior run, with the same proband gametes, but without the window option is needed to select the ibd patterns. That is, the user is required to list ibd patterns of interest by label; the labeling of the patterns is not obvious without first running lm_auto. In the example, we were interested in ibd patterns ‘1 1 1 1’ and ‘1 1 2 2’, which are assigned labels ‘0’ and ‘4’, respectively, in the output table headed ‘Probabilities of IBD patterns’. One needs to run lm_auto to obtain these labels.

set trait K data (genotypic | discrete | quantitative)

Trait data are specified as genotypic, discrete (phenotypic), or quantitative (continuous). They may also be specified as ‘discrete with covariate’ and as ‘discrete with liability’ With a genotypic trait, the trait locus genotype can be inferred from the trait value. There are four possible trait values: ‘0’ = missing, ‘1’ = homozygous for allele 1, ‘3’ = heterozygous, and ‘4’ = homozygous for allele 2. There are three possible trait values with a discrete (or phenotypic trait): ‘0’ = missing, ‘1’ = unaffected, and ‘2’ = affected. If a discrete trait is chosen, the next statement, ‘set incomplete penetrances’, must be included. With a quantitative trait, a missing value is denoted as a real number with integer portion ‘999’. For example, ‘999’, ‘999.3’ and ‘999.543’ all mean ‘missing’. The default trait type is genotypic.

set traits K1 … for tlocs L1 … incomplete penetrances X1 X2 X3

This statement is required for discrete trait data. Penetrances (probability of expressing the trait) are provided for the (1 1), the (1 2), and the (2 2) genotypes, respectively.

input pedigree record trait K integer pairs J1 J2

For liability classes or age-or-onset the trait data consists of the qualitative trait status (J1) and an additional covariate integer (J2) which may specify a liability class or age of onset.

set trait data K discrete with liability

This statement allows trait data to be input as a pair of integers, the trait status and the liability class.

set K liability classes with penetrances X11 X12 X13XK1 XK2 XK3

This statement provides the penetrances for each of the three tloc penetrances for each of the K liability classes. There is a hard maximum of 24 liability classes.

set trait K1 data discrete with covariate

The statement allows a covariate such as age-of-onset to be input as a part of the trait data. The program then expected that the input pedigree record trait statement will specify integer pairs.

set standard deviations for genotypes X1 X2 X3

This statement specifies the standard deviation of age-of-onset for each of the three tloc genotypes. The mean age-of-onset is specified via the set traits for tlocs genotype means parameter statement (see genedrop population model parameters).

set width of ages of onset window X1

This number is used in the age-of-onset penetrance function, allowing actual onset to be up to X1 units (usually years) earlier than the recorded age of onset.

select trait K

lm_pval needs to know which members of the pedigree are affected with the disease. Discrete or genotypic data for the selected trait is used to determine the disease affection status of the individuals. Here, lm_pval is to determine the affected individuals from the trait data in the pedigree file. A trait genotypic code of 3 (genotype (1 2) or (2 1)) or 4 (genotype (2 2)) indicates an affected individual. The trait number, K, determines the position of this genotypic code in the pedigree records (see Pedigree file description statements).

See Concept Index for: Autozyg computational parameters, marker data, missing marker data, scoreset, proband gametes, locus window, scoreset, window patterns, trait data, genotypic trait, discrete trait, quantitative trait, incomplete penetrance, affected individuals.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.8.8 Autozyg MCMC parameters and options

Please see that section for details regarding:

 
use (locus-by-locus sampling | sequential imputation) for setup 
use I sequential imputation realizations for setup

set MC iterations I
set burn-in iterations I
sample by (scan | step)
set L-sampler probability X

check progress I MC iterations 

One additional statement is specific to the gl_auto program:

set MCMC markers only

This statement will cause the gl_auto program to do MCMC only for the markers, and not for the assumed unlinked no-data ‘trait’ locus. This allows for identical MCMC with the lm_linkage lod score program if the same MCMC options are used, and hence for comparison of results among these programs.

See Concept Index for: Autozyg MCMC parameters and options.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Elizabeth Thompson on September 6, 2019 using texi2html 1.82.