[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

14. Estimating Genetic Maps from Marker Data

See Concept Index for: genetic map estimation.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

14.1 Introduction to lm_map

See References, for details of the cited papers.

The program lm_map finds the maximum likelihood estimate (MLE) of the marker map, estimates the (statistical not Monte Carlo) variance of the MLE, and tests hypotheses about the true map. All inference is based on the analysis of multilocus marker data obtained from some (possibly all) members of a set of independent families (pedigree components).

To find the MLE, lm_map uses either Monte Carlo expectation-maximization (MCEM) or a hybrid of MCEM and stochastic approximation (SA). In either case, the user must supply an initial map estimate, and an initial Monte Carlo (MC) sample size for the MCEM algorithm. The MCEM sample size is automatically increased with each successive step of the algorithm, and only a small number of MCEM steps are needed to estimate the MLE. If the hybrid option is chosen, lm_map uses the MCEM estimate to seed the SA algorithm. Then, a relatively large number of SA steps are used to estimate the MLE with greater precision.

Once the MLE is obtained, a long Markov chain is used to estimate the variance of the MLE. Finally, a slight adaptation of the MC likelihood ratio formula is used to estimate the likelihood ratio test (LRT) statistics for testing the simple and/or composite null hypotheses. For more details, see [ST06].

See Concept Index for: lm_map introduction.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

14.2 Sample lm_map parameter file

The two sample parameter files for lm_map can be found in the directory ‘MORGAN_Examples/Map’. The two files are ‘map_G.par’ and ‘map_P.par’, along with the corresponding marker data files ‘map_G.markers’ and ‘map_P.markers’. Thus there are two examples, one for genotypic markers (G) and one for phenotypic markers (P). ‘G’ denotes that marker genotypes are observed without error. ‘P’ denotes the possibility of error, so that the observed marker phenotype is not the same as the underlying true marker genotype. This example uses the pedigree file ‘map.ped’, but different marker data files depending on the choice of ‘P’ or ‘G’.

map_G.par’ and ‘map_P.par’ have the following statements in common:

 
input pedigree file './map.ped'
input marker data file './map_G.markers'  # or './map_P.markers' for 'map_P.par'

select all markers
set marker 1 2 3 allele freqs  .2 .2 .2 .2 .2 
set marker names DS123 DS456 DS789

map gender F marker recomb fract .18 .18  # true F map (cM): 20 20 
map gender M marker recomb fract .08 .08  # true M map (cM): 10 10

limit recomb fracts .001

use sequential imputation for setup
use 100 sequential imputation realizations for setup
set burn-in iterations 100
sample by scan
set L-sampler probability .8
set MC iterations 50 	 # The initial number of MCMC scans per step
limit EM iterations 10   # The total number of MCEM steps 

As seen in previous examples, the ‘select all markers’ statement instructs the program to use all markers on the chromosome for computation. The alternative is to use only selected markers for computation, which can be achieved by using the ‘select markers’ statement (see Autozyg computing requests). The ‘set marker 1 2 3 allele freqs .2 .2 .2 .2’ statement specifies the marker allele frequencies for markers 1, 2, and 3. This statement, as constructed, requires markers 1, 2, and 3 to each have five alleles with frequencies of 0.2 for each allele. If the number of alleles per marker varies from marker to marker, or if the allele frequencies vary from marker to marker, a separate ‘set marker freqs’ statement is needed for each marker (see markerdrop population model parameters). The ‘set marker names’ statement overrides the default behavior, which labels markers consecutively: marker-1, marker-2, etc.

The two ‘map gender marker recomb fract’ statements specify the marker map in terms of recombination fractions. This is the initial starting estimate of the map.

The ‘limit recomb fracts 0.001’ statement is optional and places lower and upper bounds on the estimated recombination fractions of the map. For markers that are separated by little or no recombination, the MCEM algorithm may yield estimated recombination fractions of zero which could lead to a severe bias in the results. As a safeguard against such events, this statement places a lower bound 0.001 and an upper bound 0.5 - 0.001 on the estimated recombination fractions of the map.

The statement ‘use sequential imputation for setup’ instructs lm_map to initialize the set of maternal and paternal meiosis indicators for all members of the pedigree who are not founders; this is done prior to the Monte Carlo simulation. The default behavior is specified in this statement, with the alternative being to ‘use locus-by-locus sampling for setup’. The statement ‘use 100 sequential imputation realizations for setup’ is optional and modifies the default behavior for setup by sequential imputation (which is 10% of the MC iterations). The next three lines in the parameter files contain statements introduced in the Autozyg examples of this tutorial. For explanation of ‘set burn-in iterations’, ‘sample by scan’, and ‘set L-sampler probability’ see Autozyg MCMC parameters and options. The statement ‘set MC iterations 50’ indicates how many MC iterations are to be performed at each EM iteration. The statement ‘limit EM iterations’ was introduced in the multivar example and puts an upper bound on the number of MCEM iterations.

Now we’ll take a look at the remaining statements in ‘map_G.par’:

 
output maps gender averaged specific
set map estimation model with no mistyping
set EM convergence .01

use MCEM and SA for maximization   
set SA curvature iterations 10	
set SA ascent iterations 10	
set SA gradient iterations 10 
set SA convergence .001

The ‘output maps gender averaged specific’ statement specifies the type of map to be estimated by lm_map. In this example, the default behavior is specified, which instructs lm_map to automatically compute the likelihood ratio test statistic for testing the null hypothesis of a sex-averaged map. The statement ‘set map estimation model with no mistyping’ instructs lm_map to assume that the genotypes are observed without error. The ‘set EM convergence’ statement instructs lm_map to stop the MCEM algorithm if all recombination fraction updates are within 0.01 of their previous values.

The statement ‘use MCEM and SA for maximization’ in ‘map_G.par’ instructs lm_map to attempt to refine its MCEM-based estimate of the MLE by performing additional SA steps. The alternative is to ‘use MCEM only for maximization’, with no further refining. There are several statements that allow additional control of the SA algorithm. First, an estimate of the curvature of the likelihood is needed to initiate the SA algorithm. The statement ‘set SA curvature iterations 10’ instructs lm_map to use at least 10 MCMC realizations to estimate the curvature of the likelihood. Also, lm_map will not initiate the SA algorithm with a step that decreases likelihood. So, when the SA algorithm is used for refining the likelihood estimate, the statement ‘set SA ascent iterations 10’ instructs lm_map to use at least 10 MCMC realizations to determine whether a proposed first step increases the likelihood. The SA algorithm also requires an estimate of the gradient of the likelihood at each SA step. The statement ‘set SA gradient iterations 10’ instructs lm_map to use at least 10 MCMC realizations to estimate the gradient of the likelihood. Finally, the map estimate obtained from the final step of the MCEM algorithm is used to seed the SA algorithm. The ‘set SA convergence 0.001’ statement instructs lm_map to terminate the SA algorithm when the absolute change in successive map estimates is less than 0.001 for each recombination fraction in the map.

The file ‘map_P.par’ shows some different Monte Carlo and estimation options. Here are the remaining statements in that file:

 
output maps gender averaged
set map estimation model with mistyping
set genotyping error rate .02
use MCEM only for maximization

In this parameter file, a gender averaged map is specified by using the ‘output maps gender averaged’ statement. Unlike in the previous parameter file, ‘map_P.par’ does not assume the genotypes are recorded without error; this is indicated by the statement ‘set map estimation model with mistyping’. When ‘with mistyping’ is chosen, one has the option of specifying an estimate of the error rate with the statement ‘set genotyping error rate E’. In this example, the error rate is set at 0.02. Finally, the statement ‘use MCEM only for maximization’ instructs lm_map not to use the SA algorithm to further refine the MCEM-based estimate of the MLE. Since the SA algorithm will not be used, none of the ‘SA’ statements are used in ‘map_P.par’.

See Concept Index for: sample parameter file for lm_map.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

14.3 Running lm_map with genotypic data

Run the genotypic example in the ‘Map’ subdirectory of the ‘MORGAN-examples’ directory with the following command

 
./lm_map map_G.par

The lm_map program is one of the more computationally intensive MORGAN programs. Even running this small example takes about 30 seconds to run (depending on the computer used, of course). Again, different random seeds will result in different outputs with each run.

Here is the output from one of the runs: The maximum likelihood estimates (MLEs) of marker map recombination frequencies are given for each marker interval and for male and female meioses. Also given is the estimated variance-covariance matrix of the MLEs. Of course, the MLE will not be identical to the true parameter value, but the variance-covariance matrix gives an estimate of the precision. The ‘effective number of meioses’ is also a measure of this precision, giving the number of fully informative meioses required for the same precision of the MLEs.

 
     MAXIMUM LIKELIHOOD ESTIMATES

  Interval    Female (RF)         Male (RF)
  --------    -----------         ---------
     1           0.2231            0.0264 
     2           0.2745            0.0801 

 ESTIMATED VARIANCE OF SEX-SPECIFIC MAP [F1,M1,F2,M2,... x F1,M1,F2,M2,...]

    0.004402  -0.000034  -0.000422  -0.000040 
   -0.000034   0.000621   0.000075  -0.000025 
   -0.000422   0.000075   0.006546  -0.000136 
   -0.000040  -0.000025  -0.000136   0.001482 

 EFFECTIVE NUMBER OF MEIOSES

 Interval  Female      Male
--------  --------    -------
     1:      40         42 
     2:      31         50

See Concept Index for: running lm_map with genotypic data, lm_map sample output for genotypic data.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

14.4 Running lm_map with phenotypic data

 
./lm_map map_P.par

Running this example takes a noticeable amount of time. Given are the MLEs of the sex-averaged recombination frequency in each of the two marker intervals and of the mistyping (error) rate. Also given is the estimated variance-covariance matrix of these MLEs and the effective number of meioses (see the previous section).

 
       MAXIMUM LIKELIHOOD ESTIMATES

  Interval        Sex-Averaged (RF)
  --------        ----------------
    1                  0.1510 
    2                  0.1787 

       MISTYPING RATE: 1.479401%


 ESTIMATED VARIANCE OF (MAP,MISTYPING RATE) SEX-AVERAGED
------------------------------------------------------
    0.001432  -0.000482   0.000007 
   -0.000482   0.001517  -0.000016 
    0.000007  -0.000016   0.000072 

Following this section, there is a table of the estimated error probability for each individual at each marker. From your output you should see that the program detects errors in individual 32 and individual 49 for marker-3 and in individual 90 for marker-1. Some other instances of data with low (non-error) probability also show non-zero estimated probability of error. The exact values of these probabilities will depend on the random seeds used in the run.

See Concept Index for: running lm_map with phenotypic data, lm_map sample output for phenotypic data.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

14.5 lm_map statements

limit recombination fractions L

This statement is optional and places lower and upper bounds on the estimated recombination fractions of the map. For markers that are separated by little or no recombination, the MCEM algorithm may yield estimated recombination fractions of zero which could lead to a severe bias in the results. As a safeguard against such events, this statement places a lower bound L and an upper bound 0.5 - L on the estimated recombination fractions of the map.

output maps gender [averaged] [specific]

This statement specifies the type of map to be estimated. The default behavior is to select both options, which instructs lm_map to automatically compute the likelihood ratio test statistic for testing the null hypothesis of a sex-averaged map.

use MCEM and SA for maximization

If the statement ‘use MCEM only for maximization’ is replaced by this statement, lm_map will attempt to refine its MCEM based estimate of the MLE by performing additional SA steps.

set EM convergence X

The MCEM algorithm is used to find a suitable starting value for the SA algorithm. The MCEM algorithm terminates when the percent change in successive parameter estimates is less than X. The default value of X is 0.2: smaller values may substantially increase the total CPU time.

set genotyping error rate E

When the statement ‘set map estimation with mistyping’ is used, the genotype observations are assumed to have an associated error rate. This statement allows for the specification of the ‘mistyping’ rate.

set SA curvature iterations I

An estimate of the curvature of the likelihood is needed to initiate the SA algorithm. This statement tells lm_map to use at least I MCMC realizations to estimate the curvature of the likelihood. The curvature is only estimated once.

set SA ascent iterations I

lm_map will not initiate the SA algorithm with a step that decreases the likelihood. This statement tells lm_map to use at least I MCMC realizations to determine whether a proposed first step increases the likelihood.

set SA gradient iterations I

If SA is initiated, this tells lm_map to use at least I MCMC realizations to estimate the gradient of the likelihood. An estimate of the gradient is needed for each SA step.

set SA convergence R

The SA algorithm is terminated, if all recombination fraction updates are within R of their previous values. In addition, the maximum possible runtime for the SA algorithm is proportional to the total runtime of the MCEM algorithm.

set map estimation (with | with no) mistyping

This statement can be used to specify whether or not errors were made during the observation of genotype. If ‘with no’ is selected, the genotypes are assumed to have been observed without error. If ‘with’ is selected, the genotype observations are assumed to have some error associated with them, which can be specified using the ‘set genotyping error rate’ statement.

set LRT statistics iterations I

This statement tells lm_map to use at least I MCMC realizations to estimate the LRT statistics. If only one option is used in ‘output maps gender ...’, then the estimated LRT statistic compares the MLE to the initial map. Otherwise, two LRT statistics are estimated. The first compares the MLE of the sex-averaged map to the initial sex-averaged map, while the second compares the MLE of the sex-specific map to the MLE of sex-averaged map.

compute estimates I times

This statement tells lm_map to conduct its entire analysis I times, and to report the map with the highest likelihood. While this statement offers some protection against convergence to local modes, the default value is 1.

See Concept Index for: lm_map statements.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Elizabeth Thompson on September 6, 2019 using texi2html 1.82.