[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5. Simulating Marker and Trait Data in Pedigrees

See Concept Index for: simulating marker and trait data.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.1 Introduction to genedrop.

genedrop simulates pedigree data for analysis by other programs. Given a genetic map, it simulates genotypes at marker loci (linked or unlinked) and the discrete genotypes and polygenic values contributing to quantitative traits. The trait loci may or may not be linked to marker maps. Thus, one or more of three kinds of loci are simulated on a chromosome: markers, traits linked to markers, and traits not linked to markers.

genedrop assigns marker and trait genotypes and polygenic trait values to the founders by using a random number generator. Meiosis indicators are then simulated for non-founders in chronological order, thus determining the founder genome labels inherited. Markers and traits, if present, are then simulated for each individual: First, marker genes are simulated in the order mapped on the chromosome, then linked traits are simulated in map order, and finally, unlinked traits are simulated.

Because founders of a pedigree are assumed to be unrelated, a unique identifier or founder genome label is assigned to each of the two haploid genomes of each founder. The user may choose to identify the ancestral source of each gene at each locus in non-founders by including the founder labels in the output pedigree.

The user may provide random number seeds for both the marker simulation and the trait simulation. This permits multiple simulations, for a pedigree, of identical marker genotypes, but with different quantitative trait values.

The population and segregation model parameters (trait genotype means, additive and residual variances) may be specified by the user and take default values if not specified. Allele frequencies have no default values and must be specified by the user. Several different trait models can be specified as in the following table:

       Equal Genotypic Means  Zero Additive Variance
        non-genetic model    YES  YES
        polygenic model   YES  NO
        major gene model    NO  YES
        mixed model    NO  NO

The trait locus must have two alleles and the trait residual variance must be greater than zero. A very small residual variance can be specified if one desires to simulate a qualitative trait.

Genetic data on all individuals may be included in the simulated pedigree, or some individuals may be specified as ‘missing’. If any individuals are to be missing genetic data, an ‘observed’ indicator column must be included in the pedigree file. See Pedigree file, for details.

See Concept Index for: genedrop introduction, quantitative trait, polygenic model, major gene model, mixed model, non-genetic model, founder genome labels, seeds for data simulation, additive variance, unobserved individuals.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.2 Sample genedrop parameter file

Files for genedrop may be found in the ‘Simulation’ subdirectory of ‘MORGAN_Examples’. The example here refers to ‘ped73_gdrop.par’.

The seed file is used to store the random seeds used in the simulations. Occasionally one will want to use the same seed with multiple runs, but most often one will want to use new seeds so as to obtain different output with each run. The seed file contains one or more statements like ‘set marker seeds 0xde5e8d39’. For more about the way genedrop handles seeds See genedrop computational parameters.

The seed file can be specified in the command line or in the parameter file. The following statements are needed to specify the seed file in the parameter file:

 
input seed file '../marker.seed'
output marker seeds only
output overwrite seed file '../marker.seed'

The first line specifies ‘marker.seed’ in the main examples directory as the input seed file for the marker simulation. The second statement, ‘output marker seeds only’, overrides the default behavior of saving both the marker and the trait seeds and causes the program to save only the marker seeds before exiting. The ‘overwrite’ option in line 3 enables the program to replace the current seed file content with the newly generated random numbers, which can be used for simulation in the future. When an overwrite is not requested, MORGAN appends the new output seeds to the existing file at the end of the run. Thus, at the next run, more than one ‘set marker seeds’ statement exists in the seed file. The program uses only the last ‘set marker seeds’ statement in the file.

In the example, we have chosen to access the seed file from the command line, which will overrule the parameter file statement and generate a warning. See the next section for command line implementation.

Note: The statement ‘output pedigree chronological’ is included in the example ‘ped73_gdrop.par’ file so that the output pedigree will be in the chronological order required for use with other MORGAN programs.

The next statements in the parameter file are the simulation requests:

 
simulate chrom 1 markers
simulate traits 1
set traits 1 tlocs 1

The above statement asks genedrop to simulate marker loci on chromosome 1. Additionally, one quantitative trait controlled by one tloc will be simulated. The number of markers, and the relative locations of tloc and marker loci will be determined from the ‘map’ statements below. In MORGAN-3, traits are distinguished from trait loci, and thus the statement ‘set traits 1 tlocs 1’ assigns trait 1 to trait locus 1. In general one or more traits may be assigned to any given trait locus. If no trait locus is to be simulated, the lines ‘simulate traits 1’ and ‘set traits 1 tlocs 1’ can be removed.

 
map chrom 1 marker dist  10 10 10 10 10 10 10 10 10
map chrom 1 tlocs 1 marker 5 dist 5

The above statement indicates a marker map on chromosome 1, with 10 equally spaced markers, each at a distance of 10 (Haldane) centiMorgans from the preceding one. Note that the number of markers is inferred from this statement. The trait locus is between markers 5 and 6 on chromosome 1, at a distance of 5 cM to marker 5.

A marker map or tloc position can also be specified by recombination fractions. For example,

 
map chrom 1 marker recomb fracs 0.1 0.5 0.2

gives a map of four ordered markers, M1,M2,M3 and M4, with recombination fraction 0.1 between M1 and M2, 0.5 between M2 and M3, and 0.2 between M3 and M4.

Marker allele frequencies are set by the following lines:

 
set chrom 1 markers 1  allele freqs 0.13 0.66 0.16 0.05
set chrom 1 markers 2  allele freqs 0.06 0.23 0.41 0.25 0.05
set chrom 1 markers 3  allele freqs 0.11 0.02 0.01 0.06 0.24 0.56
set chrom 1 markers 4  allele freqs 0.07 0.04 0.89
set chrom 1 markers 5  allele freqs 0.12 0.11 0.03 0.03 0.50 0.21
set chrom 1 markers 6  allele freqs 0.50 0.44 0.06
set chrom 1 markers 7  allele freqs 0.01 0.33 0.62 0.04
set chrom 1 markers 8  allele freqs 0.20 0.05 0.42 0.27 0.06
set chrom 1 markers 9  allele freqs 0.18 0.18 0.25 0.16 0.08 0.15
set chrom 1 markers 10 allele freqs 0.17 0.35 0.04 0.29 0.15

In the case where several markers have the same number of alleles and allele frequencies, one can group those markers together into one line:

 
set chrom 1 markers 11 12 13 15 allele freqs 0.2 0.8

However, we consider it good practice to specify the frequencies separately for each marker.

The following five lines describe the trait model. The trait locus can have only two alleles; here the frequencies are 0.5 and 0.5, for alleles 1 and 2, respectively. The mean values of the trait for each trait locus genotype are on the next line. Values correspond to the (1 1), (1 2) and (2 2) genotypes, respectively. The residual variance gives the within-genotype variance of phenotypic values about the mean. The additive variance (0 in this example, and by default if not specified) is the variance of an additive polygenic contribution to trait values.

 
set trait 1 allele freqs 0.5 0.5

set trait 1 for tlocs 1 geno means 90 100 110
set trait 1 residual variance 25.0
set trait 1 additive variance 0.0

The following three lines may be included in the parameter file (we have commented them out in the example so as to keep the output file small and easy to read).

 
output pedigree record founder genome labels
output pedigree record trait latent variables
output pedigree record unobserved variables

These lines request that the founder genome labels and latent variable values for the trait be included in the output file, and that the data be output for all (observed and unobserved) individuals. Founder gene labels indicate, for all non-founders, which founder alleles were passed to the individual. For the trait variables, the latent founder genome labels, the trait locus genotype, and the additive and residual contributions to the trait value are given. Latent trait variables will precede the trait value in the output file.

See Concept Index for: genedrop sample parameter file, seed file, additive variance, residual variance, founder genome labels.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.3 Running genedrop examples and sample output

Two examples are available under the subdirectory ‘Simulation/’. The only difference is in whether command line options are to replace some parameter statements (see the ‘README’ file in the ‘Simulation’ directory).

The command to run the first example is:

 
./<program> <parfile> [ped <pedfile>] [seed <seedfile>] [oped <opedfile>]
./genedrop ped73_gdrop.par ped ../ped73.ped seed ../marker.seed oped gdrop.oped 

For the parameter file ‘ped73_gdrop_2.par'’ the three files (input and output pedigree, and seed) are given in the file, and so are not needed as command line options. This file may be run simply as

 
./genedrop ped73_gdrop_2

The output is the same as for ‘ped73_gdrop.par’.

When running the genedrop example, we here use an (unchanging) input marker seed file ‘../marker.seed’ but output to the current directory file ‘marker.seed’. However, in practice a file such as ‘marker.seed’ can be specified as both the input and output seed file. If a ‘overwrite’ option if not included in the ‘output seed file’ statement, successive runs will generate warnings (W), but this is not a concern. Recall from the previous section that, by default, MORGAN appends the new output seeds to the existing seed file at the end of each run. In the next run, the last (most recent) seed will be used. To avoid this warning (and an ever-growing seed file), either use the ‘overwrite’ when outputting the seeds (see the previous section Sample genedrop parameter file), or manually edit the seed file removing earlier lines.

Since the function of genedrop is to simulate marker and trait data, it, unlike other MORGAN programs, always creates and outputs a pedigree file. The output file ‘gdrop.oped’ is structured similarly to the input file ‘ped73.ped’, with one individual per record (line). However, the output file contains additional columns and does not include the parameter statements found at the top of the input file. The first four items are the individual’s name, the names of the parents, and gender. If no addition output options are set, the next items are the genotypes of the markers (two items per marker) in the order they are found on the chromosomes, followed by the trait values in the order of the trait labels.

Notice the three statements at the end of the parameter file. In order to save space and make the output more readable, these statements have been commented out so that they are not executed by the program.

If the statement ‘output pedigree record trait latent variables’ was included in the parameter file, the output file would contain four additional columns preceding the trait value. The first two of these columns would be the trait locus genotype, followed by the additive component of the trait value and the residual component of the trait value. In this example, everyone has a ‘0.000’ in the additive component column because we set the additive variance to zero in the parameter file.

If the ‘output pedigree record founder gene labels’ is set, the founder genome labels (FGL) for markers precede the marker genotypes and the trait FGL precede the trait values (or the trait latent variables, if these are requested).

Also, if the ‘output pedigree record unobserved variables’ statement is included in ‘gdrop.par’, an observed indicator would follow gender in the output pedigree file. Also, marker and trait data would be output for all individuals, not only those indicated as ‘observed’.

See Concept Index for: running genedrop examples, genedrop sample output, seeds for data simulation.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.4 genedrop statements

See Concept Index for: genedrop statements.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.4.1 genedrop computing requests

simulate [chromosome I] markers

One statement is given for each chromosome on which markers or both markers and traits are to be simulated. Only unlinked traits are simulated if no such statement is provided. The ‘chromosome’ keyword can be omitted if all markers and linked traits are on the same chromosome. Note that the number of markers is inferred from the number mapped on the chromosome in the parameter file.

simulate traits K1

The linked traits to be simulated are specified here. The linked traits are specified as positive integers.

set traits K1... tlocs L1...

This statement establishes the correspondence between traits and trait loci. Presently in ‘genedrop’ each trait may have only one trait locus, but more than one trait may be assigned to the same locus. The trait loci are specified as positive integers.

map tlocs L1... unlinked

Optional. This statement specifies trait loci which are unlinked to specific traits, and hence have no map specification.

See Concept Index for: genedrop computing requests.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.4.2 genedrop mapping model parameters

map [chromosome I] [gender (F | M)] marker ( [Kosambi] distances | recombination fractions | [Kosambi] positions) X1 X2

This statement is required if simulation of more than one marker is requested. One statement is used per chromosome. This statement specifies the marker map or positions given in units of genetic distances (cM), or recombination fractions between markers. Marker map or positions can be sex-specific if gender is included in the statement. If ‘distances’ is chosen, intermarker distances are provided such that the number of distances is one less than the number of markers. If ‘positions’ is chosen, the number of positions is equal the number of markers, as these are absolute positions relative to a zero point to the left of all of the markers. The Haldane mapping function is used to convert between the genetic distances and recombination fractions unless Kosambi is specified.

map [chromosome I] [gender (F | M)] tlocs K1 K2 … markers J1 J2 …  ( [Kosambi] distances | recombination fractions) X1 X2

This statement is required if simulated trait loci are to be linked to markers; i.e., it is not required if no trait loci or only unlinked trait loci are to be simulated. The statement specifies the location of each trait locus with respect to one of the marker loci. Thus, the number of trait loci listed in the statement must be equal to the number of markers listed and to the number of distances (or recombination fractions) listed. The trait locus will follow the corresponding marker locus (to the right, so to speak) at the distance specified. To simulate a trait locus that precedes all marker loci, list marker ‘0’ in the statement. For example, with ‘map tlocs 3 2 marker 6 0 distances 5 4’, trait loci 3 and 2 will be placed 5 cM to the right of marker 6 and 4 cM to the left of marker 1, respectively.

See Concept Index for: genedrop mapping model parameters, gender–specific maps, Haldane map function, Kosambi map function.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.4.3 genedrop population model parameters

set [chromosome I] markers K1 … allele frequencies X1 X2

This statement specifies markers allele frequencies. Allele frequencies for a marker should sum to between 0.9999 and 1.0001. Otherwise they are normalized. Multiple markers can be specified in a single statement if they reside on the same chromosome and have the same number of alleles with the same allele frequencies.

set tlocs K1 … allele frequencies X1 X2

This statement specifies the trait loci allele frequencies. Allele frequencies for a trait locus should sum to between 0.9999 and 1.0001. Otherwise they are normalized. Multiple trait loci can be specified in a single statement if they have the same allele frequencies. Trait loci must have two alleles.

set normalized allele frequencies

If the set of allele frequencies for each marker and trait is to be normalized, this statement is given. Normalization of the frequencies is recommended when simulating pedigree data, but not recommended when using the other programs.

set traits K1 for ... tlocs L1... genotype means X1 X2 X3

Since two alleles are simulated for each trait locus, three means must be specified for the polygenic trait values: one each for the (1 1), the (1 2) or (2 1), and the (2 2) genotypes. The default values 0.0, 0.0, and 0.0.

set traits K1 … additive variance X

Here we specify the genetic variance for one or more trait. One of there statements is given for each value assigned. The default variance is 0.0.

set traits K1 ... residual variance X

This statement is like the preceding one. The environmental contribution to the trait is set using this statement.

See Concept Index for: genedrop population model parameters, allele frequencies, additive variance, residual variance.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.4.4 genedrop computational parameters

set marker seeds H1 H2

This statement initializes the seeds for the random number generator in the gene dropping algorithms. The seeds are to be positive and no greater than hexadecimal 0xFFFFFFFF, with the first seed (congruential seed) odd, and the second seed (Tausworthe seed) nonzero. In genedrop, markers are simulated before traits, so that, if no seeds are specified for marker simulation, default seeds (0x3039 0x431) are used.

set trait seeds

H1 H2 This statement initializes the seeds for trait simulation. If no seeds are given, the starting seeds for trait simulation are the seeds returned by the random number generator at completion of marker simulation. Note that if output of marker seed is requested, this will be the same value as is output to the marker seed file for a subsequent genedrop run.

See Concept Index for: genedrop computational parameters, seeds for data simulation, simulating marker and trait data.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.4.5 genedrop output pedigree options

output pedigree record founder gene labels

When this option is selected, each record contains a pair of founder genome labels for each locus. Each founder is assigned a pair of labels, which are in the same order as the names of the parents. Then, for each locus of each descendant, founder genome labels are determined by the simulated meiosis indicators.

This statement is useful in cases where the founder origins or descent of trait locus alleles are required, for example in assessing the results of subsequent analyses of the simulated data.

output pedigree record trait latent variables

This statement requests that the quantitative trait latent variables be included in the output. The genotype at each trait locus, as well as the additive and residual component of each quantitative trait, will appear in the output record.

output pedigree record unobserved variables

If this option is set, genotypes, gene labels and trait values are output for both observed and unobserved individuals. An additional data field, following the gender indicator, specifies whether the individual is observed (‘1’) or unobserved(‘0’).

When this option is not selected, unobserved individuals take on default values; the genotype at each locus represented as ‘0 0’, the founder genome label (if requested) at each locus represented as ‘0 0’, and each quantitative trait value is recorded as ‘999’.

input pedigree record observed (absent | present)

The observed indicator is used to designate which members are observed, with ’0’ indicating unobserved, ’1’ indicating observed. When the observed indicator is present in the pedigree file, it follows gender (or parents, if gender is not present). If this statement is not given, all pedigree members are assumed to be observed. See also the next statement ‘assume all observed’.

If individuals are flagged in the pedigree file as unobserved, the default behavior is to indicate in the output pedigree file that the data for these individuals is missing.

assume all observed

When this statement is used, all members of the pedigree are treated as “observed” in the simulation. If an observed indicator column is present in the input file, it is ignored by the simulation.

See Concept Index for: genedrop output pedigree options, founder genome labels, meiosis indicators, inheritance indicators, unobserved individuals.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.4.6 genedrop output seed file options

output (marker | trait) seeds only

If an output seed file is given, both ending marker and trait seeds are saved unless one or the other is requested in this statement.

See Concept Index for: genedrop output seed file, seeds for data simulation.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Elizabeth Thompson on September 6, 2019 using texi2html 1.82.