Program calculate relatedness




















Download full text. Interim Report. Occasional Paper No. The report describes the characteristics and usage of a computer program, the Relatedness Coefficient Matrix Program RCMAT , designed to summarize associative responses given to verbal stimuli by individual respondents and by groups of respondents. It can be derived similarly that, when allele frequencies are calculated from the sample by omitting the two focal genotypes, , and.

Equation 18 shows that underestimates relatedness, the extent of the underestimation depends on N only and is unaffected by the actual population allele frequencies p , q. For a given sample size N , more bias is induced by excluding than including the two focal genotypes in allele frequency estimation Figure 1.

It is much more complicated than. In the case of a diallelic locus, the estimator is. It can be shown that and and are identical for a diallelic locus. However, and are different for a locus with more than two alleles Wang, Estimators 19 , 20 and 21 or their multiallelic forms Wang, are calculated for a single locus. Following previous work Ritland, ; Lynch and Ritland, , Wang derived the variances of these estimators by assuming zero relatedness.

Weighting single-locus estimates by the inverses of their variances yields multilocus estimators Wang, For a single diallelic locus, and have identical properties as shown above for the latter. For multiple loci, they are slightly different because different weighting schemes were applied to loci with different allele frequencies Wang, It can be shown that, for a sample of N unrelated individuals, when population allele frequencies are known and used in the estimation.

However, when allele frequencies are estimated from the sample with the focal individuals either included or excluded, is positively and is negatively biased in general Supplementary Figure S1. Much of the opposite biases cancel each other that is much less biased Figure 1. Loiselle et al. The estimator can also be used for two individuals, as shown by Heuertz et al. An important characteristic of the estimator, denoted as LS or hereafter, is that it uses a correction for small sample sizes.

For two individuals X and Y in a sample of N individuals genotyped at L loci, the relatedness estimator is. The first term of the estimator gives the relatedness when allele frequencies are either known that is, not estimated from the sample or estimated from a large sample that is, N large.

The second term of the estimator corrects for the bias caused by estimating p li from a small sample of N individuals. It can be shown that the average in a sample of N individuals is always zero when known population allele frequency p is used in the calculation.

The average relatedness among the N individuals is expected to be. It can be shown , despite the correction for sample size N. The value of depends on both sample size N and population allele frequency p. As can be seen, is most often negatively biased. The extent of underestimation depends on values of N and p.

The bias of LS estimator is usually smaller than the other estimators for the same values of N and p , thanks to the correction for sample size. Note that and can be zero when the focal pair of individuals are excluded, leaving the estimator undefined. In such cases, the estimator is set to zero.

Like 23 , in 24 does not reduce to zero but varies with both sample size N and allele frequency p. Estimators and can be modified to become unbiased when population allele frequencies are estimated from the same small sample of individuals whose relatedness is being estimated.

The sum of estimated allele frequencies to the m th power, t m [ N ], can be calculated from the sample N as. Equation 25 corresponds to Equation 16 for the case of known population allele frequencies. It reduces asymptotically to Equation 16 with an increasing sample size N , as expected.

It is derived by considering sampling without replacement. Let us consider the estimation of as an example. The probability that the first gene drawn at random from the sample is of allele type i is. Given the first allele i , the probability that the second gene drawn at random from the remaining sample is also of allele type i is.

Therefore, the probability of sampling two alleles of type i from the sample without replacement is. Similarly, is derived by considering the probability of sampling three genes of the same allele type without replacement from sample N. Using calculated by Equation 25 instead of t m calculated by Equation 16 leads to an unbiased LL estimator. For a diallelic locus, the expected value of S 0 for a sample of N individuals is. The average observed similarity, S XY , among the N sampled individuals is obtained in deriving 17 , which is Inserting and into estimator 14 leads to , irrespective of sample size N , and population allele frequencies p and q.

Similarly, using by Equation 25 instead of t 2 calculated by Equation 16 leads to unbiased. However, and are still biased in opposite directions. Their effects on cancel out exactly such that is always unbiased. Simulations were conducted to check the above analytical results, and to investigate other cases such as multiallelic locus, multiple loci and a mixed sample containing both unrelated and closely related individuals.

A sample of N individuals was drawn from a large outbred population at Hardy—Weinberg equilibrium and linkage equilibrium. Two types of samples were considered. For an unrelated sample, all pairs of sampled individuals were unrelated, as assumed in the analytical study above.

For a mixed sample, one pair of individuals were related as full sibs FS , half sibs HS or parent offspring PO and the rest of the pairs were unrelated UR. Each sampled individual was genotyped at a number of L loci, and each locus had a fixed number of n codominant alleles with a uniform, equal or triangular frequency distribution in the population.

All of the sampled genotypes were used in calculating allele frequencies that is, no omitting of the focal pair of individuals and relatedness estimators. The quality of a relatedness estimator was measured by its bias and accuracy RMSE root mean squared errors ,. The simulated true value of r is 0. The simulations Figure 2 and Supplementary Figure S2 confirm the analytical results above that all estimators give biased r estimates for different types of relationships FS, HS, PO, UR when the same genotype data of a small sample of individuals are used to calculate allele frequencies and relatedness.

The underestimation increases rapidly with a decreasing sample size. Results for PO dyads are similar to those for FS dyads. As shown in Figure 2 and Supplementary Figure S2 , the extent of bias varies with the true relatedness. The relatedness of closely related individuals for example, FS and PO tends to be much more underestimated than that of unrelated individuals UR. As a result, no single correction in terms of N exists that can make an estimator unbiased for all types of relationships.

The correction of LS results in underestimated and overestimated relatedness for closely related for example, FS and unrelated UR individuals in a mixed sample. All estimators become less biased with an increase in sample size N. However, the rate of decline in bias with N is slow. The bias patterns of different estimators for multiallelic loci Supplementary Figure S2 are generally similar to those for diallelic loci Figure 2.

To investigate the genetic structure of Atlantic salmon in the entire North American range of the species, Moore et al. Individuals sampled from within a population were not studied for relatedness. To demonstrate the bias of the original estimators and the sample size-independent properties of the modified estimators, a sample of 25 individuals taken from a single population was analysed.

First, the 25 individuals were used to calculate allele frequencies at each locus, and these estimated frequencies were used to obtain pairwise relatedness estimates. Second, the 25 individuals were partitioned into 5 non-overlapping subsamples, each containing 5 individuals.

Each subsample was then analysed for allele frequencies that were then used in calculating relatedness. If an estimator is robust to small sample size, then relatedness estimates for a given dyad obtained from the original sample 25 individuals and from the subsamples each having 5 individuals should be similar.

Figure 3 see also Supplementary Figure S3 plots these estimates for different estimators. The modified estimators, WM and LLM, and the estimator with bias correction, LS, give very similar, although not identical, estimates calculated from the subsamples and the original sample. Most of the points each showing the relatedness estimates of a dyad calculated from the original sample and a subsample are centred on the diagonal line Figure 3 , and there is no obvious trend that estimates from the subsamples are uniformly smaller or larger than those from the original sample.

In contrast, all pairwise estimates obtained from subsamples are much smaller than estimates from the original sample for each of the five unmodified estimators without bias correction. Despite that LS gives consistent estimates that are little affected by sample size, it could underestimate the relatedness of close relatives as shown in simulations in Figure 2 and Supplementary Figure S2.

All estimates from LS tend to shrink toward 0, with the highest and lowest related dyads whose MW estimates are 0. An original sample of 25 individuals was taken from a single population, with each individual genotyped at SNP loci.

Five non-overlapping subsamples, each having five individuals, were obtained from the original sample. Each point plots the relatedness estimates for each of 50 dyads obtained from an estimator using the original sample x axis and a subsample y axis.

The thin diagonal line shows the ideal case when relatedness estimates made from the original sample and subsamples are equal across the 50 dyads. The slope and intercept of the six estimators are 0.

Estimating pairwise relatedness from genetic marker data is now a routine analysis in molecular ecology, evolutionary biology and conservation studies. The estimators developed for this purpose invariably assume that population allele frequencies of markers are known without errors, and the behaviours of these estimators were usually investigated under this assumption see, for example, Lynch and Ritland, ; Wang, ; Milligan, Unfortunately, however, population allele frequencies are rarely known in reality.

Frequently, the only data one has in a relatedness analysis are a sample of multilocus genotypes. In such a case, we have to calculate both allele frequencies and relatedness from the same sample. Furthermore, because of various constraints, sample sizes of individuals or numbers of genotypes at a locus, to be precise can be quite small.

Current relatedness estimators were developed in the pre-genomic era mainly for application to microsatellite data. In the genomic era, however, the N » L situation is reversed; a typical large-yet-sparse data set given by next-generation sequencing could have millions of SNP loci, with each having a small number of genotypes because of a small number of sampled individuals and a high rate of missing data.

This study showed, for the first time, that the popular relatedness estimators can become highly biased and their accuracy is dominated by bias rather than sampling error when they are applied to such SNP data sets that is, L » N.

The direction that is, over- or under-estimation and extent of bias depends on sample sizes, the underlying unknown population allele frequencies, the estimators and the true relatedness. For example, the relatedness of first-degree relatives PO, FS is expected to be 0. As a possible consequence, first-degree relatives may be mistaken as second-degree relatives if one is unaware of the bias. This study also showed that omitting the focal individuals in calculating allele frequencies, as suggested in the literature see, for example, Queller and Goodnight, ; Lynch and Ritland, , cannot remove the bias of popular relatedness estimators.

On the contrary, this ad hoc treatment in estimating allele frequencies not only causes a high frequency of undefined estimators but also induces more biased estimates Figure 1. This is perhaps not too surprising. At a small sample size, allele frequencies are estimated without bias by allele counting method, although estimates of higher-order terms of the frequencies can be biased Nei and Chesser, ; Weir, When a focal pair of individuals is omitted, however, both allele frequencies and their higher order terms are biased, leading to worse estimates of relatedness than those when all sampled individuals are used in calculating allele frequencies.

Among the estimators investigated in this study, the one described in Loiselle et al. Both analytical and simulation results show that, compared with other estimators, LS has substantially reduced biases for all types of relationships and for different sample sizes. As a result, it is more accurate than most of the unmodified estimators Figure 2 and Supplementary Figure S2. However, the correction is insufficient to make the estimator unbiased Figure 1.

In a mixed sample containing both related and unrelated individuals, LS tends to underestimate and overestimate relatedness for related and unrelated individuals, respectively. This is not surprising because the extent of bias of a relatedness estimator varies with true relatedness, and it is impossible to apply a single correction for small sample size, such as , to obtain unbiased relatedness estimates for all possible relationships.

In contrast, the modified estimators, and , are almost unbiased for all relationships, sample sizes and allele frequency distributions. Estimating both allele frequencies and relatedness from the same sample has three problems see Introduction; Wang, This study has addressed the third problem, underestimation of relatedness due to small sample sizes.

The first problem that is, negative relatedness estimates, mean of relatedness estimates across dyads in a sample being close to zero is no longer pertinent when relatedness is defined, understood and used in terms of a correlation coefficient rather than a probability of IBD Wright, ; Wang, The second problem comes from the genetic structure of a sample, no matter whether it is small or large.

When a sample containing both related and unrelated individuals is used in calculating allele frequencies by naively assuming unrelated individuals, relatedness will be underestimated because of the biased allele frequency estimates. Indeed, my simulation in Figure 2 shows that the modified estimators, WM and LLM, underestimate r for all relationships FS, PO, UR, … when sample size is extremely small such that sample genetic structures become substantial.

However, the bias is rather small. For example, the mean LLM estimates are 0. The mean LLM estimates are 0. Compared with the huge bias caused by small sample size as shown in this study, the bias caused by the genetic structure of a sample is negligible. This study modified the LL and W estimators and showed, using analytical Figure 1 , simulated Figure 2 and Supplementary Figure S2 and empirical data Figure 3 and Supplementary Figure S3 , that relatedness can be reliably estimated by the modified estimators with little bias even when sample size is extremely small say, 3 individuals.

Because of the great reduction in bias and some decrease in sampling variance Supplementary Figure S2 , the modified estimators are always much more accurate that is, smaller RMSE than the original estimators, except when few loci are used such that RMSE is dominated by sampling variance rather than bias and true relatedness is low for example, UR.

The smaller the sample size is, the greater the accuracy improvements the modified estimates make. When sample sizes are large or when population allele frequencies are known, the relative performances of different estimators depend on the true relationship being estimated.

When sample sizes are small and many loci are used, however, the modified estimators, WM and LLM, always perform better than all of the original estimators, regardless of the actual relatedness being estimated.

This study assumes an outbred population in which close inbreeding, due to close relative mating such as sib mating and selfing, is absent or rare. It is usually the case that the generation distance is the same for all common ancestors of a pair of individuals. Therefore, having worked out the relatedness between A and B due to any one of the ancestors, all you have to do in practice is to multiply by the number of ancestors.

First cousins, for instance, have two common ancestors, and the generation distance via each one is 4. Welcome to the Relatedness Calculator Enter relative:.

Relatedness Calculator v1. Graphs courtesy of Graphviz and Canviz.



0コメント

  • 1000 / 1000