ALBERT — All Library Books, journals and Electronic Records Telegrafenberg

1

Unknown

Fast spatial ancestry via flexible allele frequency surfaces (2014)

Ranola, J. M., Novembre, J., Lange, K.

Oxford University Press

In: Bioinformatics

add to mindlist on the mindlist

Details

Publication Date: 2014-10-04

Description: Motivation: Unique modeling and computational challenges arise in locating the geographic origin of individuals based on their genetic backgrounds. Single-nucleotide polymorphisms (SNPs) vary widely in informativeness, allele frequencies change non-linearly with geography and reliable localization requires evidence to be integrated across a multitude of SNPs. These problems become even more acute for individuals of mixed ancestry. It is hardly surprising that matching genetic models to computational constraints has limited the development of methods for estimating geographic origins. We attack these related problems by borrowing ideas from image processing and optimization theory. Our proposed model divides the region of interest into pixels and operates SNP by SNP. We estimate allele frequencies across the landscape by maximizing a product of binomial likelihoods penalized by nearest neighbor interactions. Penalization smooths allele frequency estimates and promotes estimation at pixels with no data. Maximization is accomplished by a minorize–maximize (MM) algorithm. Once allele frequency surfaces are available, one can apply Bayes’ rule to compute the posterior probability that each pixel is the pixel of origin of a given person. Placement of admixed individuals on the landscape is more complicated and requires estimation of the fractional contribution of each pixel to a person’s genome. This estimation problem also succumbs to a penalized MM algorithm. Results: We applied the model to the Population Reference Sample (POPRES) data. The model gives better localization for both unmixed and admixed individuals than existing methods despite using just a small fraction of the available SNPs. Computing times are comparable with the best competing software. Availability and implementation: Software will be freely available as the OriGen package in R. Contact: ranolaj@uw.edu or klange@ucla.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

2

Unknown

forqs: forward-in-time simulation of recombination, quantitative traits and selection (2014)

Kessner, D., Novembre, J.

Oxford University Press

In: Bioinformatics

add to mindlist on the mindlist

Details

Publication Date: 2014-02-20

Description: : forqs is a forward-in-time simulation of recombination, quantitative traits and selection. It was designed to investigate haplotype patterns resulting from scenarios where substantial evolutionary change has taken place in a small number of generations due to recombination and/or selection on polygenic quantitative traits. Availability and implementation : forqs is implemented as a command-line C++ program. Source code and binary executables for Linux, OSX and Windows are freely available under a permissive BSD license: https://bitbucket.org/dkessner/forqs . Contact: jnovembre@uchicago.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

3

Unknown

Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data (2013)

Kessner, D., Turner, T. L., Novembre, J.

Oxford University Press

In: Molecular Biology and Evolution

add to mindlist on the mindlist

Details

Publication Date: 2013-04-16

Description: DNA samples are often pooled, either by experimental design or because the sample itself is a mixture. For example, when population allele frequencies are of primary interest, individual samples may be pooled together to lower the cost of sequencing. Alternatively, the sample itself may be a mixture of multiple species or strains (e.g., bacterial species comprising a microbiome or pathogen strains in a blood sample). We present an expectation–maximization algorithm for estimating haplotype frequencies in a pooled sample directly from mapped sequence reads, in the case where the possible haplotypes are known. This method is relevant to the analysis of pooled sequencing data from selection experiments, as well as the calculation of proportions of different species within a metagenomics sample. Our method outperforms existing methods based on single-site allele frequencies, as well as simple approaches using sequence read data. We have implemented the method in a freely available open-source software tool.

Print ISSN: 0737-4038

Electronic ISSN: 1537-1719

Topics: Biology

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

4

Unknown

Characterizing Bias in Population Genetic Inferences from Low-Coverage Sequencing Data (2014)

Han, E., Sinsheimer, J. S., Novembre, J.

Oxford University Press

In: Molecular Biology and Evolution

add to mindlist on the mindlist

Details

Publication Date: 2014-02-27

Description: The site frequency spectrum (SFS) is of primary interest in population genetic studies, because the SFS compresses variation data into a simple summary from which many population genetic inferences can proceed. However, inferring the SFS from sequencing data is challenging because genotype calls from sequencing data are often inaccurate due to high error rates and if not accounted for, this genotype uncertainty can lead to serious bias in downstream analysis based on the inferred SFS. Here, we compare two approaches to estimate the SFS from sequencing data: one approach infers individual genotypes from aligned sequencing reads and then estimates the SFS based on the inferred genotypes (call-based approach) and the other approach directly estimates the SFS from aligned sequencing reads by maximum likelihood (direct estimation approach). We find that the SFS estimated by the direct estimation approach is unbiased even at low coverage, whereas the SFS by the call-based approach becomes biased as coverage decreases. The direction of the bias in the call-based approach depends on the pipeline to infer genotypes. Estimating genotypes by pooling individuals in a sample (multisample calling) results in underestimation of the number of rare variants, whereas estimating genotypes in each individual and merging them later (single-sample calling) leads to overestimation of rare variants. We characterize the impact of these biases on downstream analyses, such as demographic parameter estimation and genome-wide selection scans. Our work highlights that depending on the pipeline used to infer the SFS, one can reach different conclusions in population genetic inference with the same data set. Thus, careful attention to the analysis pipeline and SFS estimation procedures is vital for population genetic inferences.

Print ISSN: 0737-4038

Electronic ISSN: 1537-1719

Topics: Biology

Published by Oxford University Press

Permalink