ALBERT

All Library Books, journals and Electronic Records Telegrafenberg

Ihre E-Mail wurde erfolgreich gesendet. Bitte prüfen Sie Ihren Maileingang.

Leider ist ein Fehler beim E-Mail-Versand aufgetreten. Bitte versuchen Sie es erneut.

Vorgang fortführen?

Exportieren
Filter
  • Artikel  (1.700)
  • BioMed Central  (1.700)
  • American Meteorological Society
  • American Physical Society (APS)
  • Elsevier
  • Institute of Physics
  • MDPI Publishing
  • Reed Business Information
  • 2010-2014  (1.700)
  • 1985-1989
  • 1955-1959
  • 1935-1939
  • 2012  (839)
  • 2010  (861)
  • BMC Bioinformatics  (230)
  • 9756
Sammlung
  • Artikel  (1.700)
Verlag/Herausgeber
  • BioMed Central  (1.700)
  • American Meteorological Society
  • American Physical Society (APS)
  • Elsevier
  • Institute of Physics
  • +
Erscheinungszeitraum
  • 2010-2014  (1.700)
  • 1985-1989
  • 1955-1959
  • 1935-1939
Jahr
Thema
  • 1
    Publikationsdatum: 2012-12-29
    Beschreibung: Background: RNA interference (RNAi) becomes an increasingly important and effective genetic tool to study the function of target genes by suppressing specific genes of interest. This system approach helps identify signaling pathways and cellular phase types by tracking intensity and/or morphological changes of cells. The traditional RNAi screening scheme, in which one siRNA is designed to knockdown one specific mRNA target, needs a large library of siRNAs and turns out to be time-consuming and expensive. Results: In this paper, we propose a conceptual model, called compressed sensing RNAi (csRNAi), which employs the unique combination of group of small interfering RNAs (siRNAs) to knockdown a much larger size of genes. This strategy is based on the fact that one gene can be partially bound with several small interfering RNAs (siRNAs) and conversely, one siRNA can bind to a few genes with distinct binding affinity. This model constructs a multi-to-multi correspondence between siRNAs and their targets, with siRNAs much fewer than mRNA targets, compared with the conventional scheme. Mathematically this problem involves an underdetermined system of equations (linear or nonlinear), which is ill-posed in general. However, the recently developed compressed sensing (CS) theory can solve this problem. We present a mathematical model to describe the csRNAi system based on both CS theory and biological concerns. To build this model, we first search nucleotide motifs in a target gene set. Then we propose a machine learning based method to find the effective siRNAs with novel features, such as image features and speech features to describe an siRNA sequence. Numerical simulations show that we can reduce the siRNA library to one third of that in the conventional scheme. In addition, the features to describe siRNAs outperform the existing ones substantially. Conclusions: This csRNAi system is very promising in saving both time and cost for large-scale RNAi screening experiments which may benefit the biological research with respect to cellular processes and pathways.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 2
    Publikationsdatum: 2012-12-29
    Beschreibung: Background: Copy number variations (CNVs) are genomic structural variants that are found in healthy populations and have been observed to be associated with disease susceptibility. Existing methods for CNV detection are often performed on a sample-by-sample basis, which is not ideal for large datasets where common CNVs must be estimated by comparing the frequency of CNVs in the individual samples. Here we describe a simple and novel approach to locate genome-wide CNVs common to a specific population, using human ancestry as the phenotype. Results: We utilized our previously published Genome Alteration Detection Analysis (GADA) algorithm to identify common ancestry CNVs (caCNVs) and built a caCNV model to predict population structure. We identified a 73 caCNV signature using a training set of 225 healthy individuals from European, Asian, and African ancestry. The signature was validated on an independent test set of 300 individuals with similar ancestral background. The error rate in predicting ancestry in this test set was 2% using the 73 caCNV signature. Among the caCNVs identified, several were previously confirmed experimentally to vary by ancestry. Our signature also contains a caCNV region with a single microRNA (MIR270), which represents the first reported variation of microRNA by ancestry. Conclusions: We developed a new methodology to identify common CNVs and demonstrated its performance by building a caCNV signature to predict human ancestry with high accuracy. The utility of our approach could be extended to large case--control studies to identify CNV signatures for other phenotypes such as disease susceptibility and drug response.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 3
    Publikationsdatum: 2012-11-09
    Beschreibung: Background The robust identification of isotope patterns originating from peptides being analyzed through mass spectrometry (MS) is often significantly hampered by noise artifacts and the interference of overlappingpatterns arising e.g. from post-translational modifications. As the classification of the recorded data points into either 'noise' or 'signal' lies at the very root of essentially every proteomic application, the quality of the automated processing of mass spectra can significantly influence the way the data might be interpreted within a given biological context.Results We propose non-negative least squares/non-negative least absolute deviation regression to fit a raw spectrum by templates imitating isotope patterns. In a carefully designed validation scheme, we show that the method exhibits excellent performance in pattern picking. It is demonstrated that the method is able to disentangle complicated overlaps of patterns. Conclusions: We find that regularization is not necessary to prevent overfitting and that thresholding is an effective and user-friendly way to perform feature selection. The proposed method avoids problems inherent in regularization-based approaches, comes with a set of well-interpretable parameters whose default configuration is shown to generalize well without the need for fine-tuning, and is applicable to spectra of different platforms. The R package IPPD implements the method and is available from the Bioconductor platform (http://bioconductor.fhcrc.org/help/bioc-views/devel/bioc/html/IPPD.html).
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 4
    facet.materialart.
    Unbekannt
    BioMed Central
    Publikationsdatum: 2012-11-10
    Beschreibung: Background: The inference of homologies among DNA sequences, that is, positions in multiple genomes that share a common evolutionary origin, is a crucial, yet difficult task facing biologists. Its computational counterpart is known as the multiple sequence alignment problem. There are various criteria and methods available to perform multiple sequence alignments, and among these, the minimization of the overall cost of the alignment on a phylogenetic tree is known in combinatorial optimization as the Tree Alignment Problem. This problem typically occurs as a subproblem of the Generalized Tree Alignment Problem, which looks for the tree with the lowest alignment cost among all possible trees. This is equivalent to the Maximum Parsimony problem when the input sequences are not aligned, that is, when phylogeny and alignments are simultaneously inferred. Results: For large data sets, a popular heuristic is Direct Optimization (DO). DO provides a good tradeoff between speed, scalability, and competitive scores, and is implemented in the computer program POY. All other (competitive) algorithms have greater time complexities compared to DO. Here, weintroduce and present experiments a new algorithm Affine-DO to accommodate the indel (alignment gap) models commonly used in phylogenetic analysis of molecular sequence data. Affine-DO has the same time complexity as DO, but is correctly suited for the affine gap edit distance. We demonstrateits performance with more than 330,000 experimental tests. These experiments show that the solutions of Affine-DO are close to the lower bound inferred from a linear programming solution. Moreover, iterating over a solution produced using Affine-DO shows little improvement. Conclusions: Our results show that Affine-DO is likely producing near-optimal solutions, with approximations within 10% for sequences with small divergence, and within 30% for random sequences, for which Affine-DO produced the worst solutions. The Affine-DO algorithm has the necessary scalability andoptimality to be a significant improvement in the real-world phylogenetic analysis of sequence data.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 5
    Publikationsdatum: 2012-11-15
    Beschreibung: Background: Time-course gene expression data such as yeast cell cycle data may be periodically expressed. To cluster such data,currently used Fourier series approximations of periodic gene expressions have been found not to be sufficientlyadequate to model the complexity of the time-course data, partly due to their ignoring the dependence between theexpression measurements over time and the correlation among gene expression profiles. We further investigatethe advantages and limitations of available models in the literature and propose a new mixture model withautoregressive random effects of the first order for the clustering of time-course gene-expression profiles. Somesimulations and real examples are given to demonstrate the usefulness of the proposed models. Results: We illustrate the applicability of our new model using synthetic and real time-course datasets. We show that ourmodel outperforms existing models to provide more reliable and robust clustering of time-course data. Our modelprovides superior results when genetic profiles are correlated. It also gives comparable results when the correlationbetween the gene profiles is weak. In the applications to real time-course data, relevant clusters of co-regulatedgenes are obtained, which are supported by gene-function annotation databases. Conclusions: Our new model under our extension of the EMMIX-WIRE procedure is more reliable and robust for clusteringtime-course data because it adopts a random effects model that allows for the correlation among observations atdifferent time points. It postulates gene-specific random effects with an auto-correlation variance structure thatmodels coregulation within the clusters The developed R package is flexible in its specification of the randomeffectsthrough user-input parameters that enables improved modelling and consequent clustering of time-coursedata.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 6
    Publikationsdatum: 2012-11-16
    Beschreibung: Background: 454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingly, not much effort has been devoted to the development of improved base-calling methods and more intuitive quality scores for this platform. Results: We present HPCall, a 454 base-calling method based on a weighted Hurdle Poisson model. HPCall uses a probabilistic framework to call the homopolymer lengths in the sequence by modeling well-known 454 noise predictors. Base-calling quality is assessed based on estimated probabilities for each homopolymer length, which are easily transformed to useful quality scores. Conclusions: Using a reference data set of the Escherichia coli K-12 strain, we show that HPCall produces superior quality scores that are very informative towards possible insertion and deletion errors, while maintaining a base-calling accuracy that is better than the current one. Given the generality of the framework, HPCall has the potential to also adapt to other homopolymer-sensitive sequencing technologies.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 7
    Publikationsdatum: 2012-12-09
    Beschreibung: Background: Multivariate approaches have been successfully applied to genome wide association studies. Recently, a Partial Least Squares (PLS) based approach was introduced for mapping yeast genotype-phenotype relations, where background information such as gene function classification, gene dispensability, recent or ancient gene copy number variations and the presence of premature stop codons or frameshift mutations in reading frames, were used post hoc to explain selected genes. One of the latest advancement in PLS named L-Partial Least Squares (L-PLS), where 'L' presents the used data structure, enables the use of background information at the modeling level. Here, a modification of L-PLS with variable importance on projection (VIP) was implemented using a stepwise regularized procedure for gene and background information selection. Results werecompared to PLS-based procedures, where no background information was used. Results: Applying the proposed methodology to yeast Saccharomyces cerevisiae data, we found the relationship between genotype-phenotype to have improved understandability. Phenotypic variations were explained by the variations of relatively stable genes and stable background variations. The suggested procedure provides an automatic way for genotype-phenotype mapping. The selected phenotype influencing genes were evolving 29% faster than non-influential genes, and the current results are supported by a recently conducted study. Further power analysis on simulated data verified that the proposed methodology selects relevant variables. Conclusions: A modification of L-PLS with VIP in a stepwise regularized elimination procedure can improve the understandability and stability of selected genes and background information. The approach is recommended for genome wide association studies where background information is available.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 8
    Publikationsdatum: 2012-12-09
    Beschreibung: Background: Biomarker panels derived separately from genomic and proteomic data and with a variety of computational methods have demonstrated promising classification performance in various diseases. An open question is how to create effective proteo-genomic panels. The framework of ensemble classifiers has been applied successfully in various analytical domains to combine classifiers so that the performance of the ensemble exceeds the performance of individual classifiers. Using blood-based diagnosis of acute renal allograft rejection as a case study, we address the following question in this paper: Can acute rejection classification performance be improved by combining individual genomic and proteomic classifiers in an ensemble? Results: The first part of the paper presents a computational biomarker development pipeline for genomic and proteomic data. The pipeline begins with data acquisition (e.g., from bio-samples to microarray data), quality control, statistical analysis and mining of the data, and finally various forms of validation. The pipeline ensures that the various classifiers to be combined later in an ensemble are diverse and adequate for clinical use. Five mRNA genomic and five proteomic classifiers were developed independently using single time-point blood samples from 11 acute-rejection and 22 non-rejection renal transplant patients. The second part of the paper examines five ensembles ranging in size from two to 10 individual classifiers. Performance of ensembles is characterized by area under the curve (AUC), sensitivity, and specificity, as derived from the probability of acute rejection for individual classifiers in the ensemble in combination with one of two aggregation methods: (1) Average Probability or (2) Vote Threshold. One ensemble demonstrated superior performance and was able to improve sensitivity and AUC beyond the best values observed for any of the individual classifiers in the ensemble, while staying within the range of observed specificity. The Vote Threshold aggregation method achieved improved sensitivity for all 5 ensembles, but typically at the cost of decreased specificity. Conclusion: Proteo-genomic biomarker ensemble classifiers show promise in the diagnosis of acute renal allograft rejection and can improve classification performance beyond that of individual genomic or proteomic classifiers alone. Validation of our results in an international multicenter study is currently underway.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 9
    Publikationsdatum: 2012-12-10
    Beschreibung: Background: Co-expression measures are often used to define networks among genes. Mutual information (MI) is often used as a generalized correlation measure. It is not clear how much MI adds beyond standard (robust) correlation measures or regression model based association measures. Further, it is important to assess what transformations of these and other co-expression measures lead to biologically meaningful modules (clusters of genes). Results: We provide a comprehensive comparison between mutual information and several correlation measures in 8 empirical data sets and in simulations. We also study different approaches for transforming an adjacency matrix, e.g. using the topological overlap measure. Overall, we confirm close relationships between MI and correlation in all data sets which reflects the fact that most gene pairs satisfy linear or monotonic relationships. We discuss rare situations when the two measures disagree. We also compare correlation and MI based approaches when it comes to defining co-expression network modules. We show that a robust measure of correlation (the biweight midcorrelation transformed via the topological overlap transformation) leads to modules that are superior to MI based modules and maximal information coefficient (MIC) based modules in terms of gene ontology enrichment. We present a function that relates correlation to mutual information which can be used to approximate the mutual information from the corresponding correlation coefficient. We propose the use of polynomial or spline regression models as an alternative to MI for capturing non-linear relationships between quantitative variables. Conclusions: The biweight midcorrelation outperforms MI in terms of elucidating gene pairwise relationships. Coupled with the topological overlap matrix transformation, it often leads to more significantly enriched co-expression modules. Spline and polynomial networks form attractive alternatives to MI in case of non-linear relationships. Our results indicate that MI networks can safely be replaced by correlation networks when it comes to measuring co-expression relationships in stationary data.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 10
    Publikationsdatum: 2012-12-12
    Beschreibung: Background: Illumina BeadArray technology includes non specific negative control features that allow a precise estimation of the background noise. As an alternative to the background subtraction proposed in BeadStudio which leads to an important loss of information by generating negative values, a background correction method modeling the observed intensities as the sum of the exponentially distributed signal and normally distributed noise has been developed. Nevertheless, Wang and Ye (2012) display a kernel-based estimator of the signal distribution on Illumina BeadArrays and suggest that a gamma distribution would represent a better modeling of the signal density. Hence, the normal-exponential modeling may not be appropriate for Illumina data and background corrections derived from this model may lead to wrong estimation. Results: We propose a more flexible modeling based on a gamma distributed signal and a normal distributed background noise and develop the associated background correction, implemented in the R-package NormalGamma. Our model proves to be markedly more accurate to model Illumina BeadArrays: on the one hand, it is shown on two types of Illumina BeadChips that this model offers a more correct fit of the observed intensities. On the other hand, the comparison of the operating characteristics of several background correction procedures on spike-in and on normal-gamma simulated data shows high similarities, reinforcing the validation of the normal-gamma modeling. The performance of the background corrections based on the normal-gamma and normal-exponential models are compared on two dilution data sets, through testing procedures which represent various experimental designs. Surprisingly, we observe that the implementation of a more accurate parametrisation in the model-based background correction does not increase the sensitivity. These results may be explained by the operating characteristics of the estimators: the normal-gamma background correction offers an improvement in terms of bias, but at the cost of a loss in precision. Conclusions: This paper addresses the lack of fit of the usual normal-exponential model by proposing a more flexible parametrisation of the signal distribution as well as the associated background correction. This new model proves to be considerably more accurate for Illumina microarrays, but the improvement in terms of modeling does not lead to a higher sensitivity in differential analysis. Nevertheless, this realistic modeling makes way for future investigations, in particular to examine the characteristics of pre-processing strategies.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 11
    Publikationsdatum: 2012-09-25
    Beschreibung: Background: Biologists are elucidating complex collections of genetic regulatory data for multiple organisms. Software is needed for such regulatory network data. Results: The Pathway Tools software provides a comprehensive environment for manipulating molecular regulatory interactions that integrates regulatory data with an organism's genome and metabolic network. The Pathway Tools regulation ontology captures transcriptional and translational regulation, substrate-level regulation of enzyme activity, post-translational modifications, and regulatory pathways. Curated collections of regulatory data are available for Escherichia coli, Bacillus subtilis, and Shewanella oneidensis. Regulatory visualizations include a novel diagram that sum- marizes all regulatory influences on a gene; a transcription-unit diagram, and an interactive visualization of a full transcriptional regulatory network that can be painted with gene expression data to probe correlations between gene expression and regulatory mechanisms. We introduce a novel type of enrichment analysis that asks whether a gene-expression dataset is over-represented for known regulators. We present algorithms for ranking the degree of regulatory influence of genes , and for computing the net positive and negative regulatory influences on a gene.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 12
    Publikationsdatum: 2012-09-26
    Beschreibung: Background: Inverted repeat genes encode precursor RNAs characterized by hairpin structures. These RNA hairpins are then metabolized by biosynthetic pathways to produce functional small RNAs. In eukaryotic genomes, short non-autonomous transposable elements can have similar size and hairpin structures as non-coding precursor RNAs. This resemblance leads to problems annotating small RNAs.MethodWe mapped all microRNA precursors from miRBASE to several genomes and studied the repetition and dispersion of the corresponding loci. We then searched for repetitive elements overlapping these loci. Results: We developed an automatic method called ncRNAclassifier to classify pre-ncRNAs according to their relationship with transposable elements (TEs). We show there is a correlation between the number of scattered occurrences of ncRNA precursor candidates is correlated with the presence of TEs. We applied ncRNAclassifier on six chordate genomes and report our findings. Among the 1,426 human and 721 mouse pre-miRNAs of miRBase, we identified 235 and 68 mis-annotated pre-miRNAs respectively corresponding completely to TEs. Conclusions: We provide a tool enabling the identification of repetitive elements in precursor ncRNA sequences. ncRNAclassifier is available at http://EvryRNA.ibisc.univ-evry.fr
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 13
    Publikationsdatum: 2012-10-14
    Beschreibung: Background: New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of protein families that focuses on the rapid discovery of homologous protein sequences. Results: We implemented fully automated and high-throughput procedures to de novo cluster proteins into families based upon global alignment similarity. Our approach employs an iterative clustering strategy in which homologs of known families are sifted out of the search for new families. The resulting reduction in computational complexity enables us to rapidly identify novel protein families found in new genomes and to perform efficient, automated updates that keep pace with genome sequencing. We refer to protein families identified through this approach as "Sifting Families," or SFams. Our analysis of ~10.5 million protein sequences from 2,928 genomes identified 436,360 SFams, many of which are not represented in other protein family databases. We validated the quality of SFam clustering through statistical as well as network topology--based analyses. Conclusions: We describe the rapid identification of SFams and demonstrate how they can be used to annotate genomes and metagenomes. The SFam database catalogs protein-family quality metrics, multiple sequence alignments, hidden Markov models, and phylogenetic trees. Our source code and database are publicly available and will be subject to frequent updates (http://edhar.genomecenter.ucdavis.edu/sifting_families/).
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 14
    Publikationsdatum: 2012-09-22
    Beschreibung: Background: Chromosome conformation capture experiments result in pairwise proximity measurements between chromosome locations in a genome, and they have been used to construct three-dimensional models of genomic regions, chromosomes, and entire genomes. These models can be used to understand long-range gene regulation, chromosome rearrangements, and the relationships between sequence and spatial location. However, it is unclear whether these pairwise distance constraints provide sufficient information to embed chromatin in three dimensions. A priori, it is possible that an infinite number of embeddings are consistent with the measurements due to a lack of constraints between some regions. It is therefore necessary to separate regions of the chromatin structure that are sufficiently constrained from regions with measurements that do not provide enough information to reconstruct the embedding. Results: We present a new method based on graph rigidity to assess the suitability of experiments for constructingplausible three-dimensional models of chromatin structure. Underlying this analysis is a new, efficient, andaccurate algorithm for finding sufficiently constrained (rigid) collections of constraints in three dimensions, aproblem for which there is no known efficient algorithm. Applying the method to four recent chromosomeconformation experiments, we find that, for even stringently filtered constraints, a large rigid component spansmost of the measured region. Filtering highlights higher-confidence regions, and we find that the organizationof these regions depends crucially on short-range interactions. Conclusions: Without performing an embedding or creating a frequency-to-distance mapping, our proposed approachestablishes which substructures are supported by a sufficient framework of interactions. It also establishes thatinteractions from recent highly filtered genome-wide chromosome conformation experiments provide anadequate set of constraints for embedding. Pre-processing experimentally observed interactions with thismethod before relating chromatin structure to biological phenomena will ensure that hypothesized correlationsare not driven by the arbitrary choice of a particular unconstrained embedding. The software for identifyingrigid components is GPL-Licensed and available for download at http://cbcb.umd.edu/kingsford-group/starfish.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 15
    Publikationsdatum: 2012-09-22
    Beschreibung: Background: Experimental determination of protein 3D structures is expensive, time consuming and sometimes impossible. A gap between number of protein structures deposited in the World Wide Protein Data Bank and the number of sequenced proteins constantly broadens. Computational modeling is deemed to be one of the ways to deal with the problem. Although protein 3D structure prediction is a difficult task, many tools are available. These tools can model it from a sequence or partial structural information, e.g. contact maps. Consequently, biologists have the ability to generate automatically a putative 3D structure model of any protein. However, the main issue becomes evaluation of the model quality, which is one of the most important challenges of structural biology. Results: GOBA - Gene Ontology-Based Assessment is a novel Protein Model Quality Assessment Program. It estimates the compatibility between a model-structure and its expected function. GOBA is based on the assumption that a high quality model is expected to be structurally similar to proteins functionally similar to the prediction target. Whereas DALI is used to measure structure similarity, protein functional similarity is quantified using standardized and hierarchical description of proteins provided by Gene Ontology combined with Wang's algorithm for calculating semantic similarity. Two approaches are proposed to express the quality of protein model-structures. One is a single model quality assessment method, the other is its modification, which provides a relative measure of model quality. Exhaustive evaluation is performed on data sets of model-structures submitted to the CASP8 and CASP9 contests. Conclusions: The validation shows that the method is able to discriminate between good and bad model-structures. The best of tested GOBA scores achieved 0.74 and 0.8 as a mean Pearson correlation to the observed quality of models in our CASP8 and CASP9-based validation sets. GOBA also obtained the best result for two targets of CASP8, and one of CASP9, compared to the contest participants. Consequently, GOBA offers a novel single model quality assessment program that addresses the practical needs of biologists. In conjunction with other Model Quality Assessment Programs (MQAPs), it would prove useful for the evaluation of single protein models.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 16
    Publikationsdatum: 2012-09-22
    Beschreibung: Background: The University of California, Santa Cruz (UCSC) genome database is among the most used sources of genomic annotation in human and other organisms. The database offers an excellent web-based graphical user interface (the UCSC genome browser) and several means for programmatic queries. A simple application programming interface (API) in a scripting language aimed at the biologist was however not yet available. Here, we present the Ruby UCSC API, a library to access the UCSC genome database using Ruby. Results: The API is designed as a BioRuby plug-in and built on the ActiveRecord 3 framework for the object-relational mapping, making writing SQL statements unnecessary. The current version of the API supports databases of all organisms in the UCSC genome database including human, mammals, vertebrates, deuterostomes, insects, nematodes, and yeast.The API uses the bin index---if available---when querying for genomic intervals. The API also supports genomic sequence queries using locally downloaded *.2bit files that are not stored in the official MySQL database. The API is implemented in pure Ruby and is therefore available in different environments and with different Ruby interpreters (including JRuby). Conclusions: Assisted by the straightforward object-oriented design of Ruby and ActiveRecord, the Ruby UCSC API will facilitate biologists to query the UCSC genome database programmatically. The API is available through the RubyGem system. Source code and documentation are available at https://github.com/misshie/bioruby-ucsc-api/ under the Ruby license. Feedback and help is provided via the website at http://rubyucscapi.userecho.com/.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 17
    Publikationsdatum: 2012-09-25
    Beschreibung: Background: Sporadic Amyotrophic Lateral Sclerosis (sALS) is a devastating, complex disease of unknown etiology. We studied this disease with microarray technology to capture as much biological complexity as possible. The Affymetrix-focused BaFL pipeline takes into account problems with probes that arise from physical and biological properties, so we adapted it to handle the long-oligonucleotide probes on our arrays (hence LO-BaFL). The revised method was tested against a validated array experiment and then used in a meta-analysis of peripheral white blood cells from healthy control samples in two experiments. We predicted differentially expressed (DE) genes in our sALS data, combining the results obtained using the TM4 suite of tools with those from the LO-BaFL method. Those predictions were tested using qRT-PCR assays. Results: LO-BaFL filtering and DE testing accurately predicted previously validated DE genes in a published experiment on coronary artery disease (CAD). Filtering healthy control data from the sALS and CAD studies with LO-BaFL resulted in highly correlated expression levels across many genes. After bioinformatics analysis, twelve genes from the sALS DE gene list were selected for independent testing using qRT-PCR assays. High-quality RNA from six healthy Control and six sALS samples yielded the predicted differential expression for 7 genes: TARDBP, SKIV2L2, C12orf35, DYNLT1, ACTG1, B2M, and ILKAP. Four of the seven have been previously described in sALS studies, while ACTG1, B2M and ILKAP appear in the context of this disease for the first time. Supplementary material can be accessed at: http://webpages.uncc.edu/~cbaciu/LO-BaFL/supplementary_data.html Conclusion: LO-BaFL predicts DE results that are broadly similar to those of other methods. The small healthy control cohort in the sALS study is a reasonable foundation for predicting DE genes. Modifying the BaFL pipeline allowed us to remove noise and systematic errors, improving the power of this study, which had a small sample size. Each bioinformatics approach revealed DE genes not predicted by the other; subsequent PCR assays confirmed seven of twelve candidates, a relatively high success rate.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 18
    Publikationsdatum: 2012-09-27
    Beschreibung: Background: With the advent of next-generation sequencing (NGS) technologies, full cDNA shotgun sequencing has become a major approach in the study of transcriptomes, and several different protocols in 454 sequencing have been invented. As each protocol uses its own short DNA tags or adapters attached to the ends of cDNA fragments for labeling or sequencing, different contaminants may lead to mis-assembly and inaccurate sequence products. Results: We have designed and implemented a new program for raw sequence cleaning in a graphical user interface and a batch script. The cleaning process consists of several modules including barcode trimming, sequencing adapter trimming, amplification primer trimming, poly-A tail trimming, vector screening and low quality region trimming. These modules can be combined based on various sequencing applications. Conclusions: ESTclean is a software package not only for cleaning cDNA sequences, but also for helping to develop sequencing protocols by providing summary tables and figures for sequencing quality control in a graphical user interface. It outperforms in cleaning read sequences from complicated sequencing protocols which use barcodes and multiple amplification primers.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 19
    Publikationsdatum: 2012-09-27
    Beschreibung: Background: While the genetics of diploid inheritance are well studied and software for linkage mapping, haplotyping and QTL analysis are available, for tetraploids the available tools are limited. In order to develop such tools it would be helpful if simulated populations based on a variety of models of the tetraploid meiosis would be available. Results: Here we present PedigreeSim, a software package that simulates meiosis in both diploid and tetraploid species and uses this to simulate pedigrees and cross populations. For tetraploids a variety of models can be used, including both bivalent and quadrivalent formation, varying degrees of preferential pairing of hom(oe)ologous chromosomes, different quadrivalent configurations and more. Simulation of quadrivalent meiosis results as expected in double reduction and recombination between more than two hom(oe)ologous chromosomes. The results are shown to match theoretical predictions. Conclusions: This is the first simulation software that implements all features of meiosis in tetraploids. It allows to generate data for tetraploid and diploid populations, and to investigate different models of tetraploid meiosis. The software and manual are available from http://www.plantbreeding.wur.nl/UK/software_pedigreeSim.html and as Additional files 1, 2, 3 and 4 with this publication.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 20
    Publikationsdatum: 2012-10-04
    Beschreibung: Background: Sharing of data about variation and the associated phenotypes is a critical need, yet variant information can be arbitrarily complex, making a single standard vocabulary elusive and re-formatting difficult. Complex standards have proven too time-consuming to implement. Results: The GEN2PHEN project addressed these difficulties by developing a comprehensive data model for capturing biomedical observations, Observ-OM, and building the VarioML format around it. VarioML pairs a simplified open specification for describing variants, with a toolkit for adapting the specification into one's own research workflow. Straightforward variant data can be captured, federated, and exchanged with no overhead; more complex data can be described, without loss of compatibility. The open specification enables push-button submission to gene variant databases (LSDB's) e.g., the Leiden Open Variation Database, using the Cafe Variome data publishing service, while VarioML bidirectionally transforms data between XML and web-application code formats, opening up new possibilities for open source web applications building on shared data. A Java implementation toolkit makes VarioML easily integrated into biomedical applications. VarioML is designed primarily for LSDB data submission and transfer scenarios, but can also be used as a standard variation data format for JSON and XML document databases and user interface components. Conclusions: VarioML is a set of tools and practices improving the availability, quality, and comprehensibility of human variation information. It enables researchers, diagnostic laboratories, and clinics to share that information with ease, clarity, and without ambiguity.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 21
    Publikationsdatum: 2012-10-04
    Beschreibung: Background: We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection ofphylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees. Results: We develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Ourmethod works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of inputtrees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to buildlarger candidate MFASTs. In the final phase, it performs a post-processing step that ensures that we find afrequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate thatthis heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyondcurrent methods. Conclusions: Although this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST inall of our synthetic datasets where we could verify the correctness of the result. It also performed well on largeempirical data sets. Its performance is robust to the number and size of the input trees. Overall, this methodprovides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 22
    Publikationsdatum: 2012-10-04
    Beschreibung: Background: Currently, there is no open-source, cross-platform and scalable framework for coalescent analysis in population genetics. There is no scalable GUI based user application either. Such a framework and application would not only drive the creation of more complex and realistic models but also make them truly accessible. Results: As a first attempt, we built a framework and user application for the domain of exact calculations in coalescent analysis. The framework provides an API with the concepts of model, data, statistic, phylogeny, gene tree and recursion. Infinite-alleles and infinite-sites models are considered. It defines pluggable computations such as counting and listing all the ancestral configurations and genealogies and computing the exact probability of data. It can visualize a gene tree, trace and visualize the internals of the recursion algorithm for further improvement and attach dynamically a number of output processors. The user application defines jobs in a plug-in like manner so that they can be activated, deactivated, installed or uninstalled on demand. Multiple jobs can be run and their inputs edited. Job inputs are persisted across restarts and running jobs can be cancelled where applicable. Conclusions: Coalescent theory plays an increasingly important role in analysing molecular population genetic data. Models involved are mathematically difficult and computationally challenging. An open-source, scalable framework that lets users immediately take advantage of the progress made by others will enable exploration of yet more difficult and realistic models. As models become more complex and mathematically less tractable, the need for an integrated computational approach is obvious. Object oriented designs, though has upfront costs, are practical now and can provide such an integrated approach.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 23
    Publikationsdatum: 2012-10-06
    Beschreibung: Background: Clinical Bioinformatics is currently growing and is based on the integration of clinical and omics data aiming at the development of personalized medicine. Thus the introduction of novel technologies able to investigate the relationship among clinical states and biological machineries may help the development of this field. For instance the Affymetrix DMET platform (drug metabolism enzymes and transporters) is able to study the relationship among the variation of the genome of patients and drug metabolism, detecting SNPs (Single Nucleotide Polymorphism) ongenes related to drug metabolism. This may allow for instance to find genetic variants in patients which present different drug responses, in pharmacogenomics and clinical studies. Despite this, there is currently a lack in the development of open-source algorithms and tools for the analysis of DMET data. Existing software tools for DMET data generally allow only the preprocessing of binary data (e.g. the DMET-Console provided by Affymetrix) and simple data analysis operations, but do not allow to test the association of the presence of SNPs with the response to drugs. Results: We developed DMET-Analyzer a tool for the automatic association analysis among the variation of the patient genomes and the clinical conditions of patients, i.e. the different response to drugs. The proposed system allows: (i) to automatize the workflow of analysis of DMET-SNP data avoiding the use of multiple tools; (ii) the automatic annotation of DMET-SNP data and the search in existing databases of SNPs (e.g. dbSNP), (iii) the association of SNP with pathway through the search in PharmaGKB, a major knowledge base for pharmacogenomic studies. DMET-Analyzer has a simple graphical user interface that allows users (doctors/biologists) to upload and analyse DMET files produced by Affymetrix DMET-Console in an interactive way. The effectiveness and easy use of DMET Analyzer is demonstrated through different case studies regarding the analysis of clinical datasets produced in the University Hospital of Catanzaro, Italy. Conclusion: DMET Analyzer is a novel tool able to automatically analyse data produced by the DMET-platform in case-control association studies. Using such tool user may avoid wasting time in the manual execution of multiple statistical tests avoiding possible errors and reducing the amount of time needed for a whole experiment. Moreover annotations and the direct link to external databases may increase the biological knowledge extracted. The system is freely available for academic purposes at: https://sourceforge.net/projects/dmetanalyzer/files/
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 24
    Publikationsdatum: 2012-08-01
    Beschreibung: Background: The Hedgehog Signaling Pathway is one of signaling pathways that are very important toembryonic development. The participation of inhibitors in the Hedgehog Signal Pathway cancontrol cell growth and death, and searching novel inhibitors to the functioning of thepathway are in a great demand. As the matter of fact, effective inhibitors could provideefficient therapies for a wide range of malignancies, and targeting such pathway in cellsrepresents a promising new paradigm for cell growth and death control. Current researchmainly focuses on the syntheses of the inhibitors of cyclopamine derivatives, which bindspecifically to the Smo protein, and can be used for cancer therapy. While quantitativelystructure-activity relationship (QSAR) studies have been performed for these compounds among different cell lines, none of them have achieved acceptable results in the prediction ofactivity values of new compounds. In this study, we proposed a novel collaborative QSARmodel for inhibitors of the Hedgehog Signaling Pathway by integration the information frommultiple cell lines. Such a model is expected to substantially improve the QSAR ability fromsingle cell lines, and provide useful clues in developing clinically effective inhibitors andmodifications of parent lead compounds for target on the Hedgehog Signaling Pathway. Results: In this study, we have presented: (1) a collaborative QSAR model, which is used to integrateinformation among multiple cell lines to boost the QSAR results, rather than only a singlecell line QSAR modeling. Our experiments have shown that the performance of our model issignificantly better than single cell line QSAR methods; and (2) an efficient feature selectionstrategy under such collaborative environment, which can derive the commonly importantfeatures related to the entire given cell lines, while simultaneously showing their specificcontributions to a specific cell-line. Based on feature selection results, we have proposedseveral possible chemical modifications to improve the inhibitor affinity towards multipletargets in the Hedgehog Signaling Pathway. Conclusions: Our model with the feature selection strategy presented here is efficient, robust, and flexible,and can be easily extended to model large-scale multiple cell line/QSAR data. The data andscripts for collaborative QSAR modeling are available in the Additional file 1.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 25
    Publikationsdatum: 2012-08-02
    Beschreibung: Background: Accurate gene structure annotation is a fundamental but somewhat elusive goal of genome projects, as witnessed by the fact that (model) genomes typically undergo several cycles of re-annotation.In many cases, it is not only different versions of annotations that need to be compared but also different sources of annotation of the same genome, derived from distinct gene prediction workflows.Such comparisons are of interest to annotation providers, prediction software developers, and end-users, who all need to assess what is common and what is different among distinct annotation sources.We developed ParsEval, a software application for pairwise comparison of sets of gene structure annotations.ParsEval calculates several statistics that highlight the similarities and differences between the two sets of annotations provided.These statistics are presented in an aggregate summary report, with additional details provided as individual reports specific to non-overlappinng, gene-model-centric genomic loci.Genome browser styled graphics embedded in these reports help visualize the genomic context of the annotations.Output from ParsEval is both easily read and parsed, enabling systematic identification of problematic gene models for subsequent focused analysis. Results: ParsEval is capable of analyzing annotations for large eukaryotic genomes on typical desktop or laptop hardware.In comparison to existing methods, ParsEval exhibits a considerable performance improvement, both in terms of runtime and memory consumption.Reports from ParsEval can provide relevant biological insights into the gene structure annotations being compared. Conclusions: Implemented in C, ParseEval provides the quickest and most feature-rich solution for genome annotation comparison to date.The source code is freely available (under an ISC license) at http://parseval.sourceforge.net/.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 26
    Publikationsdatum: 2012-08-03
    Beschreibung: Background: Web-based synteny visualization tools are important for sharing data and revealing patterns of complicated genome conservation and rearrangements. Such tools should allow biologists to upload genomic data for their own analysis. This requirement is critical because individual biologists are generating large amounts of genomic sequences that quickly overwhelm any centralized web resources to collect and display all those data. Recently, we published a web-based synteny viewer, GSV, which was designed to satisfy the above requirement. However, GSV can only compare two genomes at a given time. Extending the functionality of GSV to visualize multiple genomes is important to meet the increasing demand of the research community. Results: We have developed a multi-Genome Synteny Viewer (mGSV). Similar to GSV, mGSV is a web-based tool that allows users to upload their own genomic data files for visualization. Multiple genomes can be presented in a single integrated view with an enhanced user interface. Users can navigate through all the selected genomes in either pairwise or multiple viewing mode to examine conserved genomic regions as well as the accompanying genome annotations. Besides serving users who manually interact with the web server, mGSV also provides Web Services for machine-to-machine communication to accept data sent by other remote resources. The entire mGSV package can also be downloaded for easy local installation. Conclusions: mGSV significantly enhances the original functionalities of GSV. A web server hosting mGSV is provided at http://cas-bioinfo.cas.unt.edu/mgsv.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 27
    Publikationsdatum: 2012-08-03
    Beschreibung: Background: Increasingly, biologists and biochemists use computational tools to design experiments to probe the function of proteins and/or to engineer them for a variety of different purposes. The most effective strategies rely on the knowledge of the three-dimensional structure of the protein of interest. However it is often the case that an experimental structure is not available and that models of different quality are used instead. On the other hand, the relationship between the quality of a model and its appropriate use is not easy to derive in general, and so far it has been analyzed in detail only for specific application Results: This paper describes a database and related software tools that allow testing of a given structure based methods on models of a protein representing different levels of accuracy. The comparison of the results of a computational experiment on the experimental structure and on a set of its decoy models will allow developers and users to assess which is the specific threshold of accuracy required to perform the task effectively. Conclusions: The ModelDB server automatically builds decoy models of different accuracy for a given protein of known structure and provides a set of useful tools for their analysis. Pre-computed data for a non-redundant set of deposited protein structures are available for analysis and download in the ModelDB database.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 28
    Publikationsdatum: 2012-07-03
    Beschreibung: Background: Protein-protein, cell-signaling, metabolic, and transcriptional interaction networks are useful for identifying connections between lists of experimentally identified genes/proteins. However, besides physical or co-expression interactions there are many ways in which pairs of genes, or their protein products, can be associated. By systematically incorporating knowledge on shared properties of genes from diverse sources to build functional association networks (FANs), researchers may be able to identify additional functional interactions between groups of genes that are not readily apparent. Results: Genes2FANs is a web based tool and a database that utilizes 14 carefully constructed FANs and a large-scale protein-protein interaction (PPI) network to build subnetworks that connect input lists of human and mouse genes. The FANs are created from mammalian gene set libraries where mouse genes are converted to their human orthologs. The tool takes as input a list of human or mouse Entrez gene symbols to produce a subnetwork and a ranked list of intermediate genes that are used to connect the query input list. In addition, users can enter any PubMed search term and then the system automatically converts the returned results to gene lists using GeneRIF. This gene list is then used as input to generate a subnetwork from the user's PubMed query. As a case study, we applied Genes2FANs to connect disease genes from 90 well studied disorders. We find an inverse correlation between the counts of links connecting disease genes through PPI and links connecting diseases genes through FANs separating diseases into two categories. Conclusions: Genes2FANs is a useful tool for interpreting the relationships between gene/protein lists in the context of their various functions and networks. Combining functional association interactions with physical PPIs can be useful for revealing new biology and help form hypotheses for further experimentation. Our finding that disease genes in many cancers are mostly connected through PPIs whereas other complex diseases, such as autism and type-2 diabetes, are mostly connected through FANs without PPIs, can guide better strategies for disease gene discovery. Genes2FANs is available at: http://actin.pharm.mssm.edu/genes2FANs.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 29
    Publikationsdatum: 2012-07-03
    Beschreibung: Background: Due to hybridization events in evolution, studying two different genes of a set of species may yieldtwo related but different phylogenetic trees for the set of species. In this case, we want to combine the two phylogenetic trees into a hybridization network with the fewest hybridization events. This leads to three computational problems, namely, the problem of computing the minimum size of a hybridization network, the problem of constructing one minimum hybridization network, and the problem of enumerating a representative set of minimum hybridization networks. The previously best software tools for these problems (namely, Chen and Wang's HybridNet and Albrecht et al.'s Dendroscope 3) run very slowly for large instances that cannot be reduced to relatively small instances. Indeed, when the minimum size of a hybridization network of two given trees are larger than 23 and the problem for the trees cannot be reduced to relatively smaller independent subproblems, then HybridNet almost always takes longer than 1 day and Dendroscope 3 often fails to complete. Thus, a faster software tool for the problems is in need. Results: We develop a software tool in ANSI C, named FastHN, for the following problems: Computing the minimum size of a hybridization network, constructing one minimum hybridization network, and enumerating a representative set of minimum hybridization networks. We obtain FastHN by refining HybridNet with three ideas. The first idea is to preprocess the input trees so that the trees become smaller or the problem becomes to solve two or more relatively smaller independent subproblems. The second idea is to use a fast algorithm for computing rSPR distance of two given phylognetic trees to cut more branches of the search tree in the exhaustive-search stage of the algorithm. The third idea is that during the exhaustive-search stage of the algorithm, we find two sibling leaves in one of the two forests (obtained from the given trees by cutting some edges) such that they are as far as possible in the other forest. As the result, FastHN always runs much faster than HybridNet. Unlike Dendroscope 3, FastHN is a single-threaded program. Despite this disadvantage, our experimental data shows that FastHN runs substantially faster than the multi-threaded Dendroscope 3 on a PC with multiple cores. Indeed, FastHN can finish within 16 minutes (on average on a Windows-7 (x64) desktop PC with i7-2600 CPU) even if the minimum size of a hybridization network of two given trees is about 25, the trees each have 100 leaves, and the problem for the input trees cannot be reduced to two or more independent subproblems via cluster reductions. It is also worth mentioning that like HybridNet, FastHN does not use much memory (indeed, the amount of memory is at most quadratic in the input size). In contrast, Dendroscope 3 uses a huge amount of memory. Executables of FastHN for Windows XP (x86), Windows 7 (x64), Linux, and Mac OS are available. Conclusions: For both biological datasets and simulated datasets, our experimental results show that FastHN runs substantially faster than HybridNet and Dendroscope 3. The superiority of FastHN in speed over the previous tools becomes more significant as the hybridization number becomes larger. In addition, FastHN uses much less memory than Dendroscope 3 and uses the same amount of memory as HybridNet.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 30
    Publikationsdatum: 2012-08-22
    Beschreibung: Background: Ongoing innovation in phylogenetics and evolutionary biology has been accompanied by a proliferation of software tools, data formats, analytical techniques and web servers. This brings with it the challenge of integrating phylogenetic and other related biological data found in a wide variety of formats, and underlines the need for reusable software that can read, manipulate and transform this information into the various forms required to build computational pipelines. Results: We built a Python software library for working with phylogenetic data that is tightly integrated with Biopython, a broad-ranging toolkit for computational biology. Our library, Bio.Phylo, is highly interoperable with existing libraries, tools and standards, and is capable of parsing common file formats for phylogenetic trees, performing basic transformations and manipulations, attaching rich annotations, and visualizing trees. We unified the modules for working with the standard file formats Newick, NEXUS and phyloXML behind a consistent and simple API, providing a common set of functionality independent of the data source. Conclusions: Bio.Phylo meets a growing need in bioinformatics for working with heterogeneous types of phylogenetic data. By supporting interoperability with multiple file formats and leveraging existing Biopython features, this library simplifies the construction of phylogenetic workflows. We also provide examples of the benefits of building a community around a shared open-source project. Bio.Phylo is included with Biopython, available through the Biopython website, http://biopython.org.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 31
    Publikationsdatum: 2012-08-22
    Beschreibung: Background: The increased use of multi-locus data sets for phylogenetic reconstruction has increased the need todetermine whether a set of gene trees significantly deviate from the phylogenetic patterns of other genes.Such unusual gene trees may have been influenced by other evolutionary processes such as selection, geneduplication, or horizontal gene transfer. Results: Motivated by this problem we propose a nonparametric goodness-of-fit test for two empirical distributionsof gene trees, and we developed the software GeneOut to estimate a p-value for the test. Our approachmaps trees into a multi-dimensional vector space and then applies support vector machines (SVMs) tomeasure the separation between two sets of pre-defined trees. We use a permutation test to assess thesignificance of the SVM separation. To demonstrate the performance of GeneOut, we applied it to thecomparison of gene trees simulated within different species trees across a range of species tree depths.Applied directly to sets of simulated gene trees with large sample sizes, GeneOut was able to detect verysmall differences between two set of gene trees generated under different species trees. Our statistical testcan also include tree reconstruction into its test framework through a variety of phylogenetic optimalitycriteria. When applied to DNA sequence data simulated from different sets of gene trees, results in the formof receiver operating characteristic (ROC) curves indicated that GeneOut performed well in the detectionof differences between sets of trees with different distributions in a multi-dimensional space. Furthermore, itcontrolled false positive and false negative rates very well, indicating a high degree of accuracy. Conclusions: The non-parametric nature of our statistical test provides fast and efficient analyses, and makes it anapplicable test for any scenario where evolutionary or other factors can lead to trees with differentmulti-dimensional distributions. The software GeneOut is freely available under the GNU public license.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 32
    Publikationsdatum: 2012-08-23
    Beschreibung: Background: Histone deacetylase (HDAC) is a novel target for the treatment of cancer and it can be classified into three classes, i.e., classes I, II, and IV. The inhibitors selectively targeting individual HDAC have been proved to be the better candidate antitumor drugs. To screen selective HDAC inhibitors, several proteochemometric (PCM) models based on different combinations of three kinds of protein descriptors, two kinds of ligand descriptors and multiplication cross-terms were constructed in our study. Results: The results show that structure similarity descriptors are better than sequence similarity descriptors and geometry descriptors in the characterization of HDACs. Furthermore, the predictive ability was not improved by introducing the cross-terms in our models. Finally, a best PCM model based on protein structure similarity descriptors and 32-dimensional general descriptors was derived (R2 = 0.9897, Qtest2 = 0.7542), which shows a powerful ability to screen selective HDAC inhibitors. Conclusions: Our best model not only predict the activities of inhibitors for each HDAC isoform, but also screen and distinguish class-selective inhibitors and even more isoform-selective inhibitors, thus it provides a potential way to discover or design novel candidate antitumor drugs with reduced side effect.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 33
    Publikationsdatum: 2012-08-23
    Beschreibung: Background: A scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information. Results: We present NetiNeti (Name Extraction from Textual Information-Name Extraction for Taxonomic Indexing), a machine learning based approach for recognition of scientific names including the discovery of new species names from text that will also handle misspellings, OCR errors and other variations in names. The system generates candidate names using rules for scientific names and applies probabilistic machine learning methods to classify names based on structural features of candidate names and features derived from their contexts. NetiNeti can also disambiguate scientific names from other names using the contextual information. We evaluated NetiNeti on legacy biodiversity texts and biomedical literature (MEDLINE). NetiNeti performs better (precision = 98.9 % and recall = 70.5 %) compared to a popular dictionary based approach (precision = 97.5 % and recall = 54.3 %) on a 600-page biodiversity book that was manually marked by an annotator. On a small set of PubMed Central's full text articles annotated with scientific names, the precision and recall values are 98.5 % and 96.2 % respectively. NetiNeti found more than 190,000 unique binomial and trinomial names in more than 1,880,000 PubMed records when used on the full MEDLINE database. NetiNeti also successfully identifies almost all of the new species names mentioned within web pages. Additionally, we present the comparison results of various machine learning algorithms on our annotated corpus. Naive Bayes and Maximum Entropy with Generalized Iterative Scaling (GIS) parameter estimation are the top two performing algorithms. Conclusions: We present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed at http://namefinding.ubio.org
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 34
    Publikationsdatum: 2012-10-16
    Beschreibung: Background: Plants are important as foods, pharmaceuticals, biorenewable chemicals, fuel resources, bioremediation tools and general tools for recombinant technology. The study of plant biological pathways is advanced by easy access to integrated data sources. Today, various plant data sources are scattered throughout the web, making it increasingly complicated to build comprehensive datasets. Results: MetNet Online is a web-based portal that provides access to a regulatory and metabolic plant pathway database. The database and portal integrate Arabidopsis, soybean (Glycine max) and grapevine (Vitis vinifera) data. Pathways are enriched with known or predicted information on sub cellular location. MetNet Online enables pathways, interactions and entities to be browsed or searched by multiple categories such as sub cellular compartment, pathway ontology, and GO term. In addition to this, the "My MetNet" feature allows registered users to bookmark content and track, import and export customized lists of entities. Users can also construct custom networks using existing pathways and/or interactions as building blocks. Conclusion: The site can be reached at http://www.metnetonline.org. Extensive video tutorials on how to use the site are available through http://www.metnetonline.org/tutorial/.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 35
    Publikationsdatum: 2012-10-20
    Beschreibung: Background: Dysregulation of imprinted genes, which are expressed in a parent-of-origin-specific manner, plays an important role in various human diseases, such as cancer and behavioral disorder. To date, however, fewer than 100 imprinted genes have been identified in the human genome. The recent availability of high-throughput technology makes it possible to have large-scale prediction of imprinted genes. Here we propose a Bayesian model (dsPIG) to predict imprinted genes on the basis of allelic expression observed in mRNA-Seq data of independent human tissues. Results: Our model (dsPIG) was capable of identifying imprinted genes with high sensitivity and specificity and a low false discovery rate when the number of sequenced tissue samples was fairly large, according to simulations. By applying dsPIG to the mRNA-Seq data, we predicted 94 imprinted genes in 20 cerebellum samples and 57 imprinted genes in 9 diverse tissue samples with expected low false discovery rates. We also assessed dsPIG using previously validated imprinted and non-imprinted genes. With simulations, we further analyzed how imbalanced allelic expression of non-imprinted genes or different minor allele frequencies affected the predictions of dsPIG. Interestingly, we found that, among biallelically expressed genes, at least 18 genes expressed significantly more transcripts from one allele than the other among different individuals and tissues. Conclusion: With the prevalence of the mRNA-Seq technology, dsPIG has become a useful tool for analysis of allelic expression and large-scale prediction of imprinted genes. For ease of use, we have set up a web service and also provided an R package for dsPIG at http://www.shoudanliang.com/dsPIG/.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 36
    Publikationsdatum: 2012-08-28
    Beschreibung: Modern analytical methods in biology and chemistry useseparation techniques coupled to sensitive detectors, such as gaschromatography-mass spectrometry (GC-MS) and liquid chromatography-massspectrometry (LC-MS). These hyphenated methods provide high-dimensionaldata. Comparing such data manually to find corresponding signals is a laborioustask, as each experiment usually consists of thousands of individual scans, eachcontaining hundreds or even thousands of distinct signals.In order to allow for successful identification of metabolites or proteinswithin such data, especially in the context of metabolomics and proteomics, anaccurate alignment and matching of corresponding features between two or moreexperiments is required. Such a matching algorithm should capture fluctuationsin the chromatographic system which lead to non-linear distortions on the timeaxis, as well as systematic changes in recorded intensities.Many different algorithms for the retention time alignment of GC-MS and LC-MSdata have been proposed and published, but all of them focus either on aligningpreviously extracted peak features or on aligning and comparing the complete rawdata containing all available features. Results: In this paper we introduce two algorithms for retentiontime alignment of multiple GC-MS datasets: multiple alignment bybidirectional best hits peak assignment and cluster extension (BiPACE) andcenter-star multiple alignment by pairwise partitioned dynamic time warping(CeMAPP-DTW). We show how the similarity-based peak group matchingmethod BiPACE may be used for multiple alignment calculation individually and how it can be usedas a preprocessing step for the pairwise alignments performed by CeMAPP-DTW. We evaluate thealgorithms individually and in combination on a previously published small GC-MS dataset studying the Leishmania parasite and on a larger GC-MS dataset studying grains of wheat (Triticum aestivum). Conclusions: We have shown that BiPACE achieves very high precision and recall anda very low number of false positive peak assignments on both evaluation datasets. CeMAPP-DTW finds a high number of true positives when executed on its own,but achieves even better results when BiPACE is used to constrain its search space. The source code of both algorithms is included in the OpenSource software framework Maltcms, which is available from http://maltcms.sf.net. The evaluation scripts of the present study are available from the same source.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 37
    Publikationsdatum: 2012-08-29
    Beschreibung: Background: Biomedical processes can provide essential information about the (mal-) functioning of an organism and are thus frequently represented in biomedical terminologies and ontologies, including the GO Biological Process branch. These processes often need to be described and categorised in terms of their attributes, such as rates or regularities. The adequate representation of such process attributes has been a contentious issue in bio-ontologies recently; and domain ontologies have correspondingly developed ad hoc workarounds that compromise interoperability and logical consistency. Results: We present a design pattern for the representation of process attributes that is compatible with upper ontology frameworks such as BFO and BioTop. Our solution rests on two key tenets: firstly, that many of the sorts of process attributes which are biomedically interesting can be characterised by the ways that repeated parts of such processes constitute, in combination, an overall process; secondly, that entities for which a full logical definition can be assigned do not need to be treated as primitive within a formal ontology framework. We apply this approach to the challenge of modelling and automatically classifying examples of normal and abnormal rates and patterns of heart beating processes, and discuss the expressivity required in the underlying ontology representation language. We provide full definitions for process attributes at increasing levels of domain complexity. Conclusions: We show that a logical definition of process attributes is feasible, though limited by the expressivity of DL languages so that the creation of primitives is still necessary. This finding may endorse current formal upper-ontology frameworks as a way of ensuring consistency, interoperability and clarity.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 38
    Publikationsdatum: 2012-08-28
    Beschreibung: No description available
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 39
    Publikationsdatum: 2012-08-31
    Beschreibung: Background: The epidermal growth factor receptor (EGFR) signaling pathway and angiogenesis in brain cancer act as an engine for tumor initiation, expansion and response to therapy. Since the existing literature does not have any models that investigate the impact of both angiogenesis and molecular signaling pathways on treatment, we propose a novel multi-scale, agent-based computational model that includes both angiogenesis and EGFR modules to study the response of brain cancer under tyrosine kinase inhibitors (TKIs) treatment. Results: The novel angiogenesis module integrated into the agent-based tumor model is based on a set of reaction--diffusion equations that describe the spatio-temporal evolution of the distributions of micro-environmental factors such as glucose, oxygen, TGFalpha, VEGF and fibronectin. These molecular species regulate tumor growth during angiogenesis. Each tumor cell is equipped with an EGFR signaling pathway linked to a cell-cycle pathway to determine its phenotype. EGFR TKIs are delivered through the blood vessels of tumor microvasculature and the response to treatment is studied. Conclusions: Our simulations demonstrated that entire tumor growth profile is a collective behaviour of cells regulated by the EGFR signaling pathway and the cell cycle. We also found that angiogenesis has a dual effect under TKI treatment: on one hand, through neo-vasculature TKIs are delivered to decrease tumor invasion; on the other hand, the neo-vasculature can transport glucose and oxygen to tumor cells to maintain their metabolism, which results in an increase of cell survival rate in the late simulation stages.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 40
    Publikationsdatum: 2012-09-05
    Beschreibung: Background: Next-generation sequencing technologies have become important tools for genome-wide studies. However, the quality scores that are assigned to each base have been shown to be inaccurate. If the quality scores are used in downstream analyses, these inaccuracies can have a significant impact on the results. Results: Here we present ReQON, a tool that recalibrates the base quality scores from an input BAM file of aligned sequencing data using logistic regression. ReQON also generates diagnostic plots showing the effectiveness of the recalibration. We show that ReQON produces quality scores that are both more accurate, in the sense that they more closely correspond to the probability of a sequencing error, and do a better job of discriminating between sequencing errors and non-errors than the original quality scores. We also compare ReQON to other available recalibration tools and show that ReQON is less biased and performs favorably in terms of quality score accuracy. Conclusion: ReQON is an open source software package, written in R and available through Bioconductor, for recalibrating base quality scores for next-generation sequencing data. ReQON produces a new BAM file with more accurate quality scores, which can improve the results of downstream analysis, and produces several diagnostic plots showing the effectiveness of the recalibration.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 41
    Publikationsdatum: 2012-08-26
    Beschreibung: Background: Quantitative analysis of changes in dendritic spine morphology has become an interesting issue in contemporary neuroscience.However, the diversity in dendritic spines population might seriously influence the results of measurements in which their morphology is studied,the detection of differences in spine morphology between control and test group is often compromised by the number of dendritic spines taken for analysis. In order to estimate how severe is such an impact we have performed Monte Carlo simulations examining various experimental setups and statistical approaches. The confocal images of dendritic spines from hippocampal dissociated cultures have been used to create a set of variables exploited as the simulation resources. Results: The tabulated results of simulations are given, providing the number of dendritic spines required for the detection of hidden morphological differences between control and test group, in spine head-width, length and area. It turns out that this is the head-width among these three variables, where the changes are most easily detected. Simulation of changes occurring in a subpopulation of spines reveal the strong dependenceof detectability on the statistical approach applied. The analysis based on comparison of percentage of spines in subclasses is less sensitive thanthe direct comparison of relevant variables describing spines morphology. Conclusions: We evaluated the sampling aspect and effect of systematic morphological variation on detecting the differences in spine morphology.Provided results may serve as a guideline in selecting the number of samples to be studied in a planned experiment. Our simulations might be a step towards the development of a standardized method of quantitative comparison of dendritic spines morphology, in which different sources of errors are considered.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 42
    Publikationsdatum: 2012-09-04
    Beschreibung: Background: Food security is an issue that has come under renewed scrutiny amidst concerns that substantial yield increases in cereal crops are required to feed the world's booming population. Wheat is of fundamental importance in this regard being one of the three most important crops for both human consumption and livestock feed; however, increase in crop yields have not kept pace with the demands of a growing world population. In order to address this issue, plant breeders require new molecular tools to help them identify genes for important agronomic traits that can be introduced into elite varieties. Studies of the genome using next-generation sequencing enable the identification of molecular markers such as single nucleotide polymorphisms that may be used by breeders to identify and follow genes when breeding new varieties. The development and application of next-generation sequencing technologies has made the characterisation of SNP markers in wheat relatively cheap and straightforward. There is a growing need for the widespread dissemination of this information to plant breeders.DescriptionCerealsDB is an online resource containing a range of genomic datasets for wheat (Triticum aestivum) that will assist plant breeders and scientists to select the most appropriate markers for marker assisted selection. CerealsDB includes a database which currently contains in excess of 100,000 putative varietal SNPs, of which several thousand have been experimentally validated. In addition, CerealsDB contains databases for DArT markers and EST sequences, and links to a draft genome sequence for the wheat variety Chinese Spring. Conclusion: CerealsDB is an open access website that is rapidly becoming an invaluable resource within the wheat research and plant breeding communities.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 43
    Publikationsdatum: 2012-09-04
    Beschreibung: Background: One of the crucial steps in regulation of gene expression is the binding of transcription factor(s) to specific DNA sequences. Knowledge of the binding affinity and specificity at a structural level between transcription factors and their target sites has important implications in our understanding of the mechanism of gene regulation. Due to their unique functions and binding specificity, there is a need for a transcription factor-specific, structure-based database and corresponding web service to facilitate structural bioinformatics studies of transcription factor-DNA interactions, such as development of knowledge-based interaction potential, transcription factor-DNA docking, binding induced conformational changes, and the thermodynamics of protein-DNA interactions.DescriptionTFinDit is a relational database and a web search tool for studying transcription factor-DNA interactions. The database contains annotated transcription factor-DNA complex structures and related data, such as unbound protein structures, thermodynamic data, and binding sequences for the corresponding transcription factors in the complex structures. TFinDit also provides a user-friendly interface and allows users to either query individual entries or generate datasets through culling the database based on one or more search criteria. Conclusions: TFinDit is a specialized structural database with annotated transcription factor-DNA complex structures and other preprocessed data. We believe that this database/web service can facilitate the development and testing of TF-DNA interaction potentials and TF-DNA docking algorithms, and the study of protein-DNA recognition mechanisms.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 44
    Publikationsdatum: 2012-09-06
    Beschreibung: Background: High-density oligonucleotide microarray is an appropriate technology for genomic analysis, and is particulary useful in the generation of transcriptional maps, ChIP-on-chip studies and re-sequencing of the genome.Transcriptome analysis of tiling microarray data facilitates the discovery of novel transcripts and the assessment of differential expression in diverse experimental conditions. Although new technologies such as next-generation sequencing have appeared, microarrays might still be useful for the study of small genomes or for the analysis of genomic regions with custom microarrays due to their lower price and good accuracy in expression quantification. Results: Here, we propose a novel wavelet-based method, named ZCL (zero-crossing lines), for the combined denoising and segmentation of tiling signals. The denoising is performed with the classical SUREshrink method and the detection of transcriptionally active regions is based on the computation of the Continuous Wavelet Transform (CWT). In particular, the detection of the transitions is implemented as the thresholding of the zero-crossing lines. The algorithm described has been applied to the public Saccharomyces cerevisiae dataset and it has been compared with two well-known algorithms: pseudo-median sliding window (PMSW) and the structural change model (SCM). As a proof-of-principle, we applied the ZCL algorithm to the analysis of the custom tiling microarray hybridization results of a S. aureus mutant deficient in the sigma B transcription factor. The challenge was to identify those transcripts whose expression decreases in the absence of sigma B. Conclusions: The proposed method archives the best performance in terms of positive predictive value (PPV) while its sensitivity is similar to the other algorithms used for the comparison. The computation time needed to process the transcriptional signals is low as compared with model-based methods and in the same range to those based on the use of filters. Automatic parameter selection has been incorporated and moreover, it can be easily adapted to a parallel implementation. We can conclude that the proposed method is well suited for the analysis of tiling signals, in which transcriptional activity is often hidden in the noise. Finally, the quantification and differential expression analysis of S. aureus dataset have demonstrated the valuable utility of this novel device to the biological analysis of the S. aureus transcriptome.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 45
    Publikationsdatum: 2012-08-29
    Beschreibung: Background: A number of software packages are available togenerate DNA multiple sequence alignments (MSAs) evolved undercontinuous-time Markov processes on phylogenetic trees. On the other hand,methods of simulating the DNA MSA directly from the transition matricesdo not exist. Moreover, existing software restricts tothe time-reversible models and it is not optimized togenerate nonhomogeneous data (i.e. placing distinct substitution rates at different lineages). Results: We present the first package designed to generate MSAs evolving under discrete-time Markov processes on phylogenetic trees, directly from probability substitution matrices.Based on the input model and aphylogenetic tree in the Newick format (with branch lengths measuredas the expected number of substitutions per site), thealgorithm produces DNA alignments of desired length.GenNon-h is publicly available for download. Conclusion: The software presented here is an efficient toolto generate DNA MSAs on a given phylogenetic tree.GenNon-h provides the user with the nonstationary or nonhomogeneousphylogenetic data that is well suited for testing complex biologicalhypotheses, exploring the limits of the reconstruction algorithms and their robustness to such models.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 46
    Publikationsdatum: 2012-08-17
    Beschreibung: Background: Many biological processes are context-dependent or temporally specific. As a result, relationships between molecular constituents evolve across time and environments. While cutting-edge machine learning techniques can recover these networks, exploring and interpreting the rewiring behavior is challenging. Information visualization shines in this type of exploratory analysis, motivating the development of TVNViewer (http://sailing.cs.cmu.edu/tvnviewer), a visualization tool for dynamic network analysis. Results: In this paper, we demonstrate visualization techniques for dynamic network analysis by using TVNViewer to analyze yeast cell cycle and breast cancer progression datasets. Conclusions: TVNViewer is a powerful new visualization tool for the analysis of biological networks that change across time or space.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 47
    Publikationsdatum: 2012-08-17
    Beschreibung: Background: Numerous models for use in interpreting quantitative PCR (qPCR) data are present in recent literature. The most commonly used models assume the amplification in qPCR is exponential and fit an exponential model with a constant rate of increase to a select part of the curve. Kinetic theory may be used to model the annealing phase and does not assume constant efficiency of amplification. Mechanistic models describing the annealing phase with kinetic theory offer the most potential for accurate interpretation of qPCR data. Even so, they have not been thoroughly investigated and are rarely used for interpretation of qPCR data. New results for kinetic modeling of qPCR are presented. Results: Two models are presented in which the efficiency of amplification is based on equilibrium solutions for the annealing phase of the qPCR process. Model 1 assumes annealing of complementary targets strands and annealing of target and primers are both reversible reactions and reach a dynamic equilibrium. Model 2 assumes all annealing reactions are nonreversible and equilibrium is static. Both models include the effect of primer concentration during the annealing phase. Analytic formulae are given for the equilibrium values of all single and double stranded molecules at the end of the annealing step. The equilibrium values are then used in a stepwise method to describe the whole qPCR process. Rate constants of kinetic models are the same for solutions that are identical except for possibly having different initial target concentrations. Analysis of qPCR curves from such solutions are thus analyzed by simultaneous non-linear curve fitting with the same rate constant values applying to all curves and each curve having a unique value for initial target concentration. The models were fit to two data sets for which the true initial target concentrations are known. Both models give better fit to observed qPCR data than other kinetic models present in the literature. They also give better estimates of initial target concentration. Model 1 was found to be slightly more robust than model 2 giving better estimates of initial target concentration when estimation of parameters was done for qPCR curves with very different initial target concentration. Both models may be used to estimate the initial absolute concentration of target sequence when a standard curve is not available. Conclusions: It is argued that the kinetic approach to modeling and interpreting quantitative PCR data has the potential to give more precise estimates of the true initial target concentrations than other methods currently used for analysis of qPCR data. The two models presented here give a unified model of the qPCR process in that they explain the shape of the qPCR curve for a wide variety of initial target concentrations.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 48
    Publikationsdatum: 2012-08-17
    Beschreibung: Background: Variations in DNA copy number carry information on the modalities of genome evolution and mis-regulation of DNA replication in cancer cells. Their study can help localize tumor suppressor genes, distinguish different populations of cancerous cells, and identify genomic variations responsible for disease phenotypes. A number of different high throughput technologies can be used to identify copy number variable sites, and the literature documents multiple effective algorithms. We focus here on the specific problem of detecting regions where variation in copy number is relatively common in the sample at hand. This problem encompasses the cases of copy number polymorphisms, related samples, technical replicates, and cancerous sub-populations from the same individual. Results: We present a segmentation method named generalized fused lasso (GFL) to reconstruct copy number variant regions, that is based on penalized estimation and is capable of processing multiple signals jointly. Our approach is computationally very attractive and leads to sensitivity and specificity levels comparable to those of state-of-the-art specialized methodologies. We illustrate its applicability with simulated and real data sets. Conclusions: The flexibility of our framework makes it applicable to data obtained with a wide range of technology. Its versatility and speed make GFL particularly useful in the initial screening stages of large data sets.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 49
    facet.materialart.
    Unbekannt
    BioMed Central
    Publikationsdatum: 2012-07-17
    Beschreibung: Background: Although genome-scale expression experiments are performed routinely in biomedical research, methods ofanalysis remain simplistic and their interpretation challenging. The conventional approach is to compare theexpression of each gene, one at a time, between treatment groups. This implicitly treats the gene expressionlevels as independent, but they are in fact highly interdependent, and exploiting this enables substantialpower gains to be realized. Results: We assume that information on the dependence structure between the expression levels of a set of genes isavailable in the form of a Bayesian network (directed acyclic graph), derived from external resources. Weshow how to analyze gene expression data conditional on this network. Genes whose expression is directlyaffected by treatment may be identified using tests for the independence of each gene and treatment,conditional on the parents of the gene in the network. We apply this approach to two datasets: one from ahepatotoxicity study in rats using a PPAR pathway, and the other from a study of the effects of smoking onthe epithelial transcriptome, using a global transcription factor network. Conclusions: The proposed method is straightforward, simple to implement, gives rise to substantial power gains, andmay assist in relating the experimental results to the underlying biology
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 50
    Publikationsdatum: 2012-07-16
    Beschreibung: Background: Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional. Results: RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions. Conclusions: While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 51
    Publikationsdatum: 2012-07-18
    Beschreibung: Background: Alpha-helical transmembrane channel and transporter proteins play vital roles in a diverse range of essential biological processes and are crucial in facilitating the passage of ions and molecules across the lipid bilayer. However, the experimental difficulties associated with obtaining high quality crystals has led to their significant under-representation in structural databases; therefore, computational methods that can identify structural features from sequence alone are of high importance. Results: We present a method capable of automatically identifying pore-lining regions in transmembrane proteins from sequence information alone, which can then be used to determine the pore stoichiometry. By labelling pore-lining residues in crystal structures using geometric criteria, we have trained a support vector machine classifier to predict the likelihood of a transmembrane helix being involved in pore formation. Results from testing this approach under stringent cross-validation indicate that prediction accuracy of 72% is possible, while a support vector regression model is able to predict the number of subunits participating in the pore with 62% accuracy. Conclusion: To our knowledge, this is the first tool capable of identifying such regions and we present the results of applying it to a data set of sequences with available crystal structures. Our method provides a way to characterise pores in transmembrane proteins and may provide valuable insight into routes of therapeutic intervention in a number of important diseases. This software is freely available as source code from:http://bioinfadmin.cs.ucl.ac.uk/downloads/memsat-svm/
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 52
    Publikationsdatum: 2012-07-19
    Beschreibung: Background: The k-mer hash length is a key factor affecting the output of de novo transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single k-mer choices might result in the loss of unique contiguous sequences (contigs) and relevant biological information. A common solution to this problem is the clustering of single k-mer assemblies. Even though annotation is one of the primary goals of a transcriptome assembly, the success of assembly strategies does not consider the impact of k-mer selection on the annotation output. This study provides an in-depth k-mer selection analysis that is focused on the degree of functional annotation achieved for a non-model organism where no reference genome information is available. Individual k-mers and clustered assemblies (CA) were considered using three representative software packages. Pair-wise comparison analyses (between individual k-mers and CAs) were produced to reveal missing Kyoto Encyclopedia of Genes and Genomes (KEGG) ortholog identifiers (KOIs), and to determine a strategy that maximizes the recovery of biological information in a de novo transcriptome assembly. Results: Analyses of single k-mer assemblies resulted in the generation of various quantities of contigs and functional annotations within the selection window of k-mers (k-19 to k-63). For each k-mer in this window, generated assemblies contained certain unique contigs and KOIs that were not present in the other k-mer assemblies. Producing a non-redundant CA of k-mers 19 to 63 resulted in a more complete functional annotation than any single k-mer assembly. However, a fraction of unique annotations remained (~0.19 to 0.27% of total KOIs) in the assemblies of individual k-mers (k-19 to k-63) that were not present in the non-redundant CA. A workflow to recover these unique annotations is presented. Conclusions: This study demonstrated that different k-mer choices result in various quantities of unique contigs per single k-mer assembly which affects biological information that is retrievable from the transcriptome. This undesirable effect can be minimized, but not eliminated, with clustering of multi-k assemblies with redundancy removal. The complete extraction of biological information in de novo transcriptomics studies requires both the production of a CA and efforts to identify unique contigs that are present in individual k-mer assemblies but not in the CA.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 53
    Publikationsdatum: 2012-07-24
    Beschreibung: Background: Today, recognition and classification of sequence motifs and protein folds is a mature field, thanks to the availability of numerous comprehensive and easy to use software packages and web-based services. Recognition of structural motifs, by comparison, is less well developed and much less frequently used, possibly due to a lack of easily accessible and easy to use software. Results: In this paper, we describe an extension of DeepView/Swiss-PdbViewer through which structural motifs may be defined and searched for in large protein structure databases, and we show that common structural motifs involved in stabilizing protein folds are present in evolutionarily and structurally unrelated proteins, also in deeply buried locations which are not obviously related to protein function. Conclusions: The possibility to define custom motifs and search for their occurrence in other proteins permits the identification of recurrent arrangements of residues that could have structural implications. The possibility to do so without having to maintain a complex software / hardware installation on site brings this technology to experts and non-experts alike.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 54
    Publikationsdatum: 2012-07-24
    Beschreibung: Background: Clustering DNA sequences into functional groups is an important problem in bioinformatics. We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. This method transforms DNA sequences into the feature vectors which contain the occurrence, location and order relation of k-tuples in DNA sequence. Afterwards, a hierarchical procedure is applied to clustering DNA sequences based on the feature vectors. Results: The proposed distance measure and clustering method are evaluated by clustering functionally related genes and by phylogenetic analysis. This method is also compared with BlastClust and CD-HIT-EST. The experimental results show our method is effective in classifying DNA sequences with similar biological characteristics and in discovering the underlying relationship among the sequences. Conclusions: We introduced a novel clustering algorithm which is based on a new sequence similarity measure. It is effective in classifying DNA sequences with similar biological characteristics and in discovering the relationship among the sequences.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 55
    Publikationsdatum: 2012-07-24
    Beschreibung: Background: Increasingly biological text mining research is focusing on the extraction of complex relationships relevant to the construction and curation of biological networks and pathways. However, one important category of pathway - metabolic pathways - has been largely neglected.Here we present a relatively simple method for extracting metabolic reaction information from free text that scores different permutations of assigned entities (enzymes and metabolites) within a given sentence based on the presence and location of stemmed keywords. This method extends an approach that has proved effective in the context of the extraction of protein-protein interactions. Results: When evaluated on a set of manually-curated metabolic pathways using standard performance criteria, our method performs surprisingly well. Precision and recall rates are comparable to those previously achieved for the well-known protein-protein interaction extraction task. Conclusions: We conclude that automated metabolic pathway construction is more tractable than has often been assumed, and that (as in the case of protein-protein interaction extraction) relatively simple text-mining approaches can prove surprisingly effective. It is hoped that these results will provide an impetus to further research and act as a useful benchmark for judging the performance of more sophisticated methods that are yet to be developed.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 56
    Publikationsdatum: 2012-06-16
    Beschreibung: Background: The distance matrix computed from multiple alignments of homologous sequences is widely used by distance-based phylogenetic methods to provide information on the evolution of protein families. This matrix can also be visualized in a low dimensional space by metric multidimensional scaling (MDS). Applied to protein families, MDS provides information complementary to the information derived from tree-based methods. Moreover, MDS gives a unique opportunity to compare orthologous sequence sets because it can add supplementary elements to a reference space. Results: The R package bios2mds (from BIOlogical Sequences to MultiDimensional Scaling) has been designed to analyze multiple sequence alignments by MDS. Bios2mds starts with a sequence alignment, builds a matrix of distances between the aligned sequences, and represents this matrix by MDS to visualize a sequence space. This package also offers the possibility of performing K-means clustering in the MDS derived sequence space. Most importantly, bios2mds includes a function that projects supplementary elements (a.k.a. "out of sample" elements) onto the space defined by reference or "active" elements. Orthologous sequence sets can thus be compared in a straightforward way. The data analysis and visualization tools have been specifically designed for an easy monitoring of the evolutionary drift of protein sub-families. Conclusions: The bios2mds package provides the tools for a complete integrated pipeline aimed at the MDS analysis of multiple sets of orthologous sequences in the R statistical environment. In addition, as the analysis can be carried out from user provided matrices, the projection function can be widely used on any kind of data.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 57
    Publikationsdatum: 2012-06-20
    Beschreibung: Background: Two-dimensional polyacrylamide gel electrophoresis (2D PAGE) is commonly used toidentify differentially expressed proteins under two or more experimental or observationalconditions. Wu et al (2009) developed a univariate probabilistic model which was used toidentify differential expression between Case and Control groups, by applying a LikelihoodRatio Test (LRT) to each protein on a 2D PAGE. In contrast to commonly used statisticalapproaches, this model takes into account the two possible causes of missing values in 2DPAGE: either (1) the non-expression of a protein; or (2) a level of expression that falls belowthe limit of detection. Results: We develop a global Bayesian model which extends the previously described model. Unlikethe univariate approach, the model reported here is able treat all differentially expressedproteins simultaneously. Whereas each protein is modelled by the univariate likelihoodfunction previously described, several global distributions are used to model the underlyingrelationship between the parameters associated with individual proteins. These globaldistributions are able to combine information from each protein to give more accurateestimates of the true parameters. In our implementation of the procedure, all parameters arerecovered by Markov chain Monte Carlo (MCMC) integration. The 95% highest posteriordensity (HPD) intervals for the marginal posterior distributions are used to determine whetherdifferences in protein expression are due to differences in mean expression intensities, and/ordifferences in the probabilities of expression. Conclusions: Simulation analyses showed that the global model is able to accurately recover the underlyingglobal distributions, and identify more differentially expressed proteins than the simpleapplication of a LRT. Additionally, simulations also indicate that the probability ofincorrectly identifying a protein as differentially expressed (i.e., the False Discovery Rate) isvery low. The source code is available at https://github.com/stevenhwu/BIDE-2D.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 58
    Publikationsdatum: 2012-06-20
    Beschreibung: Background: The identification of gene sets that are significantly impacted in a given condition based on microarray data is acrucial step in current life science research. Most gene set analysis methods treat genes equally, regardless howspecific they are to a given gene set. Results: In this work we propose a new gene set analysis method that computes a gene set score as the mean of absolutevalues of weighted moderated gene t-scores. The gene weights are designed to emphasize the genes appearing infew gene sets, versus genes that appear in many gene sets. We demonstrate the usefulness of the method whenanalyzing gene sets that correspond to the KEGG pathways, and hence we called our method Pathway Analysiswith Down-weighting of Overlapping Genes (PADOG). Unlike most gene set analysis methods which arevalidated through the analysis of 2-3 data sets followed by a human interpretation of the results, the validationemployed here uses 24 different data sets and a completely objective assessment scheme that makes minimalassumptions and eliminates the need for possibly biased human assessments of the analysis results. Conclusions: PADOG significantly improves gene set ranking and boosts sensitivity of analysis using information alreadyavailable in the gene expression profiles and the collection of gene sets to be analyzed. The advantages ofPADOG over other existing approaches are shown to be stable to changes in the database of gene sets to beanalyzed. PADOG was implemented as an R package available at:http://bioinformaticsprb.med.wayne.edu/PADOG/ or www.bioconductor.org.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 59
    Publikationsdatum: 2012-06-20
    Beschreibung: Background: Microarray data enables the high-throughput survey of mRNA expression profiles at the genomic level;however, the data presents a challenging statistical problem because of the large number of transcripts withsmall sample sizes that are obtained. To reduce the dimensionality, various Bayesian or empirical Bayeshierarchical models have been developed. However, because of the complexity of the microarray data, nomodel can explain the data fully. It is generally difficult to scrutinize the irregular patterns of expression thatare not expected by the usual statistical gene by gene models. Results: As an extension of empirical Bayes (EB) procedures, we have developed the beta-empirical Bayes (beta-EB)approach based on a beta-likelihood measure which can be regarded as an 'evidence-based' weighted (quasi-)likelihood inference. The weight of a transcript t is described as a power function of its likelihood, f beta(yt|theta).Genes with low likelihoods have unexpected expression patterns and low weights. By assigning low weightsto outliers, the inference becomes robust. The value of beta, which controls the balance between the robustnessand efficiency, is selected by maximizing the predictive beta0-likelihood by cross-validation. The proposedbeta-EB approach identified six significant (p 〈 105) contaminated transcripts as differentially expressed(DE) in normal/tumor tissues from the head and neck of cancer patients. These six genes were all confirmedto be related to cancer; they were not identified as DE genes by the classical EB approach. When applied tothe eQTL analysis of Arabidopsis thaliana, the proposed beta-EB approach identified some potential masterregulators that were missed by the EB approach. Conclusions: The simulation data and real gene expression data showed that the proposed beta-EB method was robustagainst outliers. The distribution of the weights was used to scrutinize the irregular patterns of expression and diagnose the model statistically. When beta-weights outside the range of the predicted distribution wereobserved, a detailed inspection of the data was carried out. The beta-weights described here can be applied toother likelihood-based statistical models for diagnosis, and may serve as a useful tool for transcriptome andproteome studies.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 60
    Publikationsdatum: 2012-07-17
    Beschreibung: Background: The emergence of Next Generation Sequencing technologies has made it possible for individual investigators to generate gigabases of sequencing data per week. Effective analysis and manipulation of these data is limited due to large file sizes, so even simple tasks such as data filtration and quality assessment have to be performed in several steps. This requires (potentially problematic) interaction between the investigator and a bioinformatics/computational service provider. Furthermore, such services are often performed using specialized computational facilities. Results: We present a windows-based application, Slim-Filter designed to interactively examine the statistical properties of sequencing reads produced by Illumina Genome Analyzer and to perform a broad spectrum of data manipulation tasks including: filtration of low quality and low complexity reads; filtration of reads containing undesired subsequences (such as parts of adapters and PCR primers used during the sample and sequencing libraries preparation steps); excluding duplicated reads (while keeping each read's copy number information in a specialized data format); and sorting reads by copy numbers allowing for easy access and manual editing of the resulting files. Slim-Filter is organized as a sequence of windows summarizing the statistical properties of the reads. Each data manipulation step has roll-back abilities, allowing for return to previous steps of the data analysis process. Slim-Filter is written in C++ and is compatible with fasta, fastq, and specialized AS file formats presented in this manuscript. Setup files and a user's manual are available for download at the supplementary web site (https://www.bioinfo.uh.edu/Slim_Filter/). Conclusion: The presented windows-based application has been developed with the goal to provide individual investigators with integrated sequencing reads analysis, curation, and manipulation capabilities.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 61
    Publikationsdatum: 2012-07-17
    Beschreibung: Background: In high throughput cancer genomic studies, results from the analysis of single datasets often suffer from a lack of reproducibility because of small sample sizes. Integrative analysis can effectively pool and analyze multiple datasets and provides a cost effective way to improve reproducibility. In integrative analysis, simultaneously analyzing all genes profiled may incur high computational cost. A computationally affordable remedy is prescreening, which fits marginal models, can be conducted in a parallel manner, and has low computational cost. Results: An integrative prescreening approach is developed for the analysis of multiple cancer genomic datasets. Simulation shows that the proposed integrative prescreening has better performance than alternatives, particularly including prescreening with individual datasets, an intensity approach and meta-analysis. We also analyze multiple microarray gene profiling studies on liver and pancreatic cancers using the proposed approach. Conclusions: The proposed integrative prescreening provides an effective way to reduce the dimensionality in cancer genomic studies. It can be coupled with existing analysis methods to identify cancer markers.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 62
    Publikationsdatum: 2012-07-17
    Beschreibung: Background: Precise DNA-protein interactions play most important and vital role in maintaining the normal physiological functioning of the cell, as it controls many high fidelity cellular processes. Detailed study of the nature of these interactions has paved the way for understanding the mechanisms behind the biological processes in which they are involved. Earlier in 2000, a systematic classification of DNA-protein complexes based on the structural analysis of the proteins was proposed at two tiers, namely groups and families. With the advancement in the number and resolution of structures of DNA-protein complexes deposited in the Protein Data Bank, it is important to revisit the existing classification. Results: On the basis of the sequence analysis of DNA binding proteins, we have built upon the protein centric, two-tier classification of DNA-protein complexes by adding new members to existing families and making new families and groups. While classifying the new complexes, we also realised the emergence of new groups and families. The new group observed was where beta-propeller was seen to interact with DNA. There were 34 SCOP folds which were observed to be present in the complexes of both old and new classifications, whereas 28 folds are present exclusively in the new complexes. Some new families noticed were NarL transcription factor, Z-alpha DNA binding proteins, Forkhead transcription factor, AP2 protein, Methyl CpG binding protein etc. Conclusions: Our results suggest that with the increasing number of availability of DNA-protein complexes in Protein Data Bank, the number of families in the classification increased by approximately three fold. The folds present exclusively in newly classified complexes is suggestive of inclusion of proteins with new function in new classification, the most populated of which are the folds responsible for DNA damage repair. The proposed re-visited classification can be used to perform genome-wide surveys in the genomes of interest for the presence of DNA-binding proteins. Further analysis of these complexes can aid in developing algorithms for identifying DNA-binding proteins and their family members from mere sequence information.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 63
    Publikationsdatum: 2012-07-25
    Beschreibung: Background: Seqcrawler takes its roots in software like SRS or Lucegene.It provides an indexing platform to ease the search of data and meta-data in biological banks and it can scale to face the current flow of data.While many biological bank search tools are available on the Internet, mainly provided by large organizations to search in their data, there is a lack of free and open source solution to browse one own set of data with a flexible query system and able to scale from single computer to a cloud system.A personal index platform will help labs and bioinformaticians to search in their meta-data but also to build a larger information system with custom subsets of data. Results: The software is scalable from a single computer to a cloud-based infrastructure.It has been successfully tested in a private cloud with 3 index shards (piece of index) hosting ~400 millions of sequence information (whole GenBank, UniProt, PDB and others) for a total size of 600 GB in a fault tolerant architecture (high-availability). It has also been successfully integrated with software to add extra meta-data from blast results to enhance user's result analysis. Conclusions: Seqcrawler provides a complete open source search and store solution for labs or platforms needing to manage large amount of data/meta-data with a flexible and customizable web interface. All components (search engine, visualization and data storage), though independent, share a common and coherent data system that can be queried with a simple HTTP interface. The solution scales easily and can also provide a high availability infrastructure.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 64
    Publikationsdatum: 2012-07-25
    Beschreibung: Background: Chromatin immunoprecipitation combined with high-throughput sequencing (ChIP-Seq) is the most frequently used method to identify the binding sites of transcription factors. Active binding sites can be seen as peaks in enrichment profiles when the sequencing reads are mapped to a reference genome. However, the profiles are normally noisy, making it challenging to identify all significantly enriched regions in a reliable way and with an acceptable false discovery rate. Results: We present the Triform algorithm, an improved approach to automatic peak finding in ChIP-Seq enrichment profiles for transcription factors. The method uses model-free statistics to identify peak-like distributions of sequencing reads, taking advantage of improved peak definition in combination with known characteristics of ChIP-Seq data. Conclusions: Triform outperforms several existing methods in the identification of representative peak profiles in curated benchmark data sets. We also show that Triform in many cases is able to identify peaks that are more consistent with biological function, compared with other methods. Finally, we show that Triform can be used to generate novel information on transcription factor binding in repeat regions, which represents a particular challenge in many ChIP-Seq experiments. The Triform algorithm has been implemented in R, and is available via http://tare.medisin.ntnu.no/triform.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 65
    Publikationsdatum: 2012-07-26
    Beschreibung: Background: Previous studies on tumor classification based on gene expression profiles suggest that gene selection plays a key role in improving the classification performance. Moreover, finding important tumor-related genes with the highest accuracy is a very important task because these genes might serve as tumor biomarkers, which is of great benefit to not only tumor molecular diagnosis but also drug development. Results: This paper proposes a novel gene selection method with rich biomedical meaning based on Heuristic Breadth-first Search Algorithm (HBSA) to find as many optimal gene subsets as possible. Due to the curse of dimensionality, this type of method could suffer from over-fitting and selection bias problems. To address these potential problems, a HBSA-based ensemble classifier is constructed using majority voting strategy from individual classifiers constructed by the selected gene subsets, and a novel HBSA-based gene ranking method is designed to find important tumor-related genes by measuring the significance of genes using their occurrence frequencies in the selected gene subsets. The experimental results on nine tumor datasets including three pairs of cross-platform datasets indicate that the proposed method can not only obtain better generalization performance but also find many important tumor-related genes. Conclusions: It is found that the frequencies of the selected genes follow a power-law distribution, indicating that only a few top-ranked genes can be used as potential diagnosis biomarkers. Moreover, the top-ranked genes leading to very high prediction accuracy are closely related to specific tumor subtype and even hub genes. Compared with other related methods, the proposed method can achieve higher prediction accuracy with fewer genes. Moreover, they are further justified by analyzing the top-ranked genes in the context of individual gene function, biological pathway, and protein-protein interaction network.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 66
    Publikationsdatum: 2012-07-20
    Beschreibung: Background: The enteric pathogen Salmonella is the causative agent of the majority of food-borne bacterial poisonings.Resent research revealed that colonization of plants by Salmonella is an active infection process. Salmonellachanges the metabolism and adjust the plant host by suppressing the defense mechanisms. In this report wedeveloped an automatic algorithm to quantify the symptoms caused by Salmonella infection on Arabidopsis. Results: The algorithm is designed to attribute image pixels into one of the two classes: healthy and unhealthy. Thetask is solved in three steps. First, we perform segmentation to divide the image into foreground and Background: In the second step, a support vector machine (SVM) is applied to predict the class of each pixelbelonging to the foreground. And finally, we do refinement by a neighborhood-check in order to omit allfalsely classified pixels from the second step. The developed algorithm was tested on infection with thenon-pathogenic E. coli and the plant pathogen Pseudomonas syringae and used to study the interactionbetween plants and Salmonella wild type and T3SS mutants. We proved that T3SS mutants of Salmonellaare unable to suppress the plant defenses. Results obtained through the automatic analyses were furtherverified on biochemical and transcriptome levels. Conclusion: This report presents an automatic pixel-based classification method for detecting "unhealthy" regions in leafimages. The proposed method was compared to existing method and showed a higher accuracy. We usedthis algorithm to study the impact of the human pathogenic bacterium Salmonella Typhimurium on plantsimmune system. The comparison between wild type bacteria and T3SS mutants showed similarity in theinfection process in animals and in plants. Plant epidemiology is only one possible application of theproposed algorithm, it can be easily extended to other detection tasks, which also rely on color information,or even extended to other features.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 67
    Publikationsdatum: 2012-07-25
    Beschreibung: Background: Based on available biological information, genomic data can often be partitioned into pre-defined sets (e.g. pathways) and subsets within sets. Biologists are often interested in determining whether some pre-defined sets of variables (e.g. genes) are differentially expressed under varying experimental conditions. Several procedures are available for performing gene set analysis but they do not take into account information regarding the subsets within each set. Secondly, variables (e.g. genes) belonging to a set or a subset are potentially correlated, yet such information is often ignored and univariate methods are used. This may result in loss of power and/or inflated false positive rate. Results: We introduce a multiple testing-based methodology which makes use of available information regarding biologically relevant subsets within each pre-defined set of variables while exploiting the underlying dependence structure among the variables. Using this methodology, a biologist may not only determine whether a set of variables are differentiallyexpressed between two experimental conditions, but may also test whether specific subsets within a significant set are also significant. Conclusions: The proposed methodology is; (a) easy to implement, (b) does not require inverting potentially singular covariance matrices, (c) controls the family wise error rate (FWER) at the desired nominal level, and (d) is robust to the underlying distribution and covariance structures. Although for simplicity of expression, the methodology is described for microarray gene expression data, it is also applicable to any high dimensional data, such as the mRNA seq data, CpG methylation data, etc.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 68
    Publikationsdatum: 2012-09-12
    Beschreibung: Background: The detection of significant compensatory mutation signals in multiple sequence alignments (MSAs) is often complicated by noise. A challenging problem in bioinformatics is remains the separation of significant signals between two or more non-conserved residue sites from the phylogenetic noise and unrelated pair signals. Determination of these non-conserved residue sites is as important asthe recognition of strictly conserved positions for understanding of the structural basis of protein functions and identification of functionally important residue regions. In this study, we developed a new method, the Coupled Mutation Finder CMF quantifying the phylogeneticnoise for the detection of compensatory mutations. Results: } To demonstrate the effectiveness of this method, we analyzed essential sites of two human proteins: epidermal growth factor receptor (EGFR) and glucokinase (GCK). Our results suggest that the $\cmf$ is able to separate significant compensatory mutation signals from the phylogenetic noise and unrelated pair signals. The vast majority of compensatory mutation sites found by the CMF are related to essential sites of both proteins and they are likely to affect protein stability or functionality. Conclusions: } The CMF is a new method, which includes an MSA-specific statistical model based on multiple testing procedures that quantify the error made in terms of the false discovery rate and a novel entropy-based metric to upscale BLOSUM62 dissimilar compensatory mutations. Therefore, it is a helpful tool topredict and investigate compensatory mutation sites of structural or functional importance in proteins. We suggest that the CMF could be used as a novel automated function prediction tool that is required for a better understanding of the structural basis of proteins. The CMF server is freely accessible at http://cmf.bioinf.med.uni-goettingen.de.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 69
    Publikationsdatum: 2012-09-12
    Beschreibung: Background: Gene-set enrichment analyses (GEA or GSEA) are commonly used for biologicalcharacterization of an experimental gene-set. This is done by finding known functionalcategories, such as pathways or Gene Ontology terms, that are over-represented in theexperimental set; the assessment is based on an overlap statistic. Rich biologicalinformation in terms of gene interaction network is now widely available, but thistopological information is not used by GEA, so there is a need for methods that exploit thistype of information in high-throughput data analysis. Results: We developed a method of network enrichment analysis (NEA) that extends the overlapstatistic in GEA to network links between genes in the experimental set and those in thefunctional categories. For the crucial step in statistical inference, we developed a fastnetwork randomization algorithm in order to obtain the distribution of any network statisticunder the null hypothesis of no association between an experimental gene-set and afunctional category. We illustrate the NEA method using gene and protein expression datafrom a lung cancer study. Conclusions: The results indicate that the NEA method is more powerful than the traditional GEA,primarily because the relationships between gene sets were more strongly captured bynetwork connectivity rather than by simple overlaps.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 70
    Publikationsdatum: 2012-09-13
    Beschreibung: Background: Meta-analysis (MA) is widely used to pool genome-wide association studies (GWASes) in order to a) increasethe power to detect strong or weak genotype effects or b) as a result verification method. As a consequence ofdiffering SNP panels among genotyping chips, imputation is the method of choice within GWAS consortia toavoid losing too many SNPs in a MA. YAMAS (Yet Another Meta Analysis Software), however, enablescross-GWAS conclusions prior to finished and polished imputation runs, which eventually are time-consuming. Results: Here we present a fast method to avoid forfeiting SNPs present in only a subset of studies, without relying onimputation. This is accomplished by using reference linkage disequilibrium data from 1,000Genomes/HapMap projects to find proxy-SNPs together with in-phase alleles for SNPs missing in at least onestudy. MA is conducted by combining association effect estimates of a SNP and those of its proxy-SNPs. Ouralgorithm is implemented in the MA software YAMAS. Association results from GWAS analysis applicationscan be used as input files for MA, tremendously speeding up MA compared to the conventional imputationapproach. We show that our proxy algorithm is well-powered and yields valuable ad hoc results, possiblyproviding an incentive for follow-up studies. We propose our method as a quick screening step prior toimputation-based MA, as well as an additional main approach for studies without available reference datamatching the ethnicities of study participants. As a proof of principle, we analyzed six dbGaP Type II DiabetesGWAS and found that the proxy algorithm clearly outperforms naive MA on the P-value level: for 17 out of23 we observe an improvement on the p-value level by a factor of more than two, and a maximumimprovement by a factor of 2127. Conclusions: YAMAS is an efficient and fast meta-analysis program which offers various methods, including conventionalMA as well as inserting proxy-SNPs for missing markers to avoid unnecessary power loss. MA with YAMAScan be readily conducted as YAMAS provides a generic parser for heterogeneous tabulated file formats withinthe GWAS field and avoids cumbersome setups. In this way, it supplements the meta-analysis process.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 71
    Publikationsdatum: 2012-09-15
    Beschreibung: Background: Existing statistical methods for tiling array transcriptome data either focus on transcript discovery in one biological or experimental condition or on the detection of differential expression between two conditions. Increasingly often, however, biologists are interested in time-course studies, studies with more than two conditions or even multiple-factor studies. As these studies are currently analyzed with the traditional microarray analysis techniques, they do not exploit the genome-wide nature of tiling array data to its full potential. Results: We present an R Bioconductor package, waveTiling, which implements a wavelet-based model for analyzing transcriptome data and extends it towards more complex experimental designs. With waveTiling the user is able to discover (1) group-wise expressed regions, (2) differentially expressed regions between any two groups in single-factor studies and in (3) multifactorial designs. Moreover, for time-course experiments it is also possible to detect (4) linear time effects and (5) a circadian rhythm of transcripts. By considering the expression values of the individual tiling probes as a function of genomic position, effect regions can be detected regardless of existing annotation. Three case studies with different experimental set-ups illustrate the use and the flexibility of the model-based transcriptome analysis. Conclusions: The waveTiling package provides the user with a convenient tool for the analysis of tiling array trancriptome data for a multitude of experimental set-ups. Regardless of the study design, the probe-wise analysis allows for the detection of transcriptional effects in both exonic, intronic and intergenic regions, without prior consultation of existing annotation.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 72
    Publikationsdatum: 2012-09-15
    Beschreibung: Background: Identification of protein structural cores requires isolation of sets of proteins all sharing a same subset of structural motifs. In the context of ever growing number of available 3D protein structures, standard and automatic clustering algorithms require adaptations so as to allow for efficient identification of such sets of proteins. Results: When considering a pair of 3D structures, they are stated as similar or not according to the local similarities of their matching substructures in a structural alignment. This binary relation can be represented in a graph of similarities where a node represents a 3D protein structure and an edge states that two 3D protein structures are similar. Therefore, the classification of proteins into structural families can be viewed as graph clustering task. Unfortunately, because such a graph encodes only pairwise similarity information, clustering algorithms may group in the same cluster a subset of 3D structures that do not share a common substructure. To overcome this drawback we first define a ternary similarity on a triple of 3D structures as a constraint to be satisfied by the graph of similarities. Such a ternary constraint takes into account similarities between pairwise alignments, so as to ensure that the three involved protein structures do have some common substructure. We propose hereunder a modification algorithm that eliminates edges from the original graph of similarities and outputs a reduced graph in which no ternary constraints are violated. Our proposition is then first to build a graph of similarities, then to reduce the graph according to the modification algorithm, and finally to apply to the reduced graph a standard graph clustering algorithm. We applied this method to ASTRAL-40 non-redundant protein domains, identifying significant pairwise similarities with Yakusa, a program devised for rapid 3D structure alignments. Conclusions: We show that filtering similarities prior to standard graph based clustering process by applying ternary similarity constraints i) improves the separation of proteins of different classes and consequently ii) improves the classification quality of standard graph based clustering algorithms according to the reference classification SCOP.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 73
    Publikationsdatum: 2012-09-18
    Beschreibung: Background: Protein interactions play a key role in life processes. Characterization of conformational properties of protein-protein interactions is important for understanding the mechanisms of protein association. The rapidly increasing amount of experimentally determined structures of proteins and protein-protein complexes provides foundation for research on protein interactions and complex formation. The knowledge of the conformations of the surface side chains is essential for modeling of protein complexes. The purpose of this study was to analyze and compare dihedral angle distribution functions of the side chains at the interface and non-interface areas in bound and unbound proteins. Results: To calculate the dihedral angle distribution functions, the configuration space was divided into grid cells. Statistical analysis showed that the similarity between bound and unbound interface and non-interface surface depends on the amino acid type and the grid resolution. The correlation coefficients between the distribution functions increased with the grid spacing increase for all amino acid types. The Manhattan distance showing the degree of dissimilarity between the distribution functions decreased accordingly. Short residues with one or two dihedral angles had higher correlations and smaller Manhattan distances than the longer residues. Met and Arg had the slowest growth of the correlation coefficient with the grid spacing increase. The correlations between the interface and non-interface distribution functions had a similar dependence on the grid resolution in both bound and unbound states. The interface and non-interface differences between bound and unbound distribution functions, caused by biological protein-protein interactions or crystal contacts, disappeared at the 70[degree sign] grid spacing for interfaces and 30o for non-interface surface, which agrees with an average span of the side-chain rotamers. Conclusions: The two-fold difference in the critical grid spacing indicates larger conformational changes upon binding at the interface than at the rest of the surface. At the same time, transitions between rotamers induced by interactions across the interface or the crystal packing are rare, with most side chains having local readjustments that do not change the rotameric state. The analysis is important for better understanding of protein interactions and development of flexible docking approaches.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 74
    Publikationsdatum: 2012-09-09
    Beschreibung: Background: The COG database is the most popular collection of orthologous proteins from many different completely sequenced microbial genomes. Per definition, a cluster of orthologous groups (COG) within this database exclusively contains proteins that most likely achieve the same cellular function. Recently, the COG database was extended by assigning to every protein both the corresponding amino acid and its encoding nucleotide sequence resulting in the NUCOCOG database. This extended version of the COG database is a valuable resource connecting sequence features with the functionality of the respective proteins. Results: Here we present ANCAC, a web tool and MySQL database for the analysis of amino acid, nucleotide, and codon frequencies in COGs on the basis of freely definable phylogenetic patterns. We demonstrate the usefulness of ANCAC by analyzing amino acid frequencies, codon usage, and GC-content in a species- or function-specific context. With respect to amino acids we, at least in part, confirm the cognate bias hypothesis by using ANCAC's NUCOCOG dataset as the largest one available for that purpose thus far. Conclusions: Using the NUCOCOG datasets, ANCAC connects taxonomic, amino acid, and nucleotide sequence information with the functional classification via COGs and provides a GUI for flexible mining for sequence-bias. Thereby, to our knowledge, it is the only tool for the analysis of sequence composition in the light of physiological roles and phylogenetic context without requirement of substantial programming-skills.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 75
    Publikationsdatum: 2012-09-11
    Beschreibung: Background: Employing methods to assess the quality of modeled protein structures is now standard practice in bioinformatics. In a broad sense, the techniques can be divided into methods relying on consensus prediction on the one hand, and single-model methods on the other. Consensus methods frequently perform very well when there is a clear consensus, but this is not always the case. In particular, they frequently fail in selecting the best possible model in the hard cases (lacking consensus) or in the easy cases where models are very similar. In contrast, single-model methods do not suffer from these drawbacks and could potentially be applied on any protein of interest to assess quality or as a scoring function for sampling-based refinement. Results: Here, we present a new single-model method, ProQ2, based on ideas from its predecessor, ProQ. ProQ2 is a model quality assessment algorithm that uses support vector machines to predict local as well as global quality of protein models. Improved performance is obtained by combining previously used features with updated structural and predicted features. The most important contribution can be attributed the use of profile weighting of the residue specific features and the use features averaged over the whole model even tough the prediction is still local. Conclusions: ProQ2 is significantly better than its predecessors at detecting high quality models, improving the sum of Z-scores for the selected first-ranked models by 20% and 32% compared to the second-best single-model method in CASP8 and CASP9, respectively. The absolute quality assessment of the models at both local and global level is also improved. The Pearson's correlation between the correct and local predicted score is improved from 0.59 to 0.70 on CASP8 and from 0.62 to 0.68 on CASP9; for global score to the correct GDT_TS from 0.75 to 0.80 and from 0.77 to 0.80 again compared to the second-best single methods in CASP8
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 76
    Publikationsdatum: 2012-09-12
    Beschreibung: Background: Relative expression algorithms such as the top-scoring pair (TSP) and the top-scoring triplet (TST) have several strengths that distinguish them from other classification methods, including resistance to overfitting, invariance to most data normalization methods, and biological interpretability. The top-scoring 'N' (TSN) algorithm is a generalized form of other relative expression algorithms which uses generic permutations and a dynamic classifier size to control both the permutation and combination space available for classification. Results: TSN was tested on nine cancer datasets, showing statistically significant differences in classification accuracy between different classifier sizes (choices of N). TSN also performed competitively against a wide variety of different classification methods, including artificial neural networks, classification trees, discriminant analysis, k-Nearest neighbor, naive Bayes, and support vector machines, when tested on the Microarray Quality Control II datasets. Furthermore, TSN exhibits low levels of overfitting on training data compared to other methods, giving confidence that results obtained during cross validation will be more generally applicable to external validation sets. Conclusions: TSN preserves the strengths of other relative expression algorithms while allowing a much larger permutation and combination space to be explored, potentially improving classification accuracies when fewer numbers of measured features are available.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 77
    Publikationsdatum: 2012-09-15
    Beschreibung: Background: A large panel of methods exists aiming at identifying residues with critical impact on protein function based on evolutionary signals, sequence and structure information. However, it is not clear to what extent the different methods overlap, and if any of the methods have higher predictive potential compared to others when it comes to, in particular, the identification of catalytic residues (CR) in proteins.Using a large set of enzymatic protein families and measures based on different evolutionary signals, we here sought to break up the different components of the information content within a multiple sequence alignment to investigate their predictive potential and degree of overlap. Results: Our results demonstrate that the different methods included in the benchmark in general can be divided into three groups with a limited mutual overlap. One group containing real-value Evolutionary Trace (rvET) methods and conservation, other containing mutual information (MI) methods, and the last containing methods designed explicitly for the identification of specificity determining positions (SDPs): integer-value Evolutionary Trace (ivET), SDPfox, and XDET. In terms of prediction of CR, we find using a proximity score integrating structural information (as the sum of the scores of residues located within a given distance of the residue in question) that only the methods from the first two groups displayed a reliable performance.Next, we investigated to what degree proximity scores for conservation, rvET and cumulative MI (cMI) provide complementary information capable of improving the performance for CR identification. We found that integrating conservation with proximity scores for rvET and cMI achieved the highest performance. The proximity conservation score contained no complementary information when integrated with proximity rvET. Moreover, the signal from rvET provided only a limited gain in predictive performance when integrated with mutual information and conservation proximity scores. Combined, these observations demonstrate that the rvET and cMI scores add complementary information to the prediction system. Conclusions: This work contributes to the understanding of the different signals of evolution and also shows that it is possible to improve the detection of catalytic residues by integrating structural and higher order sequence evolutionary information with sequence conservation.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 78
    Publikationsdatum: 2012-09-13
    Beschreibung: Background: High-throughput molecular biology techniques yield vast amounts of data, often by detecting small portions ofribonucleotides corresponding to specific identifiers. Existing bioinformatic methodologies categorize andcompare these elements using inferred descriptive annotation given this sequence information irrespective ofthe fact that it may not be representative of the identifier as a whole. Results: All annotations, no matter the granularity, can be aligned to genomic sequences and therefore annotated bygenomic intervals. We have developed AbsIDconvert, a methodology for converting between genomicidentifiers by first mapping them onto a common universal coordinate system using an interval tree which issubsequently queried for overlapping identifiers. AbsIDconvert has many potential uses, including geneidentifier conversion, identification of features within a genomic region, and cross-species comparisons. Theutility is demonstrated in three case studies: 1) comparative genomic study mapping plasmodium genesequences to corresponding human and mosquito transcriptional regions; 2) cross-species study of Incyteclone sequences; and 3) analysis of human Ensembl transcripts mapped by Affymetrix Rand Agilentmicroarray probes. AbsIDconvert supports ID conversion of 53 species for a given list of input identifiers,genomic sequence, or genome intervals. Conclusion: AbsIDconvert provides an efficient and reliable mechanism for conversion between identifier domains ofinterest. The flexibility of this tool allows for custom definition identifier domains contingent upon theavailability and determination of a genomic mapping interval. As the genomes and the sequences for geneticelements are further refined, this tool will become increasingly useful and accurate. AbsIDconvert is freelyavailable as a web application or downloadable as a virtual machine at:http://bioinformatics.louisville.edu/abid/.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 79
    Publikationsdatum: 2012-09-18
    Beschreibung: Background: Yeasts are a model system for exploring eukaryotic genome evolution. Next-generation sequencing technologies are poised to vastly increase the number of yeast genome sequences, both from resequencing projects (population studies) and from de novo sequencing projects (new species). However, the annotation of genomes presents a major bottleneck for de novo projects, because it still relies on a process that is largely manual. Results: Here we present the Yeast Genome Annotation Pipeline (YGAP), an automated system designed specifically for new yeast genome sequences lacking transcriptome data. YGAP does automatic de novo annotation, exploiting homology and synteny information from other yeast species stored in the Yeast Gene Order Browser (YGOB) database. The basic premises underlying YGAP's approach are that data from other species already tells us what genes we should expect to find in any particular genomic region and that we should also expect that orthologous genes are likely to have similar intron/exon structures. Additionally, it is able to detect probable frameshift sequencing errors and can propose corrections for them. YGAP searches intelligently for introns, and detects tRNA genes and Ty-like elements. Conclusions: In tests on Saccharomyces cerevisiae and on the genomes of Naumovozyma castellii and Tetrapisispora blattae newly sequenced with Roche-454 technology, YGAP outperformed another popular annotation program (AUGUSTUS). For S. cerevisiae and N. castellii, 91-93 % of YGAP's predicted gene structures were identical to those in previous manually curated gene sets. YGAP has been implemented as a webserver with a user-friendly interface at http://wolfe.gen.tcd.ie/annotation
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 80
    Publikationsdatum: 2012-09-12
    Beschreibung: Background: Protein-DNA interactions are important for many cellular processes, however structural knowledge for a large fraction of known and putative complexes is still lacking. Computational docking methods aim at the prediction of complex architecture given detailed structures of its constituents. They are becoming an increasingly important tool in the field of macromolecular assemblies, complementing particularly demanding protein-nucleic acids X ray crystallography and providing means for the refinement and integration of low resolution data coming from rapidly advancing methods such as cryoelectron microscopy. Results: We present a new coarse-grained force field suitable for protein-DNA docking. The force field is an extension of previously developed parameter sets for protein-RNA and protein-protein interactions. The docking is based on potential energy minimization in translational and orientational degrees of freedom of the binding partners. It allows for fast and efficient systematic search for native-like complex geometry without any prior knowledge regarding binding site location. Conclusions: We find that the force field gives very good results for bound docking. The quality of predictions in the case of unbound docking varies, depending on the level of structural deviation from bound geometries. We analyze the role of specific protein-DNA interactions on force field performance, both with respect to complex structure prediction, and the reproduction of experimental binding affinities. We find that such direct, specific interactions only partially contribute toprotein-DNA recognition, indicating an important role of shape complementarity and sequence-dependent DNA internal energy, in line with the concept of indirect protein-DNA readout mechanism.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 81
    Publikationsdatum: 2012-09-13
    Beschreibung: Background: Correct segmentation is critical to many applications within automated microscopy image analysis. Despite the availability of advanced segmentation algorithms, variations in cell morphology, sample preparation, and acquisition settings often lead to segmentation errors. This manuscript introduces a ranked-retrieval approach using logistic regression to automate selection of accurately segmented nuclei from a set of candidate segmentations. The methodology is validated on an application of spatial gene repositioning in breast cancer cell nuclei. Gene repositioning is analyzed in patient tissue sections by labeling sequences with fluorescence in situ hybridization (FISH), followed by measurement of the relative position of each gene from the nuclear center to the nuclear periphery. This technique requires hundreds of well-segmented nuclei per sample to achieve statistical significance. Although the tissue samples in this study contain a surplus of available nuclei, automatic identification of the well-segmented subset remains a challenging task. Results: Logistic regression was applied to features extracted from candidate segmented nuclei, including nuclear shape, texture, context, and gene copy number, in order to rank objects according to the likelihood of being an accurately segmented nucleus. The method was demonstrated on a tissue microarray dataset of 43 breast cancer patients, comprisingapproximately 40,000 imaged nuclei in which the HES5 and FRA2 genes were labeled with FISH probes. Three trained reviewers independently classified nuclei into three classes of segmentation accuracy. In man vs. machine studies, the automated method outperformed the inter-observer agreement between reviewers, as measured by area under the receiver operating characteristic (ROC) curve. Robustness of gene position measurements to boundary inaccuracies was demonstrated by comparing 1086 manually and automatically segmented nuclei. Pearson correlation coefficients between the gene position measurements were above 0.9 (p 〈 0.05). A preliminary experiment was conducted to validate the ranked retrieval in a test to detect cancer. Independent manual measurement of gene positions agreed with automatic results in 21 out of 26 statistical comparisons against a pooled normal (benign) gene position distribution. Conclusions: Accurate segmentation is necessary to automate quantitative image analysis for applications such as gene repositioning. However, due to heterogeneity within images and across different applications, no segmentation algorithm provides a satisfactory solution. Automated assessment of segmentations by ranked retrieval is capable of reducing or even eliminating the need to select segmented objects by hand and represents a significant improvement over binary classification. The method can be extended to other high-throughput applications requiring accurate detection of cells or nuclei across a range of biomedical applications.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 82
    Publikationsdatum: 2012-09-13
    Beschreibung: Background: Roche 454 sequencing is the leading sequencing technology for producing long read high throughput sequence data. Unlike most methods where sequencing errors translate to base uncertainties, 454 sequencing inaccuracies create nucleotide gaps. These gaps are particularly troublesome for translated search tools such as BLASTx where they introduce frame-shifts and result in regions of decreased identity and/or terminated alignments, which affect further analysis. Results: To address this issue, the Homopolymer Aware Cross Alignment Tool (HAXAT) was developed. HAXAT uses a novel dynamic programming algorithm for solving the optimal local alignment between a 454 nucleotide and a protein sequence by allowing frame-shifts, guided by 454 flowpeak values. The algorithm is an efficient minimal extension of the Smith-Waterman-Gotoh algorithm that easily fits in into other tools.Experiments using HAXAT demonstrate, through the introduction of 454 specific frame-shift penalties, significantly increased accuracy of alignments spanning homopolymer sequence errors. The full effect of the new parameters introduced with this novel alignment model is explored. Experimental results evaluating homopolymer inaccuracy through alignments show a two to five-fold increase in Matthews Correlation Coefficient over previous algorithms, for 454-derived data. Conclusions: This increased accuracy provided by HAXAT does not only result in improved homologue estimations, but also provides un-interrupted reading-frames, which greatly facilitate further analysis of protein space, for example phylogenetic analysis.The alignment tool is available at http://bioinfo.ifm.liu.se/454tools/haxat.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 83
    Publikationsdatum: 2012-08-18
    Beschreibung: Background: We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. Results: Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. Conclusions: The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides avaluable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 84
    Publikationsdatum: 2012-08-18
    Beschreibung: Background: It is now well established that nearly 20{\%} of human cancers are caused by infectious agents, and the list of human oncogenic pathogens will grow in the future for a variety of cancer types. Whole tumor transcriptome and genome sequencing by next-generation sequencing technologies presents an unparalleled opportunity for pathogen detection and discovery in human tissues but requires development of new genome-wide bioinformatics tools. Results: Here we present CaPSID (Computational Pathogen Sequence IDentification), a comprehensive bioinformatics platform for identifying, querying and visualizing both exogenous and endogenous pathogen nucleotide sequences in tumor genomes and transcriptomes. CaPSID includes a scalable, high performance database for data storage and a web application that integrates the genome browser JBrowse. CaPSID also provides useful metrics for sequence analysis of pre-aligned BAM files, such as gene and genome coverage, and is optimized to run efficiently on multiprocessor computers with low memory usage. Conclusions: To demonstrate the usefulness and efficiency of CaPSID, we carried out a comprehensive analysis of both a simulated dataset and transcriptome samples from ovarian cancer. CaPSID correctly identified all of the human and pathogen sequences in the simulated dataset, while in the ovarian dataset CaPSID's predictions were successfully validated in vitro.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 85
    Publikationsdatum: 2012-08-17
    Beschreibung: Background: Transcriptional activity of genes depends on many factors like DNA motifs, conformational characteristics of DNA, melting etc. and there are computational approaches for their identification. However, in real applications, the number of predicted, for example, DNA motifs may be considerably large. In cases when various computational programs are applied, systematic experimental knock out of each of the potential elements obviously becomes nonproductive. Hence, one needs an approach that is able to integrate many heterogeneous computational methods and upon that suggest selected regulatory elements for experimental verification. Results: Here, we present an integrative bioinformatic approach aimed at the discovery of regulatory modules that can be effectively verified experimentally. It is based on combinatorial analysis of known and novel binding motifs, as well as of any other known features of promoters. The goal of this method is the identification of a collection of modules that are specific for an established dataset and at the same time are optimal for experimental verification. The method is particularly effective on small datasets, where most statistical approaches fail. We apply it to promoters that drive tumor-specific gene expression in tumor-colonizing Gram-negative bacteria. The method successfully identified a number of potential modules, which required only a few experiments to be verified. The resulting minimal functional bacterial promoter exhibited high specificity of expression in cancerous tissue. Conclusions: Experimental analysis of promoter structures guided by bioinformatics has proved to be efficient. The developed computational method is able to include heterogeneous features of promoters and suggest combinatorial modules for experimental testing. Expansibility and robustness of the methodology implemented in the approach ensures good results for a wide range of problems.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 86
    Publikationsdatum: 2012-06-19
    Beschreibung: Background: Choosing appropriate primers is probably the single most important factor affecting thepolymerase chain reaction (PCR). Specific amplification of the intended target requires thatprimers do not have matches to other targets in certain orientations and within certaindistances that allow undesired amplification. The process of designing specific primerstypically involves two stages. First, the primers flanking regions of interest are generatedeither manually or using software tools; then they are searched against an appropriatenucleotide sequence database using tools such as BLAST to examine the potential targets.However, the latter is not an easy process as one needs to examine many details betweenprimers and targets, such as the number and the positions of matched bases, the primerorientations and distance between forward and reverse primers. The complexity of suchanalysis usually makes this a time-consuming and very difficult task for users, especiallywhen the primers have a large number of hits. Furthermore, although the BLAST programhas been widely used for primer target detection, it is in fact not an ideal tool for this purposeas BLAST is a local alignment algorithm and does not necessarily return complete matchinformation over the entire primer range. Results: We present a new software tool called Primer-BLAST to alleviate the difficulty in designingtarget-specific primers. This tool combines BLAST with a global alignment algorithm toensure a full primer-target alignment and is sensitive enough to detect targets that have asignificant number of mismatches to primers. Primer-BLAST allows users to design newtarget-specific primers in one step as well as to check the specificity of pre-existing primers.Primer-BLAST also supports placing primers based on exon/intron locations and excludingsingle nucleotide polymorphism (SNP) sites in primers. Conclusions: We describe a robust and fully implemented general purpose primer design tool that designstarget-specific PCR primers. Primer-BLAST offers flexible options to adjust the specificitythreshold and other primer properties. This tool is publicly available athttp://www.ncbi.nlm.nih.gov/tools/primer-blast.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 87
    Publikationsdatum: 2012-06-21
    Beschreibung: Background: The molecular recognition based on the complementary base pairing of deoxyribonucleicacid (DNA) is the fundamental principle in the fields of genetics, DNA nanotechnologyand DNA computing. We present an exhaustive DNA sequence design algorithm thatallows to generate sets containing a maximum number of sequences with definedproperties. EGNAS (Exhaustive Generation of Nucleic Acid Sequences) offers thepossibility of controlling both interstrand and intrastrand properties. The guanine-cytosinecontent can be adjusted. Sequences can be forced to start and end with guanine orcytosine. This option reduces the risk of "fraying" of DNA strands. It is possible to limitcross hybridizations of a defined length, and to adjust the uniqueness of sequences.Self-complementarity and hairpin structures of certain length can be avoided. Sequencesand subsequences can optionally be forbidden. Furthermore, sequences can be designed tohave minimum interactions with predefined strands and neighboring sequences. Results: The algorithm is realized in a C++ program. TAG sequences can be generated andcombined with primers for single-base extension reactions, which were described formultiplexed genotyping of single nucleotide polymorphisms. Thereby, possible foldbackthrough intrastrand interaction of TAG-primer pairs can be limited. The design ofsequences for specific attachment of molecular constructs to DNA origami is presented. Conclusions: We developed a new software tool called EGNAS for the design of unique nucleic acidsequences. The presented exhaustive algorithm allows to generate greater sets ofsequences than with previous software and equal constraints. EGNAS is freely availablefor noncommercial use at http://www.chm.tu-dresden.de/pc6/EGNAS.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 88
    Publikationsdatum: 2012-07-07
    Beschreibung: Background: The ability to predict protein-protein binding sites has a wide range of applications,including signal transduction studies, de novo drug design, structure identification andcomparison of functional sites. The interface in a complex involves two structurallymatched protein subunits, and the binding sites can be predicted by identifying structuralmatches at protein surfaces. Results: We propose a method which enumerates "all" the configurations (or poses) between twoproteins (3D coordinates of the two subunits in a complex) and evaluates eachconfiguration by the interaction between its components using the Atomic Contact Energyfunction. The enumeration is achieved efficiently by exploring a set of rigidtransformations. Our approach incorporates a surface identification technique and amethod for avoiding clashes of two subunits when computing rigid transformations. Whenthe optimal transformations according to the Atomic Contact Energy function areidentified, the corresponding binding sites are given as predictions. Our results show thatthis approach consistently performs better than other methods in binding siteidentification. Conclusions: Our method achieved a success rate higher than other methods, with the prediction qualityimproved in terms of both accuracy and coverage. Moreover, our method is being able topredict the configurations of two binding proteins, where most of other methods predictonly the binding sites. The software package is available athttp://sites.google.com/site/guofeics/dobi for non-commercial use.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 89
    Publikationsdatum: 2012-07-04
    Beschreibung: Background: Computational prediction of protein subcellular localization can greatly help to elucidate itsfunctions. Despite the existence of dozens of protein localization prediction algorithms, theprediction accuracy and coverage are still low. Several ensemble algorithms have beenproposed to improve the prediction performance, which usually include as many as 10 ormore individual localization algorithms. However, their performance is still limited by therunning complexity and redundancy among individual prediction algorithms. Results: This paper proposed a novel method for rational design of minimalist ensemble algorithmsfor practical genome-wide protein subcellular localization prediction. The algorithm is basedon combining a feature selection based filter and a logistic regression classifier. Using anovel concept of contribution scores, we analyzed issues of algorithm redundancy, consensusmistakes, and algorithm complementarity in designing ensemble algorithms. We applied theproposed minimalist logistic regression (LR) ensemble algorithm to two genome-widedatasets of Yeast and Human and compared its performance with current ensemblealgorithms. Experimental results showed that the minimalist ensemble algorithm can achievehigh prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemblealgorithms, which greatly reduces computational complexity and running time. It was foundthat the high performance ensemble algorithms are usually composed of the predictors thattogether cover most of available features. Compared to the best individual predictor, ourensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset. Compared with popularweighted voting based ensemble algorithms, our classifier-based ensemble algorithmsachieved much better performance without suffering from inclusion of too many individualpredictors Conclusions: We proposed a method for rational design of minimalist ensemble algorithms using featureselection and classifiers. The proposed minimalist ensemble algorithm based on logisticregression can achieve equal or better prediction performance while using only half or onethirdof individual predictors compared to other ensemble algorithms. The results alsosuggested that meta-predictors that take advantage of a variety of features by combiningindividual predictors tend to achieve the best performance. The LR ensemble server andrelated benchmark datasets are available at http://mleg.cse.sc.edu/LRensemble/cgibin/predict.cgi
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 90
    Publikationsdatum: 2012-07-10
    Beschreibung: Background: Over the past years, statistical and Bayesian approaches have become increasingly appreciated to address the long-standing problem of computational RNA structure prediction. Recently, a novel probabilistic method for the prediction of RNA secondary structures from a single sequence has been studied which is based on generating statistically representative and reproducible samples of the entire ensemble of feasible structures for a particular input sequence. This method samples the possible foldings from a distribution implied by a sophisticated (traditional or length- dependent) stochastic context-free grammar (SCFG) that mirrors the standard thermodynamic model applied in modern physics-based prediction algorithms. Specifically, that grammar represents an exact probabilistic counterpart to the energy model underlying the Sfold software, which employs a sampling extension of the partition function (PF) approach to produce statistically representative subsets of the Boltzmann-weighted ensemble. Although both sampling approaches have the same worst-case time and space complexities, it has been indicated that they differ in performance (both with respect to prediction accuracy and quality of generated samples), where neither of these two competing approaches generally outperforms the other. Results: In this work, we will consider the SCFG based approach in order to perform an analysis on how the quality of generated sample sets and the corresponding prediction accuracy changes when different degrees of disturbances are incorporated into the needed sampling probabilities. This is motivated by the fact that if the results prove to be resistant to large errors on the distinct sampling probabilities (compared to the exact ones), then it will be an indication that these probabilities do not need to be computed exactly, but it may be sufficient and more efficient to approximate them. Thus, it might then be possible to decrease the worst-case time requirements of such an SCFG based sampling method without significant accuracy losses. If, on the other hand, the quality of sampled structures can be observed to strongly react to slight disturbances, there is little hope for improving the complexity by heuristic procedures. We hence provide a reliable test for the hypothesis that a heuristic method could be implemented to improve the time scaling of RNA secondary structure prediction in the worst-case -- without sacrificing much of the accuracy of the results. Conclusions: Our experiments indicate that absolute errors generally lead to the generation of useless sample sets, whereas relative errors seem to have only small negative impact on both the predictive accuracy and the overall quality of resulting structure samples. Based on these observations, we present some useful ideas for developing a time-reduced sampling method guaranteeing an acceptable predictive accuracy. We also discuss some inherent drawbacks that arise in the context of approximation. The key results of this paper are crucial for the design of an efficient and competitive heuristic prediction method based on the increasingly accepted and attractive statistical sampling approach. This has indeed been indicated by the construction of prototype algorithms (see [25]).
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 91
    Publikationsdatum: 2012-07-12
    Beschreibung: Background: There is a need for automated methods to learn general features of the interactions of a ligand class with its diverse set of protein receptors. An appropriate machine learning approach is Inductive Logic Programming (ILP), which automatically generates comprehensible rules in addition to prediction. The development of ILP systems whichcan learn rules of the complexity required for studies on protein structure remains a challenge. In this work we use a new ILP system, ProGolem, and demonstrate its performance on learning features of hexose-protein interactions. Results: The rules induced by ProGolem detect interactions mediated by aromatics and by planar-polar residues, in addition to less common features such as the aromatic sandwich. The rules also reveal a previously unreported dependency for residues CYS and LEU. They also specify interactions involving aromatic and hydrogen bonding residues. Conclusions: In addition to confirming literature results, ProGolem's model has a 10-fold cross-validated predictive accuracy that is superior, at the 95% confidence level, to another ILP system previously used to study protein/hexose interactions and is comparable with state-of-the-art statistical learners.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 92
    Publikationsdatum: 2012-07-13
    Beschreibung: Background: In bioinformatics, it is important to build extensible and low-maintenance systems that are able to deal with the new tools and data formats that are constantly being developed. The traditional and simplest implementation of pipelines involves hardcoding the execution steps into programs or scripts. This approach can lead to problems when a pipeline is expanding because the incorporation of new tools is often error prone and time consuming. Current approaches to pipeline development such as workflow management systems focus on analysis tasks that are systematically repeated without significant changes in their course of execution, such as genome annotation. However, more dynamism on the pipeline composition is necessary when each execution requires a different combination of steps. Results: We propose a graph-based approach to implement extensible and low-maintenance pipelines that is suitable for pipeline applications with multiple functionalities that require different combinations of steps in each execution. Here pipelines are composed automatically by compiling a specialised set of tools on demand, depending on the functionality required, instead of specifying every sequence of tools in advance. We represent the connectivity of pipeline components with a directed graph in which components are the graph edges, their inputs and outputs are the graph nodes, and the paths through the graph are pipelines. To that end, we developed special data structures and a pipeline system algorithm. We demonstrate the applicability of our approach by implementing a format conversion pipeline for the fields of population genetics and genetic epidemiology, but our approach is also helpful in other fields where the use of multiple software is necessary to perform comprehensive analyses, such as gene expression and proteomics analyses. The project code, documentation and the Java executables are available under an open source license at http://code.google.com/p/dynamic-pipeline. The system has been tested on Linux and Windows platforms. Conclusions: Our graph-based approach enables the automatic creation of pipelines by compiling a specialised set of tools on demand, depending on the functionality required. It also allows the implementation of extensible and low-maintenance pipelines and contributes towards consolidating openness and collaboration in bioinformatics systems. It is targeted at pipeline developers and is suited for implementing applications with sequential execution steps and combined functionalities. In the format conversion application, the automatic combination of conversion tools increased both the number of possible conversions available to the user and the extensibility of the system to allow for future updates with new file formats.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 93
    Publikationsdatum: 2012-07-10
    Beschreibung: Background: Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results: This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions: As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 94
    Publikationsdatum: 2012-07-10
    Beschreibung: Background: Next-generation sequencing systems are capable of rapid and cost-effective DNA sequencing, thusenabling routine sequencing tasks and taking us one step closer to personalized medicine. Accuracy and lengthsof their reads, however, are yet to surpass those provided by the conventional Sanger sequencing method. Thismotivates the search for computationally efficient algorithms capable of reliable and accurate detection of theorder of nucleotides in short DNA fragments from the acquired data. Results: In this paper, we consider Illumina's sequencing-by-synthesis platform which relies on reversibleterminator chemistry and describe the acquired signal by reformulating its mathematical model as a HiddenMarkov Model. Relying on this model and sequential Monte Carlo methods, we develop a parameter estimationand base calling scheme called ParticleCall. ParticleCall is tested on a data set obtained by sequencing phiX174bacteriophage using Illumina's Genome Analyzer II. The results show that the developed base calling scheme issignificantly more computationally efficient than the best performing unsupervised method currently available,while achieving the same accuracy. Conclusions: The proposed ParticleCall provides more accurate calls than the Illumina's base calling algorithm,Bustard. At the same time, ParticleCall is significantly more computationally efficient than other recent schemeswith similar performance, rendering it more feasible for high-throughput sequencing data analysis. Improvementof base calling accuracy will have immediate beneficial effects on the performance of downstream applicationssuch as SNP and genotype calling.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 95
    Publikationsdatum: 2012-06-15
    Beschreibung: Background: The analysis of complex diseases is an important problem in human genetics. Because multifactoriality isexpected to play a pivotal role, many studies are currently focused on collecting information on the geneticand environmental factors that potentially influence these diseases. However, there is still a lack of efficientand thoroughly tested statistical models that can be used to identify implicated features and theirinteractions. Simulations using large biologically realistic data sets with known gene-gene andgene-environment interactions that influence the risk of a complex disease are a convenient and useful wayto assess the performance of statistical methods. Results: The Gene-Environment iNteraction Simulator 2 (GENS2) simulates interactions among two genetic and oneenvironmental factor and also allows for epistatic interactions. GENS2 is based on data with realisticpatterns of linkage disequilibrium, and imposes no limitations either on the number of individuals to besimulated or on number of non-predisposing genetic/environmental factors to be considered. The GENS2tool is able to simulate gene-environment and gene-gene interactions. To make the Simulator more intuitive,the input parameters are expressed as standard epidemiological quantities. GENS2 is written in Pythonlanguage and takes advantage of operators and modules provided by the simuPOP simulation environment.It can be used through a graphical or a command-line interface and is freely available fromhttp://sourceforge.net/projects/gensim. The software is released under the GNU General Public Licenseversion 3.0. Conclusions: Data produced by GENS2 can be used as a benchmark for evaluating statistical tools designed for theidentification of gene-gene and gene-environment interactions.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 96
    Publikationsdatum: 2012-06-22
    Beschreibung: Background: High-throughput technologies such as DNA, RNA, protein, antibody and peptide microarrays are often used to examine differences across drug treatments, diseases, transgenic animals, and others. Typically one trains a classification system by gathering large amounts of probe-level data, selecting informative features, and classifies test samples using a small number of features. As new microarrays are invented, classification systems that worked well for other array types may not be ideal. Expression microarrays, arguably one of the most prevalent array types, have been used for years to help develop classification algorithms. Many biological assumptions are built into classifiers that were designed for these types of data. One of the more problematic is the assumption of independence, both at the probe level and again at the biological level. Probes for RNA transcripts are designed to bind single transcripts. At the biological level, many genes have dependencies across transcriptional pathways where co-regulation of transcriptional units may make many genes appear as being completely dependent. Thus, algorithms that perform well for gene expression data may not be suitable when other technologies with different binding characteristics exist. The immunosignaturing microarray is based on complex mixtures of antibodies binding to arrays of random sequence peptides. It relies on many-to-many binding of antibodies to the random sequence peptides. Each peptide can bind multiple antibodies and each antibody can bind multiple peptides. This technology has been shown to be highly reproducible and appears promising for diagnosing a variety of disease states. However, it is not clear what is the optimal classification algorithm for analyzing this new type of data. Methods: We characterized several classification algorithms to analyze immunosignaturing data. We selected several datasets that range from easy to difficult to classify, from simple monoclonal binding to complex binding patterns in asthma patients. We then classified the biological samples using 17 different classification algorithms. Results: Using a wide variety of assessment criteria, we found 'Naive Bayes' far more useful than other widely used methods due to its simplicity, robustness, speed and accuracy. Conclusions: 'Naive Bayes' algorithm appears to accommodate the complex patterns hidden within multilayered immunosignaturing microarray data due to its fundamental mathematical properties.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 97
    Publikationsdatum: 2012-06-23
    Beschreibung: Background: The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches. Results: Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related--a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results are obtained in about a day. Conclusions: This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 98
    Publikationsdatum: 2012-06-26
    Beschreibung: Background: Linkage analysis is the rst step in the search for a disease gene. Linkage studies have facilitated theidentication of several hundred human genes that can harbor mutations leading to a disease phenotype. In thispaper, we study a very important case, where the sampled individuals are closely related, but the pedigree is notgiven. This situation happens very often when the individuals share a common ancestor 6 or more generationsago. To our knowledge, no algorithm can give good results for this case. Results: To solve this problem, we rst developed some heuristic algorithms for haplotype inference without anygiven pedigree. We propose a model using the parsimony principle that can be viewed as an extension of the modelrst proposed by Dan Guseld. Our heuristic algorithm uses Clark's inference rule to infer haplotype segments. Conclusions: We ran our program both on the simulated data and a set of real data from the phase II HapMapdatabase. Experiments show that our program performs well. The recall value is from 90% to 99% in variouscases. This implies that the program can report more than 90% of the true mutation regions. The value ofprecision varies from 29% to 90%. When the precision is 29%, the size of the reported regions is three times thatof the true mutation region. This is still very useful for narrowing down the range of the disease gene location.Our program can complete the computation for all the tested cases, where there are about 110,000 SNPs on achromosome, within 20 seconds.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 99
    Publikationsdatum: 2012-06-29
    Beschreibung: Background: Dihydrouridine (D) is a modified base found in conserved positions in the D-loop of tRNA in Bacteria, Eukaryota, and some Archaea. Despite the abundant occurrence of D, little is known about its biochemical roles in mediating tRNA function. It is assumed that D may destabilize the structure of tRNA and thus enhance its conformational flexibility. D is generated post-transcriptionally by reduction of the 5,6-double bond of a uridine residue in tRNA transcripts. The reaction is carried out by dihydrouridine synthases (DUS). DUS constitute a conserved family of enzymes encoded by the orthologous gene family COG0042. In protein sequence databases, members of COG0042 are typically annotated as "predicted TIM-barrel enzymes, possibly dehydrogenases, nifR3 family". Results: To elucidate sequence-structure-function relationships in the DUS family, a comprehensive bioinformatic analysis was carried out. We performed extensive database searches to identify all members of the currently known DUS family, followed by clustering analysis to subdivide it into subfamilies of closely related sequences. We analyzed phylogenetic distributions of all members of the DUS family and inferred the evolutionary tree, which suggested a scenario for the evolutionary origin of dihydrouridine-forming enzymes. For a human representative of the DUS family, the hDus2 protein suggested as a potential drug target in cancer, we generated a homology model. While this article was under review, a crystal structure of a DUS representative has been published, giving us an opportunity to validate the model. Conclusions: We compared sequences and phylogenetic distributions of all members of the DUS family and inferred the phylogenetic tree, which provides a framework to study the functional differences among these proteins and suggests a scenario for the evolutionary origin of dihydrouridine formation. Our evolutionary and structural classification of the DUS family provides a background to study functional differences among these proteins that will guide experimental analyses.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 100
    Publikationsdatum: 2012-06-29
    Beschreibung: Background: Evidence suggests that in prokaryotes sequence-dependent transcriptional pauses affect the dynamics of transcription and translation, as well as of small genetic circuits. So far, a few pause-prone sequences have been identified from in vitro measurements of transcription elongation kinetics. Results: Using a stochastic model of gene expression at the nucleotide and codon levels with realistic parameter values, we investigate three different but related questions and present statistical methods for their analysis. First, we show that information from in vivo RNA and protein temporal numbers is sufficient to discriminate between models with and without a pause site in their coding sequence. Second, we demonstrate that it is possible to separate a large variety of models from each other with pauses of various durations and locations in the template by means of a hierarchical clustering and a random forest classifier. Third, we introduce an approximate likelihood function that allows to estimate the location of a pause site. Conclusions: This method can aid in detecting unknown pause-prone sequences from temporal measurements of RNA and protein numbers at a genome-wide scale and thus elucidate possible roles that these sequences play in the dynamics of genetic networks and phenotype.
    Digitale ISSN: 1471-2105
    Thema: Biologie , Informatik
    Publiziert von BioMed Central
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
Schließen ⊗
Diese Webseite nutzt Cookies und das Analyse-Tool Matomo. Weitere Informationen finden Sie hier...