ALBERT

All Library Books, journals and Electronic Records Telegrafenberg

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
Filter
  • Articles  (4,424)
  • BioMed Central  (4,424)
  • Copernicus
  • Frontiers Media
  • 2010-2014  (4,424)
  • 1980-1984
  • 1925-1929
  • BMC Bioinformatics  (1,006)
  • 9756
  • Biology  (4,424)
Collection
  • Articles  (4,424)
Publisher
  • BioMed Central  (4,424)
  • Copernicus
  • Frontiers Media
Years
Year
Topic
  • 1
    Publication Date: 2013-09-12
    Description: Background: High-throughput sequencing technologies are improving in quality, capacity and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multiplexing approach relies on a specific DNA tag or barcode that is attached to the sequencing or amplification primer and hence appears at the beginning of the sequence in every read. After sequencing, each sample read is identified on the basis of the respective barcode sequence.Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected. This can be accomplished by implementing error correcting algorithms and codes. This barcoding strategy increases the total number of correctly identified samples, thus improving overall sequencing efficiency. Two popular sets of error-correcting codes are Hamming codes and Levenshtein codes.ResultLevenshtein codes operate only on words of known length. Since a DNA sequence with an embedded barcode is essentially one continuous long word, application of the classical Levenshtein algorithm is problematic. In this paper we demonstrate the decreased error correction capability of Levenshtein codes in a DNA context and suggest an adaptation of Levenshtein codes that is proven of efficiently correcting nucleotide errors in DNA sequences. In our adaption we take the DNA context into account and redefine the word length whenever an insertion or deletion is revealed. In simulations we show the superior error correction capability of the new method compared to traditional Levenshtein and Hamming based codes in the presence of multiple errors. Conclusion: We present an adaptation of Levenshtein codes to DNA contexts capable of correction of a pre-defined number of insertion, deletion, and substitution mutations. Our improved method is additionally capable of recovering the new length of the corrupted codeword and of correcting on average more random mutations than traditional Levenshtein or Hamming codes.As part of this work we prepared software for the flexible generation of DNA codes based on our new approach. To adapt codes to specific experimental conditions, the user can customize sequence filtering, the number of correctable mutations and barcode length for highest performance.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 2
    Publication Date: 2013-09-24
    Description: Background: Analysis of global gene expression by DNA microarrays is widely used in experimental molecular biology. However, the complexity of such high-dimensional data sets makes it difficult to fully understand the underlying biological features present in the data.The aim of this study is to introduce a method for DNA microarray analysis that provides an intuitive interpretation of data through dimension reduction and pattern recognition. We present the first "Archetypal Analysis" of global gene expression. The analysis is based on microarray data from five integrated studies of Pseudomonas aeruginosa isolated from the airways of cystic fibrosis patients. Results: Our analysis clustered samples into distinct groups with comprehensible characteristics since the archetypes representing the individual groups are closely related to samples present in the data set. Significant changes in gene expression between different groups identified adaptive changes of the bacteria residing in the cystic fibrosis lung. The analysis suggests a similar gene expression pattern between isolates with a high mutation rate (hypermutators) despite accumulation of different mutations for these isolates. This suggests positive selection in the cystic fibrosis lung environment, and changes in gene expression for these isolates are therefore most likely related to adaptation of the bacteria. Conclusions: Archetypal analysis succeeded in identifying adaptive changes of P. aeruginosa. The combination of clustering and matrix factorization made it possible to reveal minor similarities among different groups of data, which other analytical methods failed to identify. We suggest that this analysis could be used to supplement current methods used to analyze DNA microarray data.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 3
    Publication Date: 2013-10-03
    Description: Background: The development of new therapies for orphan genetic diseases represents an extremely important medical and social challenge. Drug repositioning, i.e. finding new indications for approved drugs, could be one of the most cost- and time-effective strategies to cope with this problem, at least in a subset of cases. Therefore, many computational approaches based on the analysis of high throughput gene expression data have so far been proposed to reposition available drugs. However, most of these methods require gene expression profiles directly relevant to the pathologic conditions under study, such as those obtained from patient cells and/or from suitable experimental models. In this work we have developed a new approach for drug repositioning, based on identifying known drug targets showing conserved anti-correlated expression profiles with human disease genes, which is completely independent from the availability of 'ad hoc' gene expression data-sets. Results: By analyzing available data, we provide evidence that the genes displaying conserved anti-correlation with drug targets are antagonistically modulated in their expression by treatment with the relevant drugs. We then identified clusters of genes associated to similar phenotypes and showing conserved anticorrelation with drug targets. On this basis, we generated a list of potential candidate drug-disease associations. Importantly, we show that some of the proposed associations are already supported by independent experimental evidence. Conclusions: Our results support the hypothesis that the identification of gene clusters showing conserved anticorrelation with drug targets can be an effective method for drug repositioning and provide a wide list of new potential drug-disease associations for experimental validation.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 4
    Publication Date: 2013-10-04
    Description: Background: RNAi screening is a powerful method to study the genetics of intracellular processes in metazoans. Technically, the approach has been largely inspired by techniques and tools developed for compound screening, including those for data analysis. However, by contrast with compounds, RNAi inducing agents can be linked to a large body of gene-centric, publically available data. However, the currently available software applications to analyze RNAi screen data usually lack the ability to visualize associated gene information in an interactive fashion. Results: Here, we present ScreenSifter, an open-source desktop application developed to facilitate storing, statistical analysis and rapid and intuitive biological data mining of RNAi screening datasets. The interface facilitates meta-data acquisition and long-term safe-storage, while the graphical user interface helps the definition of a hit list and the visualization of biological modules among the hits, through Gene Ontology and protein-protein interaction analyses. The application also allows the visualization of screen-to-screen comparisons. Conclusions: Our software package, ScreenSifter, can accelerate and facilitate screen data analysis and enable discovery by providing unique biological data visualization capabilities.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 5
    Publication Date: 2013-10-05
    Description: Background: Fundamental cellular processes such as cell movement, division or food uptake critically depend on cells being able to change shape. Fast acquisition of three-dimensional image time series has now become possible, but we lack efficient tools for analysing shape deformations in order to understand the real three-dimensional nature of shape changes. Results: We present a framework for 3D+time cell shape analysis. The main contribution is three-fold: First, we develop a fast, automatic random walker method for cell segmentation. Second, a novel topology fixing method is proposed to fix segmented binary volumes without spherical topology. Third, we show that algorithms used for each individual step of the analysis pipeline (cell segmentation, topology fixing, spherical parameterization, and shape representation) are closely related to the Laplacian operator. The framework is applied to the shape analysis of neutrophil cells. Conclusions: The method we propose for cell segmentation is faster than the traditional random walker method or the level set method, and performs better on 3D time-series of neutrophil cells, which are comparatively noisy as stacks have to be acquired fast enough to account for cell motion. Our method for topology fixing outperforms the tools provided by SPHARM-MAT and SPHARM-PDM in terms of their successful fixing rates. The different tasks in the presented pipeline for 3D+time shape analysis of cells can be solved using Laplacian approaches, opening the possibility of eventually combining individual steps in order to speed up computations.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 6
    Publication Date: 2013-10-05
    Description: Background;In recent years, high-throughput microscopy has emerged as a powerful tool to analyze cellular dynamicsin an unprecedentedly high resolved manner. The amount of data that is generated, for examplein long-term time-lapse microscopy experiments, requires automated methods for processing andanalysis. Available software frameworks are well suited for high-throughput processing of fluorescenceimages, but they often do not perform well on bright field image data that varies considerablybetween laboratories, setups, and even single experiments.Results;In this contribution, we present a fully automated image processing pipeline that is able to robustly segment and analyze cells with ellipsoid morphology from bright field microscopy in a highthroughput, yet time efficient manner. The pipeline comprises two steps: (i) Image acquisition is adjusted to obtain optimal bright field image quality for automatic processing. (ii) A concatenation of fast performing image processing algorithms robustly identifies single cells in each image. We applied the method to a time-lapse movie consisting of ~315,000 images of differentiating hematopoietic stem cells over 6 days. We evaluated the accuracy of our method by comparing the number of identified cells with manual counts. Our method is able to segment images with varying cell density and different cell types without parameter adjustment and clearly outperforms a standard approach. By computing population doubling times, we were able to identify three growth phases in the stem cell population throughout the whole movie, and validated our result with cell cycle times from single cell tracking.Conclusions;Our method allows fully automated processing and analysis of high-throughput bright field microscopydata. The robustness of cell detection and fast computation time will support the analysisof high-content screening experiments, on-line analysis of time-lapse experiments as well as developmentof methods to automatically track single-cell genealogies.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 7
    Publication Date: 2013-10-05
    Description: Background: For the analysis of spatio-temporal dynamics, various automated processing methods have been developed for nuclei segmentation. These methods tend to be complex for segmentation of images with crowded nuclei, preventing the simple reapplication of the methods to other problems. Thus, it is useful to evaluate the ability of simple methods to segment images with various degrees of crowded nuclei. Results: Here, we selected six simple methods from various watershed based and local maxima detection based methods that are frequently used for nuclei segmentation, and evaluated their segmentation accuracy for each developmental stage of the Caenorhabditis elegans. We included a 4D noise filter, in addition to 2D and 3D noise filters, as a pre-processing step to evaluate the potential of simple methods as widely as possible. By applying the methods to image data between the 50- to 500-cell developmental stages at 50-cell intervals, the error rate for nuclei detection could be reduced to
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 8
    Publication Date: 2013-10-05
    Description: We briefly identify several critical issues in current computational neuroscience, and present our opinions on potential solutions based on bioimage informatics, especially automated image computing.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 9
    Publication Date: 2013-10-05
    Description: Background: Gene perturbation experiments in combination with fluorescence time-lapse cell imaging are a powerful tool in reverse genetics. High content applications require tools for the automated processing of the large amounts of data. These tools include in general several image processing steps, the extraction of morphological descriptors, and the grouping of cells into phenotype classes according to their descriptors. This phenotyping can be applied in a supervised or an unsupervised manner. Unsupervised methods are suitable for the discovery of formerly unknown phenotypes, which are expected to occur in high-throughput RNAi time-lapse screens. Results: We developed an unsupervised phenotyping approach based on Hidden Markov Models (HMMs) with multivariate Gaussian emissions for the detection of knockdown-specific phenotypes in RNAi time-lapse movies. The automated detection of abnormal cell morphologies allows us to assign a phenotypic fingerprint to each gene knockdown. By applying our method to the Mitocheck database, we show that a phenotypic fingerprint is indicative of a gene's function. Conclusion: Our fully unsupervised HMM-based phenotyping is able to automatically identify cell morphologies that are specific for a certain knockdown. Beyond the identification of genes whose knockdown affects cell morphology, phenotypic fingerprints can be used to find modules of functionally related genes.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 10
    Publication Date: 2013-10-05
    Description: Background: Pattern recognition algorithms are useful in bioimage informatics applications such as quantifying cellular and subcellular objects, annotating gene expressions, and classifying phenotypes. To provide effective and efficient image classification and annotation for the ever-increasing microscopic images, it is desirable to have tools that can combine and compare various algorithms, and build customizable solution for different biological problems. However, current tools often offer a limited solution in generating user-friendly and extensible tools for annotating higher dimensional images that correspond to multiple complicated categories. Results: We develop the BIOimage Classification and Annotation Tool (BIOCAT). It is able to apply pattern recognition algorithms to two- and three-dimensional biological image sets as well as regions of interest (ROIs) in individual images for automatic classification and annotation. We also propose a 3D anisotropic wavelet feature extractor for extracting textural features from 3D images with xy-z resolution disparity. The extractor is one of the about 20 built-in algorithms of feature extractors, selectors and classifiers in BIOCAT. The algorithms are modularized so that they can be "chained" in a customizable way to form adaptive solution for various problems, and the plugin-based extensibility gives the tool an open architecture to incorporate future algorithms. We have applied BIOCAT to classification and annotation of images and ROIs of different properties with applications in cell biology and neuroscience. Conclusions: BIOCAT provides a user-friendly, portable platform for pattern recognition based biological image classification of two- and three- dimensional images and ROIs. We show, via diverse case studies, that different algorithms and their combinations have different suitability for various problems. The customizability of BIOCAT is thus expected to be useful for providing effective and efficient solutions for a variety of biological problems involving image classification and annotation. We also demonstrate the effectiveness of 3D anisotropic wavelet in classifying both 3D image sets and ROIs.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 11
    Publication Date: 2013-09-18
    Description: Background: Pattern recognition receptors of the immune system have key roles in the regulation of pathways after the recognition of microbial- and danger-associated molecular patterns in vertebrates. Members of NOD-like receptor (NLR) family typically function intracellularly. The NOD-like receptor family CARD domain containing 5 (NLRC5) is the largest member of this family that also contains the largest number of leucine-rich repeats (LRRs).Due to the lack of crystal structures of full-length NLRs, projects have been initiated with the aim to model certain or all members of the family, but systematic studies did not model the full-length NLRC5 due to its unique domain architecture.Our aim was to analyze the LRR sequences of NLRC5 and some NLRC5-related proteins and to build a model for the full-length human NLRC5 by homology modeling. Results: LRR sequences of NLRC5 were aligned and were compared with the consensus pattern of ribonuclease inhibitor protein (RI)-like LRR subfamily. Two types of alternating consensus patterns previously identified for RI repeats were also found in NLRC5. A homology model for full-length human NLRC5 was prepared and, besides the closed conformation of monomeric NLRC5, a heptameric platform was also modeled for the opened conformational NLRC5 monomers. Conclusions: Identification of consensus patterns of leucine-rich repeat sequences helped to identify LRRs in NLRC5 and to predict their number and position within the protein. In spite of the lack of fully adequate template structures, the presence of an untypical CARD domain and unusually high number of LRRs in NLRC5, we were able to construct a homology model for both the monomeric and homo-heptameric full-length human NLRC5 protein.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 12
    Publication Date: 2013-09-18
    Description: Background: Many Single Nucleotide Polymorphism (SNP) calling programs have been developed to identify Single Nucleotide Variations (SNVs) in next-generation sequencing (NGS) data. However, low sequencing coverage presents challenges to accurate SNV identification, especially in single-sample data. Moreover, commonly used SNP calling programs usually include several metrics in their output files for each potential SNP. These metrics are highly correlated in complex patterns, making it extremely difficult to select SNPs for further experimental validations. Results: To explore solutions to the above challenges, we compare the performance of four SNP calling algorithm, SOAPsnp, Atlas-SNP2, SAMtools, and GATK, in a low-coverage single-sample sequencing dataset. Without any post-output filtering, SOAPsnp calls more SNVs than the other programs since it has fewer internal filtering criteria. Atlas-SNP2 has stringent internal filtering criteria; thus it reports the least number of SNVs. The numbers of SNVs called by GATK and SAMtools fall between SOAPsnp and Atlas-SNP2. Moreover, we explore the values of key metrics related to SNVs' quality in each algorithm and use them as post-output filtering criteria to filter out low quality SNVs. Under different coverage cutoff values, we compare four algorithms and calculate the empirical positive calling rate and sensitivity. Our results show that: 1) the overall agreement of the four calling algorithms is low, especially in non-dbSNPs; 2) the agreement of the four algorithms is similar when using different coverage cutoffs, except that the non-dbSNPs agreement level tends to increase slightly with increasing coverage; 3) SOAPsnp, SAMtools, and GATK have a higher empirical calling rate for dbSNPs compared to non-dbSNPs; and 4) overall, GATK and Atlas-SNP2 have a relatively higher positive calling rate and sensitivity, but GATK calls more SNVs. Conclusions: Our results show that the agreement between different calling algorithms is relatively low. Thus, more caution should be used in choosing algorithms, setting filtering parameters, and designing validation studies. For reliable SNV calling results, we recommend that users employ more than one algorithm and use metrics related to calling quality and coverage as filtering criteria.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 13
    Publication Date: 2013-09-27
    Description: Background: Ontologies and catalogs of gene functions, such as the Gene Ontology (GO) and MIPS-FUN,assume that functional classes are organized hierarchically, that is, general functions include morespecific ones. This has recently motivated the development of several machine learning algorithmsfor gene function prediction that leverages on this hierarchical organization where instances maybelong to multiple classes. In addition, it is possible to exploit relationships among examples,since it is plausible that related genes tend to share functional annotations. Although theserelationships have been identified and extensively studied in the area of protein-protein interaction(PPI) networks, they have not received much attention in hierarchical and multi-class gene functionprediction. Relations between genes introduce autocorrelation in functional annotations and violatethe assumption that instances are independently and identically distributed (i.i.d.), which underlinesmost machine learning algorithms. Although the explicit consideration of these relations bringsadditional complexity to the learning process, we expect substantial benefits in predictive accuracyof learned classifiers. Results: This article demonstrates the benefits (in terms of predictive accuracy) of considering autocorrelationin multi-class gene function prediction. We develop a tree-based algorithm for considering networkautocorrelation in the setting of Hierarchical Multi-label Classification (HMC). We empiricallyevaluate the proposed algorithm, called NHMC (Network Hierarchical Multi-label Classification),on 12 yeast datasets using each of the MIPS-FUN and GO annotation schemes and exploiting 2different PPI networks. The results clearly show that taking autocorrelation into account improves thepredictive performance of the learned models for predicting gene function. Conclusions: Our newly developed method for HMC takes into account network information in the learning phase:When used for gene function prediction in the context of PPI networks, the explicit considerationof network autocorrelation increases the predictive performance of the learned models. Overall, wefound that this holds for different gene features/ descriptions, functional annotation schemes, and PPInetworks: Best results are achieved when the PPI network is dense and contains a large proportion offunction-relevant interactions.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 14
    Publication Date: 2013-10-02
    Description: Background: Protein-protein docking, which aims to predict the structure of a protein-protein complex from its unbound components, remains an unresolved challenge in structural bioinformatics. An important step is the ranking of docked poses using a scoring function, for which many methods have been developed. There is a need to explore the differences and commonalities of these methods with each other, as well as with functions developed in the fields of molecular dynamics and homology modelling. Results: We present an evaluation of 115 scoring functions on an unbound docking decoy benchmark covering 118 complexes for which a near-native solution can be found, yielding top 10 success rates of up to 58%. Hierarchical clustering is performed, so as to group together functions which identify near-natives in similar subsets of complexes. Three set theoretic approaches are used to identify pairs of scoring functions capable of correctly scoring different complexes. This shows that functions in different clusters capture different aspects of binding and are likely to work together synergistically. Conclusions: All functions designed specifically for docking perform well, indicating that functions are transferable between sampling methods. We also identify promising methods from the field of homology modelling. Further, differential success rates by docking difficulty and solution quality suggest a need for flexibility-dependent scoring. Investigating pairs of scoring functions, the set theoretic measures identify known scoring strategies as well as a number of novel approaches, indicating promising augmentations of traditional scoring methods. Such augmentation and parameter combination strategies are discussed in the context of the learning-to-rank paradigm.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 15
    Publication Date: 2013-10-03
    Description: Background: A number of different statistics are used for detecting natural selection using DNA sequencing data, including statistics that are summaries of the frequency spectrum, such as Tajima's D. These statistics are now often being applied in the analysis of Next Generation Sequencing (NGS) data. However, estimates of frequency spectra from NGS data are strongly affected by low sequencing coverage; the inherent technology dependent variation in sequencing depth causes systematic differences in the value of the statistic among genomic regions. Results: We have developed an approach that accommodates the uncertainty of the data when calculating site frequency based neutrality test statistics. A salient feature of this approach is that it implicitly solves the problems of varying sequencing depth, missing data and avoids the need to infer variable sites for the analysis and thereby avoids ascertainment problems introduced by a SNP discovery process. Conclusion: Using an empirical Bayes approach for fast computations, we show that this method produces results for low-coverage NGS data comparable to those achieved when the genotypes are known without uncertainty. We also validate the method in an analysis of data from the 1000 genomes project. The method is implemented in a fast framework which enables researchers to perform these neutrality tests on a genome-wide scale.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 16
    Publication Date: 2013-10-03
    Description: Background: Dendritic spines serve as key computational structures in brain plasticity. Much remains to be learned about their spatial and temporal distribution among neurons. Our aim in this study was to perform exploratory analyses based on the population distributions of dendritic spines with regard to their morphological characteristics and period of growth in dissociated hippocampal neurons. We fit a loglinear model to the contingency table of spine features such as spine type and distance from the soma to first determine which features were important in modeling the spines, as well as the relationships between such features. A multinomial logistic regression was then used to predict the spine types using the features suggested by the log-linear model, along with neighboring spine information. Finally, an important variant of Ripley's K-function applicable to linear networks was used to study the spatial distribution of spines along dendrites. Results: Our study indicated that in the culture system, (i) dendritic spine densities were "completely spatially random", (ii) spine type and distance from the soma were independent quantities, and most importantly, (iii) spines had a tendency to cluster with other spines of the same type. Conclusions: Although these results may vary with other systems, our primary contribution is the set of statistical tools for morphological modeling of spines which can be used to assess neuronal cultures following gene manipulation such as RNAi, and to study induced pluripotent stem cells differentiated to neurons.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 17
    Publication Date: 2013-10-05
    Description: Background: Molecular recognition features (MoRFs) are short binding regions located in longer intrinsically disordered protein regions. Although these short regions lack a stable structure in the natural state, they readily undergo disorder-to-order transitions upon binding to their partner molecules. MoRFs play critical roles in the molecular interaction network of a cell, and are associated with many human genetic diseases. Therefore, identification of MoRFs is an important step in understanding functional aspects of these proteins and in finding applications in drug design. Results: Here, we propose a novel method for identifying MoRFs, named as MFSPSSMpred (Masked, Filtered and Smoothed Position-Specific Scoring Matrix-based Predictor). Firstly, a masking method is used to calculate the average local conservation scores of residues within a masking-window length in the position-specific scoring matrix (PSSM). Then, the scores below the average are filtered out. Finally, a smoothing method is used to incorporate the features of flanking regions for each residue to prepare the feature sets for prediction. Our method employs no predicted results from other classifiers as input, i.e., all features used in this method are extracted from the PSSM of sequence only. Experimental results show that, comparing with other methods tested on the same datasets, our method achieves the best performance: achieving 0.004~0.079 higher AUC than other methods when tested on TEST419, and achieving 0.045~0.212 higher AUC than other methods when tested on TEST2012. In addition, when tested on an independent membrane proteins-related dataset, MFSPSSMpred significantly outperformed the existing predictor MoRFpred. Conclusions: This study suggests that: 1) amino acid composition and physicochemical properties in the flanking regions of MoRFs are very different from those in the general non-MoRF regions; 2) MoRFs contain both highly conserved residues and highly variable residues and, on the whole, are highly locally conserved; and 3) combining contextual information with local conservation information of residues facilitates the prediction of MoRFs.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 18
    Publication Date: 2013-10-05
    Description: Background: Comprehensive protein-protein interaction (PPI) maps are a powerful resource for uncovering themolecular basis of genetic interactions and providing mechanistic insights. Over the past decade, highthroughputexperimental techniques have been developed to generate PPI maps at proteome scale,first using yeast two-hybrid approaches and more recently via affinity purification combined withmass spectrometry (AP-MS). Unfortunately, data from both protocols are prone to both high falsepositive and false negative rates. To address these issues, many methods have been developed to postprocessraw PPI data. However, with few exceptions, these methods only analyze binary experimentaldata (in which each potential interaction tested is deemed either observed or unobserved), neglectingquantitative information available from AP-MS such as spectral counts. Results: We propose a novel method for incorporating quantitative information from AP-MS data into existingPPI inference methods that analyze binary interaction data. Our approach introduces a probabilisticframework that models the statistical noise inherent in observations of co-purifications. Using asampling-based approach, we model the uncertainty of interactions with low spectral counts bygenerating an ensemble of possible alternative experimental outcomes. We then apply the existingmethod of choice to each alternative outcome and aggregate results over the ensemble. We validate ourapproach on three recent AP-MS data sets and demonstrate performance comparable to or better thanstate-of-the-art methods. Additionally, we provide an in-depth discussion comparing the theoreticalbases of existing approaches and identify common aspects that may be key to their performance. Conclusions: Our sampling framework extends the existing body of work on PPI analysis using binary interactiondata to apply to the richer quantitative data now commonly available through AP-MS assays. Thisframework is quite general, and many enhancements are likely possible. Fruitful future directionsmay include investigating more sophisticated schemes for converting spectral counts to probabilitiesand applying the framework to direct protein complex prediction methods.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 19
    Publication Date: 2013-10-05
    Description: Background: Gene expression in the Drosophila embryo is controlled by functional interactions between a large network of protein transcription factors (TFs) and specific sequences in DNA cis-regulatory modules (CRMs). The binding site sequences for any TF can be experimentally determined and represented in a position weight matrix (PWM). PWMs can then be used to predict the location of TF binding sites in other regions of the genome, although there are limitations to this approach as currently implemented. Results: In this proof-of-principle study, we analyze 127 CRMs and focus on four TFs that control transcription of target genes along the anterio-posterior axis of the embryo early in development. For all four of these TFs, there is some degree of conserved flanking sequence that extends beyond the predicted binding regions. A potential role for these conserved flanking sequences may be to enhance the specificity of TF binding, as the abundance of these sequences is greatly diminished when we examine only predicted high-affinity binding sites. Conclusions: Expanding PWMs to include sequence context-dependence will increase the information content in PWMs and facilitate a more efficient functional identification and dissection of CRMs.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 20
    Publication Date: 2013-10-05
    Description: Background: Segmenting electron microscopy (EM) images of cellular and subcellular processes in the nervous system is a key step in many bioimaging pipelines involving classification and labeling of ultrastructures. However, fully automated techniques to segment images are often susceptible to noise and heterogeneity in EM images (e.g. different histological preparations, different organisms, different brain regions, etc.). Supervised techniques to address this problem are often helpful but require large sets of training data, which are often difficult to obtain in practice, especially across many conditions. Results: We propose a new, principled unsupervised algorithm to segment EM images using a two-step approach: edge detection via salient watersheds following by robust region merging. We performed experiments to gather EM neuroimages of two organisms (mouse and fruit fly) using different histological preparations and generated manually curated ground-truth segmentations. We compared our algorithm against several state-of-the-art unsupervised segmentation algorithms and found superior performance using two standard measures of under-and over-segmentation error. Conclusions: Our algorithm is general and may be applicable to other large-scale segmentation problems for bioimages.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 21
    Publication Date: 2013-06-07
    Description: Background: The use of the knowledge produced by sciences to promote human health is the main goal of translational medicine. To make it feasible we need computational methods to handle the large amount of information that arises from bench to bedside and to deal with its heterogeneity. A computational challenge that must be faced is to promote the integration of clinical, socio-demographic and biological data. In this effort, ontologies play an essential role as a powerful artifact for knowledge representation. Chado is a modular ontology-oriented database model that gained popularity due to its robustness and flexibility as a generic platform to store biological data; however it lacks supporting representation of clinical and socio-demographic information. Results: We have implemented an extension of Chado -- the Clinical Module - to allow the representation of this kind of information. Our approach consists of a framework for data integration through the use of a common reference ontology. The design of this framework has four levels: data level, to store the data; semantic level, to integrate and standardize the data by the use of ontologies; application level, to manage clinical databases, ontologies and data integration process; and web interface level, to allow interaction between the user and the system. The clinical module was built based on the Entity-Attribute-Value (EAV) model. We also proposed a methodology to migrate data from legacy clinical databases to the integrative framework.A Chado instance was initialized using a relational database management system. The Clinical Module was implemented and the framework was loaded using data from a factual clinical research database. Clinical and demographic data as well as biomaterial data were obtained from patients with tumors of head and neck. We implemented the IPTrans tool that is a complete environment for data migration, which comprises: the construction of a model to describe the legacy clinical data, based on an ontology; the Extraction, Transformation and Load (ETL) process to extract the data from the source clinical database and load it in the Clinical Module of Chado; the development of a web tool and a Bridge Layer to adapt the web tool to Chado, as well as other applications. Conclusions: Open-source computational solutions currently available for translational science does not have a model to represent biomolecular information and also are not integrated with the existing bioinformatics tools. On the other hand, existing genomic data models do not represent clinical patient data. A framework was developed to support translational research by integrating biomolecular information coming from different "omics" technologies with patient's clinical and socio-demographic data. This framework should present some features: flexibility, compression and robustness. The experiments accomplished from a use case demonstrated that the proposed system meets requirements of flexibility and robustness, leading to the desired integration. The Clinical Module can be accessed in http://dcm.ffclrp.usp.br/caib/pg=iptrans.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 22
    Publication Date: 2013-06-07
    Description: Background: A large-scale, highly accurate, machine-understandable drug-disease treatment relationship knowledge base is important for computational approaches to drug repurposing. The large body of published biomedical research articles and clinical case reports available on MEDLINE is a rich source of FDA-approved drug-disease indication as well as drug-repurposing knowledge that is crucial for applying FDA-approved drugs for new diseases. However, much of this information is buried in free text and not captured in any existing databases. The goal of this study is to extract a large number of accurate drug-disease treatment pairs from published literature. Results: In this study, we developed a simple but highly accurate pattern-learning approach to extract treatment-specific drug-disease pairs from 20 million biomedical abstracts available on MEDLINE. We extracted a total of 34,305 unique drug-disease treatment pairs, the majority of which are not included in existing structured databases. Our algorithm achieved a precision of 0.904 and a recall of 0.131 in extracting all pairs, and a precision of 0.904 and a recall of 0.842 in extracting frequent pairs. In addition, we have shown that the extracted pairs strongly correlate with both drug target genes and therapeutic classes, therefore may have high potential in drug discovery. Conclusions: We demonstrated that our simple pattern-learning relationship extraction algorithm is able to accurately extract many drug-disease pairs from the free text of biomedical literature that are not captured in structured databases. The large-scale, accurate, machine-understandable drug-disease treatment knowledge base that is resultant of our study, in combination with pairs from structured databases, will have high potential in computational drug repurposing tasks.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 23
    Publication Date: 2013-06-08
    Description: Background: Normal Mode Analysis is one of the most successful techniques for studying motions in proteins and macromolecules. It can provide information on the mechanism of protein functions, used to aid crystallography and NMR data reconstruction, and calculate protein free energies. Results: ¿¿PT is a toolbox allowing calculation of elastic network models and principle component analysis. It allows the analysis of pdb files or trajectories taken from; Gromacs, Amber, and DL_POLY. As well as calculation of the normal modes it also allows comparison of the modes with experimental protein motion, variation of modes with mutation or ligand binding, and calculation of molecular dynamic entropies. Conclusions: This toolbox makes the respective tools available to a wide community of potential NMA users, and allows them unrivalled ability to analyse normal modes using a variety of techniques and current software.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 24
    Publication Date: 2013-06-08
    Description: Background: Data visualization is critical for interpreting biological data. However, in practice it can prove tobe a bottleneck for non trained researchers; this is especially true for three dimensional (3D) datarepresentation. Whilst existing software can provide all necessary functionalities to represent andmanipulate biological 3D datasets, very few are easily accessible (browser based), cross platform andaccessible to non-expert users. Results: An online HTML5/WebGL based 3D visualisation tool has been developed to allow biologists toquickly and easily view interactive and customizable three dimensional representations of their dataalong with multiple layers of information. Using the WebGL library Three.js written in Javascript,bioWeb3D allows the simultaneous visualisation of multiple large datasets inputted via a simpleJSON, XML or CSV file, which can be read and analysed locally thanks to HTML5 capabilities. Conclusions: Using basic 3D representation techniques in a technologically innovative context, we provide a pro-gram that is not intended to compete with professional 3D representation software, but that insteadenables a quick and intuitive representation of reasonably large 3D datasets.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 25
    Publication Date: 2013-06-08
    Description: Background: Graph-based notions are increasingly used in biomedical data mining and knowledge discovery tasks. In this paper, we present a clique-clustering method to automatically summarize graphs of semantic predications produced from PubMed citations (titles and abstracts). Results: SemRep is used to extract semantic predications from the citations returned by a PubMed search. Cliques were identified from frequently occurring predications with highly connected arguments filtered by degree centrality. Themes contained in the summary were identified with a hierarchical clustering algorithm based on common arguments shared among cliques. The validity of the clusters in the summaries produced was compared to the Silhouette-generated baseline for cohesion, separation and overall validity. The theme labels were also compared to a reference standard produced with major MeSH headings. Conclusions: For 11 topics in the testing data set, the overall validity of clusters from the system summary was 10% better than the baseline (43% versus 33%). While compared to the reference standard from MeSH headings, the results for recall, precision and F-score were 0.64, 0.65, and 0.65 respectively.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 26
    Publication Date: 2013-06-11
    Description: Background: Somatic mutation-calling based on DNA from matched tumor-normal patient samples is one of the key tasks carried by many cancer genome projects. One such large-scale project is The Cancer Genome Atlas (TCGA), which is now routinely compiling catalogs of somatic mutations from hundreds of paired tumor-normal DNA exome-sequence data. Nonetheless, mutation calling is still very challenging. TCGA benchmark studies revealed that even relatively recent mutation callers from major centers showed substantial discrepancies. Evaluation of the mutation callers or understanding the sources of discrepancies is not straightforward, since for most tumor studies, validation data based on independent whole-exome DNA sequencing is not available, only partial validation data for a selected (ascertained) subset of sites. Results: To provide guidelines to comparing outputs from multiple callers, we have analyzed two sets of mutation-calling data from the TCGA benchmark studies and their partial validation data. Various aspects of the mutation-calling outputs were explored to characterize the discrepancies in detail. To assess the performances of multiple callers, we introduce four approaches utilizing the external sequence data to varying degrees, ranging from having independent DNA-seq pairs, RNA-seq for tumor samples only, the original exome-seq pairs only, or none of those. Conclusions: Our analyses provide guidelines to visualizing and understanding the discrepancies among the outputs from multiple callers. Furthermore, applying the four evaluation approaches to the whole exome data, we illustrate the challenges and highlight the various circumstances that require extra caution in assessing the performances of multiple callers.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 27
    Publication Date: 2013-06-11
    Description: Background: ChIPx (i.e., ChIP-seq and ChIP-chip) is increasingly used to map genome-wide transcription factor (TF) binding sites. A single ChIPx experiment can identify thousands of TF bound genes, but typically only a fraction of these genes are functional targets that respond transcriptionally to perturbations of TF expression. To identify promising functional target genes for follow-up studies, researchers usually collect gene expression data from TF perturbation experiments to determine which of the TF targets respond transcriptionally to binding. Unfortunately, approximately 40% of ChIPx studies do not have accompanying gene expression data from TF perturbation experiments. For these studies, genes are often prioritized solely based on the binding strengths of ChIPx signals in order to choose follow-up candidates. ChIPXpress is a novel method that improves upon this ChIPx-only ranking approach by integrating ChIPx data with large amounts of Publicly available gene Expression Data (PED). Results: We demonstrate that PED does contain useful information to identify functional TF target genes despite its inherent heterogeneity. A truncated absolute correlation measure is developed to better capture the regulatory relationships between TFs and their target genes in PED. By integrating the information from ChIPx and PED, ChIPXpress can significantly increase the chance of finding functional target genes responsive to TF perturbation among the top ranked genes. ChIPXpress is implemented as an easy-to-use R/Bioconductor package. We evaluate ChIPXpress using 10 different ChIPx datasets in mouse and human and find that ChIPXpress rankings are more accurate than rankings based solely on ChIPx data and may result in substantial improvement in prediction accuracy, irrespective of which peak calling algorithm is used to analyze the ChIPx data. Conclusions: ChIPXpress provides a new tool to better prioritize TF bound genes from ChIPx experiments for follow-up studies when investigators do not have their own gene expression data. It demonstrates that the regulatory information from PED can be used to boost ChIPx data analyses. It also represents an important step towards more fully utilizing the valuable, but highly heterogeneous data contained in public gene expression databases.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 28
    Publication Date: 2013-06-07
    Description: Background: Detailed study of genetic variation at the population level in humans and other species is now possible due to the availability of large sets of single nucleotide polymorphism data. Alleles at two or more loci are said to be in linkage disequilibrium (LD) when they are correlated or statistically dependent. Current efforts to understand the genetic basis of complex phenotypes are based on the existence of such associations, making study of the extent and distribution of linkage disequilibrium central to this endeavour. The objective of this paper is to develop methods to study fine-scale patterns of allelic association using probabilistic graphical models. Results: An efficient, linear-time forward-backward algorithm is developed to estimate chromosome-wide LD models by optimizing a penalized likelihood criterion, and a convenient way to display these models is described. To illustrate the methods they are applied to data obtained by genotyping 8341 pigs. It is found that roughly 20% of the porcine genome exhibits complex LD patterns, forming islands of relatively high genetic diversity. Conclusions: The proposed algorithm is efficient and makes it feasible to estimate and visualize chromosome-wide LD models on a routine basis.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 29
    Publication Date: 2013-06-07
    Description: Background: Interpretation of gene expression microarray data in the light of external information on both columns and rows (experimental variables and gene annotations) facilitates the extraction of pertinent information hidden in these complex data. Biologists classically interpret genes of interest after retrieving functional information from a subset of genes of interest. Transcription factors play an important role in orchestrating the regulation of gene expression. Their activity can be deduced by examining the presence of putative transcription factors binding sites in the gene promoter regions. Results: In this paper we present the multivariate statistical method RLQ which aims to analyze microarray data where additional information is available on both genes and samples. As an illustrative example, we applied RLQ methodology to analyze transcription factor activity associated with the time-course effect of steroids on the growth of primary human lung fibroblasts. RLQ could successfully predict transcription factor activity, and could integrate various other sources of external information in the main frame of the analysis. The approach was validated by means of alternative statistical methods and biological validation. Conclusions: RLQ provides an efficient way of extracting and visualizing structures present in a gene expression dataset by directly modeling the link between experimental variables and gene annotations.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 30
    Publication Date: 2013-06-07
    Description: Background: The development of genotyping arrays containing hundreds of thousands of rare variants across the genome and advances in high-throughput sequencing technologies have made feasible empirical genetic association studies to search for rare disease susceptibility alleles. As single variant testing is underpowered to detect associations, the development of statistical methods to combine analysis across variants -- so-called "burden tests" - is an area of active research interest. We previously developed a method, the admixture maximum likelihood test, to test multiple, common variants for association with a trait of interest. We have extended this method, called the rare admixture maximum likelihood test (RAML), for the analysis of rare variants. In this paper we compare the performance of RAML with six other burden tests designed to test for association of rare variants. Results: We used simulation testing over a range of scenarios to test the power of RAML compared to the other rare variant association testing methods. These scenarios modelled differences in effect variability, the average direction of effect and the proportion of associated variants. We evaluated the power for all the different scenarios. RAML tended to have the greatest power for most scenarios where the proportion of associated variants was small, whereas SKAT-O performed a little better for the scenarios with a higher proportion of associated variants. Conclusions: The RAML method makes no assumptions about the proportion of variants that are associated with the phenotype of interest or the magnitude and direction of their effect. The method is flexible and can be applied to both dichotomous and quantitative traits and allows for the inclusion of covariates in the underlying regression model. The RAML method performed well compared to the other methods over a wide range of scenarios. Generally power was moderate in most of the scenarios, underlying the need for large sample sizes in any form of association testing.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 31
    Publication Date: 2013-06-09
    Description: Background: Miniature inverted repeat transposable elements (MITEs) are abundant non-autonomous elements, playing important roles in shaping gene and genome evolution. Their characteristic structural features are suitable for automated identification by computational approaches, however, de novo MITE discovery at genomic levels is still resource expensive. Results: Efficient and accurate computational tools are desirable. Existing algorithms process every member of a MITE family, therefore a major portion of the computing task is redundant. In this study, redundant computing steps were analyzed and a novel algorithm emphasizing on the reduction of such redundant computing was implemented in MITE Digger. It completed processing the whole rice genome sequence database in ~15 hours and produced 332 MITE candidates with low false positive (1.8%) and false negative (0.9%) rates. MITE Digger was also tested for genome wide MITE discovery with four other genomes. Conclusions: MITE Digger is efficient and accurate for genome wide retrieval of MITEs. Its user friendly interface further facilitates genome wide analyses of MITEs on a routine basis. The MITE Digger program is available at: http://labs.csb.utoronto.ca/yang/MITE Digger.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 32
    Publication Date: 2013-06-09
    Description: Background: Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. This data consists of both the nucleotide sequences and per-base quality scores that indicate the level of confidence in the readout of these sequences. Quality scores account for about half of the required disk space in the commonly used FASTQ format (before compression), and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data. Results: In this paper, we present a new scheme for the lossy compression of the quality scores, to address the problem of storage. Our framework allows the user to specify the rate (bits per quality score) prior to compression, independent of the data to be compressed. Our algorithm can work at any rate, unlike other lossy compression algorithms. We envisage our algorithm as being part of a more general compression scheme that works with the entire FASTQ file. Numerical experiments show that we can achieve a better mean squared error (MSE) for small rates (bits per quality score) than other lossy compression schemes. For the organism PhiX, whose assembled genome is known and assumed to be correct, we show that it is possible to achieve a significant reduction in size with little compromise in performance on downstream applications (e.g., alignment). Conclusions: QualComp is an open source software package, written in C and freely available for download at https://sourceforge.net/projects/qualcomp. It is designed to lossily compress the quality scores presented in a FASTQ file. Given a model for the quality scores, we use rate-distortion results to optimally allocate the available bits in order to minimize the MSE. This metric allows us to compare different lossy compression algorithms for quality scores without depending on downstream applications that may use the quality scores in very different ways.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 33
    Publication Date: 2013-04-05
    Description: Background: Pyrrolysine (the 22nd amino acid) is in certain organisms and under certain circumstances encoded by the amber stop codon, UAG. The circumstances driving pyrrolysine translation are not well understood. The involvement of a predicted mRNA structure in the region downstream UAG has been suggested, but the structure does not seem to be present in all pyrrolysine incorporating genes. Results: We propose a strategy to predict pyrrolysine encoding genes in genomes of archaea and bacteria. We cluster open reading frames interrupted by the amber codon based on sequence similarity. We rank these clusters according to several features that may influence pyrrolysine translation. The ranking effects of different features are assessed and we propose a weighted combination of these features which best explains the currently known pyrrolysine incorporating genes. We devote special attention to the effect of structural conservation and provide further substantiation to support that structural conservation may be influential -- but is not a necessary factor. Finally, from the weighted ranking, we identify a number of potentially pyrrolysine incorporating genes. Conclusions: We propose a method for prediction of pyrrolysine incorporating genes in genomes of bacteria and archaea leading to insights about the factors driving pyrrolysine translation and identification of new gene candidates. The method predicts known conserved genes with high recall and predicts several other promising candidates for experimental verification. The method is implemented as a computational pipeline which is available on request.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 34
    Publication Date: 2013-04-05
    Description: Background: Next generation transcriptome sequencing (RNA-Seq) is emerging as a powerful experimental tool for the study of alternative splicing and its regulation, but requires ad-hoc analysis methods and tools. PASTA (Patterned Alignments for Splicing and Transcriptome Analysis) is a splice junction detection algorithm specifically designed for RNA-Seq data, relying on a highly accurate alignment strategy and on a combination of heuristic and statistical methods to identify exon-intron junctions with high accuracy. Results: Comparisons against TopHat and other splice junction prediction software on real and simulated datasets show that PASTA exhibits high specificity and sensitivity, especially at lower coverage levels. Moreover, PASTA is highly configurable and flexible, and can therefore be applied in a wide range of analysis scenarios: it is able to handle both single-end and paired-end reads, it does not rely on the presence of canonical splicing signals, and it uses organism-specific regression models to accurately identify junctions. Conclusions: PASTA is a highly efficient and sensitive tool to identify splicing junctions from RNA-Seq data. Compared to similar programs, it has the ability to identify a higher number of real splicing junctions, and provides highly annotated output files containing detailed information about their location and characteristics. Accurate junction data in turn facilitates the reconstruction of the splicing isoforms and the analysis of their expression levels, which will be performed by the remaining modules of the PASTA pipeline, still under development. Use of PASTA can therefore enable the large-scale investigation of transcription and alternative splicing.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 35
    facet.materialart.
    Unknown
    BioMed Central
    Publication Date: 2013-04-11
    Description: Contributing reviewersThe editors of BMC Bioinformatics would like to thank all our reviewers who have contributed to the journal in volume 13 (2012).
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 36
    Publication Date: 2013-04-06
    Description: Background: The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. Results: We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the new AUC-based permutation VIM outperforms the standard permutation VIM for unbalanced data settings while both permutation VIMs have equal performance for balanced data settings. Conclusions: The standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 37
    Publication Date: 2013-04-11
    Description: Background: Since peak alignment in metabolomics has a huge effect on the subsequent statistical analysis, it is considered a key preprocessing step and many peak alignment methods have been developed. However, existing peak alignment methods do not produce satisfactory results. Indeed, the lack of accuracy results from the fact that peak alignment is done separately from another preprocessing step such as identification. Therefore, a post-hoc approach, which integrates both identification and alignment results, is in urgent need for the purpose of increasing the accuracy of peak alignment. Results: The proposed post-hoc method was validated with three datasets such as a mixture of compound standards, metabolite extract from mouse liver, and metabolite extract from wheat. Compared to the existing methods, the proposed approach improved peak alignment in terms of various performance measures. Also, post-hoc approach was verified to improve peak alignment by manual inspection. Conclusions: The proposed approach, which combines the information of metabolite identification and alignment, clearly improves the accuracy of peak alignment in terms of several performance measures. R package and examples using a dataset are available at http://mrr.sourceforge.net/download.html.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 38
    Publication Date: 2013-09-09
    Description: Background: DNA pooling constitutes a cost effective alternative in genome wide association studies. In DNA pooling, equimolar amounts of DNA from different individuals are mixed into one sample and the frequency of each allele in each position is observed in a single genotype experiment. The identification of haplotype frequencies from pooled data in addition to single locus analysis is of separate interest within these studies as haplotypes could increase statistical power and provide additional insight. Results: We developed a method for maximum-parsimony haplotype frequency estimation from pooled DNA data based on the sparse representation of the DNA pools in a dictionary of haplotypes. Extensions to scenarios where data is noisy or even missing are also presented. The resulting method is first applied to simulated data based on the haplotypes and their associated frequencies of the AGT gene. We further evaluate our methodology on datasets consisting of SNPs from the first 7Mb of the HapMap CEU population. Noise and missing data were further introduced in the datasets in order to test the extensions of the proposed method. Both HIPPO and HAPLOPOOL were also applied to these datasets to compare performances. Conclusions: We evaluate our methodology on scenarios where pooling is more efficient relative to individual genotyping; that is, in datasets that contain pools with a small number of individuals. We show that in such scenarios our methodology outperforms state-of-the-art methods such as HIPPO and HAPLOPOOL.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 39
    Publication Date: 2013-09-09
    Description: Background: Many problems in computational biology require alignment-free sequence comparisons. One of thecommon tasks involving sequence comparison is sequence clustering. Here we apply methods ofalignment-free comparison (in particular, comparison using sequence composition) to the challengeof sequence clustering. Results: We study several centroid based algorithms for clustering sequences based on word counts. Study oftheir performance shows that using k-means algorithm with or without the data whitening is efficientfrom the computational point of view. A higher clustering accuracy can be achieved using the softexpectation maximization method, whereby each sequence is attributed to each cluster with a specificprobability. We implement an open source tool for alignment-free clustering. It is publicly availablefrom github: https://github.com/luscinius/afcluster. Conclusions: We show the utility of alignment-free sequence clustering for high throughput sequencing analysisdespite its limitations. In particular, it allows one to perform assembly with reduced resources and aminimal loss of quality. The major factor affecting performance of alignment-free read clustering isthe length of the read.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 40
    Publication Date: 2013-09-09
    Description: Background: Current advances of the next-generation sequencing technology have revealed a large number of un-annotated RNA transcripts. Comparative study of the RNA structurome is an important approach to assess their biological functionalities. Due to the large sizes and abundance of the RNA transcripts, an efficient and accurate RNA structure-structure alignment algorithm is in urgent need to facilitate the comparative study. Despite the importance of the RNA secondary structure alignment problem, there are no computational tools available that provide high computational efficiency and accuracy. In this case, designing and implementing such an efficient and accurate RNA secondary structure alignment algorithm is highly desirable. Results: In this work, through incorporating the sparse dynamic programming technique, we implemented an algorithm that has an O(n3) expected time complexity, where n is the average number of base pairs in the RNA structures. This complexity, which can be shown assuming the polymer-zeta property, is confirmed by our experiments. The resulting new RNA secondary structure alignment tool is called ERA. Benchmark results indicate that ERA can significantly speedup RNA structure-structure alignments compared to other state-of-the-art RNA alignment tools, while maintaining high alignment accuracy. Conclusions: Using the sparse dynamic programming technique, we are able to develop a new RNA secondary structure alignment tool that is both efficient and accurate. We anticipate that the new alignment algorithm ERA will significantly promote comparative RNA structure studies. The program, ERA, is freely available at http://genome.ucf.edu/ERA.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 41
    Publication Date: 2013-09-10
    Description: Background: Chromatin immunoprecipitation coupled with hybridization to a tiling array (ChIP-chip) is a cost-effective and routinely used method to identify protein-DNA interactions or chromatin/histone mod-ifications. The robust identification of ChIP-enriched regions is frequently complicated by noisymeasurements. This identification can be improved by accounting for dependencies between adja-cent probes on chromosomes and by modeling of biological replicates. Results: MultiChIPmixHMM is a user-friendly R package to analyse ChIP-chip data modeling spatial depen-dencies between directly adjacent probes on a chromosome and enabling a simultaneous analysis ofreplicates. It is based on a linear regression mixture model, designed to perform a joint modeling ofimmunoprecipitated and input measurements. Conclusion: We show the utility of MultiChIPmixHMM by analyzing histone modifications of Arabidopsisthaliana. MultiChIPmixHMM is implemented in R and including functions in C, freely availablefrom the CRAN web site: http://cran.r-project.org.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 42
    Publication Date: 2013-09-13
    Description: Background: Gene regulatory network inference remains a challenging problem in systems biology despite the numerous approaches that have been proposed. When substantial knowledge on a gene regulatory network is already available, supervised network inference is appropriate. Such a method builds a binary classifier able to assign a class (Regulation/No regulation) to an ordered pair of genes. Once learnt, the pairwise classifier can be used to predict new regulations. In this work, we explore the framework of Markov Logic Networks (MLN) that combine features of probabilistic graphical models with the expressivity of first-order logic rules. Results: We propose to learn a Markov Logic network, e.g. a set of weighted rules that conclude on the predicate "regulates", starting from a known gene regulatory network involved in the switch proliferation/differentiation of keratinocyte cells, a set of experimental transcriptomic data and various descriptions of genes all encoded into first-order logic. As training data are unbalanced, we use asymmetric bagging to learn a set of MLNs. The prediction of a new regulation can then be obtained by averaging predictions of individual MLNs. As a side contribution, we propose three in silico tests to assess the performance of any pairwise classifier in various network inference tasks on real datasets. A first test consists of measuring the average performance on balanced edge prediction problem; a second one deals with the ability of the classifier, once enhanced by asymmetric bagging, to update a given network. Finally our main result concerns a third test that measures the ability of the method to predict regulations with a new set of genes. As expected, MLN, when provided with only numerical discretized gene expression data, does not perform as well as a pairwise SVM in terms of AUPR. However, when a more complete description of gene properties is provided by heterogeneous sources, MLN achieves the same performance as a black-box model such as a pairwise SVM while providing relevant insights on the predictions. Conclusions: The numerical studies show that MLN achieves very good predictive performance while opening the door to some interpretability of the decisions. Besides the ability to suggest new regulations, such an approach allows to cross-validate experimental data with existing knowledge.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 43
    Publication Date: 2013-09-19
    Description: Background: Recent increases in the number of deposited membrane protein crystal structures necessitate the use of automated computational tools to position them within the lipid bilayer. Identifying the correct orientation allows us to study the complex relationship between sequence, structure and the lipid environment, which is otherwise challenging to investigate using experimental techniques due to the difficulty in crystallising membrane proteins embedded within intact membranes. Results: We have developed a knowledge-based membrane potential, calculated by the statistical analysis of transmembrane protein structures, coupled with a combination of genetic and direct search algorithms, and demonstrate its use in positioning proteins in membranes, refinement of membrane protein models and in decoy discrimination. Conclusions: Our method is able to quickly and accurately orientate both alpha-helical and beta-barrel membrane proteins within the lipid bilayer, showing closer agreement with experimentally determined values than existing approaches. We also demonstrate both consistent and significant refinement of membrane protein models and the effective discrimination between native and decoy structures. Source code is available under an open source license from http://bioinf.cs.ucl.ac.uk/downloads/memembed/.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 44
    Publication Date: 2013-09-23
    Description: Background: Dynamic visualisation interfaces are required to explore the multiple microbial genome data now available, especially those obtained by high-throughput sequencing --- a.k.a. "Next-Generation Sequencing" (NGS) --- technologies; they would also be useful for "standard" annotated genomes whose chromosome organizations may be compared. Although various software systems are available, few offer an optimal combination of feature-rich capabilities, non-static user interfaces and multi-genome data handling. Results: We developed SynTView, a comparative and interactive viewer for microbial genomes, designed to run as either a web-based tool (Flash technology) or a desktop application (AIR environment). The basis of the program is a generic genome browser with sub-maps holding information about genomic objects (annotations). The software is characterised by the presentation of syntenic organisations of microbial genomes and the visualisation of polymorphism data (typically Single Nucleotide Polymorphisms --- SNPs) along these genomes; these features are accessible to the user in an integrated way. A variety of specialised views are available and are all dynamically inter-connected (including linear and circular multi-genome representations, dot plots, phylogenetic profiles, SNP density maps, and more). SynTView is not linked to any particular database, allowing the user to plug his own data into the system seamlessly, and use external web services for added functionalities. SynTView has now been used in several genome sequencing projects to help biologists make sense out of huge data sets. Conclusions: The most important assets of SynTView are: (i) the interactivity due to the Flash technology; (ii) the capabilities for dynamic interaction between many specialised views; and (iii) the flexibility allowing various user data sets to be integrated. It can thus be used to investigate massive amounts of information efficiently at the chromosome level. This innovative approach to data exploration could not be achieved with most existing genome browsers, which are more static and/or do not offer multiple views of multiple genomes. Documentation, tutorials and demonstration sites are available at the URL: http://genopole.pasteur.fr/SynTView.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 45
    Publication Date: 2013-09-23
    Description: Background: High-throughput RNA sequencing (RNA-Seq) is a revolutionary technique to study the transcriptome of a cell under various conditions at a systems level. Despite the wide application of RNA-Seq techniques to generate experimental data in the last few years, few computational methods are available to analyze this huge amount of transcription data. The computational methods for constructing gene regulatory networks from RNA-Seq expression data of hundreds or even thousands of genes are particularly lacking and urgently needed. Results: We developed an automated bioinformatics method to predict gene regulatory networks from the quantitative expression values of differentially expressed genes based on RNA-Seq transcriptome data of a cell in different stages and conditions, integrating transcriptional, genomic and gene function data. We applied the method to the RNA-Seq transcriptome data generated for soybean root hair cells in three different development stages of nodulation after rhizobium infection. The method predicted a soybean nodulation-related gene regulatory network consisting of 10 regulatory modules common for all three stages, and 24, 49 and 70 modules separately for the first, second and third stage, each containing both a group of co-expressed genes and several transcription factors collaboratively controlling their expression under different conditions. 8 of 10 common regulatory modules were validated by at least two kinds of validations, such as independent DNA binding motif analysis, gene function enrichment test, and previous experimental data in the literature. Conclusions: We developed a computational method to reliably reconstruct gene regulatory networks from RNA-Seq transcriptome data. The method can generate valuable hypotheses for interpreting biological data and designing biological experiments such as ChIP-Seq, RNA interference, and yeast two hybrid experiments.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 46
    Publication Date: 2013-09-24
    Description: Background: Finding peaks in ChIP-seq is an important process in biological inference. In some cases, such as positioning nucleosomes with specific histone modifications or finding transcription factor binding specificities, the precision of the detected peak plays a significant role. There are several applications for finding peaks (called peak finders) based on different algorithms (e.g. MACS, Erange and HPeak). Benchmark studies have shown that the existing peak finders identify different peaks for the same dataset and it is not known which one is the most accurate. We present the first meta-server called Peak Finder MetaServer (PFMS) that collects results from several peak finders and produces consensus peaks. Our application accepts three standard ChIP-seq data formats: BED, BAM, and SAM. Results: Sensitivity and specificity of seven widely used peak finders were examined. For the experiments we used three previously studied Transcription Factors (TF) ChIP-seq datasets and identified three of the selected peak finders that returned results with high specificity and very good sensitivity compared to the remaining four. We also ran PFMS using the three selected peak finders on the same TF datasets and achieved higher specificity and sensitivity than the peak finders individually. Conclusions: We show that combining outputs from up to seven peak finders yields better results than individual peak finders. In addition, three of the seven peak finders outperform the remaining four, and running PFMS with these three returns even more accurate results. Another added value of PFMS is a separate report of the peaks returned by each of the included peak finders.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 47
    Publication Date: 2013-09-26
    Description: Background: Concept recognition is an essential task in biomedical information extraction, presenting several complex and unsolved challenges. The development of such solutions is typically performed in an ad-hoc manner or using general information extraction frameworks, which are not optimized for the biomedical domain and normally require the integration of complex external libraries and/or the development of custom tools. Results: This article presents Neji, an open source framework optimized for biomedical concept recognition built around four key characteristics: modularity, scalability, speed, and usability. It integrates modules for biomedical natural language processing, such as sentence splitting, tokenization, lemmatization, part-of-speech tagging, chunking and dependency parsing. Concept recognition is provided through dictionary matching and machine learning with normalization methods. Neji also integrates an innovative concept tree implementation, supporting overlapped concept names and respective disambiguation techniques. The most popular input and output formats, namely Pubmed XML, IeXML, CoNLL and A1, are also supported. On top of the built-in functionalities, developers and researchers can implement new processing modules or pipelines, or use the provided command-line interface tool to build their own solutions, applying the most appropriate techniques to identify heterogeneous biomedical concepts. Neji was evaluated against three gold standard corpora with heterogeneous biomedical concepts (CRAFT, AnEM and NCBI disease corpus), achieving high performance results on named entity recognition (F1-measure for overlap matching: species 95%, cell 92%, cellular components 83%, gene and proteins 76%, chemicals 65%, biological processes and molecular functions 63%, disorders 85%, and anatomical entities 82%) and on entity normalization (F1-measure for overlap name matching and correct identifier included in the returned list of identifiers: species 88%, cell 71%, cellular components 72%, gene and proteins 64%, chemicals 53%, and biological processes and molecular functions 40%). Neji provides fast and multi-threaded data processing, annotating up to 1200 sentences/second when using dictionary-based concept identification. Conclusions: Considering the provided features and underlying characteristics, we believe that Neji is an important contribution to the biomedical community, streamlining the development of complex concept recognition solutions. Neji is freely available at http://bioinformatics.ua.pt/neji.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 48
    Publication Date: 2013-09-26
    Description: Background: Identifying genetic variants associated with complex human diseases is a great challenge in genomewide association studies (GWAS). Single nucleotide polymorphisms (SNPs) arising from genetic background are often dependent. The existing methods, i.e., local index of significance (LIS) and pooled local index of significance (PLIS), were both proposed for modeling SNP dependence and assumed that the whole chromosome follows a hidden Markov model (HMM). However, the fact that SNP data are often collected from separate heterogeneous regions of a single chromosome encourages different chromosomal regions to follow different HMMs. In this research, we developed a data-driven penalized criterion combined with a dynamic programming algorithm to find change points that divide the whole chromosome into more homogeneous regions. Furthermore, we extended PLIS to analyze the dependent tests obtained from multiple chromosomes with different regions for GWAS. Results: The simulation results show that our new criterion can improve the performance of the model selection procedure and that our region-specific PLIS (RSPLIS) method is better than PLIS at detecting diseaseassociated SNPs when there are multiple change points along a chromosome. Our method has been used to analyze the Daly study, and compared with PLIS, RSPLIS yielded results that more accurately detected disease-associated SNPs. Conclusions: The genomic rankings based on our method differ from the rankings based on PLIS. Specifically, for the detection of genetic variants with weak effect sizes, the RSPLIS method was able to rank them more efficiently and with greater power.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 49
    Publication Date: 2013-09-26
    Description: Background: The use of Gene Ontology (GO) data in protein analyses have largely contributed to the improvedoutcomes of these analyses. Several GO semantic similarity measures have been proposed in recentyears and provide tools that allow the integration of biological knowledge embedded in the GOstructure into different biological analyses. There is a need for a unified tool that provides the scientificcommunity with the opportunity to explore these different GO similarity measure approaches and theirbiological applications. Results: We have developed DaGO-Fun, an online tool available at http://web.cbio.uct.ac.za/ITGOM, whichincorporates many different GO similarity measures for exploring, analyzing and comparing GO termsand proteins within the context of GO. It uses GO data and UniProt proteins with their GO annotationsas provided by the Gene Ontology Annotation (GOA) project to precompute GO term informationcontent (IC), enabling rapid response to user queries. Conclusions: The DaGO-Fun online tool presents the advantage of integrating all the relevant IC-based GOsimilarity measures, including topology- and annotation-based approaches to facilitate effectiveexploration of these measures, thus enabling users to choose the most relevant approach for theirapplication. Furthermore, this tool includes several biological applications related to GO semanticsimilarity scores, including the retrieval of genes based on their GO annotations, the clustering offunctionally related genes within a set, and term enrichment analysis.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 50
    Publication Date: 2013-09-26
    Description: Background: Existing tools to model cell growth curves do not offer a flexible integrative approach to manage large datasets and automatically estimate parameters. Due to the increase of experimental time-series from microbiology and oncology, the need for a software that allows researchers to easily organize experimental data and simultaneously extract relevant parameters in an efficient way is crucial. Results: BGFit provides a web-based unified platform, where a rich set of dynamic models can be fitted to experimental time-series data, further allowing to efficiently manage the results in a structured and hierarchical way. The data managing system allows to organize projects, experiments and measurements data and also to define teams with different editing and viewing permission. Several dynamic and algebraic models are already implemented, such as polynomial regression, Gompertz, Baranyi, Logistic and Live Cell Fraction models and the user can add easily new models thus expanding current ones. Conclusions: BGFit allows users to easily manage their data and models in an integrated way, even if they are not familiar with databases or existing computational tools for parameter estimation. BGFit is designed with a flexible architecture that focus on extensibility and leverages free software with existing tools and methods, allowing to compare and evaluate different data modeling techniques. The application is described in the context of bacterial and tumor cells growth data fitting, but it is also applicable to any type of two-dimensional data, e.g. physical chemistry and macroeconomic time series, being fully scalable to high number of projects, data and model complexity.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 51
    Publication Date: 2014-12-14
    Description: Background: Biomedical ontologies are increasingly instrumental in the advancement of biological research primarily through their use to efficiently consolidate large amounts of data into structured, accessible sets. However, ontology development and usage can be hampered by the segregation of knowledge by domain that occurs due to independent development and use of the ontologies. The ability to infer data associated with one ontology to data associated with another ontology would prove useful in expanding information content and scope. We here focus on relating two ontologies: the Gene Ontology (GO), which encodes canonical gene function, and the Mammalian Phenotype Ontology (MP), which describes non-canonical phenotypes, using statistical methods to suggest GO functional annotations from existing MP phenotype annotations. This work is in contrast to previous studies that have focused on inferring gene function from phenotype primarily through lexical or semantic similarity measures. Results: We have designed and tested a set of algorithms that represents a novel methodology to define rules for predicting gene function by examining the emergent structure and relationships between the gene functions and phenotypes rather than inspecting the terms semantically. The algorithms inspect relationships among multiple phenotype terms to deduce if there are cases where they all arise from a single gene function.We apply this methodology to data about genes in the laboratory mouse that are formally represented in the Mouse Genome Informatics (MGI) resource. From the data, 7444 rule instances were generated from five generalized rules, resulting in 4818 unique GO functional predictions for 1796 genes. Conclusions: We show that our method is capable of inferring high-quality functional annotations from curated phenotype data. As well as creating inferred annotations, our method has the potential to allow for the elucidation of unforeseen, biologically significant associations between gene function and phenotypes that would be overlooked by a semantics-based approach. Future work will include the implementation of the described algorithms for a variety of other model organism databases, taking full advantage of the abundance of available high quality curated data.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 52
    Publication Date: 2014-12-18
    Description: Background: Identification of individual components in complex mixtures is an important and sometimes daunting task in several research areas like metabolomics and natural product studies. NMR spectroscopy is an excellent technique for analysis of mixtures of organic compounds and gives a detailed chemical fingerprint of most individual components above the detection limit. For the identification of individual metabolites in metabolomics, correlation or covariance between peaks in 1H NMR spectra has previously been successfully employed. Similar correlation of 2D 1H-13C Heteronuclear Single Quantum Correlation spectra was recently applied to investigate the structure of heparine. In this paper, we demonstrate how a similar approach can be used to identify metabolites in human biofluids (post-prostatic palpation urine). Results: From 50 1H-13C Heteronuclear Single Quantum Correlation spectra, 23 correlation plots resembling pure metabolites were constructed. The identities of these metabolites were confirmed by comparing the correlation plots with reported NMR data, mostly from the Human Metabolome Database. Conclusions: Correlation plots prepared by statistically correlating 1H-13C Heteronuclear Single Quantum Correlation spectra from human biofluids provide unambiguous identification of metabolites. The correlation plots highlight cross-peaks belonging to each individual compound, not limited by long-range magnetization transfer as conventional NMR experiments.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 53
    Publication Date: 2014-12-18
    Description: Background: Alternative Splicing (AS) as a post-transcription regulation mechanism is an important application of RNA-seq studies in eukaryotes. A number of software and computational methods have been developed for detecting AS. Most of the methods, however, are designed and tested on animal data, such as human and mouse. Plants genes differ from those of animals in many ways, e.g., the average intron size and preferred AS types. These differences may require different computational approaches and raise questions about their effectiveness on plant data. The goal of this paper is to benchmark existing computational differential splicing (or transcription) detection methods so that biologists can choose the most suitable tools to accomplish their goals. Results: This study compares the eight popular public available software packages for differential splicing analysis using both simulated and real Arabidopsis thaliana RNA-seq data. All software are freely available. The study examines the effect of varying AS ratio, read depth, dispersion pattern, AS types, sample sizes and the influence of annotation. Using a real data, the study looks at the consistences between the packages and verifies a subset of the detected AS events using PCR studies. Conclusions: No single method performs the best in all situations. The accuracy of annotation has a major impact on which method should be chosen for AS analysis. DEXSeq performs well in the simulated data when the AS signal is relative strong and annotation is accurate. Cufflinks achieve a better tradeoff between precision and recall and turns out to be the best one when incomplete annotation is provided. Some methods perform inconsistently for different AS types. Complex AS events that combine several simple AS events impose problems for most methods, especially for MATS. MATS stands out in the analysis of real RNA-seq data when all the AS events being evaluated are simple AS events.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 54
    Publication Date: 2014-11-09
    Description: Background: The rapid accumulation of whole-genome data has renewed interest in the study of using gene-order data for phylogenetic analyses and ancestral reconstruction. Current software and web servers typically do not support duplication and loss events along with rearrangements. Results: MLGOMLGO (Maximum Likelihood for Gene-Order Analysis) is a web tool for the reconstruction of phylogeny and/or ancestral genomes from gene-order data. MLGOMLGO is based on likelihood computation and shows advantages over existing methods in terms of accuracy, scalability and flexibility. Conclusions: To the best of our knowledge, it is the first web tool for analysis of large-scale genomic changes including not only rearrangements but also gene insertions, deletions and duplications. The web tool is available from http://www.geneorder.org/server.php.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 55
    Publication Date: 2014-12-16
    Description: Background: Genomic selection (GS) promises to improve accuracy in estimating breeding values and genetic gain for quantitative traits compared to traditional breeding methods. Its reliance on high-throughput genome-wide markers and statistical complexity, however, is a serious challenge in data management, analysis, and sharing. A bioinformatics infrastructure for data storage and access, and user-friendly web-based tool for analysis and sharing output is needed to make GS more practical for breeders. Results: We have developed a web-based tool, called solGS, for predicting genomic estimated breeding values (GEBVs) of individuals, using a Ridge-Regression Best Linear Unbiased Predictor (RR-BLUP) model. It has an intuitive web-interface for selecting a training population for modeling and estimating genomic estimated breeding values of selection candidates. It estimates phenotypic correlation and heritability of traits and selection indices of individuals. Raw data is stored in a generic database schema, Chado Natural Diversity, co-developed by multiple database groups. Analysis output is graphically visualized and can be interactively explored online or downloaded in text format. An instance of its implementation can be accessed at the NEXTGEN Cassava breeding database, http://cassavabase.org/solgs. Conclusions: solGS enables breeders to store raw data and estimate GEBVs of individuals online, in an intuitive and interactive workflow. It can be adapted to any breeding program.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 56
    Publication Date: 2014-12-16
    Description: Background: According to Regulation (EU) No 619/2011, trace amounts of non-authorised genetically modified organisms (GMO) in feed are tolerated within the EU if certain prerequisites are met. Tolerable traces must not exceed the so-called `minimum required performance limit? (MRPL), which was defined according to the mentioned regulation to correspond to 0.1% mass fraction per ingredient. Therefore, not yet authorised GMO (and some GMO whose approvals have expired) have to be quantified at very low level following the qualitative detection in genomic DNA extracted from feed samples. As the results of quantitative analysis can imply severe legal and financial consequences for producers or distributors of feed, the quantification results need to be utterly reliable. Results: We developed a statistical approach to investigate the experimental measurement variability within one 96-well PCR plate. This approach visualises the frequency distribution as zygosity-corrected relative content of genetically modified material resulting from different combinations of transgene and reference gene Cq values. One application of it is the simulation of the consequences of varying parameters on measurement results. Parameters could be for example replicate numbers or baseline and threshold settings, measurement results could be for example median (class) and relative standard deviation (RSD). All calculations can be done using the built-in functions of Excel without any need for programming. The developed Excel spreadsheets are available (see section `Availability of supporting data? for details). In most cases, the combination of four PCR replicates for each of the two DNA isolations already resulted in a relative standard deviation of 15% or less. Conclusions: The aims of the study are scientifically based suggestions for minimisation of uncertainty of measurement especially in ?but not limited to? the field of GMO quantification at low concentration levels. Four PCR replicates for each of the two DNA isolations seem to be a reasonable minimum number to narrow down the possible spread of results.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 57
    Publication Date: 2014-12-16
    Description: Background: Last generations of Single Nucleotide Polymorphism (SNP) arrays allow to study copy-number variations in addition to genotyping measures. Results: MPAgenomicsMPAgenomics, standing for multi-patient analysis (MPA) of genomic markers, is an R-package devoted to: (i) efficient segmentation and (i i) selection of genomic markers from multi-patient copy number and SNP data profiles. It provides wrappers from commonly used packages to streamline their repeated (sometimes difficult) manipulation, offering an easy-to-use pipeline for beginners in R.The segmentation of successive multiple profiles (finding losses and gains) is performed with an automatic choice of parameters involved in the wrapped packages. Considering multiple profiles in the same time, MPAgenomics MPAgenomics wraps efficient penalized regression methods to select relevant markers associated with a given outcome. Conclusions: MPAgenomics MPAgenomics provides an easy tool to analyze data from SNP arrays in R. The R-package MPAgenomics MPAgenomics is available on CRAN.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 58
    Publication Date: 2014-12-16
    Description: Background: With the ever increasing use of computational models in the biosciences, the need to share models and reproduce the results of published studies efficiently and easily is becoming more important. To this end, various standards have been proposed that can be used to describe models, simulations, data or other essential information in a consistent fashion. These constitute various separate components required to reproduce a given published scientific result. Results: We describe the Open Modeling EXchange format (OMEX). Together with the use of other standard formats from the Computational Modeling in Biology Network (COMBINE), OMEX is the basis of the COMBINE Archive, a single file that supports the exchange of all the information necessary for a modeling and simulation experiment in biology. An OMEX file is a ZIP container that includes a manifest file, listing the content of the archive, an optional metadata file adding information about the archive and its content, and the files describing the model. The content of a COMBINE Archive consists of files encoded in COMBINE standards whenever possible, but may include additional files defined by an Internet Media Type. Several tools that support the COMBINE Archive are available, either as independent libraries or embedded in modeling software. Conclusions: The COMBINE Archive facilitates the reproduction of modeling and simulation experiments in biology by embedding all the relevant information in one file. Having all the information stored and exchanged at once also helps in building activity logs and audit trails. We anticipate that the COMBINE Archive will become a significant help for modellers, as the domain moves to larger, more complex experiments such as multi-scale models of organs, digital organisms, and bioengineering.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 59
    Publication Date: 2014-12-15
    Description: Background: With the advent of low cost, fast sequencing technologies metagenomic analyses are made possible. The large data volumes gathered by these techniques and the unpredictable diversity captured in them are still, however, a challenge for computational biology. Results: In this paper we address the problem of rapid taxonomic assignment with small and adaptive data models ( 〈 5 MB) and present the accelerated k-mer explorer (AKE). Acceleration in AKE?s taxonomic assignments is achieved by a special machine learning architecture, which is well suited to model data collections that are intrinsically hierarchical. We report classification accuracy reasonably well for ranks down to order, observed on a study on real world data (Acid Mine Drainage, Cow Rumen). Conclusion: We show that the execution time of this approach is orders of magnitude shorter than competitive approaches and that accuracy is comparable. The tool is presented to the public as a web application (url: https://ani.cebitec.uni-bielefeld.de/ake/, username: bmc, password: bmcbioinfo).
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 60
    Publication Date: 2014-12-15
    Description: Background: Next generation sequencing produces base calls with low quality scores that can affect the accuracy of identifying simple nucleotide variation calls, including single nucleotide polymorphisms and small insertions and deletions. Here we compare the effectiveness of two data preprocessing methods, masking and trimming, and the accuracy of simple nucleotide variation calls on whole-genome sequence data from Caenorhabditis elegans. Masking substitutes low quality base calls with `N?s (undetermined bases), whereas trimming removes low quality bases that results in a shorter read lengths. Results: We demonstrate that masking is more effective than trimming in reducing the false-positive rate in single nucleotide polymorphism (SNP) calling. However, both of the preprocessing methods did not affect the false-negative rate in SNP calling with statistical significance compared to the data analysis without preprocessing. False-positive rate and false-negative rate for small insertions and deletions did not show differences between masking and trimming. Conclusions: We recommend masking over trimming as a more effective preprocessing method for next generation sequencing data analysis since masking reduces the false-positive rate in SNP calling without sacrificing the false-negative rate although trimming is more commonly used currently in the field. The perl script for masking is available at http://code.google.com/p/subn/. The sequencing data used in the study were deposited in the Sequence Read Archive (SRX450968 and SRX451773).
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 61
    Publication Date: 2014-12-01
    Description: Background: The identification of new diagnostic or prognostic biomarkers is one of the main aims of clinical cancer research. Technologies like mass spectrometry are commonly being used in proteomic research. Mass spectrometry signals show the proteomic profiles of the individuals under study at a given time. These profiles correspond to the recording of a large number of proteins, much larger than the number of individuals. These variables come in addition to or to complete classical clinical variables. The objective of this study is to evaluate and compare the predictive ability of new and existing models combining mass spectrometry data and classical clinical variables. This study was conducted in the context of binary prediction. Results: To achieve this goal, simulated data as well as a real dataset dedicated to the selection of proteomic markers of steatosis were used to evaluate the methods. The proposed methods meet the challenge of high-dimensional data and the selection of predictive markers by using penalization methods (Ridge, Lasso) and dimension reduction techniques (PLS), as well as a combination of both strategies through sparse PLS in the context of a binary class prediction. The methods were compared in terms of mean classification rate and their ability to select the true predictive values. These comparisons were done on clinical-only models, mass-spectrometry-only models and combined models. Conclusions: It was shown that models which combine both types of data can be more efficient than models that use only clinical or mass spectrometry data when the sample size of the dataset is large enough.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 62
    Publication Date: 2014-12-01
    Description: Background: In order to extract meaningful information from electronic medical records, such as signs and symptoms, diagnoses, and treatments, it is important to take into account the contextual properties of the identified information: negation, temporality, and experiencer. Most work on automatic identification of these contextual properties has been done on English clinical text. This study presents ContextD, an adaptation of the English ConText algorithm to the Dutch language, and a Dutch clinical corpus.We created a Dutch clinical corpus containing four types of anonymized clinical documents: entries from general practitioners, specialists? letters, radiology reports, and discharge letters. Using a Dutch list of medical terms extracted from the Unified Medical Language System, we identified medical terms in the corpus with exact matching. The identified terms were annotated for negation, temporality, and experiencer properties. To adapt the ConText algorithm, we translated English trigger terms to Dutch and added several general and document specific enhancements, such as negation rules for general practitioners? entries and a regular expression based temporality module. Results: The ContextD algorithm utilized 41 unique triggers to identify the contextual properties in the clinical corpus. For the negation property, the algorithm obtained an F-score from 87% to 93% for the different document types. For the experiencer property, the F-score was 99% to 100%. For the historical and hypothetical values of the temporality property, F-scores ranged from 26% to 54% and from 13% to 44%, respectively. Conclusions: The ContextD showed good performance in identifying negation and experiencer property values across all Dutch clinical document types. Accurate identification of the temporality property proved to be difficult and requires further work. The anonymized and annotated Dutch clinical corpus can serve as a useful resource for further algorithm development.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 63
    Publication Date: 2012-12-29
    Description: Background: RNA interference (RNAi) becomes an increasingly important and effective genetic tool to study the function of target genes by suppressing specific genes of interest. This system approach helps identify signaling pathways and cellular phase types by tracking intensity and/or morphological changes of cells. The traditional RNAi screening scheme, in which one siRNA is designed to knockdown one specific mRNA target, needs a large library of siRNAs and turns out to be time-consuming and expensive. Results: In this paper, we propose a conceptual model, called compressed sensing RNAi (csRNAi), which employs the unique combination of group of small interfering RNAs (siRNAs) to knockdown a much larger size of genes. This strategy is based on the fact that one gene can be partially bound with several small interfering RNAs (siRNAs) and conversely, one siRNA can bind to a few genes with distinct binding affinity. This model constructs a multi-to-multi correspondence between siRNAs and their targets, with siRNAs much fewer than mRNA targets, compared with the conventional scheme. Mathematically this problem involves an underdetermined system of equations (linear or nonlinear), which is ill-posed in general. However, the recently developed compressed sensing (CS) theory can solve this problem. We present a mathematical model to describe the csRNAi system based on both CS theory and biological concerns. To build this model, we first search nucleotide motifs in a target gene set. Then we propose a machine learning based method to find the effective siRNAs with novel features, such as image features and speech features to describe an siRNA sequence. Numerical simulations show that we can reduce the siRNA library to one third of that in the conventional scheme. In addition, the features to describe siRNAs outperform the existing ones substantially. Conclusions: This csRNAi system is very promising in saving both time and cost for large-scale RNAi screening experiments which may benefit the biological research with respect to cellular processes and pathways.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 64
    Publication Date: 2012-12-29
    Description: Background: Copy number variations (CNVs) are genomic structural variants that are found in healthy populations and have been observed to be associated with disease susceptibility. Existing methods for CNV detection are often performed on a sample-by-sample basis, which is not ideal for large datasets where common CNVs must be estimated by comparing the frequency of CNVs in the individual samples. Here we describe a simple and novel approach to locate genome-wide CNVs common to a specific population, using human ancestry as the phenotype. Results: We utilized our previously published Genome Alteration Detection Analysis (GADA) algorithm to identify common ancestry CNVs (caCNVs) and built a caCNV model to predict population structure. We identified a 73 caCNV signature using a training set of 225 healthy individuals from European, Asian, and African ancestry. The signature was validated on an independent test set of 300 individuals with similar ancestral background. The error rate in predicting ancestry in this test set was 2% using the 73 caCNV signature. Among the caCNVs identified, several were previously confirmed experimentally to vary by ancestry. Our signature also contains a caCNV region with a single microRNA (MIR270), which represents the first reported variation of microRNA by ancestry. Conclusions: We developed a new methodology to identify common CNVs and demonstrated its performance by building a caCNV signature to predict human ancestry with high accuracy. The utility of our approach could be extended to large case--control studies to identify CNV signatures for other phenotypes such as disease susceptibility and drug response.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 65
    Publication Date: 2013-01-17
    Description: Background: Cross-species comparisons of gene neighborhoods (also called genomic contexts) in microbes may provide insight into determining functionally related or co-regulated sets of genes, suggest annotations of previously un-annotated genes, and help to identify horizontal gene transfer events across microbial species. Existing tools to investigate genomic contexts, however, lack features for dynamically comparing and exploring genomic regions from multiple species. As DNA sequencing technologies improve and the number of whole sequenced microbial genomes increases, a user-friendly genome context comparison platform designed for use by a broad range of users promises to satisfy a growing need in the biological community. Results: Here we present JContextExplorer: a tool that organizes genomic contexts into branching diagrams. We implement several alternative context-comparison and tree rendering algorithms, and allow for easy transitioning between different clustering algorithms. To facilitate genomic context analysis, our tool implements GUI features, such as text search filtering, point-and-click interrogation of individual contexts, and genomic visualization via a multi-genome browser. We demonstrate a use case of our tool by attempting to resolve annotation ambiguities between two highly homologous yet functionally distinct genes in a set of 22 alpha and gamma proteobacteria. Conclusions: JContextExplorer should enable a broad range of users to analyze and explore genomic contexts. The program has been tested on Windows, Mac, and Linux operating systems, and is implemented both as an executable JAR file and java WebStart. Program executables, source code, and documentation is available at http://www.bme.ucdavis.edu/facciotti/resources_data/software/
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 66
    Publication Date: 2013-01-17
    Description: Background: Negation occurs frequently in scientific literature, especially in biomedical literature. It has previously been reported that around 13% of sentences found in biomedical research articles contain negation. Historically, the main motivation for identifying negated events has been to ensure their exclusion from lists of extracted interactions. However, recently, there has been a growing interest in negative results, which has resulted in negation detection being identified as a key challenge in biomedical relation extraction. In this article, we focus on the problem of identifying negated bio-events, given gold standard event annotations. Results: We have conducted a detailed analysis of three open access bio-event corpora containing negation information (i.e., GENIA Event, BioInfer and BioNLP'09 ST), and have identified the main types of negated bio-events. We have analysed the key aspects of a machine learning solution to the problem of detecting negated events, including selection of negation cues, feature engineering and the choice of learning algorithm. Combining the best solutions for each aspect of the problem, we propose a novel framework for the identification of negated bio-events. We have evaluated our system on each of the three open access corpora mentioned above. The performance of the system significantly surpasses the best results previously reported on the BioNLP'09 ST corpus, and achieves even better results on the GENIA Event and BioInfer corpora, both of which contain more varied and complex events. Conclusions: Recently, in the field of biomedical text mining, the development and enhancement of event-based systems has received significant interest. The ability to identify negated events is a key performance element for these systems. We have conducted the first detailed study on the analysis and identification of negated bio-events. Our proposed framework can be integrated with state-of-the-art event extraction systems. The resulting systems will be able to extract bio-events with attached polarities from textual documents, which can serve as the foundation for more elaborate systems that are able to detect mutually contradicting bio-events.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 67
    Publication Date: 2013-01-17
    Description: Background: Detection of low abundance metabolites is important for de novo mapping of metabolic pathways related to diet, microbiome or environmental exposures. Multiple algorithms are available to extract m/z features from liquid chromatography-mass spectral data in a conservative manner, which tends to preclude detection of low abundance chemicals and chemicals found in small subsets of samples. The present study provides software to enhance such algorithms for feature detection, quality assessment, and annotation. Results: xMSanalyzer is a set of utilities for automated processing of metabolomics data. The utilites can be classified into four main modules to: 1) improve feature detection for replicate analyses by systematic re-extraction with multiple parameter settings and data merger to optimize the balance between sensitivity and reliability, 2) evaluate sample quality and feature consistency, 3) detect feature overlap between datasets, and 4) characterize high-resolution m/z matches to small molecule metabolites and biological pathways using multiple chemical databases. The package was tested with plasma samples and shown to more than double the number of features extracted while improving quantitative reliability of detection. MS/MS analysis of a random subset of peaks that were exclusively detected using xMSanalyzer confirmed that the optimization scheme improves detection of real metabolites. Conclusions: xMSanalyzer is a package of utilities for data extraction, quality control assessment, detection of overlapping and unique metabolites in multiple datasets, and batch annotation of metabolites. The program was designed to integrate with existing packages such as apLCMS and XCMS, but the framework can also be used to enhance data extraction for other LC/MS data software.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 68
    Publication Date: 2013-01-17
    Description: Background: Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins(PPIs) from free text. Various proposals have been put forward, which diverge especially in the specifickernel function, the type of input representation, and the feature sets. These proposals are regularlycompared to each other regarding their overall performance on different gold standard corpora,but little is known about their respective performance on the instance level. Results: We report on a detailed analysis of the shared characteristics and the differences between 13 currentmethods using five PPI corpora. We identified a large number of rather difficult (misclassified by mostmethods) and easy (correctly classified by most methods) PPIs. We show that kernels using the sameinput representation perform similarly on these pairs and that building ensembles using dissimilarkernels leads to significant performance gain. However, our analysis also reveals that characteristicsshared between difficult pairs are few, which lowers the hope that new methods, if built along thesame line as current ones, will deliver breakthroughs in extraction performance. Conclusions: Our experiments show that current methods do not seem to do very well in capturing the sharedcharacteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (stillvery few) available corpora. Our analysis suggests that performance improvements shall be soughtafter rather in novel feature sets than in novel kernel functions.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 69
    Publication Date: 2013-01-17
    Description: Background: Structure-based clustering is commonly used to identify correct protein folds among candidate folds (also called decoys) generated by protein structure prediction programs. However, traditional clustering methods exhibit a poor runtime performance on large decoy sets. We hypothesized that a more efficient "partial" clustering approach in combination with an improved scoring scheme could significantly improve both the speed and performance of existing candidate selection methods. Results: We propose a new scheme that performs rapid but incomplete clustering on protein decoys. Our method detects structurally similar decoys (measured using either Calpha RMSD or GDT-TS score) and extracts representatives from them without assigning every decoy to a cluster. We integrated our new clustering strategy with several different scoring functions to assess both the performance and speed in identifying correct or near-correct folds. Experimental results on 35 Rosetta decoy sets and 40 I-TASSER decoy sets show that our method can improve the correct fold detection rate as assessed by two different quality criteria. This improvement is significantly better than two recently published clustering methods, Durandal and Calibur-lite. Speed and efficiency testing shows that our method can handle much larger decoy sets and is up to 22 times faster than Durandal and Calibur-lite. Conclusions: The new method, named HS-Forest, avoids the computationally expensive task of clustering every decoy, yet still allows superior correct-fold selection. Its improved speed, efficiency and decoy-selection performance should enable structure prediction researchers to work with larger decoy sets and significantly improve their ab initio structure prediction performance.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 70
    Publication Date: 2013-01-17
    Description: Background: Traditional methods for computational motif discovery often suffer from poor performance. In particular, methods that search for sequence matches to known binding motifs tend to predict many non-functional binding sites because they fail to take into consideration the biological state of the cell. In recent years, genome-wide studies have generated a lot of data that has the potential to improve our ability to identify functional motifs and binding sites, such as information about chromatin accessibility and epigenetic states in different cell types. However, it is not always trivial to make use of this data in combination with existing motif discovery tools, especially for researchers who are not skilled in bioinformatics programming. Results: Here we present MotifLab, a general workbench for analysing regulatory sequence regions and discovering transcription factor binding sites and cis-regulatory modules. MotifLab supports comprehensive motif discovery and analysis by allowing users to integrate several popular motif discovery tools as well as different kinds of additional information, including phylogenetic conservation, epigenetic marks, DNase hypersensitive sites, ChIP-Seq data, positional binding preferences of transcription factors, transcription factor interactions and gene expression. MotifLab offers several data-processing operations that can be used to create, manipulate and analyse data objects, and complete analysis workflows can be constructed and automatically executed within MotifLab, including graphical presentation of the results. Conclusions: We have developed MotifLab as a flexible workbench for motif analysis in a genomic context. The flexibility and effectiveness of this workbench has been demonstrated on selected test cases, in particular two previously published benchmark data sets for single motifs and modules, and a realistic example of genes responding to treatment with forskolin. MotifLab is freely available at http://www.motiflab.org.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 71
    Publication Date: 2013-01-17
    Description: Background: Recent studies of transcription activator-like (TAL) effector domains fused to nucleases (TALENs) demonstrate enormous potential for genome editing. Effective design of TALENs requires a combination of selecting appropriate genetic features, finding pairs of binding sites based on a consensus sequence, and, in some cases, identifying endogenous restriction sites for downstream molecular genetic applications. Results: We present the web-based program Mojo Hand for designing TAL and TALEN constructs for genome editing applications (www.talendesign.org). We describe the algorithm and its implementation. The features of Mojo Hand include (1) automatic download of genomic data from the National Center for Biotechnology Information, (2) analysis of any DNA sequence to reveal pairs of binding sites based on a user-defined template, (3) selection of restriction-enzyme recognition sites in the spacer between the TAL monomer binding sites including options for the selection of restriction enzyme suppliers, and (4) output files designed for subsequent TALEN construction using the Golden Gate assembly method. Conclusions: Mojo Hand enables the rapid identification of TAL binding sites for use in TALEN design. The assembly of TALEN constructs, is also simplified by using the TAL-site prediction program in conjunction with a spreadsheet management aid of reagent concentrations and TALEN formulation. Mojo Hand enables scientists to more rapidly deploy TALENs for genome editing applications.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 72
    Publication Date: 2013-01-17
    Description: Background: The digitization of biodiversity data is leading to the widespread application of taxon names that are superfluous, ambiguous or incorrect, resulting in mismatched records and inflated species numbers. The ultimate consequences of misspelled names and bad taxonomy are erroneous scientific conclusions and faulty policy decisions. The lack of tools for correcting this 'names problem' has become a fundamental obstacle to integrating disparate data sources and advancing the progress of biodiversity science. Results: The TNRS, or Taxonomic Name Resolution Service, is an online application for automated and user-supervised standardization of plant scientific names. The TNRS builds upon and extends existing open-source applications for name parsing and fuzzy matching. Names are standardized against multiple reference taxonomies, including the Missouri Botanical Garden's Tropicos database. Capable of processing thousands of names in a single operation, the TNRS parses and corrects misspelled names and authorities, standardizes variant spellings, and converts nomenclatural synonyms to accepted names. Family names can be included to increase match accuracy and resolve many types of homonyms. Partial matching of higher taxa combined with extraction of annotations, accession numbers and morphospecies allows the TNRS to standardize taxonomy across a broad range of active and legacy datasets. Conclusions: We show how the TNRS can resolve many forms of taxonomic semantic heterogeneity, correct spelling errors and eliminate spurious names. As a result, the TNRS can aid the integration of disparate biological datasets. Although the TNRS was developed to aid in standardizing plant names, its underlying algorithms and design can be extended to all organisms and nomenclatural codes. The TNRS is accessible via a web interface at http://tnrs.iplantcollaborative.org/ and as a RESTful web service and application programming interface. Source code is available at https://github.com/iPlantCollaborativeOpenSource/TNRS/.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 73
    Publication Date: 2013-01-17
    Description: Background: Gene fusions are the result of chromosomal aberrations and encode chimeric RNA (fusion transcripts) that play an important role in cancer genesis. Recent advances in high throughput transcriptome sequencing have given rise to computational methods for new fusion discovery. The ability to simulate fusion transcripts is essential for testing and improving those tools. Results: To facilitate this need, we developed FUSIM (FUsion SIMulator), a software tool for simulating fusion transcripts. The simulation of events known to create fusion genes and their resulting chimeric proteins is supported, including inter-chromosome translocation, trans-splicing, complex chromosomal rearrangements, and transcriptional read through events.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 74
    Publication Date: 2013-01-17
    Description: Background: A standard graphical notation is essential to facilitate exchange of network representations of biologicalprocesses. Towards this end, the Systems Biology Graphical Notation (SBGN) has been proposed, and it isalready supported by a number of tools. However, support for SBGN in Cytoscape, one of the most widelyused platforms in biology to visualise and analyse networks, is limited, and in particular it is not possible toimport SBGN diagrams. Results: We have developed CySBGN, a Cytoscape plug-in that extends the use of Cytoscape visualisation and analysisfeatures to SBGN maps. CySBGN adds support for Cytoscape users to visualize any of the threecomplementary SBGN languages: Process Description, Entity Relationship, and Activity Flow. Theinteroperability with other tools (CySBML plug-in and Systems Biology Format Converter) was alsoestablished allowing an automated generation of SBGN diagrams based on previously imported SBMLmodels. The plug-in was tested using a suite of 53 different test cases that covers almost all possible entities,shapes, and connections. A rendering comparison with other tools that support SBGN was performed. Toillustrate the interoperability with other Cytoscape functionalities, we present two analysis examples, shortestpath calculation, and motif identification in a metabolic network. Conclusions: CySBGN imports, modifies and analyzes SBGN diagrams in Cytoscape, and thus allows the application of thelarge palette of tools and plug-ins in this platform to networks and pathways in SBGN format.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 75
    Publication Date: 2013-01-18
    Description: Background: Amyloids are proteins capable of forming fibrils. Many of them underlie serious diseases, like Alzheimer disease. The number of amyloid-associated diseases is constantly increasing. Recent studies indicate that amyloidogenic properties can be associated with short segments of aminoacids, which transform the structure when exposed. A few hundreds of such peptides have been experimentally found; Experimental testing of all possible aminoacid combinations is currently not feasible. Instead, they can be predicted by computational methods. 3D profile is a physicochemical-based method that has generated the most numerous dataset - ZipperDB. However, it is computationally very demanding. Here, we show that dataset generation can be accelerated. Two methods to increase the classification efficiency of amyloidogenic candidates are presented and tested: simplified 3D profile generation and machine learning methods. Results: We generated a new dataset of hexapeptides, using modified 3D profile algorithm, which showed very good classification overlap with ZipperDB (93.5%). The new part of our dataset contains 1779 segments, with 204 classified as amyloidogenic. The dataset of 6-residue sequences with their binary classification, based on the energy of the segment, was applied for training machine learning methods. A separate set of sequences from ZipperDB was used as a test set. The most effective methods were Alternating Decision Tree and Multilayer Perceptron. Both methods obtained area under ROC curve of 0.96, accuracy 91%, true positive rate ca. 78%, and true negative rate 95%. A few other machine learning methods also achieved a good performance. The computational time was reduced from 18-20 CPU-hours (full 3D profile) to 0.5 CPU-hours (simplified 3D profile) to seconds (machine learning). Conclusions: We showed that the simplified profile generation method does not introduce an error with regard to the original method, while increasing the computational efficiency. Our new dataset proved representative enough to use simple statistical methods for testing the amylogenicity based only on six letter sequences. Statistical machine learning methods such as Alternating Decision Tree and Multilayer Perceptron can replace the energy based classifier, with advantage of very significantly reduced computational time and simplicity to perform the analysis. Additionally, a decision tree provides a set of very easily interpretable rules.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 76
    Publication Date: 2013-01-18
    Description: Background: The Sequence Read Archive (SRA) is the largest public repository of sequencing data from the nextgeneration of sequencing platforms including Illumina (Genome Analyzer, HiSeq, MiSeq, .etc),Roche 454 GS System, Applied Biosystems SOLiD System, Helicos Heliscope, PacBio RS, andothers. Results: SRAdb is an attempt to make query of the metadata associated with SRA submission, study, sample,experiment and run more robust and precise, and make access to sequencing data in the SRA easier.We have parsed all the SRA metadata into a SQLite database that is routinely updated and can beeasily distributed. The SRAdb R/Bioconductor package then utilizes this SQLite database forquerying and accessing metadata. Full text search functionality makes querying metadata veryflexible and powerful. Fastq files associated with query results can be downloaded easily for localanalysis. The package also includes an interface from R to a popular genome browser, the IntegratedGenomics Viewer. Conclusions: SRAdb Bioconductor package provides a convenient and integrated framework to query and accessSRA metadata quickly and powerfully from within R.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 77
    Publication Date: 2013-01-19
    Description: Background: Protein pairs that have the same secondary structure packing arrangement but have different topologies have attracted much attention in terms of both evolution and physical chemistry of protein structures. Further investigation of such protein relationships would give us a hint as to how proteins can change their fold in the course of evolution, as well as a insight into physico-chemical properties of secondary structure packing. For this purpose, highly accurate sequence order independent structure comparison methods are needed. Results: We have developed a novel protein structure alignment algorithm, MICAN (a structure alignment algorithm that can handle Multiple-chain complexes, Inverse direction of secondary structures, Ca only models, Alternative alignments, and Non-sequential alignments). The algorithm was designed so as to identify the best structural alignment between protein pairs by disregarding the connectivity between secondary structure elements (SSE). One of the key feature of the algorithm is utilizing the multiple vector representation for each SSE, which enables us to correctly treat bent or twisted nature of long SSE. We compared MICAN with other 9 publicly available structure alignment programs, using both reference-dependent and reference-independent evaluation methods on a variety of benchmark test sets which include both sequential and non-sequential alignments. We show that MICAN outperforms the other existing methods for reproducing reference alignments of non-sequential test sets. Further, although MICAN does not specialize in sequential structure alignment, it showed the top level performance on the sequential test sets. We also show that MICAN program is the fastest non-sequential structure alignment program among all the programs we examined here. Conclusions: MICAN is the fastest and the most accurate program among non-sequential alignment programs we examined here. These results suggest that MICAN is a highly effective tool for automatically detecting non-trivial structural relationships of proteins, such as circular permutations and segment-swapping, many of which have been identified manually by human experts so far. The source code of MICAN is freely download-able at http://www.tbp.cse.nagoya-u.ac.jp/MICAN.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 78
    Publication Date: 2013-01-20
    Description: Background: Maximum Likelihood (ML)-based phylogenetic inference using Felsenstein's pruning algorithm is a standard method for estimating the evolutionary relationships amongst a set of species based on DNA sequence data, and is used in popular applications such as RAxML, PHYLIP, GARLI, BEAST, and MrBayes. The Phylogenetic Likelihood Function (PLF) and its associated scaling and normalization steps comprise the computational kernel for these tools. These computations are data intensive but contain fine grain parallelism that can be exploited by coprocessor architectures such as FPGAs and GPUs. A general purpose API called BEAGLE has recently been developed that includes optimized implementations of Felsenstein's pruning algorithm for various data parallel architectures. In this paper, we extend the BEAGLE API to a multiple Field Programmable Gate Array (FPGA)-based platform called the Convey HC-1. Results: The core calculation of our implementation, which includes both the phylogenetic likelihood function (PLF) and the tree likelihood calculation, has an arithmetic intensity of 130 floating-point operations per 64 bytes of I/O, or 2.03 ops/byte. Its performance can thus be calculated as a function of the host platform's peak memory bandwidth and the implementation's memory efficiency, as 2.03 x peak bandwidth x memory efficiency. Our FPGA-based platform has a peak bandwidth of 76.8 GB/s and our implementation achieves a memory efficiency of approximately 50%, which gives an average throughput of 78 Gflops. This represents a ~40X speedup when compared with BEAGLE's CPU implementation on a dual Xeon 5520 and 3X speedup versus BEAGLE's GPU implementation on a Tesla T10 GPU for very large data sizes. The power consumption is 92 W, yielding a power efficiency of 1.7 Gflops per Watt. Conclusions: The use of data parallel architectures to achieve high performance for likelihood-based phylogenetic inference requires high memory bandwidth and a design methodology that emphasizes high memory efficiency. To achieve this objective, we integrated 32 pipelined processing elements (PEs) across four FPGAs. For the design of each PE, we developed a specialized synthesis tool to generate a floating-point pipeline with resource and throughput constraints to match the target platform. We have found that using low-latency floating-point operators can significantly reduce FPGA area and still meet timing requirement on the target platform. We found that this design methodology can achieve performance that exceeds that of a GPU-based coprocessor.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 79
    Publication Date: 2013-01-18
    Description: Background: Most phylogeny analysis methods based on molecular sequences use multiple alignment where the quality of the alignment, which is dependent on the alignment parameters, determines the accuracy of the resulting trees. Different parameter combinations chosen for the multiple alignment may result in different phylogenies. A new non-alignment based approach, Relative Complexity Measure (RCM), has been introduced to tackle this problem and proven to work in fungi and mitochondrial DNA.Result: In this work, we present an application of the RCM method to reconstruct robust phylogenetic trees using sequence data for genus Galanthus obtained from different regions in Turkey. Phylogenies have been analyzed using nuclear and chloroplast DNA sequences. Results showed that, the tree obtained from nuclear ribosomal RNA gene sequences was more robust, while the tree obtained from the chloroplast DNA showed a higher degree of variation. Conclusions: Phylogenies generated by Relative Complexity Measure were found to be robust and results of RCM were more reliable than the compared techniques. Particularly, to overcome MSA-based problems, RCM seems to be a reasonable way and a good alternative to MSA-based phylogenetic analysis. We believe our method will become a mainstream phylogeny construction method especially for the highly variable sequence families where the accuracy of the MSA heavily depends on the alignment parameters.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 80
    Publication Date: 2013-02-22
    Description: Background: The learning active subnetworks problem involves finding subnetworks of a bio-molecular network that are active in a particular condition. Many approaches integrate observation data (e.g., gene expression) with the network topology to find candidate subnetworks. Increasingly, pathway databases contain additional annotation information that can be mined to improve prediction accuracy, such as the interaction mechanism (e.g., transcription, microRNA, cleavage) annotations. We introduce a mechanism-based approach to active subnetwork recovery which exploits such annotations. We suggest that neighboring interactions in a network tend to be co-activated in a way that depends on the ``correlation'' of their mechanism annotations. e.g., neighboring phosphorylation and de-phosphorylation interactions may be more likely to be co-activated than neighboring phosphorylation and covalent bonding interactions. Results: Our method iteratively learns the mechanism correlations and finds the most likely active subnetwork. We use a probabilistic graphical model with a Markov Random Field component which creates dependencies between the states (active or non-active) of neighboring interactions, that incorporates a mechanism-based component to the function. We apply a heuristic-based EM-based algorithm suitable for the problem. We validated our method's performance using simulated data in networks downloaded from GeneGO against the same approach without the mechanism-based component, and two other existing methods. We validated our methods performance in correctly recovering (1) the true interaction states, and (2) global network properties of the original network against these other methods. We applied our method to networks generated from time-course gene expression studies in angiogenesis and lung organogenesis and validated the findings from a biological perspective against current literature.DiscussionThe advantage of our mechanism-based approach is best seen in networks composed of connected regions with a large number of interactions annotated with a subset of mechanisms, e.g., a regulatory region of transcription interactions, or a cleavage cascade region. When applied to real datasets, our method recovered novel and biologically meaningful putative interactions, e.g., interactions from an integrin signaling pathway using the angiogenesis dataset, and a group of regulatory microRNA interactions in an organogenesis network.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 81
    Publication Date: 2013-02-22
    Description: Background: For the last 25 years species delimitation in prokaryotes (Archaea and Bacteria) was to a large extentbased on DNA-DNA hybridization (DDH), a tedious lab procedure designed in the early 1970s thatserved its purpose astonishingly well in the absence of deciphered genome sequences. With the rapidprogress in genome sequencing time has come to directly use the now available and easy to generategenome sequences for delimitation of species. GBDP (Genome Blast Distance Phylogeny) infersgenome-to-genome distances between pairs of entirely or partially sequenced genomes, a digital,highly reliable estimator for the relatedness of genomes. Its application as an in-silico replacementfor DDH was recently introduced. The main challenge in the implementation of such an application isto produce digital DDH values that must mimic the wet-lab DDH values as close as possible to ensureconsistency in the Prokaryotic species concept. Results: Correlation and regression analyses were used to determine the best-performing methods and themost influential parameters. GBDP was further enriched with a set of new features such as confidenceintervals for intergenomic distances obtained via resampling or via the statistical models for DDHprediction and an additional family of distance functions. As in previous analyses, GBDP obtainedthe highest agreement with wet-lab DDH among all tested methods, but improved models led toa further increase in the accuracy of DDH prediction. Confidence intervals yielded stable resultswhen inferred from the statistical models, whereas those obtained via resampling showed markeddifferences between the underlying distance functions. Conclusions: Despite the high accuracy of GBDP-based DDH prediction, inferences from limited empirical dataare always associated with a certain degree of uncertainty. It is thus crucial to enrich in-silicoDDH replacements with confidence-interval estimation, enabling the user to statistically evaluatethe outcomes. Such methodological advancements, easily accessible through the web service athttp://ggdc.dsmz.de, are crucial steps towards a consistent and truly genome sequence-based classificationof microorganisms.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 82
    Publication Date: 2013-02-23
    Description: Background: Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case--control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification. Results: We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual's continental and sub-continental ancestry. To predict an individual's continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of 〉= 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control's lamda from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of 86.5% +/- 2.4%, 95.6% +/- 3.9%, 95.6% +/- 2.1%, 98.3% +/- 2.0%, and 95.9% +/- 1.5%. However, ETHNOPRED was unable to produce a classifier that can accurately distinguish Chinese in Beijing vs. Chinese in Denver. Conclusions: ETHNOPRED is a novel technique for producing classifiers that can identify an individual's continental and sub-continental heritage, based on a small number of SNPs. We show that its learned classifiers are simple, cost-efficient, accurate, transparent, flexible, fast, applicable to large scale GWASs, and robust to missing values.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 83
    Publication Date: 2013-02-24
    Description: Background PAM, a nearest shrunken centroid method (NSC), is a popular classification method for high-dimensional data. ALP and AHP are NSC algorithms that were proposed to improve upon PAM. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall cross-validated (CV) error rate.Results We show that when data are class-imbalanced the three NSC classifiers are biased towards the majority class. The bias is larger when the number of variables or class-imbalance is larger and/or the differences between classes are smaller. To diminish the class-imbalance problem of the NSC classifiers we propose to estimate the amount of shrinkage by maximizing the CV geometric mean of the class-specific predictive accuracies (g-means).Conclusions The results obtained on simulated and real high-dimensional class-imbalanced data show that our approach outperforms the currently used strategy based on the minimization of the overall error rate when NSC classifiers are biased towards the majority class. The number of variables included in the NSC classifiers when using our approach is much smaller than with the original approach. This result is supported by experiments on simulated and real high-dimensional class-imbalanced data.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 84
    Publication Date: 2013-02-24
    Description: Background: Worldwide structural genomics projects continue to release new protein structures at an unprecedented pace, so far nearly 6000, but only about 60% of these proteins have any sort of functional annotation. Results: We explored a range of features that can be used for the prediction of functional residues given a known three-dimensional structure. These features include various centrality measures of nodes in graphs of interacting residues: closeness, betweenness and page-rank centrality. We also analyzed the distance of functional amino acids to the general center of mass (GCM) of the structure, relative solvent accessibility (RSA), and the use of relative entropy as a measure of sequence conservation. From the selected features, neural networks were trained to identify catalytic residues. We found that using distance to the GCM together with amino acid type provide a good discriminant function, when combined independently with sequence conservation. Using an independent test set of 29 annotated protein structures, the method returned 411 of the initial 9262 residues as the most likely to be involved in function. The output 411 residues contain 70 of the annotated 111 catalytic residues. This represents an approximately 14-fold enrichment of catalytic residues on the entire input set (corresponding to a sensitivity of 63% and a precision of 17%), a performance competitive with that of other state-of-the-art methods. Conclusions: We found that several of the graph based measures utilize the same underlying feature of protein structures, which can be simply and more effectively captured with the distance to GCM definition. This also has the added the advantage of simplicity and easy implementation. Meanwhile sequence conservation remains by far the most influential feature in identifying functional residues. We also found that due the rapid changes in size and composition of sequence databases, conservation calculations must be recalibrated for specific reference databases.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 85
    Publication Date: 2013-02-27
    Description: Background A phylogeny postulates shared ancestry relationships among organisms in the form of a binary tree. Phylogeniesattempt to answer an important question posed in biology: what are the ancestor-descendent relationshipsbetween organisms? At the core of every biological problem lies a phylogenetic component. The patterns thatcan be observed in nature are the product of complex interactions, constrained by the template that our ancestorsprovide. The problem of simultaneous tree and alignment estimation under Maximum Parsimony is known incombinatorial optimization as the Generalized Tree Alignment Problem (GTAP). The GTAP is the Steiner TreeProblem for the sequence edit distance. Like many biologically interesting problems, the GTAP is NP-Hard.Typically the Steiner Tree is presented under the Manhattan or the Hamming distances.ResultsExperimentally, the accuracy of the GTAP has been subjected to evaluation. Results show that phylogeniesselected using the GTAP from unaligned sequences are competitive with the best methods and algorithmsavailable. Here, we implement and explore experimentally existing and new local search heuristics for theGTAP using simulated and real data.Conclusions The methods presented here improve by more than three orders of magnitude in execution time the best localsearch heuristics existing to date when applied to real data.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 86
    Publication Date: 2013-02-27
    Description: Background: Proteins are the key elements on the path from genetic information to the development of life. Theroles played by the different proteins are difficult to uncover experimentally as this process involvescomplex procedures such as genetic modifications, injection of fluorescent proteins, gene knock-outmethods and others. The knowledge learned from each protein is usually annotated in databasesthrough different methods such as the proposed by The Gene Ontology (GO) consortium. Differentmethods have been proposed in order to predict GO terms from primary structure information, butvery few are available for large-scale functional annotation of plants, and reported success rates aremuch less than the reported by other non-plant predictors. This paper explores the predictability of GOannotations on proteins belonging to the Embryophyta group from a set of features extracted solelyfrom their primary amino acid sequence. Results: High predictability of several GO terms was found for Molecular Function and Cellular Component.As expected, a lower degree of predictability was found on Biological Process ontology annotations,although a few biological processes were easily predicted. Proteins related to transport and transcriptionwere particularly well predicted from primary structure information. The most discriminantfeatures for prediction were those related to electric charges of the amino-acid sequence and hydropathicityderived features. Conclusions: An analysis of GO-slim terms predictability in plants was carried out, in order to determine singlecategories or groups of functions that are most related with primary structure information. For eachhighly predictable GO term, the responsible features of such successfulness were identified and discussed.In addition to most published studies, focused on few categories or single ontologies, resultsin this paper comprise a complete landscape of GO predictability from primary structure encompassing75 GO terms at molecular, cellular and phenotypical level. Thus, it provides a valuable guide forresearchers interested on further advances in protein function prediction on Embryophyta plants.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 87
    Publication Date: 2013-02-27
    Description: Background: The rapid growth of short read datasets poses a new challenge to the short read mapping problem in terms of sensitivity and execution speed. Existing methods often use a restrictive error model for computing the alignments to improve speed, whereas more flexible error models are generally too slow for large-scale applications. A number of short read mapping software tools have been proposed. However, designs based on hardware are relatively rare. Field programmable gate arrays (FPGAs) have been successfully used in a number of specific application areas, such as the DSP and communications domains due to their outstanding parallel data processing capabilities, making them a competitive platform to solve problems that are "inherently parallel". Results: We present a hybrid system for short read mapping utilizing both FPGA-based hardware and CPU-based software. The computation intensive alignment and the seed generation operations are mapped onto an FPGA. We present a computationally efficient, parallel block-wise alignment structure (Align Core) to approximate the conventional dynamic programming algorithm. The performance is compared to the multi-threaded CPU-based GASSST and BWA software implementations. For single-end alignment, our hybrid system achieves faster processing speed than GASSST (with a similar sensitivity) and BWA (with a higher sensitivity); for pair-end alignment, our design achieves a slightly worse sensitivity than that of BWA but has a higher processing speed. Conclusions: This paper shows that our hybrid system can effectively accelerate the mapping of short reads to a reference genome based on the seed-and-extend approach. The performance comparison to the GASSST and BWA software implementations under different conditions shows that our hybrid design achieves a high degree of sensitivity and requires less overall execution time with only modest FPGA resource utilization. Our hybrid system design also shows that the performance bottleneck for the short read mapping problem can be changed from the alignment stage to the seed generation stage, which provides an additional requirement for the future development of short read aligners.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 88
    Publication Date: 2013-02-26
    Description: Background: Characterising genetic diversity through the analysis of massively parallel sequencing (MPS) data offers enormous potential to significantly improve our understanding of the genetic basis for observed phenotypes, including predisposition to and progression of complex human disease. Great challenges remain in resolving which genetic variants are genuinely associated with disease from the millions of 'bystanders' and artefactual signals. Results: FAVR is a suite of new methods designed to work with commonly used MPS analysis pipelines to assist in the resolution of some of these issues with a focus on relatively rare genetic variants. To the best of our knowledge, no equivalent has previously been described. The most important and novel aspect of FAVR is the use of signatures in comparator sequence alignment files during variant filtering, and annotation of variants potentially shared between individuals. The FAVR methods use these signatures to facilitate filtering of (i) platform-specific artefacts, (ii) common genetic variants, and, where relevant, (iii) artefacts derived from imbalanced paired-end sequencing, as well as annotation of genetic variants based on evidence of co-occurrence in individuals. By comparing conventional variant calling with or without downstream processing by FAVR methods applied to whole-exome sequencing datasets, we demonstrate a 3-fold smaller rare single nucleotide variant shortlist with no detected reduction in sensitivity. This analysis included Sanger sequencing of rare variant signals not evident in dbSNP131, assessment of known variant signal preservation, and comparison of observed and expected rare variant numbers across a range of first cousin pairs. The principles described herein were applied in our recent publication identifying XRCC2 as a new breast cancer risk gene and have been made publically available as a suite of software tools. Conclusions: FAVR is a platform-agnostic suite of methods that significantly enhances the analysis of large volumes of sequencing data for the study of rare genetic variants and their influence on phenotypes.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 89
    Publication Date: 2012-11-09
    Description: Background The robust identification of isotope patterns originating from peptides being analyzed through mass spectrometry (MS) is often significantly hampered by noise artifacts and the interference of overlappingpatterns arising e.g. from post-translational modifications. As the classification of the recorded data points into either 'noise' or 'signal' lies at the very root of essentially every proteomic application, the quality of the automated processing of mass spectra can significantly influence the way the data might be interpreted within a given biological context.Results We propose non-negative least squares/non-negative least absolute deviation regression to fit a raw spectrum by templates imitating isotope patterns. In a carefully designed validation scheme, we show that the method exhibits excellent performance in pattern picking. It is demonstrated that the method is able to disentangle complicated overlaps of patterns. Conclusions: We find that regularization is not necessary to prevent overfitting and that thresholding is an effective and user-friendly way to perform feature selection. The proposed method avoids problems inherent in regularization-based approaches, comes with a set of well-interpretable parameters whose default configuration is shown to generalize well without the need for fine-tuning, and is applicable to spectra of different platforms. The R package IPPD implements the method and is available from the Bioconductor platform (http://bioconductor.fhcrc.org/help/bioc-views/devel/bioc/html/IPPD.html).
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 90
    facet.materialart.
    Unknown
    BioMed Central
    Publication Date: 2012-11-10
    Description: Background: The inference of homologies among DNA sequences, that is, positions in multiple genomes that share a common evolutionary origin, is a crucial, yet difficult task facing biologists. Its computational counterpart is known as the multiple sequence alignment problem. There are various criteria and methods available to perform multiple sequence alignments, and among these, the minimization of the overall cost of the alignment on a phylogenetic tree is known in combinatorial optimization as the Tree Alignment Problem. This problem typically occurs as a subproblem of the Generalized Tree Alignment Problem, which looks for the tree with the lowest alignment cost among all possible trees. This is equivalent to the Maximum Parsimony problem when the input sequences are not aligned, that is, when phylogeny and alignments are simultaneously inferred. Results: For large data sets, a popular heuristic is Direct Optimization (DO). DO provides a good tradeoff between speed, scalability, and competitive scores, and is implemented in the computer program POY. All other (competitive) algorithms have greater time complexities compared to DO. Here, weintroduce and present experiments a new algorithm Affine-DO to accommodate the indel (alignment gap) models commonly used in phylogenetic analysis of molecular sequence data. Affine-DO has the same time complexity as DO, but is correctly suited for the affine gap edit distance. We demonstrateits performance with more than 330,000 experimental tests. These experiments show that the solutions of Affine-DO are close to the lower bound inferred from a linear programming solution. Moreover, iterating over a solution produced using Affine-DO shows little improvement. Conclusions: Our results show that Affine-DO is likely producing near-optimal solutions, with approximations within 10% for sequences with small divergence, and within 30% for random sequences, for which Affine-DO produced the worst solutions. The Affine-DO algorithm has the necessary scalability andoptimality to be a significant improvement in the real-world phylogenetic analysis of sequence data.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 91
    Publication Date: 2012-11-15
    Description: Background: Time-course gene expression data such as yeast cell cycle data may be periodically expressed. To cluster such data,currently used Fourier series approximations of periodic gene expressions have been found not to be sufficientlyadequate to model the complexity of the time-course data, partly due to their ignoring the dependence between theexpression measurements over time and the correlation among gene expression profiles. We further investigatethe advantages and limitations of available models in the literature and propose a new mixture model withautoregressive random effects of the first order for the clustering of time-course gene-expression profiles. Somesimulations and real examples are given to demonstrate the usefulness of the proposed models. Results: We illustrate the applicability of our new model using synthetic and real time-course datasets. We show that ourmodel outperforms existing models to provide more reliable and robust clustering of time-course data. Our modelprovides superior results when genetic profiles are correlated. It also gives comparable results when the correlationbetween the gene profiles is weak. In the applications to real time-course data, relevant clusters of co-regulatedgenes are obtained, which are supported by gene-function annotation databases. Conclusions: Our new model under our extension of the EMMIX-WIRE procedure is more reliable and robust for clusteringtime-course data because it adopts a random effects model that allows for the correlation among observations atdifferent time points. It postulates gene-specific random effects with an auto-correlation variance structure thatmodels coregulation within the clusters The developed R package is flexible in its specification of the randomeffectsthrough user-input parameters that enables improved modelling and consequent clustering of time-coursedata.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 92
    Publication Date: 2012-11-16
    Description: Background: 454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingly, not much effort has been devoted to the development of improved base-calling methods and more intuitive quality scores for this platform. Results: We present HPCall, a 454 base-calling method based on a weighted Hurdle Poisson model. HPCall uses a probabilistic framework to call the homopolymer lengths in the sequence by modeling well-known 454 noise predictors. Base-calling quality is assessed based on estimated probabilities for each homopolymer length, which are easily transformed to useful quality scores. Conclusions: Using a reference data set of the Escherichia coli K-12 strain, we show that HPCall produces superior quality scores that are very informative towards possible insertion and deletion errors, while maintaining a base-calling accuracy that is better than the current one. Given the generality of the framework, HPCall has the potential to also adapt to other homopolymer-sensitive sequencing technologies.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 93
    Publication Date: 2012-12-09
    Description: Background: Multivariate approaches have been successfully applied to genome wide association studies. Recently, a Partial Least Squares (PLS) based approach was introduced for mapping yeast genotype-phenotype relations, where background information such as gene function classification, gene dispensability, recent or ancient gene copy number variations and the presence of premature stop codons or frameshift mutations in reading frames, were used post hoc to explain selected genes. One of the latest advancement in PLS named L-Partial Least Squares (L-PLS), where 'L' presents the used data structure, enables the use of background information at the modeling level. Here, a modification of L-PLS with variable importance on projection (VIP) was implemented using a stepwise regularized procedure for gene and background information selection. Results werecompared to PLS-based procedures, where no background information was used. Results: Applying the proposed methodology to yeast Saccharomyces cerevisiae data, we found the relationship between genotype-phenotype to have improved understandability. Phenotypic variations were explained by the variations of relatively stable genes and stable background variations. The suggested procedure provides an automatic way for genotype-phenotype mapping. The selected phenotype influencing genes were evolving 29% faster than non-influential genes, and the current results are supported by a recently conducted study. Further power analysis on simulated data verified that the proposed methodology selects relevant variables. Conclusions: A modification of L-PLS with VIP in a stepwise regularized elimination procedure can improve the understandability and stability of selected genes and background information. The approach is recommended for genome wide association studies where background information is available.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 94
    Publication Date: 2012-12-09
    Description: Background: Biomarker panels derived separately from genomic and proteomic data and with a variety of computational methods have demonstrated promising classification performance in various diseases. An open question is how to create effective proteo-genomic panels. The framework of ensemble classifiers has been applied successfully in various analytical domains to combine classifiers so that the performance of the ensemble exceeds the performance of individual classifiers. Using blood-based diagnosis of acute renal allograft rejection as a case study, we address the following question in this paper: Can acute rejection classification performance be improved by combining individual genomic and proteomic classifiers in an ensemble? Results: The first part of the paper presents a computational biomarker development pipeline for genomic and proteomic data. The pipeline begins with data acquisition (e.g., from bio-samples to microarray data), quality control, statistical analysis and mining of the data, and finally various forms of validation. The pipeline ensures that the various classifiers to be combined later in an ensemble are diverse and adequate for clinical use. Five mRNA genomic and five proteomic classifiers were developed independently using single time-point blood samples from 11 acute-rejection and 22 non-rejection renal transplant patients. The second part of the paper examines five ensembles ranging in size from two to 10 individual classifiers. Performance of ensembles is characterized by area under the curve (AUC), sensitivity, and specificity, as derived from the probability of acute rejection for individual classifiers in the ensemble in combination with one of two aggregation methods: (1) Average Probability or (2) Vote Threshold. One ensemble demonstrated superior performance and was able to improve sensitivity and AUC beyond the best values observed for any of the individual classifiers in the ensemble, while staying within the range of observed specificity. The Vote Threshold aggregation method achieved improved sensitivity for all 5 ensembles, but typically at the cost of decreased specificity. Conclusion: Proteo-genomic biomarker ensemble classifiers show promise in the diagnosis of acute renal allograft rejection and can improve classification performance beyond that of individual genomic or proteomic classifiers alone. Validation of our results in an international multicenter study is currently underway.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 95
    Publication Date: 2012-12-10
    Description: Background: Co-expression measures are often used to define networks among genes. Mutual information (MI) is often used as a generalized correlation measure. It is not clear how much MI adds beyond standard (robust) correlation measures or regression model based association measures. Further, it is important to assess what transformations of these and other co-expression measures lead to biologically meaningful modules (clusters of genes). Results: We provide a comprehensive comparison between mutual information and several correlation measures in 8 empirical data sets and in simulations. We also study different approaches for transforming an adjacency matrix, e.g. using the topological overlap measure. Overall, we confirm close relationships between MI and correlation in all data sets which reflects the fact that most gene pairs satisfy linear or monotonic relationships. We discuss rare situations when the two measures disagree. We also compare correlation and MI based approaches when it comes to defining co-expression network modules. We show that a robust measure of correlation (the biweight midcorrelation transformed via the topological overlap transformation) leads to modules that are superior to MI based modules and maximal information coefficient (MIC) based modules in terms of gene ontology enrichment. We present a function that relates correlation to mutual information which can be used to approximate the mutual information from the corresponding correlation coefficient. We propose the use of polynomial or spline regression models as an alternative to MI for capturing non-linear relationships between quantitative variables. Conclusions: The biweight midcorrelation outperforms MI in terms of elucidating gene pairwise relationships. Coupled with the topological overlap matrix transformation, it often leads to more significantly enriched co-expression modules. Spline and polynomial networks form attractive alternatives to MI in case of non-linear relationships. Our results indicate that MI networks can safely be replaced by correlation networks when it comes to measuring co-expression relationships in stationary data.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 96
    Publication Date: 2012-12-12
    Description: Background: Illumina BeadArray technology includes non specific negative control features that allow a precise estimation of the background noise. As an alternative to the background subtraction proposed in BeadStudio which leads to an important loss of information by generating negative values, a background correction method modeling the observed intensities as the sum of the exponentially distributed signal and normally distributed noise has been developed. Nevertheless, Wang and Ye (2012) display a kernel-based estimator of the signal distribution on Illumina BeadArrays and suggest that a gamma distribution would represent a better modeling of the signal density. Hence, the normal-exponential modeling may not be appropriate for Illumina data and background corrections derived from this model may lead to wrong estimation. Results: We propose a more flexible modeling based on a gamma distributed signal and a normal distributed background noise and develop the associated background correction, implemented in the R-package NormalGamma. Our model proves to be markedly more accurate to model Illumina BeadArrays: on the one hand, it is shown on two types of Illumina BeadChips that this model offers a more correct fit of the observed intensities. On the other hand, the comparison of the operating characteristics of several background correction procedures on spike-in and on normal-gamma simulated data shows high similarities, reinforcing the validation of the normal-gamma modeling. The performance of the background corrections based on the normal-gamma and normal-exponential models are compared on two dilution data sets, through testing procedures which represent various experimental designs. Surprisingly, we observe that the implementation of a more accurate parametrisation in the model-based background correction does not increase the sensitivity. These results may be explained by the operating characteristics of the estimators: the normal-gamma background correction offers an improvement in terms of bias, but at the cost of a loss in precision. Conclusions: This paper addresses the lack of fit of the usual normal-exponential model by proposing a more flexible parametrisation of the signal distribution as well as the associated background correction. This new model proves to be considerably more accurate for Illumina microarrays, but the improvement in terms of modeling does not lead to a higher sensitivity in differential analysis. Nevertheless, this realistic modeling makes way for future investigations, in particular to examine the characteristics of pre-processing strategies.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 97
    Publication Date: 2013-03-05
    Description: Background: All sequenced eukaryotic genomes have been shown to possess at least a few introns. This includes those unicellular organisms, which were previously suspected to be intron-less. Therefore, gene splicing must have been present at least in the last common ancestor of the eukaryotes. To explain the evolution of introns, basically two mutually exclusive concepts have been developed. The introns-early hypothesis says that already the very first protein-coding genes contained introns while the introns-late concept asserts that eukaryotic genes gained introns only after the emergence of the eukaryotic lineage. A very important aspect in this respect is the conservation of intron positions within homologous genes of different taxa. Results: GenePainter is a standalone application for mapping gene structure information onto protein multiple sequence alignments. Based on the multiple sequence alignments the gene structures are aligned down to single nucleotides. GenePainter accounts for variable lengths in exons and introns, respects split codons at intron junctions and is able to handle sequencing and assembly errors, which are possible reasons for frame-shifts in exons and gaps in genome assemblies. Thus, even gene structures of considerably divergent proteins can properly be compared, as it is needed in phylogenetic analyses. Conserved intron positions can also be mapped to user-provided protein structures. For their visualization GenePainter provides scripts for the molecular graphics system PyMol. Conclusions: GenePainter is a tool to analyse gene structure conservation providing various visualization options. A stable version of GenePainter for all operating systems as well as documentation and example data are available at http://www.motorprotein.de/genepainter.html.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 98
    Publication Date: 2012-09-25
    Description: Background: Biologists are elucidating complex collections of genetic regulatory data for multiple organisms. Software is needed for such regulatory network data. Results: The Pathway Tools software provides a comprehensive environment for manipulating molecular regulatory interactions that integrates regulatory data with an organism's genome and metabolic network. The Pathway Tools regulation ontology captures transcriptional and translational regulation, substrate-level regulation of enzyme activity, post-translational modifications, and regulatory pathways. Curated collections of regulatory data are available for Escherichia coli, Bacillus subtilis, and Shewanella oneidensis. Regulatory visualizations include a novel diagram that sum- marizes all regulatory influences on a gene; a transcription-unit diagram, and an interactive visualization of a full transcriptional regulatory network that can be painted with gene expression data to probe correlations between gene expression and regulatory mechanisms. We introduce a novel type of enrichment analysis that asks whether a gene-expression dataset is over-represented for known regulators. We present algorithms for ranking the degree of regulatory influence of genes , and for computing the net positive and negative regulatory influences on a gene.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 99
    Publication Date: 2012-09-26
    Description: Background: Inverted repeat genes encode precursor RNAs characterized by hairpin structures. These RNA hairpins are then metabolized by biosynthetic pathways to produce functional small RNAs. In eukaryotic genomes, short non-autonomous transposable elements can have similar size and hairpin structures as non-coding precursor RNAs. This resemblance leads to problems annotating small RNAs.MethodWe mapped all microRNA precursors from miRBASE to several genomes and studied the repetition and dispersion of the corresponding loci. We then searched for repetitive elements overlapping these loci. Results: We developed an automatic method called ncRNAclassifier to classify pre-ncRNAs according to their relationship with transposable elements (TEs). We show there is a correlation between the number of scattered occurrences of ncRNA precursor candidates is correlated with the presence of TEs. We applied ncRNAclassifier on six chordate genomes and report our findings. Among the 1,426 human and 721 mouse pre-miRNAs of miRBase, we identified 235 and 68 mis-annotated pre-miRNAs respectively corresponding completely to TEs. Conclusions: We provide a tool enabling the identification of repetitive elements in precursor ncRNA sequences. ncRNAclassifier is available at http://EvryRNA.ibisc.univ-evry.fr
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 100
    Publication Date: 2012-10-14
    Description: Background: New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of protein families that focuses on the rapid discovery of homologous protein sequences. Results: We implemented fully automated and high-throughput procedures to de novo cluster proteins into families based upon global alignment similarity. Our approach employs an iterative clustering strategy in which homologs of known families are sifted out of the search for new families. The resulting reduction in computational complexity enables us to rapidly identify novel protein families found in new genomes and to perform efficient, automated updates that keep pace with genome sequencing. We refer to protein families identified through this approach as "Sifting Families," or SFams. Our analysis of ~10.5 million protein sequences from 2,928 genomes identified 436,360 SFams, many of which are not represented in other protein family databases. We validated the quality of SFam clustering through statistical as well as network topology--based analyses. Conclusions: We describe the rapid identification of SFams and demonstrate how they can be used to annotate genomes and metagenomes. The SFam database catalogs protein-family quality metrics, multiple sequence alignments, hidden Markov models, and phylogenetic trees. Our source code and database are publicly available and will be subject to frequent updates (http://edhar.genomecenter.ucdavis.edu/sifting_families/).
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...