ALBERT

All Library Books, journals and Electronic Records Telegrafenberg

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
Filter
  • Articles  (1,892)
  • 2020-2024
  • 2010-2014  (1,892)
  • 1985-1989
  • 1950-1954
  • 1945-1949
  • 2014  (1,031)
  • 2010  (861)
  • BMC Bioinformatics  (412)
  • 9756
  • Computer Science  (1,892)
  • 1
    Publication Date: 2014-12-14
    Description: Background: Biomedical ontologies are increasingly instrumental in the advancement of biological research primarily through their use to efficiently consolidate large amounts of data into structured, accessible sets. However, ontology development and usage can be hampered by the segregation of knowledge by domain that occurs due to independent development and use of the ontologies. The ability to infer data associated with one ontology to data associated with another ontology would prove useful in expanding information content and scope. We here focus on relating two ontologies: the Gene Ontology (GO), which encodes canonical gene function, and the Mammalian Phenotype Ontology (MP), which describes non-canonical phenotypes, using statistical methods to suggest GO functional annotations from existing MP phenotype annotations. This work is in contrast to previous studies that have focused on inferring gene function from phenotype primarily through lexical or semantic similarity measures. Results: We have designed and tested a set of algorithms that represents a novel methodology to define rules for predicting gene function by examining the emergent structure and relationships between the gene functions and phenotypes rather than inspecting the terms semantically. The algorithms inspect relationships among multiple phenotype terms to deduce if there are cases where they all arise from a single gene function.We apply this methodology to data about genes in the laboratory mouse that are formally represented in the Mouse Genome Informatics (MGI) resource. From the data, 7444 rule instances were generated from five generalized rules, resulting in 4818 unique GO functional predictions for 1796 genes. Conclusions: We show that our method is capable of inferring high-quality functional annotations from curated phenotype data. As well as creating inferred annotations, our method has the potential to allow for the elucidation of unforeseen, biologically significant associations between gene function and phenotypes that would be overlooked by a semantics-based approach. Future work will include the implementation of the described algorithms for a variety of other model organism databases, taking full advantage of the abundance of available high quality curated data.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 2
    Publication Date: 2014-12-18
    Description: Background: Identification of individual components in complex mixtures is an important and sometimes daunting task in several research areas like metabolomics and natural product studies. NMR spectroscopy is an excellent technique for analysis of mixtures of organic compounds and gives a detailed chemical fingerprint of most individual components above the detection limit. For the identification of individual metabolites in metabolomics, correlation or covariance between peaks in 1H NMR spectra has previously been successfully employed. Similar correlation of 2D 1H-13C Heteronuclear Single Quantum Correlation spectra was recently applied to investigate the structure of heparine. In this paper, we demonstrate how a similar approach can be used to identify metabolites in human biofluids (post-prostatic palpation urine). Results: From 50 1H-13C Heteronuclear Single Quantum Correlation spectra, 23 correlation plots resembling pure metabolites were constructed. The identities of these metabolites were confirmed by comparing the correlation plots with reported NMR data, mostly from the Human Metabolome Database. Conclusions: Correlation plots prepared by statistically correlating 1H-13C Heteronuclear Single Quantum Correlation spectra from human biofluids provide unambiguous identification of metabolites. The correlation plots highlight cross-peaks belonging to each individual compound, not limited by long-range magnetization transfer as conventional NMR experiments.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 3
    Publication Date: 2014-12-18
    Description: Background: Alternative Splicing (AS) as a post-transcription regulation mechanism is an important application of RNA-seq studies in eukaryotes. A number of software and computational methods have been developed for detecting AS. Most of the methods, however, are designed and tested on animal data, such as human and mouse. Plants genes differ from those of animals in many ways, e.g., the average intron size and preferred AS types. These differences may require different computational approaches and raise questions about their effectiveness on plant data. The goal of this paper is to benchmark existing computational differential splicing (or transcription) detection methods so that biologists can choose the most suitable tools to accomplish their goals. Results: This study compares the eight popular public available software packages for differential splicing analysis using both simulated and real Arabidopsis thaliana RNA-seq data. All software are freely available. The study examines the effect of varying AS ratio, read depth, dispersion pattern, AS types, sample sizes and the influence of annotation. Using a real data, the study looks at the consistences between the packages and verifies a subset of the detected AS events using PCR studies. Conclusions: No single method performs the best in all situations. The accuracy of annotation has a major impact on which method should be chosen for AS analysis. DEXSeq performs well in the simulated data when the AS signal is relative strong and annotation is accurate. Cufflinks achieve a better tradeoff between precision and recall and turns out to be the best one when incomplete annotation is provided. Some methods perform inconsistently for different AS types. Complex AS events that combine several simple AS events impose problems for most methods, especially for MATS. MATS stands out in the analysis of real RNA-seq data when all the AS events being evaluated are simple AS events.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 4
    Publication Date: 2014-11-09
    Description: Background: The rapid accumulation of whole-genome data has renewed interest in the study of using gene-order data for phylogenetic analyses and ancestral reconstruction. Current software and web servers typically do not support duplication and loss events along with rearrangements. Results: MLGOMLGO (Maximum Likelihood for Gene-Order Analysis) is a web tool for the reconstruction of phylogeny and/or ancestral genomes from gene-order data. MLGOMLGO is based on likelihood computation and shows advantages over existing methods in terms of accuracy, scalability and flexibility. Conclusions: To the best of our knowledge, it is the first web tool for analysis of large-scale genomic changes including not only rearrangements but also gene insertions, deletions and duplications. The web tool is available from http://www.geneorder.org/server.php.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 5
    Publication Date: 2014-12-16
    Description: Background: Genomic selection (GS) promises to improve accuracy in estimating breeding values and genetic gain for quantitative traits compared to traditional breeding methods. Its reliance on high-throughput genome-wide markers and statistical complexity, however, is a serious challenge in data management, analysis, and sharing. A bioinformatics infrastructure for data storage and access, and user-friendly web-based tool for analysis and sharing output is needed to make GS more practical for breeders. Results: We have developed a web-based tool, called solGS, for predicting genomic estimated breeding values (GEBVs) of individuals, using a Ridge-Regression Best Linear Unbiased Predictor (RR-BLUP) model. It has an intuitive web-interface for selecting a training population for modeling and estimating genomic estimated breeding values of selection candidates. It estimates phenotypic correlation and heritability of traits and selection indices of individuals. Raw data is stored in a generic database schema, Chado Natural Diversity, co-developed by multiple database groups. Analysis output is graphically visualized and can be interactively explored online or downloaded in text format. An instance of its implementation can be accessed at the NEXTGEN Cassava breeding database, http://cassavabase.org/solgs. Conclusions: solGS enables breeders to store raw data and estimate GEBVs of individuals online, in an intuitive and interactive workflow. It can be adapted to any breeding program.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 6
    Publication Date: 2014-12-16
    Description: Background: According to Regulation (EU) No 619/2011, trace amounts of non-authorised genetically modified organisms (GMO) in feed are tolerated within the EU if certain prerequisites are met. Tolerable traces must not exceed the so-called `minimum required performance limit? (MRPL), which was defined according to the mentioned regulation to correspond to 0.1% mass fraction per ingredient. Therefore, not yet authorised GMO (and some GMO whose approvals have expired) have to be quantified at very low level following the qualitative detection in genomic DNA extracted from feed samples. As the results of quantitative analysis can imply severe legal and financial consequences for producers or distributors of feed, the quantification results need to be utterly reliable. Results: We developed a statistical approach to investigate the experimental measurement variability within one 96-well PCR plate. This approach visualises the frequency distribution as zygosity-corrected relative content of genetically modified material resulting from different combinations of transgene and reference gene Cq values. One application of it is the simulation of the consequences of varying parameters on measurement results. Parameters could be for example replicate numbers or baseline and threshold settings, measurement results could be for example median (class) and relative standard deviation (RSD). All calculations can be done using the built-in functions of Excel without any need for programming. The developed Excel spreadsheets are available (see section `Availability of supporting data? for details). In most cases, the combination of four PCR replicates for each of the two DNA isolations already resulted in a relative standard deviation of 15% or less. Conclusions: The aims of the study are scientifically based suggestions for minimisation of uncertainty of measurement especially in ?but not limited to? the field of GMO quantification at low concentration levels. Four PCR replicates for each of the two DNA isolations seem to be a reasonable minimum number to narrow down the possible spread of results.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 7
    Publication Date: 2014-12-16
    Description: Background: Last generations of Single Nucleotide Polymorphism (SNP) arrays allow to study copy-number variations in addition to genotyping measures. Results: MPAgenomicsMPAgenomics, standing for multi-patient analysis (MPA) of genomic markers, is an R-package devoted to: (i) efficient segmentation and (i i) selection of genomic markers from multi-patient copy number and SNP data profiles. It provides wrappers from commonly used packages to streamline their repeated (sometimes difficult) manipulation, offering an easy-to-use pipeline for beginners in R.The segmentation of successive multiple profiles (finding losses and gains) is performed with an automatic choice of parameters involved in the wrapped packages. Considering multiple profiles in the same time, MPAgenomics MPAgenomics wraps efficient penalized regression methods to select relevant markers associated with a given outcome. Conclusions: MPAgenomics MPAgenomics provides an easy tool to analyze data from SNP arrays in R. The R-package MPAgenomics MPAgenomics is available on CRAN.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 8
    Publication Date: 2014-12-16
    Description: Background: With the ever increasing use of computational models in the biosciences, the need to share models and reproduce the results of published studies efficiently and easily is becoming more important. To this end, various standards have been proposed that can be used to describe models, simulations, data or other essential information in a consistent fashion. These constitute various separate components required to reproduce a given published scientific result. Results: We describe the Open Modeling EXchange format (OMEX). Together with the use of other standard formats from the Computational Modeling in Biology Network (COMBINE), OMEX is the basis of the COMBINE Archive, a single file that supports the exchange of all the information necessary for a modeling and simulation experiment in biology. An OMEX file is a ZIP container that includes a manifest file, listing the content of the archive, an optional metadata file adding information about the archive and its content, and the files describing the model. The content of a COMBINE Archive consists of files encoded in COMBINE standards whenever possible, but may include additional files defined by an Internet Media Type. Several tools that support the COMBINE Archive are available, either as independent libraries or embedded in modeling software. Conclusions: The COMBINE Archive facilitates the reproduction of modeling and simulation experiments in biology by embedding all the relevant information in one file. Having all the information stored and exchanged at once also helps in building activity logs and audit trails. We anticipate that the COMBINE Archive will become a significant help for modellers, as the domain moves to larger, more complex experiments such as multi-scale models of organs, digital organisms, and bioengineering.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 9
    Publication Date: 2014-12-15
    Description: Background: With the advent of low cost, fast sequencing technologies metagenomic analyses are made possible. The large data volumes gathered by these techniques and the unpredictable diversity captured in them are still, however, a challenge for computational biology. Results: In this paper we address the problem of rapid taxonomic assignment with small and adaptive data models ( 〈 5 MB) and present the accelerated k-mer explorer (AKE). Acceleration in AKE?s taxonomic assignments is achieved by a special machine learning architecture, which is well suited to model data collections that are intrinsically hierarchical. We report classification accuracy reasonably well for ranks down to order, observed on a study on real world data (Acid Mine Drainage, Cow Rumen). Conclusion: We show that the execution time of this approach is orders of magnitude shorter than competitive approaches and that accuracy is comparable. The tool is presented to the public as a web application (url: https://ani.cebitec.uni-bielefeld.de/ake/, username: bmc, password: bmcbioinfo).
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 10
    Publication Date: 2014-12-15
    Description: Background: Next generation sequencing produces base calls with low quality scores that can affect the accuracy of identifying simple nucleotide variation calls, including single nucleotide polymorphisms and small insertions and deletions. Here we compare the effectiveness of two data preprocessing methods, masking and trimming, and the accuracy of simple nucleotide variation calls on whole-genome sequence data from Caenorhabditis elegans. Masking substitutes low quality base calls with `N?s (undetermined bases), whereas trimming removes low quality bases that results in a shorter read lengths. Results: We demonstrate that masking is more effective than trimming in reducing the false-positive rate in single nucleotide polymorphism (SNP) calling. However, both of the preprocessing methods did not affect the false-negative rate in SNP calling with statistical significance compared to the data analysis without preprocessing. False-positive rate and false-negative rate for small insertions and deletions did not show differences between masking and trimming. Conclusions: We recommend masking over trimming as a more effective preprocessing method for next generation sequencing data analysis since masking reduces the false-positive rate in SNP calling without sacrificing the false-negative rate although trimming is more commonly used currently in the field. The perl script for masking is available at http://code.google.com/p/subn/. The sequencing data used in the study were deposited in the Sequence Read Archive (SRX450968 and SRX451773).
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 11
    Publication Date: 2014-12-01
    Description: Background: The identification of new diagnostic or prognostic biomarkers is one of the main aims of clinical cancer research. Technologies like mass spectrometry are commonly being used in proteomic research. Mass spectrometry signals show the proteomic profiles of the individuals under study at a given time. These profiles correspond to the recording of a large number of proteins, much larger than the number of individuals. These variables come in addition to or to complete classical clinical variables. The objective of this study is to evaluate and compare the predictive ability of new and existing models combining mass spectrometry data and classical clinical variables. This study was conducted in the context of binary prediction. Results: To achieve this goal, simulated data as well as a real dataset dedicated to the selection of proteomic markers of steatosis were used to evaluate the methods. The proposed methods meet the challenge of high-dimensional data and the selection of predictive markers by using penalization methods (Ridge, Lasso) and dimension reduction techniques (PLS), as well as a combination of both strategies through sparse PLS in the context of a binary class prediction. The methods were compared in terms of mean classification rate and their ability to select the true predictive values. These comparisons were done on clinical-only models, mass-spectrometry-only models and combined models. Conclusions: It was shown that models which combine both types of data can be more efficient than models that use only clinical or mass spectrometry data when the sample size of the dataset is large enough.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 12
    Publication Date: 2014-12-01
    Description: Background: In order to extract meaningful information from electronic medical records, such as signs and symptoms, diagnoses, and treatments, it is important to take into account the contextual properties of the identified information: negation, temporality, and experiencer. Most work on automatic identification of these contextual properties has been done on English clinical text. This study presents ContextD, an adaptation of the English ConText algorithm to the Dutch language, and a Dutch clinical corpus.We created a Dutch clinical corpus containing four types of anonymized clinical documents: entries from general practitioners, specialists? letters, radiology reports, and discharge letters. Using a Dutch list of medical terms extracted from the Unified Medical Language System, we identified medical terms in the corpus with exact matching. The identified terms were annotated for negation, temporality, and experiencer properties. To adapt the ConText algorithm, we translated English trigger terms to Dutch and added several general and document specific enhancements, such as negation rules for general practitioners? entries and a regular expression based temporality module. Results: The ContextD algorithm utilized 41 unique triggers to identify the contextual properties in the clinical corpus. For the negation property, the algorithm obtained an F-score from 87% to 93% for the different document types. For the experiencer property, the F-score was 99% to 100%. For the historical and hypothetical values of the temporality property, F-scores ranged from 26% to 54% and from 13% to 44%, respectively. Conclusions: The ContextD showed good performance in identifying negation and experiencer property values across all Dutch clinical document types. Accurate identification of the temporality property proved to be difficult and requires further work. The anonymized and annotated Dutch clinical corpus can serve as a useful resource for further algorithm development.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 13
    Publication Date: 2014-01-14
    Description: Background: Gene selection is an important part of microarray data analysis because it provides information thatcan lead to a better mechanistic understanding of an investigated phenomenon. At the same time,gene selection is very difficult because of the noisy nature of microarray data. As a consequence,gene selection is often performed with machine learning methods. The Random Forest method isparticularly well suited for this purpose. In this work, four state-of-the-art Random Forest-basedfeature selection methods were compared in a gene selection context. The analysis focused on thestability of selection because, although it is necessary for determining the significance of results, it isoften ignored in similar studies. Results: The comparison of post-selection accuracy in the validation of Random Forest classifiers revealed thatall investigated methods were equivalent in this context. However, the methods substantially differedwith respect to the number of selected genes and the stability of selection. Of the analysed methods,the Boruta algorithm predicted the most genes as potentially important. Conclusions: The post-selection classifier error rate, which is a frequently used measure, was found to be apotentially deceptive measure of gene selection quality. When the number of consistently selectedgenes was considered, the Boruta algorithm was clearly the best. Although it was also the mostcomputationally intensive method, the Boruta algorithm's computational demands could be reducedto levels comparable to those of other algorithms by replacing the Random Forest importance witha comparable measure from Random Ferns (a similar but simplified classifier). Despite their designassumptions, the minimal-optimal selection methods, were found to select a high fraction of falsepositives.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 14
    Publication Date: 2014-01-15
    Description: Background: The Kruskal-Wallis test is a popular non-parametric statistical test for identifying expression quantitativetrait loci (eQTLs) from genome-wide data due to its robustness against variations in the underlyinggenetic model and expression trait distribution, but testing billions of marker-trait combinationsone-by-one can become computationally prohibitive. Results: We developed kruX, an algorithm implemented in Matlab, Python and R that uses matrix multiplicationsto simultaneously calculate the Kruskal-Wallis test statistic for several millions of marker-traitcombinations at once. KruX is more than ten thousand times faster than computing associations oneby-one on a typical human dataset. We used kruX and a dataset of more than 500k SNPs and 20kexpression traits measured in 102 human blood samples to compare eQTLs detected by the Kruskal-Wallis test to eQTLs detected by the parametric ANOVA and linear model methods. We found that theKruskal-Wallis test is more robust against data outliers and heterogeneous genotype group sizes anddetects a higher proportion of non-linear associations, but is more conservative for calling additivelinear associations. Conclusion: kruX enables the use of robust non-parametric methods for massive eQTL mapping without the needfor a high-performance computing infrastructure and is freely available from http://krux.googlecode.com.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 15
    Publication Date: 2014-01-19
    Description: Background: Glioblastoma is the most aggressive primary central nervous tumor and carries a very poor prognosis. Invasion precludes effective treatment and virtually assures tumor recurrence. In the current study, we applied analytical and bioinformatics approaches to identify a set of microRNAs (miRs) from several different human glioblastoma cell lines that exhibit significant differential expression between migratory (edge) and migration-restricted (core) cell populations. The hypothesis of the study is that differential expression of miRs provides an epigenetic mechanism to drive cell migration and invasion. Results: Our research data comprise gene expression values for a set of 805 human miRs collected from matched pairs of migratory and migration-restricted cell populations from seven different glioblastoma cell lines. We identified 62 down-regulated and 2 up-regulated miRs that exhibit significant differential expression in the migratory (edge) cell population compared to matched migration-restricted (core) cells. We then conducted target prediction and pathway enrichment analysis with these miRs to investigate potential associated gene and pathway targets. Several miRs in the list appear to directly target apoptosis related genes. The analysis identifies a set of genes that are predicted by 3 different algorithms, further emphasizing the potential validity of these miRs to promote glioblastoma. Conclusions: The results of this study identify a set of miRs with potential for decreased expression in invasive glioblastoma cells. The verification of these miRs and their associated targeted proteins provides new insights for further investigation into therapeutic interventions. The methodological approaches employed here could be applied to the study of other diseases to provide biomedical researchers and clinicians with increased opportunities for therapeutic interventions.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 16
    Publication Date: 2014-01-21
    Description: Background: The comparative modeling approach to protein structure prediction inherently relies on a template structure. Before building a model such a template protein has to be found and aligned with the query sequence. Any error made on this stage may dramatically affects the quality of result. There is a need, therefore, to develop accurate and sensitive alignment protocols. Results: BioShell threading software is a versatile tool for aligning protein structures, protein sequences or sequence profiles and query sequences to a template structures. The software is also capable of suboptimal alignment generation. It can be executed as an application from the UNIX command line, or as a set of Java classes called from a script or a Java application. The implemented Monte Carlo search engine greatly facilitates the development and benchmarking of new alignment scoring schemes evenwhen the functions exhibit non-deterministic polynomial-time complexity. Conclusions: Numerical experiments indicate that the new threading application offers template detection abilities and provides much better alignments than other methods. The package along with documentation and examples is available at: http://bioshell.pl/threading3d
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 17
    Publication Date: 2014-01-16
    Description: Background: Independent data sources can be used to augment post-marketing drug safety signal detection. The vast amount of publicly available biomedical literature contains rich side effect information for drugs at all clinical stages. In this study, we present a large-scale signal boosting approach that combines over 4 million records in the US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) and over 21 million biomedical articles. Results: The datasets are comprised of 4,285,097 records from FAERS and 21,354,075 MEDLINE articles. We first extracted all drug-side effect (SE) pairs from FAERS. Our study implemented a total of seven signal ranking algorithms. We then compared these different ranking algorithms before and after they were boosted with signals from MEDLINE sentences or abstracts. Finally, we manually curated all drug-cardiovascular (CV) pairs that appeared in both data sources and investigated whether our approach can detect many true signals that have not been included in FDA drug labels. We extracted a total of 2,787,797 drug-SE pairs from FAERS with a low initial precision of 0.025. The ranking algorithm combined signals from both FAERS and MEDLINE, significantly improving the precision from 0.025 to 0.371 for top-ranked pairs, representing a 13.8 fold elevation in precision. We showed by manual curation that drug-SE pairs that appeared in both data sources were highly enriched with true signals, many of which have not yet been included in FDA drug labels. Conclusions: We have developed an efficient and effective drug safety signal ranking and strengthening approach We demonstrate that large-scale combining information from FAERS and biomedical literature can significantly contribute to drug safety surveillance.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 18
    Publication Date: 2014-01-16
    Description: Background: Computational methods for the prediction of protein features from sequence are a long-standing focusof bioinformatics. A key observation is that several protein features are closely inter-related, that is,they are conditioned on each other. Researchers invested a lot of effort into designing predictors thatexploit this fact. Most existing methods leverage inter-feature constraints by including known (orpredicted) correlated features as inputs to the predictor, thus conditioning the result. Results: By including correlated features as inputs, existing methods only rely on one side of the relation:the output feature is conditioned on the known input features. Here we show how to jointly improvethe outputs of multiple correlated predictors by means of a probabilistic-logical consistencylayer. The logical layer enforces a set of weighted first-order rules encoding biological constraintsbetween the features, and improves the raw predictions so that they least violate the constraints. Inparticular, we show how to integrate three stand-alone predictors of correlated features: subcellular localization(Loctree [J Mol Biol 348:85-100, 2005]), disulfide bonding state (Disulfind [Nucleic AcidsRes 34:W177-W181, 2006]), and metal bonding state (MetalDetector [Bioinformatics 24:2094-2095,2008]), in a way that takes into account the respective strengths and weaknesses, and does not requireany change to the predictors themselves. We also compare our methodology against two alternativerefinement pipelines based on state-of-the-art sequential prediction methods. Conclusions: The proposed framework is able to improve the performance of the underlying predictors by removingrule violations. We show that different predictors offer complementary advantages, and our method isable to integrate them using non-trivial constraints, generating more consistent predictions. In addition,our framework is fully general, and could in principle be applied to a vast array of heterogeneouspredictions without requiring any change to the underlying software. On the other hand, the alternativestrategies are more specific and tend to favor one task at the expense of the others, as shown byour experimental evaluation. The ultimate goal of our framework is to seamlessly integrate full predictionsuites, such as Distill [BMC Bioinformatics 7:402, 2006] and PredictProtein [Nucleic AcidsRes 32:W321-W326, 2004].
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 19
    Publication Date: 2014-01-14
    Description: Background: Logos are commonly used in molecular biology to provide a compact graphical representation of the conservation pattern of a set of sequences. They render the information contained in sequence alignments or profile hidden Markov models by drawing a stack of letters for each position, where the height of the stack corresponds to the conservation at that position, and the height of each letter within a stack depends on the frequency of that letter at that position. Results: We present a new tool and web server, called Skylign, which provides a unified framework for creating logos for both sequence alignments and profile hidden Markov models. In addition to static image files, Skylign creates a novel interactive logo plot for inclusion in web pages. These interactive logos enable scrolling, zooming, and inspection of underlying values. Skylign can avoid sampling bias in sequence alignments by down-weighting redundant sequences and by combining observed counts with informed priors. It also simplifies the representation of gap parameters, and can optionally scale letter heights based on alternate calculations of the conservation of a position. Conclusion: Skylign is available as a website, a scriptable web service with a RESTful interface, and as a software package for download. Skylign's interactive logos are easily incorporated into a web page with just a few lines of HTML markup. Skylign may be found at http://skylign.org.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 20
    Publication Date: 2014-01-15
    Description: Background: Gene set analysis (GSA) is useful in deducing biological significance of gene lists using a priori defined gene sets such as gene ontology (GO) or pathways. Phenotypic annotation is sparse for human genes, but is far more abundant for other model organisms such as mouse, fly, and worm. Often, GSA needs to be done highly interactively by combining or modifying gene lists or inspecting gene-gene interactions in a molecular network.Description: We developed gsGator, a web-based platform for functional interpretation of gene sets with useful features such as cross-species GSA, simultaneous analysis of multiple gene sets, and a fully integrated network viewer for visualizing both GSA results and molecular networks. An extensive set of gene annotation information is amassed including GO & pathways, genomic annotations, protein-protein interaction, transcription factor-target (TF-target), miRNA targeting, and phenotype information for various model organisms. By combining the functionalities of Set Creator, Set Operator and Network Navigator, user can perform highly flexible and interactive GSA by creating a new gene list by any combination of existing gene sets (intersection, union and difference) or expanding genes interactively along the molecular networks such as protein-protein interaction and TF-target. We also demonstrate the utility of our interactive and cross-species GSA implemented in gsGator by several usage examples for interpreting genome-wide association study (GWAS) results. gsGator is freely available at http://gsGator.ewha.ac.kr. Conclusions: Interactive and cross-species GSA in gsGator greatly extends the scope and utility of GSA, leading to novel insights via conserved functional gene modules across different species.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 21
    Publication Date: 2014-01-15
    Description: Background: Interpretation of binding modes of protein-small ligand complexes from 3D structure data is essential for understanding selective ligand recognition by proteins. It is often performed by visual inspection and sometimes largely depends on a priori knowledge about typical interactions such as hydrogen bonds and pi-pi stacking. Because it can introduce some biases due to scientists' subjective perspectives, more objective viewpoints considering a wide range of interactions are required.Description: In this paper, we present a web server for analyzing protein-small ligand interactions on the basis of patterns of atomic contacts, or "interaction patterns" obtained from the statistical analyses of 3D structures of protein-ligand complexes in our previous study. This server can guide visual inspection by providing information about interaction patterns for each atomic contact in 3D structures. Users can visually investigate what atomic contacts in user-specified 3D structures of protein-small ligand complexes are statistically overrepresented. This server consists of two main components: "Complex Analyzer," and "Pattern Viewer." The former provides a 3D structure viewer with annotations of interacting amino acid residues, ligand atoms, and interacting pairs of these. In the annotations of interacting pairs, assignment to an interaction pattern of each contact and statistical preferences of the patterns are presented. The "Pattern Viewer" provides details of each interaction pattern. Users can see visual representations of probability density functions of interactions, and a list of protein-ligand complexes showing similar interactions. Conclusions: Users can interactively analyze protein-small ligand binding modes with statistically determined interaction patterns rather than relying on a priori knowledge of the users, by using our new web server named GIANT that is freely available at http://giant.hgc.jp/.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 22
    Publication Date: 2014-01-16
    Description: Background: Different methods have been proposed for analyzing differentially expressed (DE) genes in microarray data. Methods based on statistical tests that incorporate expression level variability are used more commonly than those based on fold change (FC). However, FC based results are more reproducible and biologically relevant. Results: We propose a new method based on fold change rank ordering statistics (FCROS). We exploit the variation in calculated FC levels using combinatorial pairs of biological conditions in the datasets. A statistic is associated with the ranks of the FC values for each gene, and the resulting probability is used to identify the DE genes within an error level. The FCROS method is deterministic, requires a low computational runtime and also solves the problem of multiple tests which usually arises with microarray datasets. Conclusion: We compared the performance of FCROS with those of other methods using synthetic and real microarray datasets. We found that FCROS is well suited for DE gene identification from noisy datasets when compared with existing FC based methods.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 23
    Publication Date: 2014-01-23
    Description: Background: The interest of the scientific community in investigating the impact of rare variants on complex traits has stimulated the development of novel statistical methodologies for association studies. The fact that many of the recently proposed methods for association studies suffer from low power to identify a genetic association motivates the incorporation of prior knowledge into statistical tests. Results: In this article we propose a methodology to incorporate prior information into the region-based score test. Within our framework prior information is used to partition variants within a region into several groups, following which asymptotically independent group statistics are constructed and then combined into a global test statistic. Under the null hypothesis the distribution of our test statistic has lower degrees of freedom compared with those of the region-based score statistic. Theoretical power comparison, population genetics simulations and results from analysis of the GAW17 sequencing data set suggest that under some scenarios our method may perform as well as or outperform the score test and other competing methods. Conclusions: An approach which uses prior information to improve the power of the region-based score test is proposed. Theoretical power comparison, population genetics simulations and the results of GAW17 data analysis showed that for some scenarios power of our method is on the level with or higher than those of the score test and other methods.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 24
    Publication Date: 2014-01-24
    Description: Background: An ion mobility (IM) spectrometer coupled with a multi-capillary column (MCC) measures volatile organic compounds (VOCs) in the air or in exhaled breath. This technique is utilized in several biotechnological and medical applications. Each peak in an MCC/IM measurement represents a certain compound, which may be known or unknown. For clustering and classification of measurements, the raw data matrix must be reduced to a set of peaks. Each peak is described by its coordinates (retention time in the MCC and reduced inverse ion mobility) and shape (signal intensity, further shape parameters). This fundamental step is referred to as peak extraction. It is the basis for identifying discriminating peaks, and hence putative biomarkers, between two classes of measurements, such as a healthy control group and a group of patients with a confirmed disease. Current state-of-the-art peak extraction methods require human interaction, such as hand-picking approximate peak locations, assisted by a visualization of the data matrix. In a high-throughput context, however, it is preferable to have robust methods for fully automated peak extraction. Results: We introduce PEAX, a modular framework for automated peak extraction. The framework consists of several steps in a pipeline architecture. Each step performs a specific sub-task and can be instantiated by different methods implemented as modules. We provide open-source software for the framework and several modules for each step. Additionally, an interface that allows easy extension by a new module is provided. Combining the modules in all reasonable ways leads to a large number of peak extraction methods. We evaluate all combinations using intrinsic error measures and by comparing the resulting peak sets with an expert-picked one. Conclusions: Our software PEAX is able to automatically extract peaks from MCC/IM measurements within a few seconds. The automatically obtained results keep up with the results provided by current state-of-theart peak extraction methods. This opens a high-throughput context for the MCC/IM application field. Our software is available at http://www.rahmannlab.de/research/ims.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 25
    facet.materialart.
    Unknown
    BioMed Central
    Publication Date: 2014-03-12
    Description: Contributing reviewersThe editors of BMC Bioinformatics would like to thank all our reviewers who have contributed their time to the journal in Volume 14 (2013).
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 26
    Publication Date: 2014-03-13
    Description: Background: Mandatory deposit of raw microarray data files for public access, prior to study publication, provides significant opportunities to conduct new bioinformatics analyses within and across multiple datasets. Analysis of raw microarray data files (e.g. Affymetrix .cel files) can be time consuming, complex, and requires fundamental computational and bioinformatics skills. The development of analytical workflows to automate these tasks simplifies the processing of, improves the efficiency of, and serves to standardize multiple and sequential analyses. Once installed, workflows facilitate the tedious steps required to run rapid intra- and inter-dataset comparisons. Results: We developed a workflow to facilitate and standardize Meta-Analysis of Affymetrix Microarray Data analysis (MAAMD) in Kepler. Two freely available stand-alone software tools, R and AltAnalyze were embedded in MAAMD. The inputs of MAAMD are user-editable csv files, which contain sample information and parameters describing the locations of input files and required tools. MAAMD was tested by analyzing 4 different GEO datasets from mice and drosophila.MAAMD automates data downloading, data organization, data quality control assesment, differential gene expression analysis, clustering analysis, pathway visualization, gene-set enrichment analysis, and cross-species orthologous-gene comparisons MAAMD was utilized to identify gene orthologues responding to hypoxia or hyperoxia in both mice and drosophila. The entire set of analyses for 4 datasets (34 total microarrays) finished in ~ one hour. Conclusions: MAAMD saves time, minimizes the required computer skills, and offers a standardized procedure for users to analyze microarray datasets and make new intra- and inter-dataset comparisons.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 27
    Publication Date: 2014-03-15
    Description: Background: Modeling high-dimensional data involving thousands of variables is particularly important for gene expression profiling experiments, nevertheless,it remains a challenging task . One of the challenges is to implement an effective method for selecting a small set of relevant genes, buried in high-dimensional irrelevant noises. RELIEF is a popular and widely used approach for feature selection owing to its low computational cost and high accuracy. However, RELIEF based methods suffer from instability, especially in the presence of noisy and/or high-dimensional outliers. Results: We propose an innovative feature weighting algorithm, called LHR, to select informative genes from highly noisy data. LHR is based on RELIEF for feature weighting using classical margin maximization. The key idea of LHR is to estimate the feature weights through local approximation rather than global measurement, which is typically used in existing methods. The weights obtained by our method are very robust in terms of degradation of noisy features, even those with vast dimensions. To demonstrate the performance of our method, extensive experiments involving classification tests have been carried out on both synthetic and real microarray benchmark datasets by combining the proposed technique with standard classifiers, including the support vector machine (SVM), $k$-nearest neighbor (KNN), hyperplane $k$-nearest neighbor (HKNN), linear discriminant analysis (LDA) and naive Bayes (NB). Conclusion: Experiments on both synthetic and real-world datasets demonstrate the superior performance of the proposed feature selection method combined with supervised learning in three aspects: 1) high classification accuracy, 2) excellent robustness to noise and 3) good stability using to various classification algorithms.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 28
    Publication Date: 2014-03-15
    Description: Background: Transposition event detection of transposable element (TE) in the genome using short reads from the next-generation sequence (NGS) was difficult, because the nucleotide sequence of TE itself is repetitive, making it difficult to identify locations of its insertions by alignment programs for NGS. We have developed a program with a new algorithm to detect the transpositions from NGS data. Results: In the process of tool development, we used next-generation sequence (NGS) data of derivative lines (ttm2 and ttm5) of japonica rice cv. Nipponbare, regenerated through cell culture. The new program, called a transposon insertion finder (TIF), was applied to detect the de novo transpositions of Tos17 in the regenerated lines. TIF searched 300 million reads of a line within 20 min, identifying 4 and 12 de novo transposition in ttm2 and ttm5 lines, respectively. All of the transpositions were confirmed by PCR/electrophoresis and sequencing. Using the program, we also detected new transposon insertions of P-element from NGS data of Drosophila melanogaster. Conclusion: TIF operates to find the transposition of any elements provided that target site duplications (TSDs) are generated by their transpositions.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 29
    Publication Date: 2014-05-04
    Description: Background: The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible. Results: To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers. Conclusions: Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 30
    Publication Date: 2014-05-03
    Description: Background: Computational discovery of microRNAs (miRNA) is based on pre-determined sets of features frommiRNA precursors (pre-miRNA). Some feature sets are composed of sequence-structure patternscommonly found in pre-miRNAs, while others are a combination of more sophisticated RNA features.In this work, we analyze the discriminant power of seven feature sets, which are used in six premiRNAprediction tools. The analysis is based on the classification performance achieved with thesefeature sets for the training algorithms used in these tools. We also evaluate feature discriminationthrough the F-score and feature importance in the induction of random forests. Results: Small or non-significant differences were found among the estimated classification performances ofclassifiers induced using sets with diversification of features, despite the wide differences in theirdimension. Inspired in these results, we obtained a lower-dimensional feature set, which achieved asensitivity of 90% and a specificity of 95%. These estimates are within 0.1% of the maximal valuesobtained with any feature set (SELECT, Section ¿Results and discussion¿) while it is 34 times fasterto compute. Even compared to another feature set (FS2, see Section ¿Results and discussion¿), whichis the computationally least expensive feature set of those from the literature which perform within0.1% of the maximal values, it is 34 times faster to compute. The results obtained by the tools used asreferences in the experiments carried out showed that five out of these six tools have lower sensitivityor specificity. Conclusion: In miRNA discovery the number of putative miRNA loci is in the order of millions. Analysisof putative pre-miRNAs using a computationally expensive feature set would be wasteful or evenunfeasible for large genomes. In this work, we propose a relatively inexpensive feature set and exploremost of the learning aspects implemented in current ab-initio pre-miRNA prediction tools, which maylead to the development of efficient ab-initio pre-miRNA discovery tools.The material to reproduce the main results from this paper can be downloaded fromhttp://bioinformatics.rutgers.edu/Static/Software/discriminant.tar.gz.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 31
    Publication Date: 2014-05-07
    Description: Background: It is currently accepted that the perturbation of complex intracellular networks, rather than the dysregulation of a single gene, is the basis for phenotypical diversity. High-throughput gene expression data allow to investigate changes in gene expression profiles among different conditions. Recently, many efforts have been made to individuate which biological pathways are perturbed, given a list of differentially expressed genes (DEGs). In order to understand these mechanisms, it is necessary to unveil the variation of genes in relation to each other, considering the different phenotypes. In this paper, we illustrate a pipeline, based on Structural Equation Modeling (SEM) that allowed to investigate pathway modules, considering not only deregulated genes but also the connections between the perturbed ones. Results: The procedure was tested on microarray experiments relative to two neurological diseases: frontotemporal lobar degeneration with ubiquitinated inclusions (FTLD-U) and multiple sclerosis (MS). Starting from DEGs and dysregulated biological pathways, a model for each pathway was generated using databases information contained in STRING and KEGG, in order to design how DEGs were connected in a causal structure. Successively, SEM analysis proved if pathways differ globally, between groups, and for specific path relationships. The results confirmed the importance of certain genes in the analyzed diseases, and unveiled which connections are modified among them. Conclusions: We propose a framework to perform differential gene expression analysis on microarray data based on SEM, which is able to: 1) find relevant genes and perturbed biological pathways; 2) investigate putative sub-pathway models based on the concept of disease module; 3) test and improve the generated models; 4) individuate a differential expression level of one gene, and differential connection between two genes. This could shed a light, not only on the mechanisms affecting variations in gene expression, but also on the causes of gene-gene relationship modifications in diseased phenotypes.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 32
    Publication Date: 2014-03-20
    Description: Background: Chromothripsis, a newly discovered type of complex genomic rearrangement, has been implicated inthe evolution of several types of cancers. To date, it has been described in bone cancer,SHH-medulloblastoma and acute myeloid leukemia, amongst others, however there are still noformal or automated methods for detecting or annotating it in high throughput sequencing data. Assuch, findings of chromothripsis are difficult to compare and many cases likely escape detectionaltogether. Results: We introduce ShatterProof, a software tool for detecting and quantifying chromothriptic events.ShatterProof takes structural variation calls (translocations, copy-number variations, short insertionsand loss of heterozygosity) produced by any algorithm and using an operational definition ofchromothripsis performs robust statistical tests to accurately predict the presence and location ofchromothriptic events. Validation of our tool was conducted using clinical data sets includingmatched normal, prostate cancer samples in addition to the colorectal cancer and SCLC data setsused in the original description of chromothripsis. Conclusions: ShatterProof is computationally efficient, having low memory requirements and near linearcomputation time. This allows it to become a standard component of sequencing analysis pipelines,enabling researchers to routinely and accurately assess samples for chromothripsis. Source code anddocumentation can be found at http://search.cpan.org/~sgovind/Shatterproof.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 33
    Publication Date: 2014-03-20
    Description: Background: Metagenomics, based on culture-independent sequencing, is a well-fitted approach to provide insights into the composition, structure and dynamics of environmental viral communities. Following recent advances in sequencing technologies, new challenges arise for existing bioinformatic tools dedicated to viral metagenome (i.e. virome) analysis as (i) the number of viromes is rapidly growing and (ii) large genomic fragments can now be obtained by assembling the huge amount of sequence data generated for each metagenome. Results: To face these challenges, a new version of Metavir was developed. First, all Metavir tools have been adapted to support comparative analysis of viromes in order to improve the analysis of multiple datasets. In addition to the sequence comparison previously provided, viromes can now be compared through their k-mer frequencies, their taxonomic compositions, recruitment plots and phylogenetic trees containing sequences from different datasets. Second, a new section has been specifically designed to handle assembled viromes made of thousands of large genomic fragments (i.e. contigs). This section includes an annotation pipeline for uploaded viral contigs (gene prediction, similarity search against reference viral genomes and protein domains) and an extensive comparison between contigs and reference genomes. Contigs and their annotations can be explored on the website through specifically developed dynamic genomic maps and interactive networks. Conclusions: The new features of Metavir 2 allow users to explore and analyze viromes composed of raw reads or assembled fragments through a set of adapted tools and a user-friendly interface.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 34
    Publication Date: 2014-03-12
    Description: Background: Information about drug-target relations is at the heart of drug discovery. There are now dozens of databases providing drug-target interaction data with varying scope, and focus. Therefore, and due to the large chemical space, the overlap of the different data sets is surprisingly small. As searching through these sources manually is cumbersome, time-consuming and error-prone, integrating all the data is highly desirable. Despite a few attempts, integration has been hampered by the diversity of descriptions of compounds, and by the fact that the reported activity values, coming from different data sets, are not always directly comparable due to usage of different metrics or data formats.Description: We have built Drug2Gene, a knowledge base, which combines the compound/drug-gene/protein information from 19 publicly available databases. A key feature is our rigorous unification and standardization process which makes the data truly comparable on a large scale, allowing for the first time effective data mining in such a large knowledge corpus. As of version 3.2, Drug2Gene contains 4,372,290 unified relations between compounds and their targets most of which include reported bioactivity data. We extend this set with putative (i.e. homology-inferred) relations where sufficient sequence homology between proteins suggests they may bind to similar compounds. Drug2Gene provides powerful search functionalities, very flexible export procedures, and a user-friendly web interface. Conclusions: Drug2Gene v3.2 has become a mature and comprehensive knowledge base providing unified, standardized drug-target related information gathered from publicly available data sources. It can be used to integrate proprietary data sets with publicly available data sets. Its main goal is to be a 'one-stop shop' to identify tool compounds targeting a given gene product or for finding all known targets of a drug. Drug2Gene with its integrated data set of public compound-target relations is freely accessible without restrictions at http://www.drug2gene.com.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 35
    Publication Date: 2014-03-05
    Description: Background: Many biomedical relation extraction systems are machine-learning based and have to be trained on large annotated corpora that are expensive and cumbersome to construct. We developed a knowledge-based relation extraction system that requires minimal training data, and applied the system for the extraction of adverse drug events from biomedical text. The system consists of a concept recognition module that identifies drugs and adverse effects in sentences, and a knowledge-base module that establishes whether a relation exists between the recognized concepts. The knowledge base was filled with information from the Unified Medical Language System. The performance of the system was evaluated on the ADE corpus, consisting of 1644 abstracts with manually annotated adverse drug events. Fifty abstracts were used for training, the remaining abstracts were used for testing. Results: The knowledge-based system obtained an F-score of 50.5%, which was 34.4 percentage points better than the co-occurrence baseline. Increasing the training set to 400 abstracts improved the F-score to 54.3%. When the system was compared with a machine-learning system, jSRE, on a subset of the sentences in the ADE corpus, our knowledge-based system achieved an F-score that is 7 percentage points higher than the F-score of jSRE trained on 50 abstracts, and still 2 percentage points higher than jSRE trained on 90% of the corpus. Conclusion: A knowledge-based approach can be successfully used to extract adverse drug events from biomedical text without need for a large training set. Whether use of a knowledge base is equally advantageous for other biomedical relation-extraction tasks remains to be investigated.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 36
    Publication Date: 2014-04-27
    Description: Background: Complex designs are common in (observational) clinical studies. Sequencing data for such studies are produced more and more often, implying challenges for the analysis, such as excess of zeros, presence of random effects and multi-parameter inference. Moreover, when sample sizes are small, inference is likely to be too liberal when, in a Bayesian setting, applying a non-appropriate prior or to lack power when not carefully borrowing information across features. Results: We show on microRNA sequencing data from a clinical cancer study how our software ShrinkBayes tackles the aforementioned challenges. In addition, we illustrate its comparatively good performance on multi-parameter inference for groups using a data-based simulation. Finally, in the small sample size setting, we demonstrate its high power and improved FDR estimation by use of Gaussian mixture priors that include a point mass. Conclusion: ShrinkBayes is a versatile software package for the analysis of count-based sequencing data, which is particularly useful for studies with small sample sizes or complex designs.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 37
    Publication Date: 2014-04-27
    Description: A common class of biomedical analysis is to explore expression data from high throughput experiments for the purpose of uncovering functional relationships that can lead to a hypothesis about mechanisms of a disease. We call this analysis expression driven, -omics hypothesizing. In it, scientists use interactive data visualizations and read deeply in the research literature. Little is known, however, about the actual flow of reasoning and behaviors (sensemaking) that scientists enact in this analysis, end-to-end. Understanding this flow is important because if bioinformatics tools are to be truly useful they must support it. Sensemaking models of visual analytics in other domains have been developed and used to inform the design of useful and usable tools. We believe they would be helpful in bioinformatics. To characterize the sensemaking involved in expression-driven, -omics hypothesizing, we conducted an in-depth observational study of one scientist as she engaged in this analysis over six months. From findings, we abstracted a preliminary sensemaking model. Here we describe its stages and suggest guidelines for developing visualization tools that we derived from this case. A single case cannot be generalized. But we offer our findings, sensemaking model and case-based tool guidelines as a first step toward increasing interest and further research in the bioinformatics field on scientists' analytical workflows and their implications for tool design.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 38
    Publication Date: 2014-04-28
    Description: Background: Periodic proteins, characterized by the presence of multiple repeats of short motifs, form an interesting and seldom-studied group. Due to often extreme divergence in sequence, detection and analysis of such motifs is performed more reliably on the structural level. Yet, few algorithms have been developed for the detection and analysis of structures of periodic proteins. Results: ConSole recognizes modularity in protein contact maps, allowing for precise identification of repeats in solenoid protein structures, an important subgroup of periodic proteins. Tests on benchmarks show that ConSole has higher recognition accuracy as compared to Raphael, the only other publicly available solenoid structure detection tool. As a next step of ConSole analysis, we show how detection of solenoid repeats in structures can be used to improve sequence recognition of these motifs and to detect subtle irregularities of repeat lengths in three solenoid protein families. Conclusions: The ConSole algorithm provides a fast and accurate tool to recognize solenoid protein structures as a whole and to identify individual solenoid repeat units from a structure. ConSole is available as a web-based, interactive server and is available for download at http://console.sanfordburnham.org.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 39
    Publication Date: 2014-04-28
    Description: Background: The identification of functionally important residue positions is an important task of computational biology. Methods of correlation analysis allow for the identification of pairs of residue positions, whose occupancy is mutually dependent due to constraints imposed by protein structure or function. A common measure assessing these dependencies is the mutual information, which is based on Shannon's information theory that utilizes probabilities only. Consequently, such approaches do not consider the similarity of residue pairs, which may degrade the algorithm's performance. One typical algorithm is H2r, which characterizes each individual residue position k by the conn(k)-value, which is the number of significantly correlated pairs it belongs to. Results: To improve specificity of H2r, we developed a revised algorithm, named H2rs, which is based on the von Neumann entropy (vNE). To compute the corresponding mutual information, a matrix A is required, which assesses the similarity of residue pairs. We determined A by deducing substitution frequencies from contacting residue pairs observed in the homologs of 35 809 proteins, whose structure is known. In analogy to H2r, the enhanced algorithm computes a normalized conn(k)-value. Within the framework of H2rs, only statistically significant vNE values were considered. To decide on significance, the algorithm calculates a p-value by performing a randomization test for each individual pair of residue positions. The analysis of a large in silico testbed demonstrated that specificity and precision were higher for H2rs than for H2r and two other methods of correlation analysis. The gain in prediction quality is further confirmed by a detailed assessment of five well-studied enzymes. The outcome of H2rs and of a method that predicts contacting residue positions (PSICOV) overlapped only marginally. H2rs can be downloaded from www-bioinf.uni-regensburg.de. Conclusions: Considering substitution frequencies for residue pairs by means of the von Neumann entropy and a p-value improved the success rate in identifying important residue positions. The integration of proven statistical concepts and normalization allows for an easier comparison of results obtained with different proteins. Comparing the outcome of the local method H2rs and of the global method PSICOV indicates that such methods supplement each other and have different scopes of application.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 40
    Publication Date: 2014-04-29
    Description: Background: It is important to predict the quality of a protein structural model before its native structure is known. The method that can predict the absolute local quality of individual residues in a single protein model is rare, yet particularly needed for using, ranking and refining protein models. Results: We developed a machine learning tool (SMOQ) that can predict the distance deviation of each residue in a single protein model. SMOQ uses support vector machines (SVM) with protein sequence and structural features (i.e. basic feature set), including amino acid sequence, secondary structures, solvent accessibilities, and residue-residue contacts to make predictions. We also trained a SVM model with two new additional features (profiles and SOV scores) on 20 CASP8 targets and found that including them can only improve the performance when real deviations between native and model are higher than 5A. The SMOQ tool finally released uses the basic feature set trained on 85 CASP8 targets. Moreover, SMOQ implemented a way to convert predicted local quality scores into a global quality score. SMOQ was tested on the 84 CASP9 single-domain targets. The average difference between the residue-specific distance deviation predicted by our method and the actual distance deviation on the test data is 2.637A. The global quality prediction accuracy of the tool is comparable to other good tools on the same benchmark. Conclusion: SMOQ is a useful tool for protein single model quality assessment. Its source code and executable are available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 41
    Publication Date: 2014-03-04
    Description: Background: Whole-genome sequencing represents a powerful experimental tool for pathogen research. We present methods for the analysis of small eukaryotic genomes, including a streamlined system (called Platypus) for finding single nucleotide and copy number variants as well as recombination events. Results: We have validated our pipeline using four sets of Plasmodium falciparum drug resistant data containing 26 clones from 3D7 and Dd2 background strains, identifying an average of 11 single nucleotide variants per clone. We also identify 8 copy number variants with contributions to resistance, and report for the first time that all analyzed amplification events are in tandem. Conclusions: The Platypus pipeline provides malaria researchers with a powerful tool to analyze short read sequencing data. It provides an accurate way to detect SNVs using known software packages, and a novel methodology for detection of CNVs, though it does not currently support detection of small indels. We have validated that the pipeline detects known SNVs in a variety of samples while filtering out spurious data. We bundle the methods into a freely available package.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 42
    Publication Date: 2014-04-30
    Description: Background: RNA-binding proteins interact with specific RNA molecules to regulate important cellular processes. It is therefore necessary to identify the RNA interaction partners in order to understand the precise functions of such proteins. Protein-RNA interactions are typically characterized using in vivo and in vitro experiments but these may not detect all binding partners. Therefore, computational methods that capture the protein-dependent nature of such binding interactions could help to predict potential binding partners in silico. Results: We have developed three methods to predict whether an RNA can interact with a particular RNAbinding protein using support vector machines and different features based on the sequence (the Oli method), the motif score (the OliMo method) and the secondary structure (the OliMoSS method). We applied these approaches to different experimentally-derived datasets and compared the predictions with RNAcontext and RPISeq. Oli outperformed OliMoSS and RPISeq, confirming our protein-specific predictions and suggesting that tetranucleotide frequencies are appropriate discriminative features. Oli and RNAcontext were the most competitive methods in terms of the area under curve. A precisionrecall curve analysis achieved higher precision values for Oli. On a second experimental dataset including real negative binding information, Oli outperformed RNAcontext with a precision of 0.73 vs. 0.59. Conclusions: Our experiments showed that features based on primary sequence information are sufficiently discriminating to predict specific RNA-protein interactions. Sequence motifs and secondary structure information were not necessary to improve these predictions. Finally we confirmed that proteinspecific experimental data concerning RNA-protein interactions are valuable sources of information that can be used for the efficient training of models for in silico predictions. The scripts are available upon request to the corresponding author.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 43
    Publication Date: 2014-04-30
    Description: Background: Tandem mass spectrometry-based database searching is currently the main method for protein identification in shotgun proteomics. The explosive growth of protein and peptide databases, which is a result of genome translations, enzymatic digestions, and post-translational modifications (PTMs), is making computational efficiency in database searching a serious challenge. Profile analysis shows that most search engines spend 50%-90% of their total time on the scoring module, and that the spectrum dot product (SDP) based scoring module is the most widely used. As a general purpose and high performance parallel hardware, graphics processing units (GPUs) are promising platforms for speeding up database searches in the protein identification process. Results: We designed and implemented a parallel SDP-based scoring module on GPUs that exploits the efficient use of GPU registers, constant memory and shared memory. Compared with the CPU-based version, we achieved a 30 to 60 times speedup using a single GPU. We also implemented our algorithm on a GPU cluster and achieved an approximately favorable speedup. Conclusions: Our GPU-based SDP algorithm can significantly improve the speed of the scoring module in mass spectrometry-based protein identification. The algorithm can be easily implemented in many database search engines such as X!Tandem, SEQUEST, and pFind. A software tool implementing this algorithm is available at http://www.comp.hkbu.edu.hk/~youli/ProteinByGPU.html
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 44
    Publication Date: 2014-04-30
    Description: Background: RNA-seq and its variant differential RNA-seq (dRNA-seq) are today routine methods for transcriptome analysis in bacteria. While expression profiling and transcriptional start site prediction are standard tasks today, the problem of identifying transcriptional units in a genome-wide fashion is still not solved for prokaryotic systems. Results: We present RNASEG, an algorithm for the prediction of transcriptional units based on dRNA-seq data. A key feature of the algorithm is that, based on the data, it distinguishes between transcribed and un-transcribed genomic segments. Furthermore, the program provides many different predictions in a single run, which can be used to infer the significance of transcriptional units in a consensus procedure. We show the performance of our method based on a well-studied dRNA-seq data set for Helicobacter pylori. Conclusions: With our algorithm it is possible to identify operons and 5'- and 3'-UTRs in an automated fashion. This alleviates the need for labour intensive manual inspection and enables large-scale studies in the area of comparative transcriptomics.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 45
    Publication Date: 2014-03-20
    Description: Background: CA_C2195 from Clostridium acetobutylicum is a protein of unknown function. Sequence analysis predicted that part of the protein contained a metallopeptidase-related domain. There are over 200 homologs of similar size in large sequence databases such as UniProt, with pairwise sequence identities in the range of ~40-60%. CA_C2195 was chosen for crystal structure determination for structure-based function annotation of novel protein sequence space. Results: The structure confirmed that CA_C2195 contained an N-terminal metallopeptidase-like domain. The structure revealed two extra domains: an alpha+beta domain inserted in the metallopeptidase-like domain and a C-terminal circularly permuted winged-helix-turn-helix domain. Conclusions: Based on our sequence and structural analyses using the crystal structure of CA_C2195 we provide a view into the possible functions of the protein. From contextual information from gene-neighborhood analysis, we propose that rather than being a peptidase, CA_C2195 and its homologs might play a role in biosynthesis of a modified cell-surface carbohydrate in conjunction with several sugar-modification enzymes. These results provide the groundwork for the experimental verification of the function.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 46
    Publication Date: 2014-03-20
    Description: Background: Recent efforts in HIV-1 vaccine design have focused on immunogens that evoke potent neutralizing antibody responses to a broad spectrum of viruses circulating worldwide. However, the development of effective vaccines will depend on the identification and characterization of the neutralizing antibodies and their epitopes. We developed bioinformatics methods to predict epitope networks and antigenic determinants using structural information, as well as corresponding genotypes and phenotypes generated by a highly sensitive and reproducible neutralization assay.282 clonal envelope sequences from a multiclade panel of HIV-1 viruses were tested in viral neutralization assays with an array of broadly neutralizing monoclonal antibodies (mAbs: b12, PG9,16, PGT121 - 128, PGT130 - 131, PGT135 - 137, PGT141 - 145, and PGV04). We correlated IC50 titers with the envelope sequences, and used this information to predict antibody epitope networks. Structural patches were defined as amino acid groups based on solvent-accessibility, radius, atomic depth, and interaction networks within 3D envelope models. We applied a boosted algorithm consisting of multiple machine-learning and statistical models to evaluate these patches as possible antibody epitope regions, evidenced by strong correlations with the neutralization response for each antibody. Results: We identified patch clusters with significant correlation to IC50 titers as sites that impact neutralization sensitivity and therefore are potentially part of the antibody binding sites. Predicted epitope networks were mostly located within the variable loops of the envelope glycoprotein (gp120), particularly in V1/V2. Site-directed mutagenesis experiments involving residues identified as epitope networks across multiple mAbs confirmed association of these residues with loss or gain of neutralization sensitivity. Conclusions: Computational methods were implemented to rapidly survey protein structures and predict epitope networks associated with response to individual monoclonal antibodies, which resulted in the identification and deeper understanding of immunological hotspots targeted by broadly neutralizing HIV-1 antibodies.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 47
    Publication Date: 2014-01-25
    Description: Background: The introduction of next-generation sequencing (NGS) technology has made it possible to detect genomic alterations within tumor cells on a large scale. However, most applications of NGS show the genetic content of mixtures of cells. Recently developed single cell sequencing technology can identify variation within a single cell. Characterization of multiple samples from a tumor using single cell sequencing can potentially provide information on the evolutionary history of that tumor. This may facilitate understanding how key mutations accumulate and evolve in lineages to form a heterogeneous tumor. Results: We provide a computational method to infer an evolutionary mutation tree based on single cell sequencing data. Our approach differs from traditional phylogenetic tree approaches in that our mutation tree directly describes temporal order relationships among mutation sites. Our method also accommodates sequencing errors. Furthermore, we provide a method for estimating the proportion of time from the earliest mutation event of the sample to the most recent common ancestor of the sample of cells. Finally, we discuss current limitations on modeling with single cell sequencing data and possible improvements under those limitations. Conclusions: Inferring the temporal ordering of mutational sites using current single cell sequencing data is a challenge. Our proposed method may help elucidate relationships among key mutations and their role in tumor progression.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 48
    Publication Date: 2014-01-28
    Description: Background: Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it. Results: To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (hive.biochemistry.gwu.edu/tools/csr/SRARecords_Curated.php). Conclusions: Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 49
    Publication Date: 2014-01-25
    Description: Background: Discovering novel interactions between HIV-1 and human proteins would greatly contribute to different areas of HIV research. Identification of such interactions leads to a greater insight into drug target prediction. Some recent studies have been conducted for computational prediction of new interactions based on the experimentally validated information stored in a HIV-1-human protein-protein interaction database. However, these techniques do not predict any regulatory mechanism between HIV-1 and human proteins by considering interaction types and direction of regulation of interactions. Results: Here we present an association rule mining technique based on biclustering for discovering a set of rules among human and HIV-1 proteins using the publicly available HIV-1-human PPI database. These rules are subsequently utilized to predict some novel interactions among HIV-1 and human proteins. For prediction purpose both the interaction types and direction of regulation of interactions, (i.e., virus-to-host or host-to-virus) are considered here to provide important additional information about the regulation pattern of interactions. We have also studied the biclusters and analyzed the significant GO terms and KEGG pathways in which the human proteins of the biclusters participate. Moreover the predicted rules have also been analyzed to discover regulatory relationship between some human proteins in course of HIV-1 infection. Some experimental evidences of our predicted interactions have been found by searching the recent literatures in PUBMED.We have also highlighted some human proteins that are likely to act against the HIV-1 attack. Conclusions: We pose the problem of identifying new regulatory interactions between HIV-1 and human proteins based on the existing PPI database as an association rule mining problem based on biclustering algorithm. We discover some novel regulatory interactions between HIV-1 and human proteins. Significant number of predicted interactions has been found to be supported by recent literature.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 50
    Publication Date: 2014-02-01
    Description: Background: Motif searching is an important step in the detection of rare events occurring in a set of DNA or proteinsequences. One formulation of the problem is known as (l, d)-motif search or Planted Motif Search(PMS). In PMS we are given two integers l and d and n biological sequences. We want to find allsequences of length l that appear in each of the input sequences with at most d mismatches. The PMSproblem is NP-complete. PMS algorithms are typically evaluated on certain instances consideredchallenging. Despite ample research in the area, a considerable performance gap exists because manystate of the art algorithms have large runtimes even for moderately challenging instances. Results: This paper presents a fast exact parallel PMS algorithm called PMS8. PMS8 is the first algorithmto solve the challenging (l, d) instances (25, 10) and (26, 11). PMS8 is also efficient on instanceswith larger l and d such as (50, 21). We include a comparison of PMS8 with several stateof the art algorithms on multiple problem instances. This paper also presents necessary and sufficientconditions for 3 l-mers to have a common d-neighbor. The program is freely available athttp://engr.uconn.edu/~man09004/PMS8/. Conclusions: We present PMS8, an efficient exact algorithm for Planted Motif Search. PMS8 introduces novelideas for generating common neighborhoods. We have also implemented a parallel version for thisalgorithm. PMS8 can solve instances not solved by any previous algorithms.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 51
    Publication Date: 2014-02-25
    Description: Background: True date palms (Phoenix dactylifera L.) are impressive trees and have served as an indispensable source of food for mankind in tropical and subtropical countries for centuries. The aim of this study is to differentiate date palm tree varieties by analysing leaflet cross sections with technical/optical methods and artificial neural networks (ANN). Results: Fluorescence microscopy images of leaflet cross sections have been taken from a set of five date palm tree cultivars (Hewlat al Jouf, Khlas, Nabot Soltan, Shishi, Um Raheem). After features extraction from images, the obtained data have been fed in a multilayer perceptron ANN with backpropagation learning algorithm. Conclusions: Overall, an accurate result in prediction and differentiation of date palm tree cultivars was achieved with average prediction in tenfold cross-validation is 89.1% and reached 100% in one of the best ANN.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 52
    Publication Date: 2014-02-27
    Description: Background: Ontological concepts are useful for many different biomedical tasks. Concepts are difficult to recognize in text due to a disconnect between what is captured in an ontology and how the concepts are expressed in text. There are many recognizers for specific ontologies, but a general approach for concept recognition is an open problem. Results: Three dictionary-based systems (MetaMap, NCBO Annotator, and ConceptMapper) are evaluated on eight biomedical ontologies in the Colorado Richly Annotated Full-Text (CRAFT) Corpus. Over 1,000 parameter combinations are examined, and best-performing parameters for each system-ontology pair are presented. Conclusions: Baselines for concept recognition by three systems on eight biomedical ontologies are established (F-measures range from 0.14¿0.83). Out of the three systems we tested, ConceptMapper is generally the best-performing system; it produces the highest F-measure of seven out of eight ontologies. Default parameters are not ideal for most systems on most ontologies; by changing parameters F-measure can be increased by up to 0.4. Not only are best performing parameters presented, but suggestions for choosing the best parameters based on ontology characteristics are presented.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 53
    Publication Date: 2014-02-27
    Description: Background: Binding free energy and binding hot spots at protein-protein interfaces are two important research areas for understanding protein interactions. Computational methods have been developed previously for accurate prediction of binding free energy change upon mutation for interfacial residues. However, a large number of interrupted and unimportant atomic contacts are used in the training phase which caused accuracy loss. Results: This work proposes a new method, ßACVASA, to predict the change of binding free energy after alanine mutations. ßACVASA integrates accessible surface area (ASA) and our newly defined ß contacts together into an atomic contact vector (ACV). A ß contact between two atoms is a direct contact without being interrupted by any other atom between them. A ß contact's potential contribution to protein binding is also supposed to be inversely proportional to its ASA to follow the water exclusion hypothesis of binding hot spots. Tested on a dataset of 396 alanine mutations, our method is found to be superior in classification performance to many other methods, including Robetta, FoldX, HotPOINT, an ACV method of ß contacts without ASA integration, and ACVASA methods (similar to ßACVASA but based on distance-cutoff contacts). Based on our data analysis and results, we can draw conclusions that: (i) our method is powerful in the prediction of binding free energy change after alanine mutation; (ii) ß contacts are better than distance-cutoff contacts for modeling the well-organized protein-binding interfaces; (iii) ß contacts usually are only a small fraction number of the distance-based contacts; and (iv) water exclusion is a necessary condition for a residue to become a binding hot spot. Conclusions: ßACVASA is designed using the advantages of both ß contacts and water exclusion. It is an excellent tool to predict binding free energy changes and binding hot spots after alanine mutation.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 54
    Publication Date: 2014-03-01
    Description: Background: Protein-coding DNA sequences and their corresponding amino acid sequences are routinely used to study relationships between sequence, structure, function, and evolution. The rapidly growing size of sequence databases increases the power of such comparative analyses but it makes it more challenging to prepare high quality sequence data sets with control over redundancy, quality, completeness, formatting, and labeling. Software tools for some individual steps in this process exist but manual intervention remains a common and time consuming necessity.Description: CDSbank is a database that stores both the protein-coding DNA sequence (CDS) and amino acid sequence for each protein annotated in Genbank. CDSbank also stores Genbank feature annotation, a flag to indicate incomplete 5[prime] and 3[prime] ends, full taxonomic data, and a heuristic to rank the scientific interest of each species. This rich information allows fully automated data set preparation with a level of sophistication that aims to meet or exceed manual processing. Defaults ensure ease of use for typical scenarios while allowing great flexibility when needed. Access is via a free web server at http://hazeslab.med.ualberta.ca/CDSbank/. Conclusions: CDSbank presents a user-friendly web server to download, filter, format, and name large sequence data sets. Common usage scenarios can be accessed via pre-programmed default choices, while optional sections give full control over the processing pipeline. Particular strengths are: extract protein-coding DNA sequences just as easily as amino acid sequences, full access to taxonomy for labeling and filtering, awareness of incomplete sequences, and the ability to take one protein sequence and extract all synonymous CDS or identical protein sequences in other species. Finally, CDSbank can also create labeled property files to, for instance, annotate or re-label phylogenetic trees.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 55
    Publication Date: 2014-03-01
    Description: Background: RNA molecules, especially non-coding RNAs, play vital roles in the cell and their biological functions are mostly determined by structural properties. Often, these properties are related to dynamic changes in the structure, as in the case of riboswitches, and thus the analysis of RNA folding kinetics is crucial for their study. Exact approaches to kinetic folding are computationally expensive and, thus, limited to short sequences. In a previous study, we introduced a position-specific abstraction based on helices which we termed helix index shapes (hishapes) and a hishape-based algorithm for near-optimal folding pathway computation, called HiPath. The combination of these approaches provides an abstract view of the folding space that offers information about the global features. Results: In this paper we present HiKinetics, an algorithm that can predict RNA folding kinetics for sequences up to several hundred nucleotides long. This algorithm is based on RNAHeliCes, which decomposes the folding space into abstract classes, namely hishapes, and an improved version of HiPath, namely HiPath2, which estimates plausible folding pathways that connect these classes. Furthermore, we analyse the relationship of hishapes to locally optimal structures, the results of which strengthen the use of the hishape abstraction for studying folding kinetics. Finally, we show the application of HiKinetics to the folding kinetics of two well-studied RNAs. Conclusions: HiKinetics can calculate kinetic folding based on a novel hishape decomposition. HiKinetics, together with HiPath2 and RNAHeliCes, is available for download athttp://www.cyanolab.de/software/RNAHeliCes.htm.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 56
    Publication Date: 2014-02-12
    Description: Background: Simple peak-picking algorithms, such as those based on lineshape fitting, perform well when peaks are completely resolved in multidimensional NMR spectra, but often produce wrong intensities and frequencies for overlapping peak clusters. For example, NOESY-type spectra have considerable overlaps leading to significant peak-picking intensity errors, which can result in erroneous structural restraints. Precise frequencies are critical for unambiguous resonance assignments. Results: To alleviate this problem, a more sophisticated peaks decomposition algorithm, based on non-negative matrix factorization (NMF), was developed. We produce peak shapes from Fourier-transformed NMR spectra. Apart from its main goal of deriving components from spectra and producing peak lists automatically, the NMF approach can also be applied if the positions of some peaks are known a priori, e.g. from consistently referenced spectral dimensions of other experiments. Conclusions: Application of the NMF algorithm to a three-dimensional peak list of the 23kDa bi-domain section of the RcsD protein (RcsD-ABL-HPt, residues 688-890) as well as to synthetic HSQC data shows that peaks can be picked accurately also in spectral regions with strong overlap.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 57
    Publication Date: 2014-02-22
    Description: Background: Along with the improvement of high throughput sequencing technologies, the genetics communityis showing marked interest for the rare variants/common diseases hypothesis. While sequencingcan still be prohibitive for large studies, commercially available genotyping arrays targeting rarevariants prove to be a reasonable alternative. A technical challenge of array based methods is thetask of deriving genotype classes (homozygous or heterozygous) by clustering intensity data points.The performance of clustering tools for common polymorphisms is well established, while theirperformance when conducted with a large proportion of rare variants (where data points are sparsefor genotypes containing the rare allele) is less known. We have compared the performance of fourclustering tools (GenCall, GenoSNP, optiCall and zCall) for the genotyping of over 10,000 samplesusing the Illumina's HumanExome BeadChip, which includes 247,870 variants, 90% of which have aminor allele frequency below 5%in a population of European ancestry. Different reference parametersfor GenCall and different initial parameters for GenoSNP were tested. Genotyping accuracy wasassessed using data from the 1000 Genomes Project as a gold standard, and agreement between toolswas measured. Results: Concordance of GenoSNP's calls with the gold standard was below expectations and was increasedby changing the tool's initial parameters. While the four tools provided concordance with the goldstandard above 99% for common alleles, some of them performed poorly for rare alleles. Thereproducibility of genotype calls for each tool was assessed using experimental duplicates whichprovided concordance rates above 99%. The inter-tool agreement of genotype calls was high forapproximately 95% of variants. Most tools yielded similar error rates (approximately 0.02), exceptfor zCall which performed better with a 0.00164 mean error rate. Conclusions: The GenoSNP clustering tool could not be run straight "out of the box" with the HumanExomeBeadChip, as modification of hard coded parameters was necessary to achieve optimal performance.Overall, GenCall marginally outperformed the other tools for the HumanExome BeadChip. The useof experimental replicates provided a valuable quality control tool for genotyping projects with rarevariants.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 58
    Publication Date: 2014-02-22
    Description: Background: Principal component analysis (PCA) has been widely used to visualize high-dimensional metabolomic data in a two- or three-dimensional subspace. In metabolomics, some metabolites (e.g., the top 10 metabolites) have been subjectively selected when using factor loading in PCA, and biological inferences are made for these metabolites. However, this approach may lead to biased biological inferences because these metabolites are not objectively selected with statistical criteria. Results: We propose a statistical procedure that selects metabolites with statistical hypothesis testing of the factor loading in PCA and makes biological inferences about these significant metabolites with a metabolite set enrichment analysis (MSEA). This procedure depends on the fact that the eigenvector in PCA for autoscaled data is proportional to the correlation coefficient between the PC score and each metabolite level. We applied this approach to two sets of metabolomic data from mouse liver samples: 136 of 282 metabolites in the first case study and 66 of 275 metabolites in the second case study were statistically significant. This result suggests that to set the number of metabolites before the analysis is inappropriate because the number of significant metabolites differs in each study when factor loading is used in PCA. Moreover, when an MSEA of these significant metabolites was performed, significant metabolic pathways were detected, which were acceptable in terms of previous biological knowledge. Conclusions: It is essential to select metabolites statistically to make unbiased biological inferences from metabolomic data when using factor loading in PCA. We propose a statistical procedure to select metabolites with statistical hypothesis testing of the factor loading in PCA, and to draw biological inferences about these significant metabolites with MSEA. We have developed an R package "mseapca" to facilitate this approach. The "mseapca" package is publicly available at the CRAN website.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 59
    Publication Date: 2014-02-22
    Description: Background: Copy Number Variations (CNVs) are usually inferred from Single Nucleotide Polymorphism (SNP) arrays by use of some software packages based on given algorithms. However, there is no clear understanding of the performance of these software packages; it is therefore difficult to select one or several software packages for CNV detection based on the SNP array platform.We selected four publicly available software packages designed for CNV calling from an Affymetrix SNP array, including Birdsuite, dChip, Genotyping Console (GTC) and PennCNV. The publicly available dataset generated by Array-based Comparative Genomic Hybridization (CGH), with a resolution of 24 million probes per sample, was considered to be the "gold standard". Compared with the CGH-based dataset, the success rate, average stability rate, sensitivity, consistence and reproducibility of these four software packages were assessed compared with the "gold standard". Specially, we also compared the efficiency of detecting CNVs simultaneously by two, three and all of the software packages with that by a single software package. Results: Simply from the quantity of the detected CNVs, Birdsuite detected the most while GTC detected the least. We found that Birdsuite and dChip had obvious detecting bias. And GTC seemed to be inferior because of the least amount of CNVs it detected. Thereafter we investigated the detection consistency produced by one certain software package and the rest three software suits. We found that the consistency of dChip was the lowest while GTC was the highest. Compared with the CNVs detecting result of CGH, in the matching group, GTC called the most matching CNVs, PennCNV-Affy ranked second. In the non-overlapping group, GTC called the least CNVs. With regards to the reproducibility of CNV calling, larger CNVs were usually replicated better. PennCNV-Affy shows the best consistency while Birdsuite shows the poorest. Conclusion: We found that PennCNV outperformed the other three packages in the sensitivity and specificity of CNV calling. Obviously, each calling method had its own limitations and advantages for different data analysis. Therefore, the optimized calling methods might be identified using multiple algorithms to evaluate the concordance and discordance of SNP array-based CNV calling.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 60
    Publication Date: 2014-02-25
    Description: Background: In the past decade, the field of molecular biology has become increasingly quantitative; rapid development of new technologies enables researchers to investigate and address fundamental issues quickly and in an efficient manner which were once impossible. Among these technologies, DNA microarray provides methodology for many applications such as gene discovery, diseases diagnosis, drug development and toxicological research and it has been used increasingly since it first emerged. Multiple tools have been developed to interpret the high-throughput data produced by microarrays. However, many times, less consideration has been given to the fact that an extensive and effective interpretation requires close interplay between the bioinformaticians who analyze the data and the biologists who generate it. To bridge this gap and to simplify the usability of such tools we developed Eureka-DMA -- an easy-to-operate graphical user interface that allows bioinformaticians and bench-biologists alike to initiate analyses as well as to investigate the data produced by DNA microarrays. Results: In this paper, we describe Eureka-DMA, a user-friendly software that comprises a set of methods for the interpretation of gene expression arrays. Eureka-DMA includes methods for the identification of genes with differential expression between conditions; it searches for enriched pathways and gene ontology terms and combines them with other relevant features. It thus enables the full understanding of the data for following testing as well as generating new hypotheses. Here we show two analyses, demonstrating examples of how Eureka-DMA can be used and its capability to produce relevant and reliable results. Conclusions: We have integrated several elementary expression analysis tools to provide a unified interface for their implementation. Eureka-DMA's simple graphical user interface provides effective and efficient framework in which the investigator has the full set of tools for the visualization and interpretation of the data with the option of exporting the analysis results for later use in other platforms. Eureka-DMA is freely available for academic users and can be downloaded at http://blue-meduza.org/Eureka-DMA
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 61
    Publication Date: 2014-02-27
    Description: Background: Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilitiesof patients. For multivariate risk prediction models on such high-dimensional data, there areestablished techniques that combine parameter estimation and variable selection. One big challengeis to incorporate interactions into such prediction models. In this feasibility study, we present buildingblocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings,especially for settings in which it is computationally too expensive to check all possible interactions. Results: We use a boosting technique for estimation of effects and the following building blocks for preselectinginteractions: (1) resampling, (2) random forests and (3) orthogonalization as a data preprocessingstep. In a simulation study, the strategy that uses all building blocks is able to detect truemain effects and interactions with high sensitivity in different kinds of scenarios. The main challengeare interactions composed of variables that do not represent main effects, but our findings are alsopromising in this regard. Results on real world data illustrate that effect sizes of interactions frequentlymay not be large enough to improve prediction performance, even though the interactions arepotentially of biological relevance. Conclusion: Screening interactions through random forests is feasible and useful, when one is interested in findingrelevant two-way interactions. The other building blocks also contribute considerably to an enhancedpre-selection of interactions. We determined the limits of interaction detection in terms of necessaryeffect sizes. Our study emphasizes the importance of making full use of existing methods in additionto establishing new ones.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 62
    Publication Date: 2014-03-22
    Description: Background: Identifying differentially expressed genes (DEG) is a fundamental step in studies that perform genome wide expression profiling. Typically, DEG are identified by univariate approaches such as Significance Analysis of Microarrays (SAM) or Linear Models for Microarray Data (LIMMA) for processing cDNA microarrays, and differential gene expression analysis based on the negative binomial distribution (DESeq) or Empirical analysis of Digital Gene Expression data in R (edgeR) for RNA-seq profiling. Results: Here we present a new geometrical multivariate approach to identify DEG called the Characteristic Direction. We demonstrate that the Characteristic Direction method is significantly more sensitive than existing methods for identifying DEG in the context of transcription factor (TF) and drug perturbation responses over a large number of microarray experiments. We also benchmarked the Characteristic Direction method using synthetic data, as well as RNA-Seq data. A large collection of microarray expression data from TF perturbations (73 experiments) and drug perturbations (130 experiments) extracted from the Gene Expression Omnibus (GEO), as well as an RNA-Seq study that profiled genome-wide gene expression and STAT3 DNA binding in two subtypes of diffuse large B-cell Lymphoma, was used for benchmarking the method using real data. ChIP-Seq data identifying DNA binding sites of the perturbed TFs, as well as known drug targets of the perturbing drugs, were used as prior knowledge silver-standard for validation. In all cases the Characteristic Direction DEG calling method outperformed other methods. We find that when drugs are applied to cells in various contexts, the proteins that interact with the drug-targets are differentially expressed and more of the corresponding genes are discovered by the Characteristic Direction method. In addition, we show that the Characteristic Direction conceptualization can be used to perform improved gene set enrichment analyses when compared with the gene-set enrichment analysis (GSEA) and the hypergeometric test. Conclusions: The application of the Characteristic Direction method may shed new light on relevant biological mechanisms that would have remained undiscovered by the current state-of-the-art DEG methods. The method is freely accessible via various open source code implementations using four popular programming languages: R, Python, MATLAB and Mathematica, all available at: http://www.maayanlab.net/CD.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 63
    Publication Date: 2014-03-24
    Description: Background: RNA-seq data is currently underutilized, in part because it is difficult to predict the functional impact of alternate transcription events. Recent software improvements in full-length transcript deconvolution prompted us to develop spliceR, an R package for classification of alternative splicing and prediction of coding potential. Results: spliceR uses the full-length transcript output from RNA-seq assemblers to detect single or multiple exon skipping, alternative donor and acceptor sites, intron retention, alternative first or last exon usage, and mutually exclusive exon events. For each of these events spliceR also annotates the genomic coordinates of the differentially spliced elements, facilitating downstream sequence analysis. For each transcript isoform fraction values are calculated to identify transcript switching between conditions. Lastly, spliceR predicts the coding potential, as well as the potential nonsense mediated decay (NMD) sensitivity of each transcript. Conclusions: spliceR is an easy-to-use tool that extends the usability of RNA-seq and assembly technologies by allowing greater depth of annotation of RNA-seq data. spliceR is implemented as an R package and is freely available from the Bioconductor repository (http://www.bioconductor.org/packages/2.13/bioc/html/spliceR.html).
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 64
    Publication Date: 2014-03-25
    Description: Background: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development.Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis.However, docking generally requires high resolution structures of both of the binding partners and sequence analysis requires that a significant number of recurrent patterns exist for the identification of a potential binding site.Researchers have turned to machine learning to overcome some of the other methods' restrictions by generalising interface sites with sets of descriptive features.Best practices for dataset generation, features, and learning algorithms have not yet been identified or agreed upon, and an analysis of the overall efficacy of machine learning based PPI predictors is due, in order to highlight potential areas for improvement. Results: The presence of unknown interaction sites as a result of limited knowledge about protein interactions in the testing set dramatically reduces prediction accuracy.Greater accuracy in labelling the data by enforcing higher interface site rates per domain resulted in an average 44\% improvement across multiple machine learning algorithms.A set of 10 biologically unrelated proteins that were consistently predicted on with high accuracy emerged through our analysis. We identify seven features with the most predictive power over multiple datasets and machine learning algorithms. Through our analysis, we created a new predictor, RAD-T, that outperforms existing non-structurally specializing machine learning protein interface predictors, with an average 59\% increase in MCC score on a dataset with a high number of interactions. Conclusion: Current methods of evaluating machine-learning based PPI predictors tend to undervalue their performance, which may be artificially decreased by the presence of un-identified interaction sites. Changes to predictors' training sets will be integral to the future progress of interface prediction by machine learning methods. We reveal the need for a larger test set of well studied proteins or domain-specific scoring algorithms to compensate for poor interaction site identification on proteins in general.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 65
    Publication Date: 2014-03-27
    Description: Background: It is well known that the development of cancer is caused by the accumulation of somatic mutations within the genome. For oncogenes specifically, current research suggests that there is a small set of ``driver'' mutations that are primarily responsible for tumorigenesis. Further, due to some recent pharmacological successes in treating these driver mutations and their resulting tumors, a variety of methods have been developed to identify potential driver mutations using methods such as machine learning and mutational clustering. We propose a novel methodology that increases our power to identify mutational clusters by taking into account protein tertiary structure via a graph theoretical approach. Results: We have designed and implemented GraphPAC (Graph Protein Amino acid Clustering) to identify mutational clustering while considering protein spatial structure. Using GraphPAC, we are able to detect novel clusters in proteins that are known to exhibit mutation clustering as well as identify clusters in proteins without evidence of prior clustering based on current methods. Specifically, by utilizing the spatial information available in the Protein Data Bank (PDB) along with the mutational data in the Catalogue of Somatic Mutations in Cancer (COSMIC), GraphPAC identifies new mutational clusters in well known oncogenes such as EGFR and KRAS. Further, by utilizing graph theory to account for the tertiary structure, GraphPAC discovers clusters in DPP4, NRP1 and other proteins not identified by existing methods. The R package is available at: http://bioconductor.org/packages/release/bioc/html/GraphPAC.html. Conclusion: GraphPAC provides an alternative to iPAC and an extension to current methodology when identifying potential activating driver mutations by utilizing a graph theoretic approach when considering protein tertiary structure.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 66
    Publication Date: 2014-03-27
    Description: Background: 20 years of improved technology and growing sequences now renders residue-residue contact constraints in large protein families through correlated mutations accurate enough to drive de novo predictions of protein three-dimensional structure. The method EVfold broke new ground using mean-field Direct Coupling Analysis (EVfold-mfDCA); the method PSICOV applied a related concept by estimating a sparse inverse covariance matrix. Both methods (EVfold-mfDCA and PSICOV) are publicly available, but both require too much CPU time for interactive applications. On top, EVfold-mfDCA depends on proprietary software. Results: Here, we present FreeContact, a fast, open source implementation of EVfold-mfDCA and PSICOV. On a test set of 140 proteins, FreeContact was almost eight times faster than PSICOV without decreasing prediction performance. The EVfold-mfDCA implementation of FreeContact was over 220 times faster than PSICOV with negligible performance decrease. EVfold-mfDCA was unavailable for testing due to its dependency on proprietary software. FreeContact is implemented as the free C++ library "libfreecontact", complete with command line tool "freecontact", as well as Perl and Python modules. All components are available as Debian packages. FreeContact supports the BioXSD format for interoperability. Conclusions: FreeContact provides the opportunity to compute reliable contact predictions in any environment (desktop or cloud).
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 67
    Publication Date: 2014-03-27
    Description: Background: New experimental methods must be developed to study interaction networks in systems biology. To reduce biological noise, individual subjects, such as single cells, should be analyzed using high throughput approaches. The measurement of several correlative physical properties would further improve data consistency. Accordingly, a considerable quantity of data must be acquired, correlated, catalogued and stored in a database for subsequent analysis. Results: We have developed openBEB (open Biological Experiment Browser), a software framework for data acquisition, coordination, annotation and synchronization with database solutions such as openBIS. OpenBEB consists of two main parts: A core program and a plug-in manager. Whereas the data-type independent core of openBEB maintains a local container of raw-data and metadata and provides annotation and data management tools, all data-specific tasks are performed by plug-ins. The open architecture of openBEB enables the fast integration of plug-ins, e.g., for data acquisition or visualization. A macro-interpreter allows the automation and coordination of the different modules. An update and deployment mechanism keeps the core program, the plug-ins and the metadata definition files in sync with a central repository. Conclusions: The versatility, the simple deployment and update mechanism, and the scalability in terms of module integration offered by openBEB make this software interesting for a large scientific community. OpenBEB targets three types of researcher, ideally working closely together: (i) Engineers and scientists developing new methods and instruments, e.g., for systems-biology, (ii) scientists performing biological experiments, (iii) theoreticians and mathematicians analyzing data. The design of openBEB enables the rapid development of plug-ins, which will inherently benefit from the "house keeping" abilities of the core program. We report the use of openBEB to combine live cell microscopy, microfluidic control and visual proteomics. In this example, measurements from diverse, complementary techniques are combined and correlated.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 68
    Publication Date: 2014-03-28
    Description: Background: The accumulation of protein structural data occurs more rapidly than it can be characterized by traditional laboratory means. This has motivated widespread efforts to predict enzyme function computationally. The most useful/accurate strategies employed to date are based on the detection of motifs in novel structures that correspond to a specific function. Functional residues are critical components of predictively useful motifs. We have implemented a novel method, to complement current approaches, which detects motifs solely on the basis of distance restraints between catalytic residues. Results: ProMOL is a plugin for the PyMOL molecular graphics environment that can be used to create active site motifs for enzymes. A library of 181 active site motifs has been created with ProMOL, based on definitions published in the Catalytic Site Atlas (CSA). Searches with ProMOL produce better than 50% useful Enzyme Commission (EC) class suggestions for level 1 searches in EC classes 1, 4 and 5, and produce some useful results for other classes. 261 additional motifs automatically translated from Jonathan Barker's JESS motif set [Bioinformatics 19:1644-1649, 2003] and a set of NMR motifs is under development. Alignments are evaluated by visual superposition, Levenshtein distance and root-mean-square deviation (RMSD) and are reasonably consistent with related search methods. Conclusion: The ProMOL plugin for PyMOL provides ready access to template-based local alignments. Recent improvements to ProMOL, including the expanded motif library, RMSD calculations and output selection formatting, have greatly increased the program's usability and speed, and have improved the way that the results are presented.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 69
    Publication Date: 2014-03-28
    Description: Background: Different high-dimensional regression methodologies exist for the selection of variables to predict a continuous variable. To improve the variable selection in case clustered observations are present in the training data, an extension towards mixed-effects modeling (MM) is requested, but may not always be straightforward to implement.In this article, we developed such a MM extension (GA-MM-MMI) for the automated variable selection by a linear regression based genetic algorithm (GA) using multi-model inference (MMI). We exemplify our approach by training a linear regression model for prediction of resistance to the integrase inhibitor Raltegravir (RAL) on a genotype-phenotype database, with many integrase mutations as candidate covariates. The genotype-phenotype pairs in this database were derived from a limited number of subjects, with presence of multiple data points from the same subject, and with an intra-class correlation of 0.92. Results: In generation of the RAL model, we took computational efficiency into account by optimizing the GA parameters one by one, and by using tournament selection. To derive the main GA parameters we used 3 times 5-fold cross-validation. The number of integrase mutations to be used as covariates in the mixed effects models was 25 (chrom.size). A GA solution was found when R2MM 〉 0.95 (goal.fitness). We tested three different MMI approaches to combine the results of 100 GA solutions into one GA-MM-MMI model. When evaluating the GA-MM-MMI performance on two unseen data sets, a more parsimonious and interpretable model was found (GA-MM-MMI TOP18: mixed-effects model containing the 18 most prevalent mutations in the GA solutions, refitted on the training data) with better predictive accuracy (R2) in comparison to GA-ordinary least squares (GA-OLS) and Least Absolute Shrinkage and Selection Operator (LASSO). Conclusions: We have demonstrated improved performance when using GA-MM-MMI for selection of mutations on a genotype-phenotype data set. As we largely automated setting the GA parameters, the method should be applicable on similar datasets with clustered observations.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 70
    Publication Date: 2014-03-28
    Description: Background: Differential RNA sequencing (dRNA-seq) is a high-throughput screening technique designed to examinethe architecture of bacterial operons in general and the precise position of transcription startsites (TSS) in particular. Hitherto, dRNA-seq data were analyzed by visualizing the sequencing readsmapped to the reference genome and manually annotating reliable positions. This is very labor intensiveand, due to the subjectivity, biased. Results: Here, we present TSSAR, a tool for automated de novo TSS annotation from dRNA-seq data thatrespects the statistics of dRNA-seq libraries. TSSAR uses the premise that the number of sequencingreads starting at a certain genomic position within a transcriptional active region follows a Poissondistribution with a parameter that depends on the local strength of expression. The differences oftwo dRNA-seq library counts thus follow a Skellam distribution. This provides a statistical basis toidentify significantly enriched primary transcripts. Conclusions: Having an automated and efficient tool for analyzing dRNA-seq data facilitates the use of thedRNA-seq technique and promotes its application to more sophisticated analysis. For instance, monitoringthe plasticity and dynamics of the transcriptomal architecture triggered by different stimuli andgrowth conditions becomes possible.The main asset of a novel tool for dRNA-seq analysis that reaches out to a broad user community isusability. As such, we provide TSSAR both as intuitive RESTfulWeb service (http://rna.tbi.univie.ac.at/TSSAR) together with a set of post-processing and analysis tools, as well as a stand-alone versionfor use in high-throughput dRNA-seq data analysis pipelines.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 71
    Publication Date: 2014-03-29
    Description: Background: Metagenomics is the genomic study of uncultured environmental samples, which has been greatly facilitated by the advent of shotgun-sequencing technologies. One of the main focuses of metagenomics is the discovery of previously uncultured microorganisms, which makes the assignment of sequences to a particular taxon a challenge and a crucial step. Recently, several methods have been developed to perform this task, based on different methodologies such as sequence composition or sequence similarity. The sequence composition methods have the ability to completely assign the whole dataset. However, their use in metagenomics and the study of their performance with real data is limited. In this work, we assess the consistency of three different methods (BLAST + Lowest Common Ancestor, Phymm, and Naive Bayesian Classifier) in assigning real and simulated sequence reads. Results: Both in real and in simulated data, BLAST + Lowest Common Ancestor (BLAST + LCA), Phymm, and Naive Bayesian Classifier consistently assign a larger number of reads in higher taxonomic levels than in lower levels. However, discrepancies increase at lower taxonomic levels. In simulated data, consistent assignments between all three methods showed greater precision than assignments based on Phymm or Bayesian Classifier alone, since the BLAST + LCA algorithm performed best. In addition, assignment consistency in real data increased with sequence read length, in agreement with previously published simulation results. Conclusions: The use and combination of different approaches is advisable to assign metagenomic reads. Although the sensitivity could be reduced, the reliability can be increased by using the reads consistently assigned to the same taxa by, at least, two methods, and by training the programs using all available information.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 72
    Publication Date: 2014-03-30
    Description: Background: High-throughput sequencing is now regularly used for studies of the transcriptome (RNA-seq), particularly for comparisons among experimental conditions. For the time being, a limited number of biological replicates are typically considered in such experiments, leading to low detection power for differential expression. As their cost continues to decrease, it is likely that additional follow-up studies will be conducted to re-address the same biological question. Results: We demonstrate how p-value combination techniques previously used for microarray meta-analyses can be used for the differential analysis of RNA-seq data from multiple related studies. These techniques are compared to a negative binomial generalized linear model (GLM) including a fixed study effect on simulated data and real data on human melanoma cell lines. The GLM with fixed study effect performed well for low inter-study variation and small numbers of studies, but was outperformed by the meta-analysis methods for moderate to large inter-study variability and larger numbers of studies. Conclusions: The p-value combination techniques illustrated here are a valuable tool to perform differential meta-analyses of RNA-seq data by appropriately accounting for biological and technical variability within studies as well as additional study-specific effects. An R package metaRNASeq is available on the R Forge.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 73
    Publication Date: 2014-04-01
    Description: Background: Next-generation sequencing (NGS) has advanced the application of high-throughput sequencing technologies in genetic and genomic variation analysis. Due to the large dynamic range of expression levels, RNA-seq is more prone to detect transcripts with low expression. It is clear that genes with no mapped reads are not expressed; however, there is ongoing debate about the level of abundance that constitutes biologically meaningful expression. To date, there is no consensus on the definition of low expression. Since random variation is high in regions with low expression and distributions of transcript expression are affected by numerous experimental factors, methods to differentiate low and high expressed data in a sample are critical to interpreting classes of abundance levels in RNA-seq data. Results: A data-adaptive approach was developed to estimate the lower bound of high expression for RNA-seq data. The Kolmgorov-Smirnov statistic and multivariate adaptive regression splines were used to determine the optimal cutoff value for separating transcripts with high and low expression. Results from the proposed method were compared to results obtained by estimating the theoretical cutoff of a fitted two-component mixture distribution. The robustness of the proposed method was demonstrated by analyzing different RNA-seq datasets that varied by sequencing depth, species, scale of measurement, and empirical density shape. Conclusions: The analysis of real and simulated data presented here illustrates the need to employ data-adaptive methodology in lieu of arbitrary cutoffs to distinguish low expressed RNA-seq data from high expression. Our results also present the drawbacks of characterizing the data by a two-component mixture distribution when classes of gene expression are not well separated. The ability to ascertain stably expressed RNA-seq data is essential in the filtering process of data analysis, and methodologies that consider the underlying data structure demonstrate superior performance in preserving most of the interpretable and meaningful data. The proposed algorithm for classifying low and high regions of transcript abundance promises wide-range application in the continuing development of RNA-seq analysis.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 74
    Publication Date: 2014-04-02
    Description: Background: In this study we consider DNA sequences as mathematical strings. Total and reduced alignments between two DNA sequences have been considered in the literature to measure their similarity. Results for explicit representations of some alignments have been already obtained. Results: We present exact, explicit and computable formulas for the number of different possible alignments between two DNA sequences and a new formula for a class of reduced alignments. Conclusions: A unified approach for a wide class of alignments between two DNA sequences has been provided. The formula is computable and, if complemented by software development, will provide a deeper insight into the theory of sequence alignment and give rise to new comparison methods.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 75
    Publication Date: 2014-04-02
    Description: Background: Amino acid sequences and features extracted from such sequences have been used to predict manyprotein properties, such as subcellular localization or solubility, using classifier algorithms. Althoughsoftware tools are available for both feature extraction and classifier construction, their applicationis not straightforward, requiring users to install various packages and to convert data into differentformats. This lack of easily accessible software hampers quick, explorative use of sequence-basedclassification techniques by biologists. Results: We have developed the web-based software tool SPiCE for exploring sequence-based features ofproteins in predefined classes. It offers data upload/download, sequence-based feature calculation,data visualization and protein classifier construction and testing in a single integrated, interactiveenvironment. To illustrate its use, two example datasets are included showing the identification ofdifferences in amino acid composition between proteins yielding low and high production levels infungi and low and high expression levels in yeast, respectively. Conclusions: SPiCE is an easy-to-use online tool for extracting and exploring sequence-based features of sets ofproteins, allowing non-experts to apply advanced classification techniques. The tool is available athttp://helix.ewi.tudelft.nl/spice.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 76
    Publication Date: 2014-04-03
    Description: Background: Protein structures are flexible and often show conformational changes upon binding to other molecules to exert biological functions. As protein structures correlate with characteristic functions, structure comparison allows classification and prediction of proteins of undefined functions. However, most comparison methods treat proteins as rigid bodies and cannot retrieve similarities of proteins with large conformational changes effectively. Results: In this paper, we propose a novel descriptor, local average distance (LAD), based on either the geodesic distances (GDs) or Euclidean distances (EDs) for pairwise flexible protein structure comparison. The proposed method was compared with 7 structural alignment methods and 7 shape descriptors on two datasets comprising hinge bending motions from the MolMovDB, and the results have shown that our method outperformed all other methods regarding retrieving similar structures in terms of precision-recall curve, retrieval success rate, R-precision, mean average precision and F1-measure. Conclusions: Both ED- and GD-based LAD descriptors are effective to search deformed structures and overcome the problems of self-connection caused by a large bending motion. We have also demonstrated that the ED-based LAD is more robust than the GD-based descriptor. The proposed algorithm provides an alternative approach for blasting structure database, discovering previously unknown conformational relationships, and reorganizing protein structure classification.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 77
    Publication Date: 2014-04-04
    Description: Background: The identification of functionally or structurally important non-conserved residue sites in protein MSAs is an important challenge for understanding the structural basis and molecular mechanism of protein functions.Despite the rich literature on compensatory mutations as well as sequenceconservation analysis for the detection of those important residues, previousmethods often rely on classical information-theoretic measures. However, thesemeasures usually do not take into account dis/similarities of aminoacids which are likely to be crucial for those residues. In this study, we present a new method, the Quantum Coupled Mutation Finder (QCMF) that incorporates significant dis/similar amino acid pair signals in the prediction of functionally or structurally important sites. Results: The result of this study is twofold. First, using the essential sitesof two human proteins, namely epidermal growth factor receptor (EGFR) and glucokinase (GCK), we tested the QCMFMethodQCMF includes two metrics based on quantum Jensen-Shannon divergence to measure both sequence conservation and compensatory mutations.We found that QCMF reaches an improved performance in identifyingessential sites from MSAs of both proteins with a significantly higher Matthewscorrelation coefficient (MCC) value in comparison to previous Methods: Second, using a data set of 153 proteins, we made a pairwise comparison between QCMF and three conventional methods. This comparison study strongly suggests that QCMF complements the conventional methods for the identification of correlated mutations in MSAs. Conclusions: QCMF utilizes the notion of entanglement, which is a major resource of quantum information, to model significant dissimilar and similar amino acid pair signals in the detection of functionally or structurally important sites. Our results suggest that on the one hand QCMF significantly outperforms the previous method, which mainly focuses on dissimilar amino acid signals, to detect essential sites in proteins. On the other hand, it is complementary to the existing methods for the identification of correlated mutations. The method of QCMF is computationally intensive. To ensure a feasible computation time of the QCMF's algorithm, we leveraged Compute Unified Device Architecture (CUDA). The QCMF server is freely accessible at http://qcmf.informatik.uni-goettingen.de/.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 78
    Publication Date: 2014-04-05
    Description: Background: Over the last decade, metabolomics has evolved into a mainstream enterprise utilized by many laboratories globally. Like other "omics" data, metabolomics data has the characteristics of a smaller sample size compared to the number of features evaluated. Thus the selection of an optimal subset of features with a supervised classifier is imperative. We extended an existing feature selection algorithm, threshold gradient descent regularization (TGDR), to handle multi-class classification of "omics" data, and proposed two such extensions referred to as multi-TGDR. Both multi-TGDR frameworks were used to analyze a metabolomics dataset that compares the metabolic profiles of hepatocellular carcinoma (HCC) infected with hepatitis B (HBV) or C virus (HCV) with that of cirrhosis induced by HBV/HCV infection; the goal was to improve early-stage diagnosis of HCC. Results: We applied two multi-TGDR frameworks to the HCC metabolomics data that determined TGDR thresholds either globally across classes, or locally for each class. Multi-TGDR global model selected 45 metabolites with a 0% misclassification rate (the error rate on the training data) and had a 3.82% 5-fold cross-validation (CV-5) predictive error rate. Multi-TGDR local selected 48 metabolites with a 0% misclassification rate and a 5.34% CV-5 error rate. Conclusions: One important advantage of multi-TGDR local is that it allows inference for determining which feature is related specifically to the class/classes. Thus, we recommend multi-TGDR local be used because it has similar predictive performance and requires the same computing time as multi-TGDR global, but may provide class-specific inference.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 79
    Publication Date: 2014-09-16
    Description: Background: De novo genome assembly of next-generation sequencing data is one of the most important current problems in bioinformatics, essential in many biological applications. In spite of significant amount of work in this area, better solutions are still very much needed. Results: We present a new program, SAGE, for de novo genome assembly. As opposed to most assemblers, which are de Bruijn graph based, SAGE uses the string-overlap graph. SAGE builds upon great existing work on string-overlap graph and maximum likelihood assembly, bringing an important number of new ideas, such as the efficient computation of the transitive reduction of the string overlap graph, the use of (generalized) edge multiplicity statistics for more accurate estimation of read copy counts, and the improved use of mate pairs and min-cost flow for supporting edge merging. The assemblies produced by SAGE for several short and medium-size genomes compared favourably with those of existing leading assemblers. Conclusions: SAGE benefits from innovations in almost every aspect of the assembly process: error correction of input reads, string-overlap graph construction, read copy counts estimation, overlap graph analysis and reduction, contig extraction, and scaffolding. We hope that these new ideas will help advance the current state-of-the-art in an essential area of research in genomics.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 80
    Publication Date: 2014-09-17
    Description: Background: Various methods have been developed to computationally predict hotspot residues at novel protein-protein interfaces. However, there are various challenges in obtaining accurate prediction. We have developed a novel method which uses different aspects of protein structure and sequence space at residue level to highlight interface residues crucial for the protein-protein complex formation. Results: ECMIS (Energetic Conservation Mass Index and Spatial Clustering) algorithm was able to outperform existing hotspot identification methods. It was able to achieve around 80% accuracy with incredible increase in sensitivity and outperforms other existing methods. This method is even sensitive towards the hotspot residues contributing only small-scale hydrophobic interactions. Conclusion: Combination of diverse features of the protein viz. energy contribution, extent of conservation, location and surrounding environment, along with optimized weightage for each feature, was the key for the success of the algorithm. The academic version of the algorithm is available at http://caps.ncbs.res.in/download/ECMIS/ECMIS.zip
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 81
    Publication Date: 2014-09-20
    Description: Background: DNA methylation is a widely studied epigenetic phenomenon; alterations in methylation patterns influence human phenotypes and risk of disease. As part of the Atherosclerosis Risk in Communities (ARIC) study, the Illumina Infinium HumanMethylation450 (HM450) BeadChip was used to measure DNA methylation in peripheral blood obtained from ~3000 African American study participants. Over 480,000 cytosine-guanine (CpG) dinucleotide sites were surveyed on the HM450 BeadChip. To evaluate the impact of technical variation, 265 technical replicates from 130 participants were included in the study. Results: For each CpG site, we calculated the intraclass correlation coefficient (ICC) to compare variation of methylation levels within- and between-replicate pairs, ranging between 0 and 1. We modeled the distribution of ICC as a mixture of censored or truncated normal and normal distributions using an EM algorithm. The CpG sites were clustered into low- and high-reliability groups, according to the calculated posterior probabilities. We also demonstrated the performance of this clustering when applied to a study of association between methylation levels and smoking status of individuals. For the CpG sites showing genome-wide significant association with smoking status, most (~96%) were seen from sites in the high reliability cluster. Conclusions: We suggest that CpG sites with low ICC may be excluded from subsequent association analyses, or extra caution needs to be taken for associations at such sites.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 82
    Publication Date: 2014-09-20
    Description: Background: Viruses are typically characterized by high mutation rates, which allow them to quickly developdrug-resistant mutations. Mining relevant rules from mutation data can be extremely useful tounderstand the virus adaptation mechanism and to design drugs that effectively counter potentiallyresistant mutants. Results: We propose a simple statistical relational learning approach for mutant prediction where the inputconsists of mutation data with drug-resistance information, either as sets of mutations conferringresistance to a certain drug, or as sets of mutants with information on their susceptibility to the drug.The algorithm learns a set of relational rules characterizing drug-resistance and uses them to generatea set of potentially resistant mutants. Learning a weighted combination of rules allows to attachgenerated mutants with a resistance score as predicted by the statistical relational model and selectonly the highest scoring ones. Conclusions: Promising results were obtained in generating resistant mutations for both nucleoside andnon-nucleoside HIV reverse transcriptase inhibitors. The approach can be generalized quite easily tolearning mutants characterized by more complex rules correlating multiple mutations.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 83
    Publication Date: 2014-09-20
    Description: Background: Non-small cell lung cancer (NSCLC), the most common type of lung cancer, is one of serious diseases causing death for both men and women. Computer-aided diagnosis and survival prediction of NSCLC, is of great importance in providing assistance to diagnosis and personalize therapy planning for lung cancer patients. Results: In this paper we have proposed an integrated framework for NSCLC computer-aided diagnosis and survival analysis using novel image markers. The entire biomedical imaging informatics framework consists of cell detection, segmentation, classification, discovery of image markers, and survival analysis. A robust seed detection-guided cell segmentation algorithm is proposed to accurately segment each individual cell in digital images. Based on cell segmentation results, a set of extensive cellular morphological features are extracted using efficient feature descriptors. Next, eight different classification techniques that can handle high-dimensional data have been evaluated and then compared for computer-aided diagnosis. The results show that the random forest and adaboost offer the best classification performance for NSCLC. Finally, a Cox proportional hazards model is fitted by component-wise likelihood based boosting. Significant image markers have been discovered using the bootstrap analysis and the survival prediction performance of the model is also evaluated. Conclusions: The proposed model have been applied to a lung cancer dataset that contains 122 cases with complete clinical information. The classification performance exhibits high correlations between the discovered image markers and the subtypes of NSCLC. The survival analysis demonstrates strong prediction power of the statistical model built from the discovered image markers.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 84
    Publication Date: 2014-09-20
    Description: Background: High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing. Results: We present an alignment-free tool called PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme), which uses a computational pipeline based on an improved k-mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from messenger RNAs (mRNAs), in the absence of genomic sequences or annotations. The performance of PLEK was evaluated on well-annotated mRNA and lncRNA transcripts. 10-fold cross-validation tests on human RefSeq mRNAs and GENCODE lncRNAs indicated that our tool could achieve accuracy of up to 95.6%. We demonstrated the utility of PLEK on transcripts from other vertebrates using the model built from human datasets. PLEK attained 〉90% accuracy on most of these datasets. PLEK also performed well using a simulated dataset and two real de novo assembled transcriptome datasets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. In addition, PLEK is approximately eightfold faster than a newly developed alignment-free tool, named Coding-Non-Coding Index (CNCI), and 244 times faster than the most popular alignment-based tool, Coding Potential Calculator (CPC), in a single-threading running manner. Conclusions: PLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes. PLEK is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data. Its open-source software can be freely downloaded from https://sourceforge.net/projects/plek/files/.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 85
    Publication Date: 2014-10-03
    Description: Background: In past number of methods have been developed for predicting post-translational modifications in proteins. In contrast, limited attempt has been made to understand post-transcriptional modifications. Recently it has been shown that tRNA modifications play direct role in the genome structure and codon usage. This study is an attempt to understand kingdom-wise tRNA modifications particularly uridine modifications (UMs), as majority of modifications are uridine-derived. Results: A three-steps strategy has been applied to develop an efficient method for the prediction of UMs. In the first step, we developed a common prediction model for all the kingdoms using a dataset from MODOMICS-2008. Support Vector Machine (SVM) based prediction models were developed and evaluated by five-fold cross-validation technique. Different approaches were applied and found that a hybrid approach of binary and structural information achieved highest Area under the curve (AUC) of 0.936. In the second step, we used newly added tRNA sequences (as independent dataset) of MODOMICS-2012 for the kingdom-wise prediction performance evaluation of previously developed (in the first step) common model and achieved performances between the AUC of 0.910 to 0.949. In the third and last step, we used different datasets from MODOMICS-2012 for the kingdom-wise individual prediction models development and achieved performances between the AUC of 0.915 to 0.987. Conclusions: The hybrid approach is efficient not only to predict kingdom-wise modifications but also to classify them into two most prominent UMs: Pseudouridine (Y) and Dihydrouridine (D). A webserver called tRNAmod (http://crdd.osdd.net/raghava/trnamod/) has been developed, which predicts UMs from both tRNA sequences and whole genome.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 86
    Publication Date: 2014-10-05
    Description: Background: Proteins dynamically interact with each other to perform their biological functions. The dynamic operations of protein interaction networks (PPI) are also reflected in the dynamic formations of protein complexes. Existing protein complex detection algorithms usually overlook the inherent temporal nature of protein interactions within PPI networks. Systematically analyzing the temporal protein complexes can not only improve the accuracy of protein complex detection, but also strengthen our biological knowledge on the dynamic protein assembly processes for cellular organization. Results: In this study, we propose a novel computational method to predict temporal protein complexes. Particularly, we first construct a series of dynamic PPI networks by joint analysis of time-course gene expression data and protein interaction data. Then a Time Smooth Overlapping Complex Detection model (TS-OCD) has been proposed to detect temporal protein complexes from these dynamic PPI networks. TS-OCD can naturally capture the smoothness of networks between consecutive time points and detect overlapping protein complexes at each time point. Finally, a nonnegative matrix factorization based algorithm is introduced to merge those very similar temporal complexes across different time points. Conclusions: Extensive experimental results demonstrate the proposed method is very effective in detecting temporal protein complexes than the state-of-the-art complex detection techniques.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 87
    Publication Date: 2014-10-04
    Description: Background: Neural stem cells are motile and proliferative cells that undergo mitosis, dividing to produce daughtercells and ultimately generating differentiated neurons and glia. Understanding the mechanismscontrolling neural stem cell proliferation and differentiation will play a key role in the emergingfields of regenerative medicine and cancer therapeutics. Stem cell studies in vitro from 2-D imagedata are well established. Visualizing and analyzing large three dimensional images of intact tissueis a challenging task. It becomes more difficult as the dimensionality of the image data increases toinclude time and additional fluorescence channels. There is a pressing need for 5-D image analysisand visualization tools to study cellular dynamics in the intact niche and to quantify the role thatenvironmental factors play in determining cell fate. Results: We present an application that integrates visualization and quantitative analysis of 5-D(x, y, z, t, channel) and large montage confocal fluorescence microscopy images. The imagesequences show stem cells together with blood vessels, enabling quantification of the dynamicbehaviors of stem cells in relation to their vascular niche, with applications in developmental andcancer biology. Our application automatically segments, tracks, and lineages the image sequencedata and then allows the user to view and edit the results of automated algorithms in a stereoscopic3-D window while simultaneously viewing the stem cell lineage tree in a 2-D window. Usingthe GPU to store and render the image sequence data enables a hybrid computational approach.An inference-based approach utilizing user-provided edits to automatically correct related mistakesexecutes interactively on the system CPU while the GPU handles 3-D visualization tasks. Conclusions: By exploiting commodity computer gaming hardware, we have developed an application that canbe run in the laboratory to facilitate rapid iteration through biological experiments. We combineunsupervised image analysis algorithms with an interactive visualization of the results. Our validationinterface allows for each data set to be corrected to 100% accuracy, ensuring that downstream dataanalysis is accurate and verifiable. Our tool is the first to combine all of these aspects, leveraging thesynergies obtained by utilizing validation information from stereo visualization to improve the lowlevel image processing tasks.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 88
    Publication Date: 2014-10-05
    Description: Background: Guide-trees are used as part of an essential heuristic to enable the calculation of multiple sequencealignments. They have been the focus of much method development but there has been little effortat determining systematically, which guide-trees, if any, give the best alignments. Some guide-treeconstruction schemes are based on pair-wise distances amongst unaligned sequences. Others try toemulate an underlying evolutionary tree and involve various iteration methods. Results: We explore all possible guide-trees for a set of protein alignments of up to eight sequences. We findthat pairwise distance based default guide-trees sometimes outperform evolutionary guide-trees, asmeasured by structure derived reference alignments. However, default guide-trees fall way short ofthe optimum attainable scores. On average chained guide-trees perform better than balanced ones butare not better than default guide-trees for small alignments. Conclusions: Alignment methods that use Consistency or hidden Markov models to make alignments are lesssusceptible to sub-optimal guide-trees than simpler methods, that basically use conventional sequencealignment between profiles. The latter appear to be affected positively by evolutionary based guidetreesfor difficult alignments and negatively for easy alignments. One phylogeny aware alignmentprogram can strongly discriminate between good and bad guide-trees. The results for randomlychained guide-trees improve with the number of sequences.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 89
    Publication Date: 2014-10-05
    Description: Background: Respiratory epithelial cells are the primary target of influenza virus infection in human. However, the molecular mechanisms of airway epithelial cell responses to viral infection are not fully understood. Revealing genome-wide transcriptional and post-transcriptional regulatory relationships can further advance our understanding of this problem, which motivates the development of novel and more efficient computational methods to simultaneously infer the transcriptional and post-transcriptional regulatory networks. Results: Here we propose a novel framework named SITPR to investigate the interactions among transcription factors (TFs), microRNAs (miRNAs) and target genes. Briefly, a background regulatory network on a genome-wide scale (~23,000 nodes and ~370,000 potential interactions) is constructed from curated knowledge and algorithm predictions, to which the identification of transcriptional and post-transcriptional regulatory relationships is anchored. To reduce the dimension of the associated computing problem down to an affordable size, several topological and data-based approaches are used. Furthermore, we propose the constrained LASSO formulation and combine it with the dynamic Bayesian network (DBN) model to identify the activated regulatory relationships from time-course expression data. Our simulation studies on networks of different sizes suggest that the proposed framework can effectively determine the genuine regulations among TFs, miRNAs and target genes; also, we compare SITPR with several selected state-of-the-art algorithms to further evaluate its performance. By applying the SITPR framework to mRNA and miRNA expression data generated from human lung epithelial A549 cells in response to A/Mexico/InDRE4487/2009 (H1N1) virus infection, we are able to detect the activated transcriptional and post-transcriptional regulatory relationships as well as the significant regulatory motifs. Conclusion: Compared with other representative state-of-the-art algorithms, the proposed SITPR framework can more effectively identify the activated transcriptional and post-transcriptional regulations simultaneously from a given background network. The idea of SITPR is generally applicable to the analysis of gene regulatory networks in human cells. The results obtained for human respiratory epithelial cells suggest the importance of the transcriptional, post-transcriptional regulations as well as their synergies in the innate immune responses against IAV infection.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 90
    Publication Date: 2014-10-03
    Description: Background: To determine which changes in the host cell genome are crucial for cervical carcinogenesis, alongitudinal in vitro model system of HPV-transformed keratinocytes was profiled in a genomewidemanner. Four cell lines affected with either HPV16 or HPV18 were assayed at 8 sequential timepoints for gene expression (mRNA) and gene copy number (DNA) using high-resolution microarrays.Available methods for temporal differential expression analysis are not designed for integrativegenomic studies. Results: Here, we present a method that allows for the identification of differential gene expression associatedwith DNA copy number changes over time. The temporal variation in gene expression is describedby a generalized linear mixed model employing low-rank thin-plate splines. Model parameters areestimated with an empirical Bayes procedure, which exploits integrated nested Laplace approximationfor fast computation. Iteratively, posteriors of hyperparameters and model parameters are estimated.The empirical Bayes procedure shrinks multiple dispersion-related parameters. Shrinkage leads tomore stable estimates of the model parameters, better control of false positives and improvementof reproducibility. In addition, to make estimates of the DNA copy number more stable, modelparameters are also estimated in a multivariate way using triplets of features, imposing a spatial priorfor the copy number effect. Conclusion: With the proposed method for analysis of time-course multilevel molecular data, more profoundinsight may be gained through the identification of temporal differential expression induced by DNAcopy number abnormalities. In particular, in the analysis of an integrative oncogenomics studywith a time-course set-up our method finds genes previously reported to be involved in cervicalcarcinogenesis. Furthermore, the proposed method yields improvements in sensitivity, specificityand reproducibility compared to existing methods. Finally, the proposed method is able to handlecount (RNAseq) data from time course experiments as is shown on a real data set.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 91
    Publication Date: 2014-10-05
    Description: Background: The function of a protein can be deciphered with higher accuracy from its structure than from its amino acid sequence. Due to the huge gap in the available protein sequence and structural space, tools that can generate functionally homogeneous clusters using only the sequence information, hold great importance. For this, traditional alignment-based tools work well in most cases and clustering is performed on the basis of sequence similarity. But, in the case of multi-domain proteins, the alignment quality might be poor due to varied lengths of the proteins, domain shuffling or circular permutations. Multi-domain proteins are ubiquitous in nature, hence alignment-free tools, which overcome the shortcomings of alignment-based protein comparison methods, are required. Further, existing tools classify proteins using only domain-level information and hence miss out on the information encoded in the tethered regions or accessory domains. Our method, on the other hand, takes into account the full-length sequence of a protein, consolidating the complete sequence information to understand a given protein better. Results: Our web-server, CLAP (Classification of Proteins), is one such alignment-free software for automatic classification of protein sequences. It utilizes a pattern-matching algorithm that assigns local matching scores (LMS) to residues that are a part of the matched patterns between two sequences being compared. CLAP works on full-length sequences and does not require prior domain definitions.Pilot studies undertaken previously on protein kinases and immunoglobulins have shown that CLAP yields clusters, which have high functional and domain architectural similarity. Moreover, parsing at a statistically determined cut-off resulted in clusters that corroborated with the sub-family level classification of that particular domain family. Conclusions: CLAP is a useful protein-clustering tool, independent of domain assignment, domain order, sequence length and domain diversity. Our method can be used for any set of protein sequences, yielding functionally relevant clusters with high domain architectural homogeneity. The CLAP web server is freely available for academic use at http://nslab.mbu.iisc.ernet.in/clap/
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 92
    Publication Date: 2014-10-06
    Description: Background: Signatures are short sequences that are unique and not similar to any other sequence in a databasethat can be used as the basis to identify different species. Even though several signature discoveryalgorithms have been proposed in the past, these algorithms require the entirety of databases to beloaded in the memory, thus restricting the amount of data that they can process. It makes thosealgorithms unable to process databases with large amounts of data. Also, those algorithms usesequential models and have slower discovery speeds, meaning that the efficiency can be improved. Results: In this research, we are debuting the utilization of a divide-and-conquer strategy in signature discoveryand have proposed a parallel signature discovery algorithm on a computer cluster. The algorithmapplies the divide-and-conquer strategy to solve the problem posed to the existing algorithms wherethey are unable to process large databases and uses a parallel computing mechanism to effectivelyimprove the efficiency of signature discovery. Even when run with just the memory of regular personalcomputers, the algorithm can still process large databases such as the human whole-genome ESTdatabase which were previously unable to be processed by the existing algorithms. Conclusions: The algorithm proposed in this research is not limited by the amount of usable memory and canrapidly find signatures in large databases, making it useful in applications such as Next GenerationSequencing and other large database analysis and processing. The implementation of the proposedalgorithm is available at http://www.cs.pu.edu.tw/~fang/DDCSDPrograms/DDCSD.htm.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 93
    Publication Date: 2014-09-02
    Description: Background: The sensitivity of Optical Coherence Tomography (OCT) images to identify retinal tissue morphology characterized by early neural loss from normal healthy eyes is tested by calculating structural information and fractal dimension. OCT data from 74 healthy eyes and 43 eyes with type 1 diabetes mellitus with mild diabetic retinopathy (MDR) on biomicroscopy was analyzed using a custom-built algorithm (OCTRIMA) to measure locally the intraretinal layer thickness. A power spectrum method was used to calculate the fractal dimension in intraretinal regions of interest identified in the images. ANOVA followed by Newman-Keuls post-hoc analyses were used to test for differences between pathological and normal groups. A modified p value of
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 94
    Publication Date: 2014-10-14
    Description: Background: Quantification of a transcriptional profile is a useful way to evaluate the activity of a cell at a given point in time. Although RNA-Seq has revolutionized transcriptional profiling, the costs of RNA-Seq are still significantly higher than microarrays, and often the depth of data delivered from RNA-Seq is in excess of what is needed for simple transcript quantification. Digital Gene Expression (DGE) is a cost-effective, sequence-based approach for simple transcript quantification: by sequencing one read per molecule of RNA, this technique can be used to efficiently count transcripts while obviating the need for transcript-length normalization and reducing the total numbers of reads necessary for accurate quantification. Here, we present trieFinder, a program specifically designed to rapidly map, parse, and annotate DGE tags of various lengths against cDNA and/or genomic sequence databases. Results: The trieFinder algorithm maps DGE tags in a two-step process. First, it scans FASTA files of RefSeq, UniGene, and genomic DNA sequences to create a database of all tags that can be derived from a predefined restriction site. Next, it compares the experimental DGE tags to this tag database, taking advantage of the fact that the tags are stored as a prefix tree, or "trie," which allows for linear-time searches for exact matches. DGE tags with mismatches are analyzed by recursive calls in the data structure. We find that, in terms of alignment speed, the mapping functionality of trieFinder compares favorably with Bowtie. Conclusions: trieFinder can quickly provide the user an annotation of the DGE tags from three sources simultaneously, simplifying transcript quantification and novel transcript detection, delivering the data in a simple parsed format, obviating the need to post-process the alignment results. trieFinder is available at https://github.com/grenaud/trieFinder.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 95
    Publication Date: 2014-10-27
    Description: Background: As time series experiments in higher eukaryotes usually obtain data from different individuals collected at the different time points, a time series sample itself is not equivalent to a true biological replicate but is, rather, a combination of several biological replicates. The analysis of expression data derived from a time series sample is therefore often performed with a low number of replicates due to budget limitations or limitations in sample availability. In addition, most algorithms developed to identify specific patterns in time series dataset do not consider biological variation in samples collected at the same conditions. Results: Using artificial time course datasets, we show that resampling considerably improves the accuracy of transcripts identified as rhythmic. In particular, the number of false positives can be greatly reduced while at the same time the number of true positives can be maintained in the range of other methods currently used to determine rhythmically expressed genes. Conclusions: The resampling approach described here therefore increases the accuracy of time series expression data analysis and furthermore emphasizes the importance of biological replicates in identifying oscillating genes. Resampling can be used for any time series expression dataset as long as the samples are acquired from independent individuals at each time point.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 96
    Publication Date: 2014-10-31
    Description: Background: Because of the difficulties involved in learning and using 3D modeling and rendering software,many scientists hire programmers or animators to create models and animations. This both slowsthe discovery process and provides opportunities for miscommunication. Working with multiplecollaborators, a tool was developed (based on a set of design goals) to enable them to directly constructmodels and animations. Results: SketchBio is presented, a tool that incorporates state-of-the-art bimanual interaction and drop shadowsto enable rapid construction of molecular structures and animations. It includes three novel features:crystal-by-example, pose-mode physics, and spring-based layout that accelerate operations commonin the formation of molecular models. Design decisions and their consequences are presented,including cases where iterative design was required to produce effective approaches. Conclusions: The design decisions, novel features, and inclusion of state-of-the-art techniques enabled SketchBioto meet all of its design goals. These features and decisions can be incorporated into existing and newtools to improve their effectiveness.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 97
    Publication Date: 2014-10-23
    Description: Background: Viral integration into a host genome is defined by two chimeric junctions that join viral and host DNA. Recently, computational tools have been developed that utilize NGS data to detect chimeric junctions. These methods identify individual viral-host junctions but do not associate chimeric pairs as an integration event. Without knowing the chimeric boundaries of an integration, its genetic content cannot be determined. Results: Summonchimera is a Perl program that associates chimera pairs to infer the complete viral genomic integration event to the nucleotide level within single or paired-end NGS data. SummonChimera integration prediction was verified on a set of single-end IonTorrent reads from a purified Salmonella bacterium with an integrated bacteriophage. Furthermore, SummonChimera predicted integrations from experimentally verified Hepatitis B Virus chimeras within a paired-end Whole Genome Sequencing hepatocellular carcinoma tumor database. Conclusions: SummonChimera identified all experimentally verified chimeras detected by current computational methods. Further, SummonChimera integration inference precisely predicted bacteriophage integration. The application of SummonChimera to cancer NGS accurately identifies deletion of host and viral sequence during integration. The precise nucleotide determination of an integration allows prediction of viral and cellular gene transcription patterns.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 98
    Publication Date: 2014-10-25
    Description: Background: Amino acid replacement rate matrices are a crucial component of many protein analysis systems such as sequence similarity search, sequence alignment, and phylogenetic inference. Ideally, the rate matrix reflects the mutational behavior of the actual data under study; however, estimating amino acid replacement rate matrices requires large protein alignments and is computationally expensive and complex. As a compromise, sub-optimal pre-calculated generic matrices are typically used for protein-based phylogeny. Sequence availability has now grown to a point where problem-specific rate matrices can often be calculated if the computational cost can be controlled. Results: The most time consuming step in estimating rate matrices by maximum likelihood is building maximum likelihood phylogenetic trees from protein alignments. We propose a new procedure, called FastMG, to overcome this obstacle. The key innovation is the alignment-splitting algorithm that splits alignments with many sequences into non-overlapping sub-alignments prior to estimating amino acid replacement rates. Experiments with different large data sets showed that the FastMG procedure was an order of magnitude faster than without splitting. Importantly, there was no apparent loss in matrix quality if an appropriate splitting procedure is used. Conclusions: FastMG is a simple, fast and accurate procedure to estimate amino acid replacement rate matrices from large data sets. It enables researchers to study the evolutionary relationships for specific groups of proteins or taxa with optimized, data-specific amino acid replacement rate matrices. The programs, data sets, and the new mammalian mitochondrial protein rate matrix are available at http://fastmg.codeplex.com.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 99
    Publication Date: 2014-10-31
    Description: Background: The complexity of biological data related to the genetic origins of tumour cells, originates significant challenges to glean valuable knowledge that can be used to predict therapeutic responses. In order to discover a link between gene expression profiles and drug responses, a computational framework based on Consensus p-Median clustering is proposed. The main goal is to simultaneously predict (in silico) anticancer responses by extracting common patterns among tumour cell lines, selecting genes that could potentially explain the therapy outcome and finally learning a probabilistic model able to predict the therapeutic responses. Results: The experimental investigation performed on the NCI60 dataset highlights three main findings: (1) Consensus p-Median is able to create groups of cell lines that are highly correlated both in terms of gene expression and drug response; (2) from a biological point of view, the proposed approach enables the selection of genes that are strongly involved in several cancer processes; (3) the final prediction of drug responses, built upon Consensus p-Median and the selected genes, represents a promising step for predicting potential useful drugs. Conclusion: The proposed learning framework represents a promising approach predicting drug response in tumour cells.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 100
    Publication Date: 2014-11-04
    Description: Background: Gene set testing has become an important analysis technique in high throughput microarray and next generation sequencing studies for uncovering patterns of differential expression of various biological processes. Often, the large number of gene sets that are tested simultaneously require some sort of multiplicity correction to account for the multiplicity effect. This work provides a substantial computational improvement to an existing familywise error rate controlling multiplicity approach (the Focus Level method) for gene set testing in high throughput microarray and next generation sequencing studies using Gene Ontology graphs, which we call the Short Focus Level. Results: The Short Focus Level procedure, which performs a shortcut of the full Focus Level procedure, is achieved by extending the reach of graphical weighted Bonferroni testing to closed testing situations where restricted hypotheses are present, such as in the Gene Ontology graphs. The Short Focus Level multiplicity adjustment can perform the full top-down approach of the original Focus Level procedure, overcoming a significant disadvantage of the otherwise powerful Focus Level multiplicity adjustment. The computational and power differences of the Short Focus Level procedure as compared to the original Focus Level procedure are demonstrated both through simulation and using real data. Conclusions: The Short Focus Level procedure shows a significant increase in computation speed over the original Focus Level procedure (as much as ~15,000 times faster). The Short Focus Level should be used in place of the Focus Level procedure whenever the logical assumptions of the Gene Ontology graph structure are appropriate for the study objectives and when either no a priori focus level of interest can be specified or the focus level is selected at a higher level of the graph, where the Focus Level procedure is computationally intractable.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...