ALBERT

All Library Books, journals and Electronic Records Telegrafenberg

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
Filter
  • Articles  (1,326)
  • BioMed Central  (1,326)
  • American Chemical Society
  • Copernicus
  • 2010-2014  (1,326)
  • 1980-1984
  • 1925-1929
  • 2013  (1,326)
  • 1929
  • Computer Science  (1,326)
Collection
  • Articles  (1,326)
Years
  • 2010-2014  (1,326)
  • 1980-1984
  • 1925-1929
Year
Journal
  • 1
    Publication Date: 2013-09-09
    Description: Background: The process of creating and designing Virtual Patients for teaching students of medicine is an expensive and time-consuming task. In order to explore potential methods of mitigating these costs, our group began exploring the possibility of creating Virtual Patients based on electronic health records. This review assesses the usage of electronic health records in the creation of interactive Virtual Patients for teaching clinical decision-making. Methods: The PubMed database was accessed programmatically to find papers relating to Virtual Patients. The returned citations were classified and the relevant full text articles were reviewed to find Virtual Patient systems that used electronic health records to create learning modalities. Results: A total of n = 362 citations were found on PubMed and subsequently classified, of which n = 28 full-text articles were reviewed. Few articles used unformatted electronic health records other than patient CT or MRI scans. The use of patient data, extracted from electronic health records or otherwise, is widespread. The use of unformatted electronic health records in their raw form is less frequent. Patient data use is broad and spans several areas, such as teaching, training, 3D visualisation, and assessment. Conclusions: Virtual Patients that are based on real patient data are widespread, yet the use of unformatted electronic health records, abundant in hospital information systems, is reported less often. The majority of teaching systems use reformatted patient data gathered from electronic health records, and do not use these electronic health records directly. Furthermore, many systems were found that used patient data in the form of CT or MRI scans. Much potential research exists regarding the use of unformatted electronic health records for the creation of Virtual Patients.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 2
    Publication Date: 2013-09-11
    Description: Background: Human resources are an important building block of the health system. During the last decade, enormous investment has gone into the information systems to manage human resources, but due to the lack of a clear vision, policy, and strategy, the results of these efforts have not been very visible. No reliable information portal captures the actual state of human resources in Pakistan's health sector. The World Health Organization (WHO) has provided technical support for the assessment of the existing system and development of a comprehensive Human Resource Information System (HRIS) in Pakistan. Methods: The questions in the WHO-HRIS Assessment tool were distributed into five thematic groups. Purposively selected (n=65) representatives from the government, private sector, and development partners participated in this cross sectional study, based on their programmatic affiliations. Results: Fifty-five percent of organizations and departments have an independent Human Resources (HR) section managed by an establishment branch and are fully equipped with functional computers. Forty-five organizations (70%) had HR rules, regulations and coordination mechanisms, yet these are not implemented. Data reporting is mainly in paper form, on prescribed forms (51%), registers (3%) or even plain papers (20%). Data analysis does not give inputs to the decision making process and dissemination of information is quite erratic. Most of the organizations had no feedback mechanism for cross checking the HR data, rendering it unreliable. Conclusion: Pakistan is lacking appropriate HRIS management. The current HRIS indeed has a multitude of problems. In the wake of 2011 reforms within the health sector, provinces are even in a greater need for planning their respective health department services and must work on the deficiencies and inefficiencies of their HRIS so that the gaps and HR needs are better aligned for reaching the 2015 UN Millennium Development Goals (MDGs) targets.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 3
    Publication Date: 2013-09-12
    Description: Background: Decision support systems for differential diagnosis have traditionally been evaluated on the basis of criteria how sensitively and specifically they are able to identify the correct diagnosis established by expert clinicians.DiscussionThis article questions whether evaluation criteria pertaining to identifying the correct diagnosis are most appropriate or useful. Instead it advocates evaluation of decision support systems for differential diagnosis based on the criterion of maximizing value of information.SummaryThis approach quantitatively and systematically integrates several important clinical management priorities, including avoiding serious diagnostic errors of omission and avoiding harmful or expensive tests.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 4
    Publication Date: 2013-09-12
    Description: Background: High-throughput sequencing technologies are improving in quality, capacity and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multiplexing approach relies on a specific DNA tag or barcode that is attached to the sequencing or amplification primer and hence appears at the beginning of the sequence in every read. After sequencing, each sample read is identified on the basis of the respective barcode sequence.Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected. This can be accomplished by implementing error correcting algorithms and codes. This barcoding strategy increases the total number of correctly identified samples, thus improving overall sequencing efficiency. Two popular sets of error-correcting codes are Hamming codes and Levenshtein codes.ResultLevenshtein codes operate only on words of known length. Since a DNA sequence with an embedded barcode is essentially one continuous long word, application of the classical Levenshtein algorithm is problematic. In this paper we demonstrate the decreased error correction capability of Levenshtein codes in a DNA context and suggest an adaptation of Levenshtein codes that is proven of efficiently correcting nucleotide errors in DNA sequences. In our adaption we take the DNA context into account and redefine the word length whenever an insertion or deletion is revealed. In simulations we show the superior error correction capability of the new method compared to traditional Levenshtein and Hamming based codes in the presence of multiple errors. Conclusion: We present an adaptation of Levenshtein codes to DNA contexts capable of correction of a pre-defined number of insertion, deletion, and substitution mutations. Our improved method is additionally capable of recovering the new length of the corrupted codeword and of correcting on average more random mutations than traditional Levenshtein or Hamming codes.As part of this work we prepared software for the flexible generation of DNA codes based on our new approach. To adapt codes to specific experimental conditions, the user can customize sequence filtering, the number of correctable mutations and barcode length for highest performance.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 5
    Publication Date: 2013-09-17
    Description: Background: Patient Data Management Systems (PDMS) support clinical documentation at the bedside and have demonstrated effects on completeness of patient charting and the time spent on documentation. These systems are costly and raise the question if such a major investment pays off. We tried to answer the following questions: How do costs and revenues of an intensive care unit develop before and after introduction of a PDMS? Can higher revenues be obtained with improved PDMS documentation? Can we present cost savings attributable to the PDMS? Methods: Retrospective analysis of cost and reimbursement data of a 25 bed Intensive Care Unit at a German University Hospital, three years before (2004--2006) and three years after (2007--2009) PDMS implementation. Results: Costs and revenues increased continuously over the years. The profit of the investigated ICU was fluctuating over the years and seemingly depending on other factors as well. We found a small increase in profit in the year after the introduction of the PDMS, but not in the following years. Profit per case peaked at 1039 [euro sign] in 2007, but dropped subsequently to 639 [euro sign] per case. We found no clear evidence for cost savings after the PDMS introduction. Our cautious calculation did not consider additional labour costs for IT staff needed for system maintenance. Conclusions: The introduction of a PDMS has probably minimal or no effect on reimbursement. In our case the observed increase in profit was too small to amortize the total investment for PDMS implementation.This may add some counterweight to the literature, where expectations for tools such as the PDMS can be quite unreasonable.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 6
    Publication Date: 2013-09-24
    Description: Background: Analysis of global gene expression by DNA microarrays is widely used in experimental molecular biology. However, the complexity of such high-dimensional data sets makes it difficult to fully understand the underlying biological features present in the data.The aim of this study is to introduce a method for DNA microarray analysis that provides an intuitive interpretation of data through dimension reduction and pattern recognition. We present the first "Archetypal Analysis" of global gene expression. The analysis is based on microarray data from five integrated studies of Pseudomonas aeruginosa isolated from the airways of cystic fibrosis patients. Results: Our analysis clustered samples into distinct groups with comprehensible characteristics since the archetypes representing the individual groups are closely related to samples present in the data set. Significant changes in gene expression between different groups identified adaptive changes of the bacteria residing in the cystic fibrosis lung. The analysis suggests a similar gene expression pattern between isolates with a high mutation rate (hypermutators) despite accumulation of different mutations for these isolates. This suggests positive selection in the cystic fibrosis lung environment, and changes in gene expression for these isolates are therefore most likely related to adaptation of the bacteria. Conclusions: Archetypal analysis succeeded in identifying adaptive changes of P. aeruginosa. The combination of clustering and matrix factorization made it possible to reveal minor similarities among different groups of data, which other analytical methods failed to identify. We suggest that this analysis could be used to supplement current methods used to analyze DNA microarray data.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 7
    Publication Date: 2013-10-03
    Description: Background: The development of new therapies for orphan genetic diseases represents an extremely important medical and social challenge. Drug repositioning, i.e. finding new indications for approved drugs, could be one of the most cost- and time-effective strategies to cope with this problem, at least in a subset of cases. Therefore, many computational approaches based on the analysis of high throughput gene expression data have so far been proposed to reposition available drugs. However, most of these methods require gene expression profiles directly relevant to the pathologic conditions under study, such as those obtained from patient cells and/or from suitable experimental models. In this work we have developed a new approach for drug repositioning, based on identifying known drug targets showing conserved anti-correlated expression profiles with human disease genes, which is completely independent from the availability of 'ad hoc' gene expression data-sets. Results: By analyzing available data, we provide evidence that the genes displaying conserved anti-correlation with drug targets are antagonistically modulated in their expression by treatment with the relevant drugs. We then identified clusters of genes associated to similar phenotypes and showing conserved anticorrelation with drug targets. On this basis, we generated a list of potential candidate drug-disease associations. Importantly, we show that some of the proposed associations are already supported by independent experimental evidence. Conclusions: Our results support the hypothesis that the identification of gene clusters showing conserved anticorrelation with drug targets can be an effective method for drug repositioning and provide a wide list of new potential drug-disease associations for experimental validation.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 8
    Publication Date: 2013-10-03
    Description: Background: Clinical decision support (CDS) for electronic prescribing systems (computerized physician order entry) should help prescribers in the safe and rational use of medicines. However, the best ways to alert users to unsafe or irrational prescribing are uncertain. Specifically, CDS systems may generate too many alerts, producing unwelcome distractions for prescribers, or too few alerts running the risk of overlooking possible harms. Obtaining the right balance of alerting to adequately improve patient safety should be a priority. Methods: A workshop funded through the European Regional Development Fund was convened by the University Hospitals Birmingham NHS Foundation Trust to assess current knowledge on alerts in CDS and to reach a consensus on a future research agenda on this topic. Leading European researchers in CDS and alerts in electronic prescribing systems were invited to the workshop. Results: We identified important knowledge gaps and suggest research priorities including (1) the need to determine the optimal sensitivity and specificity of alerts; (2) whether adaptation to the environment or characteristics of the user may improve alerts; and (3) whether modifying the timing and number of alerts will lead to improvements. We have also discussed the challenges and benefits of using naturalistic or experimental studies in the evaluation of alerts and suggested appropriate outcome measures. Conclusions: We have identified critical problems in CDS, which should help to guide priorities in research to evaluate alerts. It is hoped that this will spark the next generation of novel research from which practical steps can be taken to implement changes to CDS systems that will ultimately reduce alert fatigue and improve the design of future systems.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 9
    Publication Date: 2013-10-03
    Description: Background: Over-sampling methods based on Synthetic Minority Over-sampling Technique (SMOTE) have beenproposed for classification problems of imbalanced biomedical data. However, the existing oversamplingmethods achieve slightly better or sometimes worse result than the simplest SMOTE. In orderto improve the effectiveness of SMOTE, this paper presents a novel over-sampling method usingcodebooks obtained by the learning vector quantization. In general, even when an existing SMOTEapplied to a biomedical dataset, its empty feature space is still so huge that most classification algorithmswould not perform well on estimating borderlines between classes. To tackle this problem, ourover-sampling method generates synthetic samples which occupy more feature space than the otherSMOTE algorithms. Briefly saying, our over-sampling method enables to generate useful syntheticsamples by referring to actual samples taken from real-world datasets. Results: Experiments on eight real-world imbalanced datasets demonstrate that our proposed over-samplingmethod performs better than the simplest SMOTE on four of five standard classification algorithms.Moreover, it is seen that the performance of our method increases if the latest SMOTE called MWMOTEis used in our algorithm. Experiments on datasets for ß-turn types prediction show someimportant patterns that have not been seen in previous analyses. Conclusions: The proposed over-sampling method generates useful synthetic samples for the classification of imbalancedbiomedical data. Besides, the proposed over-sampling method is basically compatible withbasic classification algorithms and the existing over-sampling methods.
    Electronic ISSN: 1756-0381
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 10
    Publication Date: 2013-10-04
    Description: Background: RNAi screening is a powerful method to study the genetics of intracellular processes in metazoans. Technically, the approach has been largely inspired by techniques and tools developed for compound screening, including those for data analysis. However, by contrast with compounds, RNAi inducing agents can be linked to a large body of gene-centric, publically available data. However, the currently available software applications to analyze RNAi screen data usually lack the ability to visualize associated gene information in an interactive fashion. Results: Here, we present ScreenSifter, an open-source desktop application developed to facilitate storing, statistical analysis and rapid and intuitive biological data mining of RNAi screening datasets. The interface facilitates meta-data acquisition and long-term safe-storage, while the graphical user interface helps the definition of a hit list and the visualization of biological modules among the hits, through Gene Ontology and protein-protein interaction analyses. The application also allows the visualization of screen-to-screen comparisons. Conclusions: Our software package, ScreenSifter, can accelerate and facilitate screen data analysis and enable discovery by providing unique biological data visualization capabilities.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 11
    Publication Date: 2013-10-05
    Description: Background: Fundamental cellular processes such as cell movement, division or food uptake critically depend on cells being able to change shape. Fast acquisition of three-dimensional image time series has now become possible, but we lack efficient tools for analysing shape deformations in order to understand the real three-dimensional nature of shape changes. Results: We present a framework for 3D+time cell shape analysis. The main contribution is three-fold: First, we develop a fast, automatic random walker method for cell segmentation. Second, a novel topology fixing method is proposed to fix segmented binary volumes without spherical topology. Third, we show that algorithms used for each individual step of the analysis pipeline (cell segmentation, topology fixing, spherical parameterization, and shape representation) are closely related to the Laplacian operator. The framework is applied to the shape analysis of neutrophil cells. Conclusions: The method we propose for cell segmentation is faster than the traditional random walker method or the level set method, and performs better on 3D time-series of neutrophil cells, which are comparatively noisy as stacks have to be acquired fast enough to account for cell motion. Our method for topology fixing outperforms the tools provided by SPHARM-MAT and SPHARM-PDM in terms of their successful fixing rates. The different tasks in the presented pipeline for 3D+time shape analysis of cells can be solved using Laplacian approaches, opening the possibility of eventually combining individual steps in order to speed up computations.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 12
    Publication Date: 2013-10-05
    Description: Background;In recent years, high-throughput microscopy has emerged as a powerful tool to analyze cellular dynamicsin an unprecedentedly high resolved manner. The amount of data that is generated, for examplein long-term time-lapse microscopy experiments, requires automated methods for processing andanalysis. Available software frameworks are well suited for high-throughput processing of fluorescenceimages, but they often do not perform well on bright field image data that varies considerablybetween laboratories, setups, and even single experiments.Results;In this contribution, we present a fully automated image processing pipeline that is able to robustly segment and analyze cells with ellipsoid morphology from bright field microscopy in a highthroughput, yet time efficient manner. The pipeline comprises two steps: (i) Image acquisition is adjusted to obtain optimal bright field image quality for automatic processing. (ii) A concatenation of fast performing image processing algorithms robustly identifies single cells in each image. We applied the method to a time-lapse movie consisting of ~315,000 images of differentiating hematopoietic stem cells over 6 days. We evaluated the accuracy of our method by comparing the number of identified cells with manual counts. Our method is able to segment images with varying cell density and different cell types without parameter adjustment and clearly outperforms a standard approach. By computing population doubling times, we were able to identify three growth phases in the stem cell population throughout the whole movie, and validated our result with cell cycle times from single cell tracking.Conclusions;Our method allows fully automated processing and analysis of high-throughput bright field microscopydata. The robustness of cell detection and fast computation time will support the analysisof high-content screening experiments, on-line analysis of time-lapse experiments as well as developmentof methods to automatically track single-cell genealogies.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 13
    Publication Date: 2013-10-05
    Description: Background: For the analysis of spatio-temporal dynamics, various automated processing methods have been developed for nuclei segmentation. These methods tend to be complex for segmentation of images with crowded nuclei, preventing the simple reapplication of the methods to other problems. Thus, it is useful to evaluate the ability of simple methods to segment images with various degrees of crowded nuclei. Results: Here, we selected six simple methods from various watershed based and local maxima detection based methods that are frequently used for nuclei segmentation, and evaluated their segmentation accuracy for each developmental stage of the Caenorhabditis elegans. We included a 4D noise filter, in addition to 2D and 3D noise filters, as a pre-processing step to evaluate the potential of simple methods as widely as possible. By applying the methods to image data between the 50- to 500-cell developmental stages at 50-cell intervals, the error rate for nuclei detection could be reduced to
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 14
    Publication Date: 2013-10-05
    Description: We briefly identify several critical issues in current computational neuroscience, and present our opinions on potential solutions based on bioimage informatics, especially automated image computing.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 15
    Publication Date: 2013-10-05
    Description: Background: Gene perturbation experiments in combination with fluorescence time-lapse cell imaging are a powerful tool in reverse genetics. High content applications require tools for the automated processing of the large amounts of data. These tools include in general several image processing steps, the extraction of morphological descriptors, and the grouping of cells into phenotype classes according to their descriptors. This phenotyping can be applied in a supervised or an unsupervised manner. Unsupervised methods are suitable for the discovery of formerly unknown phenotypes, which are expected to occur in high-throughput RNAi time-lapse screens. Results: We developed an unsupervised phenotyping approach based on Hidden Markov Models (HMMs) with multivariate Gaussian emissions for the detection of knockdown-specific phenotypes in RNAi time-lapse movies. The automated detection of abnormal cell morphologies allows us to assign a phenotypic fingerprint to each gene knockdown. By applying our method to the Mitocheck database, we show that a phenotypic fingerprint is indicative of a gene's function. Conclusion: Our fully unsupervised HMM-based phenotyping is able to automatically identify cell morphologies that are specific for a certain knockdown. Beyond the identification of genes whose knockdown affects cell morphology, phenotypic fingerprints can be used to find modules of functionally related genes.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 16
    Publication Date: 2013-10-05
    Description: Background: Pattern recognition algorithms are useful in bioimage informatics applications such as quantifying cellular and subcellular objects, annotating gene expressions, and classifying phenotypes. To provide effective and efficient image classification and annotation for the ever-increasing microscopic images, it is desirable to have tools that can combine and compare various algorithms, and build customizable solution for different biological problems. However, current tools often offer a limited solution in generating user-friendly and extensible tools for annotating higher dimensional images that correspond to multiple complicated categories. Results: We develop the BIOimage Classification and Annotation Tool (BIOCAT). It is able to apply pattern recognition algorithms to two- and three-dimensional biological image sets as well as regions of interest (ROIs) in individual images for automatic classification and annotation. We also propose a 3D anisotropic wavelet feature extractor for extracting textural features from 3D images with xy-z resolution disparity. The extractor is one of the about 20 built-in algorithms of feature extractors, selectors and classifiers in BIOCAT. The algorithms are modularized so that they can be "chained" in a customizable way to form adaptive solution for various problems, and the plugin-based extensibility gives the tool an open architecture to incorporate future algorithms. We have applied BIOCAT to classification and annotation of images and ROIs of different properties with applications in cell biology and neuroscience. Conclusions: BIOCAT provides a user-friendly, portable platform for pattern recognition based biological image classification of two- and three- dimensional images and ROIs. We show, via diverse case studies, that different algorithms and their combinations have different suitability for various problems. The customizability of BIOCAT is thus expected to be useful for providing effective and efficient solutions for a variety of biological problems involving image classification and annotation. We also demonstrate the effectiveness of 3D anisotropic wavelet in classifying both 3D image sets and ROIs.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 17
    Publication Date: 2013-09-18
    Description: Background: Pattern recognition receptors of the immune system have key roles in the regulation of pathways after the recognition of microbial- and danger-associated molecular patterns in vertebrates. Members of NOD-like receptor (NLR) family typically function intracellularly. The NOD-like receptor family CARD domain containing 5 (NLRC5) is the largest member of this family that also contains the largest number of leucine-rich repeats (LRRs).Due to the lack of crystal structures of full-length NLRs, projects have been initiated with the aim to model certain or all members of the family, but systematic studies did not model the full-length NLRC5 due to its unique domain architecture.Our aim was to analyze the LRR sequences of NLRC5 and some NLRC5-related proteins and to build a model for the full-length human NLRC5 by homology modeling. Results: LRR sequences of NLRC5 were aligned and were compared with the consensus pattern of ribonuclease inhibitor protein (RI)-like LRR subfamily. Two types of alternating consensus patterns previously identified for RI repeats were also found in NLRC5. A homology model for full-length human NLRC5 was prepared and, besides the closed conformation of monomeric NLRC5, a heptameric platform was also modeled for the opened conformational NLRC5 monomers. Conclusions: Identification of consensus patterns of leucine-rich repeat sequences helped to identify LRRs in NLRC5 and to predict their number and position within the protein. In spite of the lack of fully adequate template structures, the presence of an untypical CARD domain and unusually high number of LRRs in NLRC5, we were able to construct a homology model for both the monomeric and homo-heptameric full-length human NLRC5 protein.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 18
    Publication Date: 2013-09-18
    Description: Background: Many Single Nucleotide Polymorphism (SNP) calling programs have been developed to identify Single Nucleotide Variations (SNVs) in next-generation sequencing (NGS) data. However, low sequencing coverage presents challenges to accurate SNV identification, especially in single-sample data. Moreover, commonly used SNP calling programs usually include several metrics in their output files for each potential SNP. These metrics are highly correlated in complex patterns, making it extremely difficult to select SNPs for further experimental validations. Results: To explore solutions to the above challenges, we compare the performance of four SNP calling algorithm, SOAPsnp, Atlas-SNP2, SAMtools, and GATK, in a low-coverage single-sample sequencing dataset. Without any post-output filtering, SOAPsnp calls more SNVs than the other programs since it has fewer internal filtering criteria. Atlas-SNP2 has stringent internal filtering criteria; thus it reports the least number of SNVs. The numbers of SNVs called by GATK and SAMtools fall between SOAPsnp and Atlas-SNP2. Moreover, we explore the values of key metrics related to SNVs' quality in each algorithm and use them as post-output filtering criteria to filter out low quality SNVs. Under different coverage cutoff values, we compare four algorithms and calculate the empirical positive calling rate and sensitivity. Our results show that: 1) the overall agreement of the four calling algorithms is low, especially in non-dbSNPs; 2) the agreement of the four algorithms is similar when using different coverage cutoffs, except that the non-dbSNPs agreement level tends to increase slightly with increasing coverage; 3) SOAPsnp, SAMtools, and GATK have a higher empirical calling rate for dbSNPs compared to non-dbSNPs; and 4) overall, GATK and Atlas-SNP2 have a relatively higher positive calling rate and sensitivity, but GATK calls more SNVs. Conclusions: Our results show that the agreement between different calling algorithms is relatively low. Thus, more caution should be used in choosing algorithms, setting filtering parameters, and designing validation studies. For reliable SNV calling results, we recommend that users employ more than one algorithm and use metrics related to calling quality and coverage as filtering criteria.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 19
    Publication Date: 2013-09-27
    Description: Background: Ontologies and catalogs of gene functions, such as the Gene Ontology (GO) and MIPS-FUN,assume that functional classes are organized hierarchically, that is, general functions include morespecific ones. This has recently motivated the development of several machine learning algorithmsfor gene function prediction that leverages on this hierarchical organization where instances maybelong to multiple classes. In addition, it is possible to exploit relationships among examples,since it is plausible that related genes tend to share functional annotations. Although theserelationships have been identified and extensively studied in the area of protein-protein interaction(PPI) networks, they have not received much attention in hierarchical and multi-class gene functionprediction. Relations between genes introduce autocorrelation in functional annotations and violatethe assumption that instances are independently and identically distributed (i.i.d.), which underlinesmost machine learning algorithms. Although the explicit consideration of these relations bringsadditional complexity to the learning process, we expect substantial benefits in predictive accuracyof learned classifiers. Results: This article demonstrates the benefits (in terms of predictive accuracy) of considering autocorrelationin multi-class gene function prediction. We develop a tree-based algorithm for considering networkautocorrelation in the setting of Hierarchical Multi-label Classification (HMC). We empiricallyevaluate the proposed algorithm, called NHMC (Network Hierarchical Multi-label Classification),on 12 yeast datasets using each of the MIPS-FUN and GO annotation schemes and exploiting 2different PPI networks. The results clearly show that taking autocorrelation into account improves thepredictive performance of the learned models for predicting gene function. Conclusions: Our newly developed method for HMC takes into account network information in the learning phase:When used for gene function prediction in the context of PPI networks, the explicit considerationof network autocorrelation increases the predictive performance of the learned models. Overall, wefound that this holds for different gene features/ descriptions, functional annotation schemes, and PPInetworks: Best results are achieved when the PPI network is dense and contains a large proportion offunction-relevant interactions.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 20
    Publication Date: 2013-10-01
    Description: Background: Quality assessment and continuous quality feedback to the staff is crucial for safety and efficiency of teleconsultation and triage. This study evaluates whether it is feasible to use an already existing telephone triage protocol to assess the appropriateness of point-of-care and time-to-treat recommendations after teleconsultations. Methods: Based on electronic patient records, we retrospectively compared the point-of-care and time-to-treat recommendations of the paediatric telephone triage protocol with the actual recommendations of trained physicians for children with abdominal pain, following a teleconsultation. Results: In 59 of 96 cases (61%) these recommendations were congruent with the paediatric telephone protocol. Discrepancies were either of organizational nature, due to factors such as local referral policies or gatekeeping insurance models, or of medical origin, such as milder than usual symptoms or clear diagnosis of a minor ailment. Conclusions: A paediatric telephone triage protocol may be applicable in healthcare systems other than the one in which it has been developed, if triage rules are adapted to match the organisational aspects of the local healthcare system.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 21
    Publication Date: 2013-10-02
    Description: Background: Protein-protein docking, which aims to predict the structure of a protein-protein complex from its unbound components, remains an unresolved challenge in structural bioinformatics. An important step is the ranking of docked poses using a scoring function, for which many methods have been developed. There is a need to explore the differences and commonalities of these methods with each other, as well as with functions developed in the fields of molecular dynamics and homology modelling. Results: We present an evaluation of 115 scoring functions on an unbound docking decoy benchmark covering 118 complexes for which a near-native solution can be found, yielding top 10 success rates of up to 58%. Hierarchical clustering is performed, so as to group together functions which identify near-natives in similar subsets of complexes. Three set theoretic approaches are used to identify pairs of scoring functions capable of correctly scoring different complexes. This shows that functions in different clusters capture different aspects of binding and are likely to work together synergistically. Conclusions: All functions designed specifically for docking perform well, indicating that functions are transferable between sampling methods. We also identify promising methods from the field of homology modelling. Further, differential success rates by docking difficulty and solution quality suggest a need for flexibility-dependent scoring. Investigating pairs of scoring functions, the set theoretic measures identify known scoring strategies as well as a number of novel approaches, indicating promising augmentations of traditional scoring methods. Such augmentation and parameter combination strategies are discussed in the context of the learning-to-rank paradigm.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 22
    Publication Date: 2013-10-03
    Description: Background: A number of different statistics are used for detecting natural selection using DNA sequencing data, including statistics that are summaries of the frequency spectrum, such as Tajima's D. These statistics are now often being applied in the analysis of Next Generation Sequencing (NGS) data. However, estimates of frequency spectra from NGS data are strongly affected by low sequencing coverage; the inherent technology dependent variation in sequencing depth causes systematic differences in the value of the statistic among genomic regions. Results: We have developed an approach that accommodates the uncertainty of the data when calculating site frequency based neutrality test statistics. A salient feature of this approach is that it implicitly solves the problems of varying sequencing depth, missing data and avoids the need to infer variable sites for the analysis and thereby avoids ascertainment problems introduced by a SNP discovery process. Conclusion: Using an empirical Bayes approach for fast computations, we show that this method produces results for low-coverage NGS data comparable to those achieved when the genotypes are known without uncertainty. We also validate the method in an analysis of data from the 1000 genomes project. The method is implemented in a fast framework which enables researchers to perform these neutrality tests on a genome-wide scale.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 23
    Publication Date: 2013-10-03
    Description: Background: Dendritic spines serve as key computational structures in brain plasticity. Much remains to be learned about their spatial and temporal distribution among neurons. Our aim in this study was to perform exploratory analyses based on the population distributions of dendritic spines with regard to their morphological characteristics and period of growth in dissociated hippocampal neurons. We fit a loglinear model to the contingency table of spine features such as spine type and distance from the soma to first determine which features were important in modeling the spines, as well as the relationships between such features. A multinomial logistic regression was then used to predict the spine types using the features suggested by the log-linear model, along with neighboring spine information. Finally, an important variant of Ripley's K-function applicable to linear networks was used to study the spatial distribution of spines along dendrites. Results: Our study indicated that in the culture system, (i) dendritic spine densities were "completely spatially random", (ii) spine type and distance from the soma were independent quantities, and most importantly, (iii) spines had a tendency to cluster with other spines of the same type. Conclusions: Although these results may vary with other systems, our primary contribution is the set of statistical tools for morphological modeling of spines which can be used to assess neuronal cultures following gene manipulation such as RNAi, and to study induced pluripotent stem cells differentiated to neurons.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 24
    Publication Date: 2013-10-03
    Description: Background: Physician notes routinely recorded during patient care represent a vast and underutilized resource for human disease studies on a population scale. Their use in research is primarily limited by the need to separate confidential patient information from clinical annotations, a process that is resource-intensive when performed manually. This study seeks to create an automated method for de-identifying physician notes that does not require large amounts of private information: in addition to training a model to recognize Protected Health Information (PHI) within private physician notes, we reverse the problem and train a model to recognize non-PHI words and phrases that appear in public medical texts. Methods: Public and private medical text sources were analyzed to distinguish common medical words and phrases from Protected Health Information. Patient identifiers are generally nouns and numbers that appear infrequently in medical literature. To quantify this relationship, term frequencies and part of speech tags were compared between journal publications and physician notes. Standard medical concepts and phrases were then examined across ten medical dictionaries. Lists and rules were included from the US census database and previously published studies. In total, 28 features were used to train decision tree classifiers. Results: The model successfully recalled 98% of PHI tokens from 220 discharge summaries. Cost sensitive classification was used to weight recall over precision (98% F10 score, 76% F1 score). More than half of the false negatives were the word "of" appearing in a hospital name. All patient names, phone numbers, and home addresses were at least partially redacted. Medical concepts such as "elevated white blood cell count" were informative for de-identification. The results exceed the previously approved criteria established by four Institutional Review Boards. Conclusions: The results indicate that distributional differences between private and public medical text can be used to accurately classify PHI. The data and algorithms reported here are made freely available for evaluation and improvement.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 25
    Publication Date: 2013-10-05
    Description: Background: Molecular recognition features (MoRFs) are short binding regions located in longer intrinsically disordered protein regions. Although these short regions lack a stable structure in the natural state, they readily undergo disorder-to-order transitions upon binding to their partner molecules. MoRFs play critical roles in the molecular interaction network of a cell, and are associated with many human genetic diseases. Therefore, identification of MoRFs is an important step in understanding functional aspects of these proteins and in finding applications in drug design. Results: Here, we propose a novel method for identifying MoRFs, named as MFSPSSMpred (Masked, Filtered and Smoothed Position-Specific Scoring Matrix-based Predictor). Firstly, a masking method is used to calculate the average local conservation scores of residues within a masking-window length in the position-specific scoring matrix (PSSM). Then, the scores below the average are filtered out. Finally, a smoothing method is used to incorporate the features of flanking regions for each residue to prepare the feature sets for prediction. Our method employs no predicted results from other classifiers as input, i.e., all features used in this method are extracted from the PSSM of sequence only. Experimental results show that, comparing with other methods tested on the same datasets, our method achieves the best performance: achieving 0.004~0.079 higher AUC than other methods when tested on TEST419, and achieving 0.045~0.212 higher AUC than other methods when tested on TEST2012. In addition, when tested on an independent membrane proteins-related dataset, MFSPSSMpred significantly outperformed the existing predictor MoRFpred. Conclusions: This study suggests that: 1) amino acid composition and physicochemical properties in the flanking regions of MoRFs are very different from those in the general non-MoRF regions; 2) MoRFs contain both highly conserved residues and highly variable residues and, on the whole, are highly locally conserved; and 3) combining contextual information with local conservation information of residues facilitates the prediction of MoRFs.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 26
    Publication Date: 2013-10-05
    Description: Background: Comprehensive protein-protein interaction (PPI) maps are a powerful resource for uncovering themolecular basis of genetic interactions and providing mechanistic insights. Over the past decade, highthroughputexperimental techniques have been developed to generate PPI maps at proteome scale,first using yeast two-hybrid approaches and more recently via affinity purification combined withmass spectrometry (AP-MS). Unfortunately, data from both protocols are prone to both high falsepositive and false negative rates. To address these issues, many methods have been developed to postprocessraw PPI data. However, with few exceptions, these methods only analyze binary experimentaldata (in which each potential interaction tested is deemed either observed or unobserved), neglectingquantitative information available from AP-MS such as spectral counts. Results: We propose a novel method for incorporating quantitative information from AP-MS data into existingPPI inference methods that analyze binary interaction data. Our approach introduces a probabilisticframework that models the statistical noise inherent in observations of co-purifications. Using asampling-based approach, we model the uncertainty of interactions with low spectral counts bygenerating an ensemble of possible alternative experimental outcomes. We then apply the existingmethod of choice to each alternative outcome and aggregate results over the ensemble. We validate ourapproach on three recent AP-MS data sets and demonstrate performance comparable to or better thanstate-of-the-art methods. Additionally, we provide an in-depth discussion comparing the theoreticalbases of existing approaches and identify common aspects that may be key to their performance. Conclusions: Our sampling framework extends the existing body of work on PPI analysis using binary interactiondata to apply to the richer quantitative data now commonly available through AP-MS assays. Thisframework is quite general, and many enhancements are likely possible. Fruitful future directionsmay include investigating more sophisticated schemes for converting spectral counts to probabilitiesand applying the framework to direct protein complex prediction methods.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 27
    Publication Date: 2013-10-05
    Description: Background: Gene expression in the Drosophila embryo is controlled by functional interactions between a large network of protein transcription factors (TFs) and specific sequences in DNA cis-regulatory modules (CRMs). The binding site sequences for any TF can be experimentally determined and represented in a position weight matrix (PWM). PWMs can then be used to predict the location of TF binding sites in other regions of the genome, although there are limitations to this approach as currently implemented. Results: In this proof-of-principle study, we analyze 127 CRMs and focus on four TFs that control transcription of target genes along the anterio-posterior axis of the embryo early in development. For all four of these TFs, there is some degree of conserved flanking sequence that extends beyond the predicted binding regions. A potential role for these conserved flanking sequences may be to enhance the specificity of TF binding, as the abundance of these sequences is greatly diminished when we examine only predicted high-affinity binding sites. Conclusions: Expanding PWMs to include sequence context-dependence will increase the information content in PWMs and facilitate a more efficient functional identification and dissection of CRMs.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 28
    Publication Date: 2013-10-05
    Description: Background: Segmenting electron microscopy (EM) images of cellular and subcellular processes in the nervous system is a key step in many bioimaging pipelines involving classification and labeling of ultrastructures. However, fully automated techniques to segment images are often susceptible to noise and heterogeneity in EM images (e.g. different histological preparations, different organisms, different brain regions, etc.). Supervised techniques to address this problem are often helpful but require large sets of training data, which are often difficult to obtain in practice, especially across many conditions. Results: We propose a new, principled unsupervised algorithm to segment EM images using a two-step approach: edge detection via salient watersheds following by robust region merging. We performed experiments to gather EM neuroimages of two organisms (mouse and fruit fly) using different histological preparations and generated manually curated ground-truth segmentations. We compared our algorithm against several state-of-the-art unsupervised segmentation algorithms and found superior performance using two standard measures of under-and over-segmentation error. Conclusions: Our algorithm is general and may be applicable to other large-scale segmentation problems for bioimages.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 29
    Publication Date: 2013-10-05
    Description: Background: As adolescents with hemophilia approach adulthood, they are expected to assume responsibility for their disease management. A bilingual (English and French) Internet-based self-management program, "Teens Taking Charge: Managing Hemophilia Online," was developed to support adolescents with hemophilia in this transition. This study explored the usability of the website and resulted in refinement of the prototype. Methods: A purposive sample (n=18; age 13--18; mean age 15.5 years) was recruited from two tertiary care centers to assess the usability of the program in English and French. Qualitative observations using a "think aloud" usability testing method and semi-structured interviews were conducted in four iterative cycles, with changes to the prototype made as necessary following each cycle. This study was approved by research ethics boards at each site. Results: Teens responded positively to the content and appearance of the website and felt that it was easy to navigate and understand. The multimedia components (videos, animations, quizzes) were felt to enrich the experience. Changes to the presentation of content and the website user-interface were made after the first, second and third cycles of testing in English. Cycle four did not result in any further changes. Conclusions: Overall, teens found the website to be easy to use. Usability testing identified end-user concerns that informed improvements to the program. Usability testing is a crucial step in the development of Internet-based self-management programs to ensure information is delivered in a manner that is accessible and understood by users.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 30
    Publication Date: 2013-06-07
    Description: Background: The use of the knowledge produced by sciences to promote human health is the main goal of translational medicine. To make it feasible we need computational methods to handle the large amount of information that arises from bench to bedside and to deal with its heterogeneity. A computational challenge that must be faced is to promote the integration of clinical, socio-demographic and biological data. In this effort, ontologies play an essential role as a powerful artifact for knowledge representation. Chado is a modular ontology-oriented database model that gained popularity due to its robustness and flexibility as a generic platform to store biological data; however it lacks supporting representation of clinical and socio-demographic information. Results: We have implemented an extension of Chado -- the Clinical Module - to allow the representation of this kind of information. Our approach consists of a framework for data integration through the use of a common reference ontology. The design of this framework has four levels: data level, to store the data; semantic level, to integrate and standardize the data by the use of ontologies; application level, to manage clinical databases, ontologies and data integration process; and web interface level, to allow interaction between the user and the system. The clinical module was built based on the Entity-Attribute-Value (EAV) model. We also proposed a methodology to migrate data from legacy clinical databases to the integrative framework.A Chado instance was initialized using a relational database management system. The Clinical Module was implemented and the framework was loaded using data from a factual clinical research database. Clinical and demographic data as well as biomaterial data were obtained from patients with tumors of head and neck. We implemented the IPTrans tool that is a complete environment for data migration, which comprises: the construction of a model to describe the legacy clinical data, based on an ontology; the Extraction, Transformation and Load (ETL) process to extract the data from the source clinical database and load it in the Clinical Module of Chado; the development of a web tool and a Bridge Layer to adapt the web tool to Chado, as well as other applications. Conclusions: Open-source computational solutions currently available for translational science does not have a model to represent biomolecular information and also are not integrated with the existing bioinformatics tools. On the other hand, existing genomic data models do not represent clinical patient data. A framework was developed to support translational research by integrating biomolecular information coming from different "omics" technologies with patient's clinical and socio-demographic data. This framework should present some features: flexibility, compression and robustness. The experiments accomplished from a use case demonstrated that the proposed system meets requirements of flexibility and robustness, leading to the desired integration. The Clinical Module can be accessed in http://dcm.ffclrp.usp.br/caib/pg=iptrans.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 31
    Publication Date: 2013-06-07
    Description: Background: A large-scale, highly accurate, machine-understandable drug-disease treatment relationship knowledge base is important for computational approaches to drug repurposing. The large body of published biomedical research articles and clinical case reports available on MEDLINE is a rich source of FDA-approved drug-disease indication as well as drug-repurposing knowledge that is crucial for applying FDA-approved drugs for new diseases. However, much of this information is buried in free text and not captured in any existing databases. The goal of this study is to extract a large number of accurate drug-disease treatment pairs from published literature. Results: In this study, we developed a simple but highly accurate pattern-learning approach to extract treatment-specific drug-disease pairs from 20 million biomedical abstracts available on MEDLINE. We extracted a total of 34,305 unique drug-disease treatment pairs, the majority of which are not included in existing structured databases. Our algorithm achieved a precision of 0.904 and a recall of 0.131 in extracting all pairs, and a precision of 0.904 and a recall of 0.842 in extracting frequent pairs. In addition, we have shown that the extracted pairs strongly correlate with both drug target genes and therapeutic classes, therefore may have high potential in drug discovery. Conclusions: We demonstrated that our simple pattern-learning relationship extraction algorithm is able to accurately extract many drug-disease pairs from the free text of biomedical literature that are not captured in structured databases. The large-scale, accurate, machine-understandable drug-disease treatment knowledge base that is resultant of our study, in combination with pairs from structured databases, will have high potential in computational drug repurposing tasks.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 32
    Publication Date: 2013-06-08
    Description: Background: Normal Mode Analysis is one of the most successful techniques for studying motions in proteins and macromolecules. It can provide information on the mechanism of protein functions, used to aid crystallography and NMR data reconstruction, and calculate protein free energies. Results: ¿¿PT is a toolbox allowing calculation of elastic network models and principle component analysis. It allows the analysis of pdb files or trajectories taken from; Gromacs, Amber, and DL_POLY. As well as calculation of the normal modes it also allows comparison of the modes with experimental protein motion, variation of modes with mutation or ligand binding, and calculation of molecular dynamic entropies. Conclusions: This toolbox makes the respective tools available to a wide community of potential NMA users, and allows them unrivalled ability to analyse normal modes using a variety of techniques and current software.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 33
    Publication Date: 2013-06-08
    Description: Background: Data visualization is critical for interpreting biological data. However, in practice it can prove tobe a bottleneck for non trained researchers; this is especially true for three dimensional (3D) datarepresentation. Whilst existing software can provide all necessary functionalities to represent andmanipulate biological 3D datasets, very few are easily accessible (browser based), cross platform andaccessible to non-expert users. Results: An online HTML5/WebGL based 3D visualisation tool has been developed to allow biologists toquickly and easily view interactive and customizable three dimensional representations of their dataalong with multiple layers of information. Using the WebGL library Three.js written in Javascript,bioWeb3D allows the simultaneous visualisation of multiple large datasets inputted via a simpleJSON, XML or CSV file, which can be read and analysed locally thanks to HTML5 capabilities. Conclusions: Using basic 3D representation techniques in a technologically innovative context, we provide a pro-gram that is not intended to compete with professional 3D representation software, but that insteadenables a quick and intuitive representation of reasonably large 3D datasets.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 34
    Publication Date: 2013-06-08
    Description: Background: Graph-based notions are increasingly used in biomedical data mining and knowledge discovery tasks. In this paper, we present a clique-clustering method to automatically summarize graphs of semantic predications produced from PubMed citations (titles and abstracts). Results: SemRep is used to extract semantic predications from the citations returned by a PubMed search. Cliques were identified from frequently occurring predications with highly connected arguments filtered by degree centrality. Themes contained in the summary were identified with a hierarchical clustering algorithm based on common arguments shared among cliques. The validity of the clusters in the summaries produced was compared to the Silhouette-generated baseline for cohesion, separation and overall validity. The theme labels were also compared to a reference standard produced with major MeSH headings. Conclusions: For 11 topics in the testing data set, the overall validity of clusters from the system summary was 10% better than the baseline (43% versus 33%). While compared to the reference standard from MeSH headings, the results for recall, precision and F-score were 0.64, 0.65, and 0.65 respectively.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 35
    Publication Date: 2013-06-11
    Description: Background: Somatic mutation-calling based on DNA from matched tumor-normal patient samples is one of the key tasks carried by many cancer genome projects. One such large-scale project is The Cancer Genome Atlas (TCGA), which is now routinely compiling catalogs of somatic mutations from hundreds of paired tumor-normal DNA exome-sequence data. Nonetheless, mutation calling is still very challenging. TCGA benchmark studies revealed that even relatively recent mutation callers from major centers showed substantial discrepancies. Evaluation of the mutation callers or understanding the sources of discrepancies is not straightforward, since for most tumor studies, validation data based on independent whole-exome DNA sequencing is not available, only partial validation data for a selected (ascertained) subset of sites. Results: To provide guidelines to comparing outputs from multiple callers, we have analyzed two sets of mutation-calling data from the TCGA benchmark studies and their partial validation data. Various aspects of the mutation-calling outputs were explored to characterize the discrepancies in detail. To assess the performances of multiple callers, we introduce four approaches utilizing the external sequence data to varying degrees, ranging from having independent DNA-seq pairs, RNA-seq for tumor samples only, the original exome-seq pairs only, or none of those. Conclusions: Our analyses provide guidelines to visualizing and understanding the discrepancies among the outputs from multiple callers. Furthermore, applying the four evaluation approaches to the whole exome data, we illustrate the challenges and highlight the various circumstances that require extra caution in assessing the performances of multiple callers.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 36
    Publication Date: 2013-06-11
    Description: Background: ChIPx (i.e., ChIP-seq and ChIP-chip) is increasingly used to map genome-wide transcription factor (TF) binding sites. A single ChIPx experiment can identify thousands of TF bound genes, but typically only a fraction of these genes are functional targets that respond transcriptionally to perturbations of TF expression. To identify promising functional target genes for follow-up studies, researchers usually collect gene expression data from TF perturbation experiments to determine which of the TF targets respond transcriptionally to binding. Unfortunately, approximately 40% of ChIPx studies do not have accompanying gene expression data from TF perturbation experiments. For these studies, genes are often prioritized solely based on the binding strengths of ChIPx signals in order to choose follow-up candidates. ChIPXpress is a novel method that improves upon this ChIPx-only ranking approach by integrating ChIPx data with large amounts of Publicly available gene Expression Data (PED). Results: We demonstrate that PED does contain useful information to identify functional TF target genes despite its inherent heterogeneity. A truncated absolute correlation measure is developed to better capture the regulatory relationships between TFs and their target genes in PED. By integrating the information from ChIPx and PED, ChIPXpress can significantly increase the chance of finding functional target genes responsive to TF perturbation among the top ranked genes. ChIPXpress is implemented as an easy-to-use R/Bioconductor package. We evaluate ChIPXpress using 10 different ChIPx datasets in mouse and human and find that ChIPXpress rankings are more accurate than rankings based solely on ChIPx data and may result in substantial improvement in prediction accuracy, irrespective of which peak calling algorithm is used to analyze the ChIPx data. Conclusions: ChIPXpress provides a new tool to better prioritize TF bound genes from ChIPx experiments for follow-up studies when investigators do not have their own gene expression data. It demonstrates that the regulatory information from PED can be used to boost ChIPx data analyses. It also represents an important step towards more fully utilizing the valuable, but highly heterogeneous data contained in public gene expression databases.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 37
    Publication Date: 2013-06-06
    Description: Background: Within the field of record linkage, numerous data cleaning and standardisation techniques are employed to ensure the highest quality of links. While these facilities are common in record linkage software packages and are regularly deployed across record linkage units, little work has been published demonstrating the impact of data cleaning on linkage quality. Methods: A range of cleaning techniques was applied to both a synthetically generated dataset and a large administrative dataset previously linked to a high standard. The effect of these changes on linkage quality was investigated using pairwise F-measure to determine quality. Results: Data cleaning made little difference to the overall linkage quality, with heavy cleaning leading to a decrease in quality. Further examination showed that decreases in linkage quality were due to cleaning techniques typically reducing the variability -- although correct records were now more likely to match, incorrect records were also more likely to match, and these incorrect matches outweighed the correct matches, reducing quality overall. Conclusion: Data cleaning techniques have minimal effect on linkage quality. Care should be taken during the data cleaning process.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 38
    Publication Date: 2013-06-07
    Description: Background: Detailed study of genetic variation at the population level in humans and other species is now possible due to the availability of large sets of single nucleotide polymorphism data. Alleles at two or more loci are said to be in linkage disequilibrium (LD) when they are correlated or statistically dependent. Current efforts to understand the genetic basis of complex phenotypes are based on the existence of such associations, making study of the extent and distribution of linkage disequilibrium central to this endeavour. The objective of this paper is to develop methods to study fine-scale patterns of allelic association using probabilistic graphical models. Results: An efficient, linear-time forward-backward algorithm is developed to estimate chromosome-wide LD models by optimizing a penalized likelihood criterion, and a convenient way to display these models is described. To illustrate the methods they are applied to data obtained by genotyping 8341 pigs. It is found that roughly 20% of the porcine genome exhibits complex LD patterns, forming islands of relatively high genetic diversity. Conclusions: The proposed algorithm is efficient and makes it feasible to estimate and visualize chromosome-wide LD models on a routine basis.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 39
    Publication Date: 2013-06-07
    Description: Background: Interpretation of gene expression microarray data in the light of external information on both columns and rows (experimental variables and gene annotations) facilitates the extraction of pertinent information hidden in these complex data. Biologists classically interpret genes of interest after retrieving functional information from a subset of genes of interest. Transcription factors play an important role in orchestrating the regulation of gene expression. Their activity can be deduced by examining the presence of putative transcription factors binding sites in the gene promoter regions. Results: In this paper we present the multivariate statistical method RLQ which aims to analyze microarray data where additional information is available on both genes and samples. As an illustrative example, we applied RLQ methodology to analyze transcription factor activity associated with the time-course effect of steroids on the growth of primary human lung fibroblasts. RLQ could successfully predict transcription factor activity, and could integrate various other sources of external information in the main frame of the analysis. The approach was validated by means of alternative statistical methods and biological validation. Conclusions: RLQ provides an efficient way of extracting and visualizing structures present in a gene expression dataset by directly modeling the link between experimental variables and gene annotations.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 40
    Publication Date: 2013-06-07
    Description: Background: The development of genotyping arrays containing hundreds of thousands of rare variants across the genome and advances in high-throughput sequencing technologies have made feasible empirical genetic association studies to search for rare disease susceptibility alleles. As single variant testing is underpowered to detect associations, the development of statistical methods to combine analysis across variants -- so-called "burden tests" - is an area of active research interest. We previously developed a method, the admixture maximum likelihood test, to test multiple, common variants for association with a trait of interest. We have extended this method, called the rare admixture maximum likelihood test (RAML), for the analysis of rare variants. In this paper we compare the performance of RAML with six other burden tests designed to test for association of rare variants. Results: We used simulation testing over a range of scenarios to test the power of RAML compared to the other rare variant association testing methods. These scenarios modelled differences in effect variability, the average direction of effect and the proportion of associated variants. We evaluated the power for all the different scenarios. RAML tended to have the greatest power for most scenarios where the proportion of associated variants was small, whereas SKAT-O performed a little better for the scenarios with a higher proportion of associated variants. Conclusions: The RAML method makes no assumptions about the proportion of variants that are associated with the phenotype of interest or the magnitude and direction of their effect. The method is flexible and can be applied to both dichotomous and quantitative traits and allows for the inclusion of covariates in the underlying regression model. The RAML method performed well compared to the other methods over a wide range of scenarios. Generally power was moderate in most of the scenarios, underlying the need for large sample sizes in any form of association testing.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 41
    Publication Date: 2013-06-09
    Description: Background: Miniature inverted repeat transposable elements (MITEs) are abundant non-autonomous elements, playing important roles in shaping gene and genome evolution. Their characteristic structural features are suitable for automated identification by computational approaches, however, de novo MITE discovery at genomic levels is still resource expensive. Results: Efficient and accurate computational tools are desirable. Existing algorithms process every member of a MITE family, therefore a major portion of the computing task is redundant. In this study, redundant computing steps were analyzed and a novel algorithm emphasizing on the reduction of such redundant computing was implemented in MITE Digger. It completed processing the whole rice genome sequence database in ~15 hours and produced 332 MITE candidates with low false positive (1.8%) and false negative (0.9%) rates. MITE Digger was also tested for genome wide MITE discovery with four other genomes. Conclusions: MITE Digger is efficient and accurate for genome wide retrieval of MITEs. Its user friendly interface further facilitates genome wide analyses of MITEs on a routine basis. The MITE Digger program is available at: http://labs.csb.utoronto.ca/yang/MITE Digger.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 42
    Publication Date: 2013-06-09
    Description: Background: Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. This data consists of both the nucleotide sequences and per-base quality scores that indicate the level of confidence in the readout of these sequences. Quality scores account for about half of the required disk space in the commonly used FASTQ format (before compression), and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data. Results: In this paper, we present a new scheme for the lossy compression of the quality scores, to address the problem of storage. Our framework allows the user to specify the rate (bits per quality score) prior to compression, independent of the data to be compressed. Our algorithm can work at any rate, unlike other lossy compression algorithms. We envisage our algorithm as being part of a more general compression scheme that works with the entire FASTQ file. Numerical experiments show that we can achieve a better mean squared error (MSE) for small rates (bits per quality score) than other lossy compression schemes. For the organism PhiX, whose assembled genome is known and assumed to be correct, we show that it is possible to achieve a significant reduction in size with little compromise in performance on downstream applications (e.g., alignment). Conclusions: QualComp is an open source software package, written in C and freely available for download at https://sourceforge.net/projects/qualcomp. It is designed to lossily compress the quality scores presented in a FASTQ file. Given a model for the quality scores, we use rate-distortion results to optimally allocate the available bits in order to minimize the MSE. This metric allows us to compare different lossy compression algorithms for quality scores without depending on downstream applications that may use the quality scores in very different ways.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 43
    Publication Date: 2013-04-03
    Description: Background While the genomes of hundreds of organisms have been sequenced and good approaches exist for finding protein encoding genes, an important remaining challenge is predicting the functions of the large fraction of genes for which there is no annotation. Large gene expression datasets from microarray experiments already exist and many of these can be used to help assign potential functions to these genes. We have applied Support Vector Machines (SVM), a sigmoid fitting function and a stratified cross-validation approach to analyze a large microarray experiment dataset from Drosophila melanogaster in order to predict possible functions for previously un-annotated genes. A total of approximately 5043 different genes, or about one-third of the predicted genes in the D. melanogaster genome, are represented in the dataset and 1854 (or 37%) of these genes are un-annotated.Results 39 Gene Ontology Biological Process (GO-BP) categories were found with precision value equal or larger than 0.75, when recall was fixed at the 0.4 level. For two of those categories, we have provided additional support for assigning given genes to the category by showing that the majority of transcripts for the genes belonging in a given category have a similar localization pattern during embryogenesis. Additionally, by assessing the predictions using a confidence score, we have been able to provide a putative GO-BP term for 1422 previously un-annotated genes or about 77% of the un-annotated genes represented on the microarray and about 19% of all of the un-annotated genes in the D. melanogaster genome.Conclusions Our study successfully employs a number of SVM classifiers, accompanied by detailed calibration and validation techniques, to generate a number of predictions for new annotations for D. melanogaster genes. The applied probabilistic analysis to SVM output improves the interpretability of the prediction results and the objectivity of the validation procedure.
    Electronic ISSN: 1756-0381
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 44
    Publication Date: 2013-04-05
    Description: Background: Pyrrolysine (the 22nd amino acid) is in certain organisms and under certain circumstances encoded by the amber stop codon, UAG. The circumstances driving pyrrolysine translation are not well understood. The involvement of a predicted mRNA structure in the region downstream UAG has been suggested, but the structure does not seem to be present in all pyrrolysine incorporating genes. Results: We propose a strategy to predict pyrrolysine encoding genes in genomes of archaea and bacteria. We cluster open reading frames interrupted by the amber codon based on sequence similarity. We rank these clusters according to several features that may influence pyrrolysine translation. The ranking effects of different features are assessed and we propose a weighted combination of these features which best explains the currently known pyrrolysine incorporating genes. We devote special attention to the effect of structural conservation and provide further substantiation to support that structural conservation may be influential -- but is not a necessary factor. Finally, from the weighted ranking, we identify a number of potentially pyrrolysine incorporating genes. Conclusions: We propose a method for prediction of pyrrolysine incorporating genes in genomes of bacteria and archaea leading to insights about the factors driving pyrrolysine translation and identification of new gene candidates. The method predicts known conserved genes with high recall and predicts several other promising candidates for experimental verification. The method is implemented as a computational pipeline which is available on request.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 45
    Publication Date: 2013-04-05
    Description: Background: Next generation transcriptome sequencing (RNA-Seq) is emerging as a powerful experimental tool for the study of alternative splicing and its regulation, but requires ad-hoc analysis methods and tools. PASTA (Patterned Alignments for Splicing and Transcriptome Analysis) is a splice junction detection algorithm specifically designed for RNA-Seq data, relying on a highly accurate alignment strategy and on a combination of heuristic and statistical methods to identify exon-intron junctions with high accuracy. Results: Comparisons against TopHat and other splice junction prediction software on real and simulated datasets show that PASTA exhibits high specificity and sensitivity, especially at lower coverage levels. Moreover, PASTA is highly configurable and flexible, and can therefore be applied in a wide range of analysis scenarios: it is able to handle both single-end and paired-end reads, it does not rely on the presence of canonical splicing signals, and it uses organism-specific regression models to accurately identify junctions. Conclusions: PASTA is a highly efficient and sensitive tool to identify splicing junctions from RNA-Seq data. Compared to similar programs, it has the ability to identify a higher number of real splicing junctions, and provides highly annotated output files containing detailed information about their location and characteristics. Accurate junction data in turn facilitates the reconstruction of the splicing isoforms and the analysis of their expression levels, which will be performed by the remaining modules of the PASTA pipeline, still under development. Use of PASTA can therefore enable the large-scale investigation of transcription and alternative splicing.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 46
    facet.materialart.
    Unknown
    BioMed Central
    Publication Date: 2013-04-11
    Description: Contributing reviewersThe editors of BMC Bioinformatics would like to thank all our reviewers who have contributed to the journal in volume 13 (2012).
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 47
    Publication Date: 2013-04-06
    Description: Background: The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. Results: We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the new AUC-based permutation VIM outperforms the standard permutation VIM for unbalanced data settings while both permutation VIMs have equal performance for balanced data settings. Conclusions: The standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 48
    Publication Date: 2013-04-11
    Description: Background: Since peak alignment in metabolomics has a huge effect on the subsequent statistical analysis, it is considered a key preprocessing step and many peak alignment methods have been developed. However, existing peak alignment methods do not produce satisfactory results. Indeed, the lack of accuracy results from the fact that peak alignment is done separately from another preprocessing step such as identification. Therefore, a post-hoc approach, which integrates both identification and alignment results, is in urgent need for the purpose of increasing the accuracy of peak alignment. Results: The proposed post-hoc method was validated with three datasets such as a mixture of compound standards, metabolite extract from mouse liver, and metabolite extract from wheat. Compared to the existing methods, the proposed approach improved peak alignment in terms of various performance measures. Also, post-hoc approach was verified to improve peak alignment by manual inspection. Conclusions: The proposed approach, which combines the information of metabolite identification and alignment, clearly improves the accuracy of peak alignment in terms of several performance measures. R package and examples using a dataset are available at http://mrr.sourceforge.net/download.html.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 49
    Publication Date: 2013-04-11
    Description: Background: Algorithms to identify screening colonoscopies in administrative databases would be useful for monitoring CRC screening uptake, tracking health resource utilization, and quality assurance. Previously developed algorithms based on expert opinion were insufficiently accurate. The purpose of this study was to develop and evaluate the accuracy of model-based algorithms to identify screening colonoscopies in health administrative databases. Methods: Patients aged 50-75 were recruited from endoscopy units in Montreal, Quebec, and Calgary, Alberta. Physician billing records and hospitalization data were obtained for each patient from the provincial administrative health databases. Indication for colonoscopy was derived using Bayesian latent class analysis informed by endoscopist and patient questionnaire responses. Two modeling methods were used to fit the data, multivariate logistic regression and recursive partitioning. The accuracies of these models were assessed. Results: 689 patients from Montreal and 541 from Calgary participated (January to March 2007). The latent class model identified 554 screening exams. Multivariate logistic regression predictions yielded an area under the curve of 0.786. Recursive partitioning using the latent outcome had sensitivity and specificity of 84.5% (95% CI: 81.5-87.5) and 63.3% (95% CI: 59.7-67.0), respectively. Conclusions: Model-based algorithms using administrative data failed to identify screening colonoscopies with sufficient accuracy. Nevertheless, the approach of constructing a latent reference standard against which model-based algorithms were evaluated may be useful for validating administrative data in other contexts where there lacks a gold standard.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 50
    Publication Date: 2013-09-09
    Description: Background: DNA pooling constitutes a cost effective alternative in genome wide association studies. In DNA pooling, equimolar amounts of DNA from different individuals are mixed into one sample and the frequency of each allele in each position is observed in a single genotype experiment. The identification of haplotype frequencies from pooled data in addition to single locus analysis is of separate interest within these studies as haplotypes could increase statistical power and provide additional insight. Results: We developed a method for maximum-parsimony haplotype frequency estimation from pooled DNA data based on the sparse representation of the DNA pools in a dictionary of haplotypes. Extensions to scenarios where data is noisy or even missing are also presented. The resulting method is first applied to simulated data based on the haplotypes and their associated frequencies of the AGT gene. We further evaluate our methodology on datasets consisting of SNPs from the first 7Mb of the HapMap CEU population. Noise and missing data were further introduced in the datasets in order to test the extensions of the proposed method. Both HIPPO and HAPLOPOOL were also applied to these datasets to compare performances. Conclusions: We evaluate our methodology on scenarios where pooling is more efficient relative to individual genotyping; that is, in datasets that contain pools with a small number of individuals. We show that in such scenarios our methodology outperforms state-of-the-art methods such as HIPPO and HAPLOPOOL.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 51
    Publication Date: 2013-09-09
    Description: Background: Many problems in computational biology require alignment-free sequence comparisons. One of thecommon tasks involving sequence comparison is sequence clustering. Here we apply methods ofalignment-free comparison (in particular, comparison using sequence composition) to the challengeof sequence clustering. Results: We study several centroid based algorithms for clustering sequences based on word counts. Study oftheir performance shows that using k-means algorithm with or without the data whitening is efficientfrom the computational point of view. A higher clustering accuracy can be achieved using the softexpectation maximization method, whereby each sequence is attributed to each cluster with a specificprobability. We implement an open source tool for alignment-free clustering. It is publicly availablefrom github: https://github.com/luscinius/afcluster. Conclusions: We show the utility of alignment-free sequence clustering for high throughput sequencing analysisdespite its limitations. In particular, it allows one to perform assembly with reduced resources and aminimal loss of quality. The major factor affecting performance of alignment-free read clustering isthe length of the read.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 52
    Publication Date: 2013-09-09
    Description: Background: Current advances of the next-generation sequencing technology have revealed a large number of un-annotated RNA transcripts. Comparative study of the RNA structurome is an important approach to assess their biological functionalities. Due to the large sizes and abundance of the RNA transcripts, an efficient and accurate RNA structure-structure alignment algorithm is in urgent need to facilitate the comparative study. Despite the importance of the RNA secondary structure alignment problem, there are no computational tools available that provide high computational efficiency and accuracy. In this case, designing and implementing such an efficient and accurate RNA secondary structure alignment algorithm is highly desirable. Results: In this work, through incorporating the sparse dynamic programming technique, we implemented an algorithm that has an O(n3) expected time complexity, where n is the average number of base pairs in the RNA structures. This complexity, which can be shown assuming the polymer-zeta property, is confirmed by our experiments. The resulting new RNA secondary structure alignment tool is called ERA. Benchmark results indicate that ERA can significantly speedup RNA structure-structure alignments compared to other state-of-the-art RNA alignment tools, while maintaining high alignment accuracy. Conclusions: Using the sparse dynamic programming technique, we are able to develop a new RNA secondary structure alignment tool that is both efficient and accurate. We anticipate that the new alignment algorithm ERA will significantly promote comparative RNA structure studies. The program, ERA, is freely available at http://genome.ucf.edu/ERA.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 53
    Publication Date: 2013-09-10
    Description: Background: Chromatin immunoprecipitation coupled with hybridization to a tiling array (ChIP-chip) is a cost-effective and routinely used method to identify protein-DNA interactions or chromatin/histone mod-ifications. The robust identification of ChIP-enriched regions is frequently complicated by noisymeasurements. This identification can be improved by accounting for dependencies between adja-cent probes on chromosomes and by modeling of biological replicates. Results: MultiChIPmixHMM is a user-friendly R package to analyse ChIP-chip data modeling spatial depen-dencies between directly adjacent probes on a chromosome and enabling a simultaneous analysis ofreplicates. It is based on a linear regression mixture model, designed to perform a joint modeling ofimmunoprecipitated and input measurements. Conclusion: We show the utility of MultiChIPmixHMM by analyzing histone modifications of Arabidopsisthaliana. MultiChIPmixHMM is implemented in R and including functions in C, freely availablefrom the CRAN web site: http://cran.r-project.org.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 54
    Publication Date: 2013-09-12
    Description: Background: General Feature Format (GFF) files are used to store genome features such as genes, exons, introns, primary transcripts etc. Although many software packages (i.e. ab initio gene prediction programs) can annotate features by using such a standard, a small number of tools have been developed to extract the corresponding sequence information from the original genome. However the present tools do not execute either a quality control or a customizable filter of the annotated features is available.Findingsgff2sequence is a program that extracts nucleotide/protein sequences from a genomic multifasta by using the information provided by a general feature format file. While a graphical user interface makes this software very easy to use, a C++ algorithm allows high performance together with low hardware demand. The software also allows the extraction of the genic portions such as the untranslated and the coding sequences. Moreover a highly customizable quality control pipeline can be used to deal with anomalous splicing sites, incorrect open reading frames and not canonical characters within the retrieved sequences. Conclusions: gff2sequence is a user friendly program that allows the generation of highly customizable sequence datasets by processing a general feature format file. The presence of a wide range of quality filters makes this tool also suitable for refining the ab initio gene predictions.
    Electronic ISSN: 1756-0381
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 55
    Publication Date: 2013-09-13
    Description: Background: Gene regulatory network inference remains a challenging problem in systems biology despite the numerous approaches that have been proposed. When substantial knowledge on a gene regulatory network is already available, supervised network inference is appropriate. Such a method builds a binary classifier able to assign a class (Regulation/No regulation) to an ordered pair of genes. Once learnt, the pairwise classifier can be used to predict new regulations. In this work, we explore the framework of Markov Logic Networks (MLN) that combine features of probabilistic graphical models with the expressivity of first-order logic rules. Results: We propose to learn a Markov Logic network, e.g. a set of weighted rules that conclude on the predicate "regulates", starting from a known gene regulatory network involved in the switch proliferation/differentiation of keratinocyte cells, a set of experimental transcriptomic data and various descriptions of genes all encoded into first-order logic. As training data are unbalanced, we use asymmetric bagging to learn a set of MLNs. The prediction of a new regulation can then be obtained by averaging predictions of individual MLNs. As a side contribution, we propose three in silico tests to assess the performance of any pairwise classifier in various network inference tasks on real datasets. A first test consists of measuring the average performance on balanced edge prediction problem; a second one deals with the ability of the classifier, once enhanced by asymmetric bagging, to update a given network. Finally our main result concerns a third test that measures the ability of the method to predict regulations with a new set of genes. As expected, MLN, when provided with only numerical discretized gene expression data, does not perform as well as a pairwise SVM in terms of AUPR. However, when a more complete description of gene properties is provided by heterogeneous sources, MLN achieves the same performance as a black-box model such as a pairwise SVM while providing relevant insights on the predictions. Conclusions: The numerical studies show that MLN achieves very good predictive performance while opening the door to some interpretability of the decisions. Besides the ability to suggest new regulations, such an approach allows to cross-validate experimental data with existing knowledge.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 56
    Publication Date: 2013-09-14
    Description: Background: Blindness due to diabetic retinopathy (DR) is the major disability in diabetic patients. Although early management has shown to prevent vision loss, diabetic patients have a low rate of routine ophthalmologic examination. Hence, we developed and validated sparse learning models with the aim of identifying the risk of DR in diabetic patients. Methods: Health records from the Korea National Health and Nutrition Examination Surveys (KNHANES) V-1 were used. The prediction models for DR were constructed using data from 327 diabetic patients, and were validated internally on 163 patients in the KNHANES V-1. External validation was performed using 562 diabetic patients in the KNHANES V-2. The learning models, including ridge, elastic net, and LASSO, were compared to the traditional indicators of DR. Results: Considering the Bayesian information criterion, LASSO predicted DR most efficiently. In the internal and external validation, LASSO was significantly superior to the traditional indicators by calculating the area under the curve (AUC) of the receiver operating characteristic. LASSO showed an AUC of 0.81 and an accuracy of 73.6% in the internal validation, and an AUC of 0.82 and an accuracy of 75.2% in the external validation. Conclusion: The sparse learning model using LASSO was effective in analyzing the epidemiological underlying patterns of DR. This is the first study to develop a machine learning model to predict DR risk using health records. LASSO can be an excellent choice when both discriminative power and variable selection are important in the analysis of high-dimensional electronic health records.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 57
    Publication Date: 2013-09-19
    Description: Background: Recent increases in the number of deposited membrane protein crystal structures necessitate the use of automated computational tools to position them within the lipid bilayer. Identifying the correct orientation allows us to study the complex relationship between sequence, structure and the lipid environment, which is otherwise challenging to investigate using experimental techniques due to the difficulty in crystallising membrane proteins embedded within intact membranes. Results: We have developed a knowledge-based membrane potential, calculated by the statistical analysis of transmembrane protein structures, coupled with a combination of genetic and direct search algorithms, and demonstrate its use in positioning proteins in membranes, refinement of membrane protein models and in decoy discrimination. Conclusions: Our method is able to quickly and accurately orientate both alpha-helical and beta-barrel membrane proteins within the lipid bilayer, showing closer agreement with experimentally determined values than existing approaches. We also demonstrate both consistent and significant refinement of membrane protein models and the effective discrimination between native and decoy structures. Source code is available under an open source license from http://bioinf.cs.ucl.ac.uk/downloads/memembed/.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 58
    Publication Date: 2013-09-22
    Description: Background: Paper-based Aboriginal and Torres Strait Islander health checks have promoted a preventive approach to primary care and provided data to support research at the Inala Indigenous Health Service, south-west Brisbane, Australia. Concerns about the limitations of paper-based health checks prompted us to change to a computerised system to realise potential benefits for clinical services and research capability. We describe the rationale, implementation and anticipated benefits of computerised Aboriginal and Torres Strait Islander health checks in one primary health care setting. Methods: In May 2010, the Inala Indigenous Health Service commenced a project to computerise Aboriginal and Torres Strait Islander child, adult, diabetic, and antenatal health checks. The computerised health checks were launched in September 2010 and then evaluated for staff satisfaction, research consent rate and uptake. Ethical approval for health check data to be used for research purposes was granted in December 2010. Results: Three months after the September 2010 launch date, all but two health checks (378 out of 380, 99.5%) had been completed using the computerised system. Staff gave the system a median mark of 8 out of 10 (range 5-9), where 10 represented the highest level of overall satisfaction. By September 2011, 1099 child and adult health checks, 138 annual diabetic checks and 52 of the newly introduced antenatal checks had been completed. These numbers of computerised health checks are greater than for the previous year (2010) of paper-based health checks with a risk difference of 0.07 (95% confidence interval 0.05, 0.10). Additionally, two research projects based on computerised health check data were underway. Conclusions: The Inala Indigenous Health Service has demonstrated that moving from paper-based Aboriginal and Torres Strait Islander health checks to a system using computerised health checks is feasible and can facilitate research. We expect computerised health checks will improve clinical care and continue to enable research projects using validated data, reflecting the local Aboriginal and Torres Strait Islander community's priorities.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 59
    Publication Date: 2013-09-23
    Description: Background: Dynamic visualisation interfaces are required to explore the multiple microbial genome data now available, especially those obtained by high-throughput sequencing --- a.k.a. "Next-Generation Sequencing" (NGS) --- technologies; they would also be useful for "standard" annotated genomes whose chromosome organizations may be compared. Although various software systems are available, few offer an optimal combination of feature-rich capabilities, non-static user interfaces and multi-genome data handling. Results: We developed SynTView, a comparative and interactive viewer for microbial genomes, designed to run as either a web-based tool (Flash technology) or a desktop application (AIR environment). The basis of the program is a generic genome browser with sub-maps holding information about genomic objects (annotations). The software is characterised by the presentation of syntenic organisations of microbial genomes and the visualisation of polymorphism data (typically Single Nucleotide Polymorphisms --- SNPs) along these genomes; these features are accessible to the user in an integrated way. A variety of specialised views are available and are all dynamically inter-connected (including linear and circular multi-genome representations, dot plots, phylogenetic profiles, SNP density maps, and more). SynTView is not linked to any particular database, allowing the user to plug his own data into the system seamlessly, and use external web services for added functionalities. SynTView has now been used in several genome sequencing projects to help biologists make sense out of huge data sets. Conclusions: The most important assets of SynTView are: (i) the interactivity due to the Flash technology; (ii) the capabilities for dynamic interaction between many specialised views; and (iii) the flexibility allowing various user data sets to be integrated. It can thus be used to investigate massive amounts of information efficiently at the chromosome level. This innovative approach to data exploration could not be achieved with most existing genome browsers, which are more static and/or do not offer multiple views of multiple genomes. Documentation, tutorials and demonstration sites are available at the URL: http://genopole.pasteur.fr/SynTView.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 60
    Publication Date: 2013-09-23
    Description: Background: High-throughput RNA sequencing (RNA-Seq) is a revolutionary technique to study the transcriptome of a cell under various conditions at a systems level. Despite the wide application of RNA-Seq techniques to generate experimental data in the last few years, few computational methods are available to analyze this huge amount of transcription data. The computational methods for constructing gene regulatory networks from RNA-Seq expression data of hundreds or even thousands of genes are particularly lacking and urgently needed. Results: We developed an automated bioinformatics method to predict gene regulatory networks from the quantitative expression values of differentially expressed genes based on RNA-Seq transcriptome data of a cell in different stages and conditions, integrating transcriptional, genomic and gene function data. We applied the method to the RNA-Seq transcriptome data generated for soybean root hair cells in three different development stages of nodulation after rhizobium infection. The method predicted a soybean nodulation-related gene regulatory network consisting of 10 regulatory modules common for all three stages, and 24, 49 and 70 modules separately for the first, second and third stage, each containing both a group of co-expressed genes and several transcription factors collaboratively controlling their expression under different conditions. 8 of 10 common regulatory modules were validated by at least two kinds of validations, such as independent DNA binding motif analysis, gene function enrichment test, and previous experimental data in the literature. Conclusions: We developed a computational method to reliably reconstruct gene regulatory networks from RNA-Seq transcriptome data. The method can generate valuable hypotheses for interpreting biological data and designing biological experiments such as ChIP-Seq, RNA interference, and yeast two hybrid experiments.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 61
    Publication Date: 2013-09-23
    Description: Background: With the increasing prevalence of Picture Archiving and Communication Systems (PACS) in healthcare institutions, there is a growing need to measure their success. However, there is a lack of published literature emphasizing the technical and social factors underlying a successful PACS. Methods: An updated Information Systems Success Model was utilized by radiology technologists (RTs) to evaluate the success of PACS at a large medical center in Taiwan. A survey, consisting of 109 questionnaires, was analyzed by Structural Equation Modeling. Results: Socio-technical factors (including system quality, information quality, service quality, perceived usefulness, user satisfaction, and PACS dependence) were proven to be effective measures of PACS success. Although the relationship between service quality and perceived usefulness was not significant, other proposed relationships amongst the six measurement parameters of success were all confirmed. Conclusions: Managers have an obligation to improve the attributes of PACS. At the onset of its deployment, RTs will have formed their own subjective opinions with regards to its quality (system quality, information quality, and service quality). As these personal concepts are either refuted or reinforced based on personal experiences, RTs will become either satisfied or dissatisfied with PACS, based on their perception of its usefulness or lack of usefulness. A satisfied RT may play a pivotal role in the implementation of PACS in the future.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 62
    Publication Date: 2013-09-24
    Description: Background: Finding peaks in ChIP-seq is an important process in biological inference. In some cases, such as positioning nucleosomes with specific histone modifications or finding transcription factor binding specificities, the precision of the detected peak plays a significant role. There are several applications for finding peaks (called peak finders) based on different algorithms (e.g. MACS, Erange and HPeak). Benchmark studies have shown that the existing peak finders identify different peaks for the same dataset and it is not known which one is the most accurate. We present the first meta-server called Peak Finder MetaServer (PFMS) that collects results from several peak finders and produces consensus peaks. Our application accepts three standard ChIP-seq data formats: BED, BAM, and SAM. Results: Sensitivity and specificity of seven widely used peak finders were examined. For the experiments we used three previously studied Transcription Factors (TF) ChIP-seq datasets and identified three of the selected peak finders that returned results with high specificity and very good sensitivity compared to the remaining four. We also ran PFMS using the three selected peak finders on the same TF datasets and achieved higher specificity and sensitivity than the peak finders individually. Conclusions: We show that combining outputs from up to seven peak finders yields better results than individual peak finders. In addition, three of the seven peak finders outperform the remaining four, and running PFMS with these three returns even more accurate results. Another added value of PFMS is a separate report of the peaks returned by each of the included peak finders.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 63
    Publication Date: 2013-09-26
    Description: Background: Concept recognition is an essential task in biomedical information extraction, presenting several complex and unsolved challenges. The development of such solutions is typically performed in an ad-hoc manner or using general information extraction frameworks, which are not optimized for the biomedical domain and normally require the integration of complex external libraries and/or the development of custom tools. Results: This article presents Neji, an open source framework optimized for biomedical concept recognition built around four key characteristics: modularity, scalability, speed, and usability. It integrates modules for biomedical natural language processing, such as sentence splitting, tokenization, lemmatization, part-of-speech tagging, chunking and dependency parsing. Concept recognition is provided through dictionary matching and machine learning with normalization methods. Neji also integrates an innovative concept tree implementation, supporting overlapped concept names and respective disambiguation techniques. The most popular input and output formats, namely Pubmed XML, IeXML, CoNLL and A1, are also supported. On top of the built-in functionalities, developers and researchers can implement new processing modules or pipelines, or use the provided command-line interface tool to build their own solutions, applying the most appropriate techniques to identify heterogeneous biomedical concepts. Neji was evaluated against three gold standard corpora with heterogeneous biomedical concepts (CRAFT, AnEM and NCBI disease corpus), achieving high performance results on named entity recognition (F1-measure for overlap matching: species 95%, cell 92%, cellular components 83%, gene and proteins 76%, chemicals 65%, biological processes and molecular functions 63%, disorders 85%, and anatomical entities 82%) and on entity normalization (F1-measure for overlap name matching and correct identifier included in the returned list of identifiers: species 88%, cell 71%, cellular components 72%, gene and proteins 64%, chemicals 53%, and biological processes and molecular functions 40%). Neji provides fast and multi-threaded data processing, annotating up to 1200 sentences/second when using dictionary-based concept identification. Conclusions: Considering the provided features and underlying characteristics, we believe that Neji is an important contribution to the biomedical community, streamlining the development of complex concept recognition solutions. Neji is freely available at http://bioinformatics.ua.pt/neji.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 64
    Publication Date: 2013-09-26
    Description: Background: Identifying genetic variants associated with complex human diseases is a great challenge in genomewide association studies (GWAS). Single nucleotide polymorphisms (SNPs) arising from genetic background are often dependent. The existing methods, i.e., local index of significance (LIS) and pooled local index of significance (PLIS), were both proposed for modeling SNP dependence and assumed that the whole chromosome follows a hidden Markov model (HMM). However, the fact that SNP data are often collected from separate heterogeneous regions of a single chromosome encourages different chromosomal regions to follow different HMMs. In this research, we developed a data-driven penalized criterion combined with a dynamic programming algorithm to find change points that divide the whole chromosome into more homogeneous regions. Furthermore, we extended PLIS to analyze the dependent tests obtained from multiple chromosomes with different regions for GWAS. Results: The simulation results show that our new criterion can improve the performance of the model selection procedure and that our region-specific PLIS (RSPLIS) method is better than PLIS at detecting diseaseassociated SNPs when there are multiple change points along a chromosome. Our method has been used to analyze the Daly study, and compared with PLIS, RSPLIS yielded results that more accurately detected disease-associated SNPs. Conclusions: The genomic rankings based on our method differ from the rankings based on PLIS. Specifically, for the detection of genetic variants with weak effect sizes, the RSPLIS method was able to rank them more efficiently and with greater power.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 65
    Publication Date: 2013-09-26
    Description: Background: The use of Gene Ontology (GO) data in protein analyses have largely contributed to the improvedoutcomes of these analyses. Several GO semantic similarity measures have been proposed in recentyears and provide tools that allow the integration of biological knowledge embedded in the GOstructure into different biological analyses. There is a need for a unified tool that provides the scientificcommunity with the opportunity to explore these different GO similarity measure approaches and theirbiological applications. Results: We have developed DaGO-Fun, an online tool available at http://web.cbio.uct.ac.za/ITGOM, whichincorporates many different GO similarity measures for exploring, analyzing and comparing GO termsand proteins within the context of GO. It uses GO data and UniProt proteins with their GO annotationsas provided by the Gene Ontology Annotation (GOA) project to precompute GO term informationcontent (IC), enabling rapid response to user queries. Conclusions: The DaGO-Fun online tool presents the advantage of integrating all the relevant IC-based GOsimilarity measures, including topology- and annotation-based approaches to facilitate effectiveexploration of these measures, thus enabling users to choose the most relevant approach for theirapplication. Furthermore, this tool includes several biological applications related to GO semanticsimilarity scores, including the retrieval of genes based on their GO annotations, the clustering offunctionally related genes within a set, and term enrichment analysis.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 66
    Publication Date: 2013-09-26
    Description: Background: Existing tools to model cell growth curves do not offer a flexible integrative approach to manage large datasets and automatically estimate parameters. Due to the increase of experimental time-series from microbiology and oncology, the need for a software that allows researchers to easily organize experimental data and simultaneously extract relevant parameters in an efficient way is crucial. Results: BGFit provides a web-based unified platform, where a rich set of dynamic models can be fitted to experimental time-series data, further allowing to efficiently manage the results in a structured and hierarchical way. The data managing system allows to organize projects, experiments and measurements data and also to define teams with different editing and viewing permission. Several dynamic and algebraic models are already implemented, such as polynomial regression, Gompertz, Baranyi, Logistic and Live Cell Fraction models and the user can add easily new models thus expanding current ones. Conclusions: BGFit allows users to easily manage their data and models in an integrated way, even if they are not familiar with databases or existing computational tools for parameter estimation. BGFit is designed with a flexible architecture that focus on extensibility and leverages free software with existing tools and methods, allowing to compare and evaluate different data modeling techniques. The application is described in the context of bacterial and tumor cells growth data fitting, but it is also applicable to any type of two-dimensional data, e.g. physical chemistry and macroeconomic time series, being fully scalable to high number of projects, data and model complexity.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 67
    Publication Date: 2013-01-17
    Description: Background: Cross-species comparisons of gene neighborhoods (also called genomic contexts) in microbes may provide insight into determining functionally related or co-regulated sets of genes, suggest annotations of previously un-annotated genes, and help to identify horizontal gene transfer events across microbial species. Existing tools to investigate genomic contexts, however, lack features for dynamically comparing and exploring genomic regions from multiple species. As DNA sequencing technologies improve and the number of whole sequenced microbial genomes increases, a user-friendly genome context comparison platform designed for use by a broad range of users promises to satisfy a growing need in the biological community. Results: Here we present JContextExplorer: a tool that organizes genomic contexts into branching diagrams. We implement several alternative context-comparison and tree rendering algorithms, and allow for easy transitioning between different clustering algorithms. To facilitate genomic context analysis, our tool implements GUI features, such as text search filtering, point-and-click interrogation of individual contexts, and genomic visualization via a multi-genome browser. We demonstrate a use case of our tool by attempting to resolve annotation ambiguities between two highly homologous yet functionally distinct genes in a set of 22 alpha and gamma proteobacteria. Conclusions: JContextExplorer should enable a broad range of users to analyze and explore genomic contexts. The program has been tested on Windows, Mac, and Linux operating systems, and is implemented both as an executable JAR file and java WebStart. Program executables, source code, and documentation is available at http://www.bme.ucdavis.edu/facciotti/resources_data/software/
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 68
    Publication Date: 2013-01-17
    Description: Background: Negation occurs frequently in scientific literature, especially in biomedical literature. It has previously been reported that around 13% of sentences found in biomedical research articles contain negation. Historically, the main motivation for identifying negated events has been to ensure their exclusion from lists of extracted interactions. However, recently, there has been a growing interest in negative results, which has resulted in negation detection being identified as a key challenge in biomedical relation extraction. In this article, we focus on the problem of identifying negated bio-events, given gold standard event annotations. Results: We have conducted a detailed analysis of three open access bio-event corpora containing negation information (i.e., GENIA Event, BioInfer and BioNLP'09 ST), and have identified the main types of negated bio-events. We have analysed the key aspects of a machine learning solution to the problem of detecting negated events, including selection of negation cues, feature engineering and the choice of learning algorithm. Combining the best solutions for each aspect of the problem, we propose a novel framework for the identification of negated bio-events. We have evaluated our system on each of the three open access corpora mentioned above. The performance of the system significantly surpasses the best results previously reported on the BioNLP'09 ST corpus, and achieves even better results on the GENIA Event and BioInfer corpora, both of which contain more varied and complex events. Conclusions: Recently, in the field of biomedical text mining, the development and enhancement of event-based systems has received significant interest. The ability to identify negated events is a key performance element for these systems. We have conducted the first detailed study on the analysis and identification of negated bio-events. Our proposed framework can be integrated with state-of-the-art event extraction systems. The resulting systems will be able to extract bio-events with attached polarities from textual documents, which can serve as the foundation for more elaborate systems that are able to detect mutually contradicting bio-events.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 69
    Publication Date: 2013-01-17
    Description: Background: Detection of low abundance metabolites is important for de novo mapping of metabolic pathways related to diet, microbiome or environmental exposures. Multiple algorithms are available to extract m/z features from liquid chromatography-mass spectral data in a conservative manner, which tends to preclude detection of low abundance chemicals and chemicals found in small subsets of samples. The present study provides software to enhance such algorithms for feature detection, quality assessment, and annotation. Results: xMSanalyzer is a set of utilities for automated processing of metabolomics data. The utilites can be classified into four main modules to: 1) improve feature detection for replicate analyses by systematic re-extraction with multiple parameter settings and data merger to optimize the balance between sensitivity and reliability, 2) evaluate sample quality and feature consistency, 3) detect feature overlap between datasets, and 4) characterize high-resolution m/z matches to small molecule metabolites and biological pathways using multiple chemical databases. The package was tested with plasma samples and shown to more than double the number of features extracted while improving quantitative reliability of detection. MS/MS analysis of a random subset of peaks that were exclusively detected using xMSanalyzer confirmed that the optimization scheme improves detection of real metabolites. Conclusions: xMSanalyzer is a package of utilities for data extraction, quality control assessment, detection of overlapping and unique metabolites in multiple datasets, and batch annotation of metabolites. The program was designed to integrate with existing packages such as apLCMS and XCMS, but the framework can also be used to enhance data extraction for other LC/MS data software.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 70
    Publication Date: 2013-01-17
    Description: Background: Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins(PPIs) from free text. Various proposals have been put forward, which diverge especially in the specifickernel function, the type of input representation, and the feature sets. These proposals are regularlycompared to each other regarding their overall performance on different gold standard corpora,but little is known about their respective performance on the instance level. Results: We report on a detailed analysis of the shared characteristics and the differences between 13 currentmethods using five PPI corpora. We identified a large number of rather difficult (misclassified by mostmethods) and easy (correctly classified by most methods) PPIs. We show that kernels using the sameinput representation perform similarly on these pairs and that building ensembles using dissimilarkernels leads to significant performance gain. However, our analysis also reveals that characteristicsshared between difficult pairs are few, which lowers the hope that new methods, if built along thesame line as current ones, will deliver breakthroughs in extraction performance. Conclusions: Our experiments show that current methods do not seem to do very well in capturing the sharedcharacteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (stillvery few) available corpora. Our analysis suggests that performance improvements shall be soughtafter rather in novel feature sets than in novel kernel functions.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 71
    Publication Date: 2013-01-17
    Description: Background: Structure-based clustering is commonly used to identify correct protein folds among candidate folds (also called decoys) generated by protein structure prediction programs. However, traditional clustering methods exhibit a poor runtime performance on large decoy sets. We hypothesized that a more efficient "partial" clustering approach in combination with an improved scoring scheme could significantly improve both the speed and performance of existing candidate selection methods. Results: We propose a new scheme that performs rapid but incomplete clustering on protein decoys. Our method detects structurally similar decoys (measured using either Calpha RMSD or GDT-TS score) and extracts representatives from them without assigning every decoy to a cluster. We integrated our new clustering strategy with several different scoring functions to assess both the performance and speed in identifying correct or near-correct folds. Experimental results on 35 Rosetta decoy sets and 40 I-TASSER decoy sets show that our method can improve the correct fold detection rate as assessed by two different quality criteria. This improvement is significantly better than two recently published clustering methods, Durandal and Calibur-lite. Speed and efficiency testing shows that our method can handle much larger decoy sets and is up to 22 times faster than Durandal and Calibur-lite. Conclusions: The new method, named HS-Forest, avoids the computationally expensive task of clustering every decoy, yet still allows superior correct-fold selection. Its improved speed, efficiency and decoy-selection performance should enable structure prediction researchers to work with larger decoy sets and significantly improve their ab initio structure prediction performance.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 72
    Publication Date: 2013-01-17
    Description: Background: Traditional methods for computational motif discovery often suffer from poor performance. In particular, methods that search for sequence matches to known binding motifs tend to predict many non-functional binding sites because they fail to take into consideration the biological state of the cell. In recent years, genome-wide studies have generated a lot of data that has the potential to improve our ability to identify functional motifs and binding sites, such as information about chromatin accessibility and epigenetic states in different cell types. However, it is not always trivial to make use of this data in combination with existing motif discovery tools, especially for researchers who are not skilled in bioinformatics programming. Results: Here we present MotifLab, a general workbench for analysing regulatory sequence regions and discovering transcription factor binding sites and cis-regulatory modules. MotifLab supports comprehensive motif discovery and analysis by allowing users to integrate several popular motif discovery tools as well as different kinds of additional information, including phylogenetic conservation, epigenetic marks, DNase hypersensitive sites, ChIP-Seq data, positional binding preferences of transcription factors, transcription factor interactions and gene expression. MotifLab offers several data-processing operations that can be used to create, manipulate and analyse data objects, and complete analysis workflows can be constructed and automatically executed within MotifLab, including graphical presentation of the results. Conclusions: We have developed MotifLab as a flexible workbench for motif analysis in a genomic context. The flexibility and effectiveness of this workbench has been demonstrated on selected test cases, in particular two previously published benchmark data sets for single motifs and modules, and a realistic example of genes responding to treatment with forskolin. MotifLab is freely available at http://www.motiflab.org.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 73
    Publication Date: 2013-01-17
    Description: Background: Recent studies of transcription activator-like (TAL) effector domains fused to nucleases (TALENs) demonstrate enormous potential for genome editing. Effective design of TALENs requires a combination of selecting appropriate genetic features, finding pairs of binding sites based on a consensus sequence, and, in some cases, identifying endogenous restriction sites for downstream molecular genetic applications. Results: We present the web-based program Mojo Hand for designing TAL and TALEN constructs for genome editing applications (www.talendesign.org). We describe the algorithm and its implementation. The features of Mojo Hand include (1) automatic download of genomic data from the National Center for Biotechnology Information, (2) analysis of any DNA sequence to reveal pairs of binding sites based on a user-defined template, (3) selection of restriction-enzyme recognition sites in the spacer between the TAL monomer binding sites including options for the selection of restriction enzyme suppliers, and (4) output files designed for subsequent TALEN construction using the Golden Gate assembly method. Conclusions: Mojo Hand enables the rapid identification of TAL binding sites for use in TALEN design. The assembly of TALEN constructs, is also simplified by using the TAL-site prediction program in conjunction with a spreadsheet management aid of reagent concentrations and TALEN formulation. Mojo Hand enables scientists to more rapidly deploy TALENs for genome editing applications.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 74
    Publication Date: 2013-01-17
    Description: Background: The digitization of biodiversity data is leading to the widespread application of taxon names that are superfluous, ambiguous or incorrect, resulting in mismatched records and inflated species numbers. The ultimate consequences of misspelled names and bad taxonomy are erroneous scientific conclusions and faulty policy decisions. The lack of tools for correcting this 'names problem' has become a fundamental obstacle to integrating disparate data sources and advancing the progress of biodiversity science. Results: The TNRS, or Taxonomic Name Resolution Service, is an online application for automated and user-supervised standardization of plant scientific names. The TNRS builds upon and extends existing open-source applications for name parsing and fuzzy matching. Names are standardized against multiple reference taxonomies, including the Missouri Botanical Garden's Tropicos database. Capable of processing thousands of names in a single operation, the TNRS parses and corrects misspelled names and authorities, standardizes variant spellings, and converts nomenclatural synonyms to accepted names. Family names can be included to increase match accuracy and resolve many types of homonyms. Partial matching of higher taxa combined with extraction of annotations, accession numbers and morphospecies allows the TNRS to standardize taxonomy across a broad range of active and legacy datasets. Conclusions: We show how the TNRS can resolve many forms of taxonomic semantic heterogeneity, correct spelling errors and eliminate spurious names. As a result, the TNRS can aid the integration of disparate biological datasets. Although the TNRS was developed to aid in standardizing plant names, its underlying algorithms and design can be extended to all organisms and nomenclatural codes. The TNRS is accessible via a web interface at http://tnrs.iplantcollaborative.org/ and as a RESTful web service and application programming interface. Source code is available at https://github.com/iPlantCollaborativeOpenSource/TNRS/.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 75
    Publication Date: 2013-01-17
    Description: Background: Gene fusions are the result of chromosomal aberrations and encode chimeric RNA (fusion transcripts) that play an important role in cancer genesis. Recent advances in high throughput transcriptome sequencing have given rise to computational methods for new fusion discovery. The ability to simulate fusion transcripts is essential for testing and improving those tools. Results: To facilitate this need, we developed FUSIM (FUsion SIMulator), a software tool for simulating fusion transcripts. The simulation of events known to create fusion genes and their resulting chimeric proteins is supported, including inter-chromosome translocation, trans-splicing, complex chromosomal rearrangements, and transcriptional read through events.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 76
    Publication Date: 2013-01-17
    Description: Background: A standard graphical notation is essential to facilitate exchange of network representations of biologicalprocesses. Towards this end, the Systems Biology Graphical Notation (SBGN) has been proposed, and it isalready supported by a number of tools. However, support for SBGN in Cytoscape, one of the most widelyused platforms in biology to visualise and analyse networks, is limited, and in particular it is not possible toimport SBGN diagrams. Results: We have developed CySBGN, a Cytoscape plug-in that extends the use of Cytoscape visualisation and analysisfeatures to SBGN maps. CySBGN adds support for Cytoscape users to visualize any of the threecomplementary SBGN languages: Process Description, Entity Relationship, and Activity Flow. Theinteroperability with other tools (CySBML plug-in and Systems Biology Format Converter) was alsoestablished allowing an automated generation of SBGN diagrams based on previously imported SBMLmodels. The plug-in was tested using a suite of 53 different test cases that covers almost all possible entities,shapes, and connections. A rendering comparison with other tools that support SBGN was performed. Toillustrate the interoperability with other Cytoscape functionalities, we present two analysis examples, shortestpath calculation, and motif identification in a metabolic network. Conclusions: CySBGN imports, modifies and analyzes SBGN diagrams in Cytoscape, and thus allows the application of thelarge palette of tools and plug-ins in this platform to networks and pathways in SBGN format.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 77
    Publication Date: 2013-01-18
    Description: Background: Amyloids are proteins capable of forming fibrils. Many of them underlie serious diseases, like Alzheimer disease. The number of amyloid-associated diseases is constantly increasing. Recent studies indicate that amyloidogenic properties can be associated with short segments of aminoacids, which transform the structure when exposed. A few hundreds of such peptides have been experimentally found; Experimental testing of all possible aminoacid combinations is currently not feasible. Instead, they can be predicted by computational methods. 3D profile is a physicochemical-based method that has generated the most numerous dataset - ZipperDB. However, it is computationally very demanding. Here, we show that dataset generation can be accelerated. Two methods to increase the classification efficiency of amyloidogenic candidates are presented and tested: simplified 3D profile generation and machine learning methods. Results: We generated a new dataset of hexapeptides, using modified 3D profile algorithm, which showed very good classification overlap with ZipperDB (93.5%). The new part of our dataset contains 1779 segments, with 204 classified as amyloidogenic. The dataset of 6-residue sequences with their binary classification, based on the energy of the segment, was applied for training machine learning methods. A separate set of sequences from ZipperDB was used as a test set. The most effective methods were Alternating Decision Tree and Multilayer Perceptron. Both methods obtained area under ROC curve of 0.96, accuracy 91%, true positive rate ca. 78%, and true negative rate 95%. A few other machine learning methods also achieved a good performance. The computational time was reduced from 18-20 CPU-hours (full 3D profile) to 0.5 CPU-hours (simplified 3D profile) to seconds (machine learning). Conclusions: We showed that the simplified profile generation method does not introduce an error with regard to the original method, while increasing the computational efficiency. Our new dataset proved representative enough to use simple statistical methods for testing the amylogenicity based only on six letter sequences. Statistical machine learning methods such as Alternating Decision Tree and Multilayer Perceptron can replace the energy based classifier, with advantage of very significantly reduced computational time and simplicity to perform the analysis. Additionally, a decision tree provides a set of very easily interpretable rules.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 78
    Publication Date: 2013-01-18
    Description: Background: The Sequence Read Archive (SRA) is the largest public repository of sequencing data from the nextgeneration of sequencing platforms including Illumina (Genome Analyzer, HiSeq, MiSeq, .etc),Roche 454 GS System, Applied Biosystems SOLiD System, Helicos Heliscope, PacBio RS, andothers. Results: SRAdb is an attempt to make query of the metadata associated with SRA submission, study, sample,experiment and run more robust and precise, and make access to sequencing data in the SRA easier.We have parsed all the SRA metadata into a SQLite database that is routinely updated and can beeasily distributed. The SRAdb R/Bioconductor package then utilizes this SQLite database forquerying and accessing metadata. Full text search functionality makes querying metadata veryflexible and powerful. Fastq files associated with query results can be downloaded easily for localanalysis. The package also includes an interface from R to a popular genome browser, the IntegratedGenomics Viewer. Conclusions: SRAdb Bioconductor package provides a convenient and integrated framework to query and accessSRA metadata quickly and powerfully from within R.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 79
    Publication Date: 2013-01-19
    Description: Background: Protein pairs that have the same secondary structure packing arrangement but have different topologies have attracted much attention in terms of both evolution and physical chemistry of protein structures. Further investigation of such protein relationships would give us a hint as to how proteins can change their fold in the course of evolution, as well as a insight into physico-chemical properties of secondary structure packing. For this purpose, highly accurate sequence order independent structure comparison methods are needed. Results: We have developed a novel protein structure alignment algorithm, MICAN (a structure alignment algorithm that can handle Multiple-chain complexes, Inverse direction of secondary structures, Ca only models, Alternative alignments, and Non-sequential alignments). The algorithm was designed so as to identify the best structural alignment between protein pairs by disregarding the connectivity between secondary structure elements (SSE). One of the key feature of the algorithm is utilizing the multiple vector representation for each SSE, which enables us to correctly treat bent or twisted nature of long SSE. We compared MICAN with other 9 publicly available structure alignment programs, using both reference-dependent and reference-independent evaluation methods on a variety of benchmark test sets which include both sequential and non-sequential alignments. We show that MICAN outperforms the other existing methods for reproducing reference alignments of non-sequential test sets. Further, although MICAN does not specialize in sequential structure alignment, it showed the top level performance on the sequential test sets. We also show that MICAN program is the fastest non-sequential structure alignment program among all the programs we examined here. Conclusions: MICAN is the fastest and the most accurate program among non-sequential alignment programs we examined here. These results suggest that MICAN is a highly effective tool for automatically detecting non-trivial structural relationships of proteins, such as circular permutations and segment-swapping, many of which have been identified manually by human experts so far. The source code of MICAN is freely download-able at http://www.tbp.cse.nagoya-u.ac.jp/MICAN.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 80
    Publication Date: 2013-01-20
    Description: Background: Maximum Likelihood (ML)-based phylogenetic inference using Felsenstein's pruning algorithm is a standard method for estimating the evolutionary relationships amongst a set of species based on DNA sequence data, and is used in popular applications such as RAxML, PHYLIP, GARLI, BEAST, and MrBayes. The Phylogenetic Likelihood Function (PLF) and its associated scaling and normalization steps comprise the computational kernel for these tools. These computations are data intensive but contain fine grain parallelism that can be exploited by coprocessor architectures such as FPGAs and GPUs. A general purpose API called BEAGLE has recently been developed that includes optimized implementations of Felsenstein's pruning algorithm for various data parallel architectures. In this paper, we extend the BEAGLE API to a multiple Field Programmable Gate Array (FPGA)-based platform called the Convey HC-1. Results: The core calculation of our implementation, which includes both the phylogenetic likelihood function (PLF) and the tree likelihood calculation, has an arithmetic intensity of 130 floating-point operations per 64 bytes of I/O, or 2.03 ops/byte. Its performance can thus be calculated as a function of the host platform's peak memory bandwidth and the implementation's memory efficiency, as 2.03 x peak bandwidth x memory efficiency. Our FPGA-based platform has a peak bandwidth of 76.8 GB/s and our implementation achieves a memory efficiency of approximately 50%, which gives an average throughput of 78 Gflops. This represents a ~40X speedup when compared with BEAGLE's CPU implementation on a dual Xeon 5520 and 3X speedup versus BEAGLE's GPU implementation on a Tesla T10 GPU for very large data sizes. The power consumption is 92 W, yielding a power efficiency of 1.7 Gflops per Watt. Conclusions: The use of data parallel architectures to achieve high performance for likelihood-based phylogenetic inference requires high memory bandwidth and a design methodology that emphasizes high memory efficiency. To achieve this objective, we integrated 32 pipelined processing elements (PEs) across four FPGAs. For the design of each PE, we developed a specialized synthesis tool to generate a floating-point pipeline with resource and throughput constraints to match the target platform. We have found that using low-latency floating-point operators can significantly reduce FPGA area and still meet timing requirement on the target platform. We found that this design methodology can achieve performance that exceeds that of a GPU-based coprocessor.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 81
    Publication Date: 2013-01-18
    Description: Background: Most phylogeny analysis methods based on molecular sequences use multiple alignment where the quality of the alignment, which is dependent on the alignment parameters, determines the accuracy of the resulting trees. Different parameter combinations chosen for the multiple alignment may result in different phylogenies. A new non-alignment based approach, Relative Complexity Measure (RCM), has been introduced to tackle this problem and proven to work in fungi and mitochondrial DNA.Result: In this work, we present an application of the RCM method to reconstruct robust phylogenetic trees using sequence data for genus Galanthus obtained from different regions in Turkey. Phylogenies have been analyzed using nuclear and chloroplast DNA sequences. Results showed that, the tree obtained from nuclear ribosomal RNA gene sequences was more robust, while the tree obtained from the chloroplast DNA showed a higher degree of variation. Conclusions: Phylogenies generated by Relative Complexity Measure were found to be robust and results of RCM were more reliable than the compared techniques. Particularly, to overcome MSA-based problems, RCM seems to be a reasonable way and a good alternative to MSA-based phylogenetic analysis. We believe our method will become a mainstream phylogeny construction method especially for the highly variable sequence families where the accuracy of the MSA heavily depends on the alignment parameters.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 82
    Publication Date: 2013-02-22
    Description: Background: The learning active subnetworks problem involves finding subnetworks of a bio-molecular network that are active in a particular condition. Many approaches integrate observation data (e.g., gene expression) with the network topology to find candidate subnetworks. Increasingly, pathway databases contain additional annotation information that can be mined to improve prediction accuracy, such as the interaction mechanism (e.g., transcription, microRNA, cleavage) annotations. We introduce a mechanism-based approach to active subnetwork recovery which exploits such annotations. We suggest that neighboring interactions in a network tend to be co-activated in a way that depends on the ``correlation'' of their mechanism annotations. e.g., neighboring phosphorylation and de-phosphorylation interactions may be more likely to be co-activated than neighboring phosphorylation and covalent bonding interactions. Results: Our method iteratively learns the mechanism correlations and finds the most likely active subnetwork. We use a probabilistic graphical model with a Markov Random Field component which creates dependencies between the states (active or non-active) of neighboring interactions, that incorporates a mechanism-based component to the function. We apply a heuristic-based EM-based algorithm suitable for the problem. We validated our method's performance using simulated data in networks downloaded from GeneGO against the same approach without the mechanism-based component, and two other existing methods. We validated our methods performance in correctly recovering (1) the true interaction states, and (2) global network properties of the original network against these other methods. We applied our method to networks generated from time-course gene expression studies in angiogenesis and lung organogenesis and validated the findings from a biological perspective against current literature.DiscussionThe advantage of our mechanism-based approach is best seen in networks composed of connected regions with a large number of interactions annotated with a subset of mechanisms, e.g., a regulatory region of transcription interactions, or a cleavage cascade region. When applied to real datasets, our method recovered novel and biologically meaningful putative interactions, e.g., interactions from an integrin signaling pathway using the angiogenesis dataset, and a group of regulatory microRNA interactions in an organogenesis network.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 83
    Publication Date: 2013-02-21
    Description: Background: There has been increasing emphasis on evidence-based approaches to improve patient outcomes through rigorous, standardised and well- validated approaches. Clinical guidelines drive this process and are largely developed based on the findings of systematic reviews (SRs). This paper presents a discussion of the SR process in providing decisive information to shape and guide clinical practice, using a purpose-built review database: the Cochrane reviews; and focussing on a highly prevalent medical condition: hypertension. Methods: We searched the Cochrane database and identified 25 relevant SRs incorporating 443 clinical trials. Reviews with the terms 'blood pressure' or 'hypertension' in the title were included. Once selected for inclusion, the abstracts were assessed independently by two authors for their capacity to inform and influence clinical decision-making. The inclusions were independently audited by a third author. Results: Of the 25 SRs that formed the sample, 12 provided conclusive findings to inform a particular treatment pathway. The evidence-based approaches offer the promise of assisting clinical decision-making through clarity, but in the case of management of blood pressure, half of the SRs in our sample highlight gaps in evidence and methodological limitations. Thirteen reviews were inconclusive, and eight, including four of the 12 conclusive SRs, noted the lack of adequate reporting of potential adverse effects or incidence of harm. Conclusions: These findings emphasise the importance of distillation, interpretation and synthesis of information to assist clinicians. This study questions the utility of evidence-based approaches as a uni-dimensional approach to improving clinical care and underscores the importance of standardised approaches to include adverse events, incidence of harm, patient's needs and preferences and clinician's expertise and discretion.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 84
    Publication Date: 2013-02-21
    Description: Background: To reduce the large public health burden of the high prevalence of depression, preventive interventions targeted at people at risk are essential and can be cost-effective. Web-based interventions are able to provide this care, but there is no agreement on how to best develop these applications and often the technology is seen as a given. This seems to be one of the main reasons that web-based interventions do not reach their full potential. The current study describes the development of a web-based intervention for the indicated prevention of depression, employing the CeHRes (Center for eHealth Research and Disease Management) roadmap. The goals are to create a user-friendly application which fits the values of the stakeholders and to evaluate the process of development. Methods: The employed methods are a literature scan and discussion in the contextual inquiry; interviews, rapid prototyping and a requirement session in the value specification stage; and user-based usability evaluation, expert-based usability inspection and a requirement session in the design stage. Results: The contextual inquiry indicated that there is a need for easily accessible interventions for the indicated prevention of depression and web-based interventions are seen as potentially meeting this need. The value specification stage yielded expected needs of potential participants, comments on the usefulness of the proposed features and comments on two proposed designs of the web-based intervention. The design stage yielded valuable comments on the system, content and service of the web-based intervention. Conclusions: Overall, we found that by developing the technology, we successfully (re)designed the system, content and service of the web-based intervention to match the values of stakeholders. This study has shown the importance of a structured development process of a web-based intervention for the indicated prevention of depression because: (1) it allows the development team to clarify the needs that have to be met for the intervention to be of use to the target audience; and (2) it yields feedback on the design of the application that is broader than color and buttons, but encompasses comments on the quality of the service that the application offers.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 85
    Publication Date: 2013-02-22
    Description: Background: For the last 25 years species delimitation in prokaryotes (Archaea and Bacteria) was to a large extentbased on DNA-DNA hybridization (DDH), a tedious lab procedure designed in the early 1970s thatserved its purpose astonishingly well in the absence of deciphered genome sequences. With the rapidprogress in genome sequencing time has come to directly use the now available and easy to generategenome sequences for delimitation of species. GBDP (Genome Blast Distance Phylogeny) infersgenome-to-genome distances between pairs of entirely or partially sequenced genomes, a digital,highly reliable estimator for the relatedness of genomes. Its application as an in-silico replacementfor DDH was recently introduced. The main challenge in the implementation of such an application isto produce digital DDH values that must mimic the wet-lab DDH values as close as possible to ensureconsistency in the Prokaryotic species concept. Results: Correlation and regression analyses were used to determine the best-performing methods and themost influential parameters. GBDP was further enriched with a set of new features such as confidenceintervals for intergenomic distances obtained via resampling or via the statistical models for DDHprediction and an additional family of distance functions. As in previous analyses, GBDP obtainedthe highest agreement with wet-lab DDH among all tested methods, but improved models led toa further increase in the accuracy of DDH prediction. Confidence intervals yielded stable resultswhen inferred from the statistical models, whereas those obtained via resampling showed markeddifferences between the underlying distance functions. Conclusions: Despite the high accuracy of GBDP-based DDH prediction, inferences from limited empirical dataare always associated with a certain degree of uncertainty. It is thus crucial to enrich in-silicoDDH replacements with confidence-interval estimation, enabling the user to statistically evaluatethe outcomes. Such methodological advancements, easily accessible through the web service athttp://ggdc.dsmz.de, are crucial steps towards a consistent and truly genome sequence-based classificationof microorganisms.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 86
    Publication Date: 2013-02-23
    Description: Background: Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case--control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification. Results: We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual's continental and sub-continental ancestry. To predict an individual's continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of 〉= 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control's lamda from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of 86.5% +/- 2.4%, 95.6% +/- 3.9%, 95.6% +/- 2.1%, 98.3% +/- 2.0%, and 95.9% +/- 1.5%. However, ETHNOPRED was unable to produce a classifier that can accurately distinguish Chinese in Beijing vs. Chinese in Denver. Conclusions: ETHNOPRED is a novel technique for producing classifiers that can identify an individual's continental and sub-continental heritage, based on a small number of SNPs. We show that its learned classifiers are simple, cost-efficient, accurate, transparent, flexible, fast, applicable to large scale GWASs, and robust to missing values.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 87
    Publication Date: 2013-02-24
    Description: Background PAM, a nearest shrunken centroid method (NSC), is a popular classification method for high-dimensional data. ALP and AHP are NSC algorithms that were proposed to improve upon PAM. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall cross-validated (CV) error rate.Results We show that when data are class-imbalanced the three NSC classifiers are biased towards the majority class. The bias is larger when the number of variables or class-imbalance is larger and/or the differences between classes are smaller. To diminish the class-imbalance problem of the NSC classifiers we propose to estimate the amount of shrinkage by maximizing the CV geometric mean of the class-specific predictive accuracies (g-means).Conclusions The results obtained on simulated and real high-dimensional class-imbalanced data show that our approach outperforms the currently used strategy based on the minimization of the overall error rate when NSC classifiers are biased towards the majority class. The number of variables included in the NSC classifiers when using our approach is much smaller than with the original approach. This result is supported by experiments on simulated and real high-dimensional class-imbalanced data.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 88
    Publication Date: 2013-02-24
    Description: Background: Worldwide structural genomics projects continue to release new protein structures at an unprecedented pace, so far nearly 6000, but only about 60% of these proteins have any sort of functional annotation. Results: We explored a range of features that can be used for the prediction of functional residues given a known three-dimensional structure. These features include various centrality measures of nodes in graphs of interacting residues: closeness, betweenness and page-rank centrality. We also analyzed the distance of functional amino acids to the general center of mass (GCM) of the structure, relative solvent accessibility (RSA), and the use of relative entropy as a measure of sequence conservation. From the selected features, neural networks were trained to identify catalytic residues. We found that using distance to the GCM together with amino acid type provide a good discriminant function, when combined independently with sequence conservation. Using an independent test set of 29 annotated protein structures, the method returned 411 of the initial 9262 residues as the most likely to be involved in function. The output 411 residues contain 70 of the annotated 111 catalytic residues. This represents an approximately 14-fold enrichment of catalytic residues on the entire input set (corresponding to a sensitivity of 63% and a precision of 17%), a performance competitive with that of other state-of-the-art methods. Conclusions: We found that several of the graph based measures utilize the same underlying feature of protein structures, which can be simply and more effectively captured with the distance to GCM definition. This also has the added the advantage of simplicity and easy implementation. Meanwhile sequence conservation remains by far the most influential feature in identifying functional residues. We also found that due the rapid changes in size and composition of sequence databases, conservation calculations must be recalibrated for specific reference databases.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 89
    Publication Date: 2013-02-27
    Description: Background A phylogeny postulates shared ancestry relationships among organisms in the form of a binary tree. Phylogeniesattempt to answer an important question posed in biology: what are the ancestor-descendent relationshipsbetween organisms? At the core of every biological problem lies a phylogenetic component. The patterns thatcan be observed in nature are the product of complex interactions, constrained by the template that our ancestorsprovide. The problem of simultaneous tree and alignment estimation under Maximum Parsimony is known incombinatorial optimization as the Generalized Tree Alignment Problem (GTAP). The GTAP is the Steiner TreeProblem for the sequence edit distance. Like many biologically interesting problems, the GTAP is NP-Hard.Typically the Steiner Tree is presented under the Manhattan or the Hamming distances.ResultsExperimentally, the accuracy of the GTAP has been subjected to evaluation. Results show that phylogeniesselected using the GTAP from unaligned sequences are competitive with the best methods and algorithmsavailable. Here, we implement and explore experimentally existing and new local search heuristics for theGTAP using simulated and real data.Conclusions The methods presented here improve by more than three orders of magnitude in execution time the best localsearch heuristics existing to date when applied to real data.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 90
    Publication Date: 2013-02-27
    Description: Background: Proteins are the key elements on the path from genetic information to the development of life. Theroles played by the different proteins are difficult to uncover experimentally as this process involvescomplex procedures such as genetic modifications, injection of fluorescent proteins, gene knock-outmethods and others. The knowledge learned from each protein is usually annotated in databasesthrough different methods such as the proposed by The Gene Ontology (GO) consortium. Differentmethods have been proposed in order to predict GO terms from primary structure information, butvery few are available for large-scale functional annotation of plants, and reported success rates aremuch less than the reported by other non-plant predictors. This paper explores the predictability of GOannotations on proteins belonging to the Embryophyta group from a set of features extracted solelyfrom their primary amino acid sequence. Results: High predictability of several GO terms was found for Molecular Function and Cellular Component.As expected, a lower degree of predictability was found on Biological Process ontology annotations,although a few biological processes were easily predicted. Proteins related to transport and transcriptionwere particularly well predicted from primary structure information. The most discriminantfeatures for prediction were those related to electric charges of the amino-acid sequence and hydropathicityderived features. Conclusions: An analysis of GO-slim terms predictability in plants was carried out, in order to determine singlecategories or groups of functions that are most related with primary structure information. For eachhighly predictable GO term, the responsible features of such successfulness were identified and discussed.In addition to most published studies, focused on few categories or single ontologies, resultsin this paper comprise a complete landscape of GO predictability from primary structure encompassing75 GO terms at molecular, cellular and phenotypical level. Thus, it provides a valuable guide forresearchers interested on further advances in protein function prediction on Embryophyta plants.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 91
    Publication Date: 2013-02-27
    Description: Background: The rapid growth of short read datasets poses a new challenge to the short read mapping problem in terms of sensitivity and execution speed. Existing methods often use a restrictive error model for computing the alignments to improve speed, whereas more flexible error models are generally too slow for large-scale applications. A number of short read mapping software tools have been proposed. However, designs based on hardware are relatively rare. Field programmable gate arrays (FPGAs) have been successfully used in a number of specific application areas, such as the DSP and communications domains due to their outstanding parallel data processing capabilities, making them a competitive platform to solve problems that are "inherently parallel". Results: We present a hybrid system for short read mapping utilizing both FPGA-based hardware and CPU-based software. The computation intensive alignment and the seed generation operations are mapped onto an FPGA. We present a computationally efficient, parallel block-wise alignment structure (Align Core) to approximate the conventional dynamic programming algorithm. The performance is compared to the multi-threaded CPU-based GASSST and BWA software implementations. For single-end alignment, our hybrid system achieves faster processing speed than GASSST (with a similar sensitivity) and BWA (with a higher sensitivity); for pair-end alignment, our design achieves a slightly worse sensitivity than that of BWA but has a higher processing speed. Conclusions: This paper shows that our hybrid system can effectively accelerate the mapping of short reads to a reference genome based on the seed-and-extend approach. The performance comparison to the GASSST and BWA software implementations under different conditions shows that our hybrid design achieves a high degree of sensitivity and requires less overall execution time with only modest FPGA resource utilization. Our hybrid system design also shows that the performance bottleneck for the short read mapping problem can be changed from the alignment stage to the seed generation stage, which provides an additional requirement for the future development of short read aligners.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 92
    Publication Date: 2013-02-26
    Description: Background: Characterising genetic diversity through the analysis of massively parallel sequencing (MPS) data offers enormous potential to significantly improve our understanding of the genetic basis for observed phenotypes, including predisposition to and progression of complex human disease. Great challenges remain in resolving which genetic variants are genuinely associated with disease from the millions of 'bystanders' and artefactual signals. Results: FAVR is a suite of new methods designed to work with commonly used MPS analysis pipelines to assist in the resolution of some of these issues with a focus on relatively rare genetic variants. To the best of our knowledge, no equivalent has previously been described. The most important and novel aspect of FAVR is the use of signatures in comparator sequence alignment files during variant filtering, and annotation of variants potentially shared between individuals. The FAVR methods use these signatures to facilitate filtering of (i) platform-specific artefacts, (ii) common genetic variants, and, where relevant, (iii) artefacts derived from imbalanced paired-end sequencing, as well as annotation of genetic variants based on evidence of co-occurrence in individuals. By comparing conventional variant calling with or without downstream processing by FAVR methods applied to whole-exome sequencing datasets, we demonstrate a 3-fold smaller rare single nucleotide variant shortlist with no detected reduction in sensitivity. This analysis included Sanger sequencing of rare variant signals not evident in dbSNP131, assessment of known variant signal preservation, and comparison of observed and expected rare variant numbers across a range of first cousin pairs. The principles described herein were applied in our recent publication identifying XRCC2 as a new breast cancer risk gene and have been made publically available as a suite of software tools. Conclusions: FAVR is a platform-agnostic suite of methods that significantly enhances the analysis of large volumes of sequencing data for the study of rare genetic variants and their influence on phenotypes.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 93
    Publication Date: 2013-02-27
    Description: Background: Evidence informed decision making in health policy development and clinical practice depends on the availability of valid and reliable data. The introduction of interRAI assessment systems in many countries has provided valuable new information that can be used to support case mix based payment systems, quality monitoring, outcome measurement and care planning. The Continuing Care Reporting System (CCRS) managed by the Canadian Institute for Health Information has served as a data repository supporting national implementation of the Resident Assessment Instrument (RAI 2.0) in Canada for more than 15 years. The present paper aims to evaluate data quality for the CCRS using an approach that may be generalizable to comparable data holdings internationally. Methods: Data from the RAI 2.0 implementation in Complex Continuing Care (CCC) hospitals/units and Long Term Care (LTC) homes in Ontario were analyzed using various statistical techniques that provide evidence for trends in validity, reliability, and population attributes. Time series comparisons included evaluations of scale reliability, patterns of associations between items and scales that provide evidence about convergent validity, and measures of changes in population characteristics over time. Results: Data quality with respect to reliability, validity, completeness and freedom from logical coding errors was consistently high for the CCRS in both CCC and LTC settings. The addition of logic checks further improved data quality in both settings. The only notable change of concern was a substantial inflation in the percentage of long term care home residents qualifying for the Special Rehabilitation level of the Resource Utilization Groups (RUG-III) case mix system after the adoption of that system as part of the payment system for LTC. Conclusions: The CCRS provides a robust, high quality data source that may be used to inform policy, clinical practice and service delivery in Ontario. Only one area of concern was noted, and the statistical techniques employed here may be readily used to target organizations with data quality problems in that (or any other) area. There was also evidence that data quality was good in both CCC and LTC settings from the outset of implementation, meaning data may be used from the entire time series. The methods employed here may continue to be used to monitor data quality in this province over time and they provide a benchmark for comparisons with other jurisdictions implementing the RAI 2.0 in similar populations.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 94
    Publication Date: 2013-03-03
    Description: Background: Distinguishing cases from non-cases in free-text electronic medical records is an important initial step in observational epidemiological studies, but manual record validation is time-consuming and cumbersome. We compared different approaches to develop an automatic case identification system with high sensitivity to assist manual annotators. Methods: We used four different machine-learning algorithms to build case identification systems for two data sets, one comprising hepatobiliary disease patients, the other acute renal failure patients. To improve the sensitivity of the systems, we varied the imbalance ratio between positive cases and negative cases using under- and over-sampling techniques, and applied cost-sensitive learning with various misclassification costs. Results: For the hepatobiliary data set, we obtained a high sensitivity of 0.95 (on a par with manual annotators, as compared to 0.91 for a baseline classifier) with specificity 0.56. For the acute renal failure data set, sensitivity increased from 0.69 to 0.89, with specificity 0.59. Performance differences between the various machine-learning algorithms were not large. Classifiers performed best when trained on data sets with imbalance ratio below 10. Conclusions: We were able to achieve high sensitivity with moderate specificity for automatic case identification on two data sets of electronic medical records. Such a high-sensitive case identification system can be used as a pre-filter to significantly reduce the burden of manual record validation.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 95
    Publication Date: 2013-03-05
    Description: Background: All sequenced eukaryotic genomes have been shown to possess at least a few introns. This includes those unicellular organisms, which were previously suspected to be intron-less. Therefore, gene splicing must have been present at least in the last common ancestor of the eukaryotes. To explain the evolution of introns, basically two mutually exclusive concepts have been developed. The introns-early hypothesis says that already the very first protein-coding genes contained introns while the introns-late concept asserts that eukaryotic genes gained introns only after the emergence of the eukaryotic lineage. A very important aspect in this respect is the conservation of intron positions within homologous genes of different taxa. Results: GenePainter is a standalone application for mapping gene structure information onto protein multiple sequence alignments. Based on the multiple sequence alignments the gene structures are aligned down to single nucleotides. GenePainter accounts for variable lengths in exons and introns, respects split codons at intron junctions and is able to handle sequencing and assembly errors, which are possible reasons for frame-shifts in exons and gaps in genome assemblies. Thus, even gene structures of considerably divergent proteins can properly be compared, as it is needed in phylogenetic analyses. Conserved intron positions can also be mapped to user-provided protein structures. For their visualization GenePainter provides scripts for the molecular graphics system PyMol. Conclusions: GenePainter is a tool to analyse gene structure conservation providing various visualization options. A stable version of GenePainter for all operating systems as well as documentation and example data are available at http://www.motorprotein.de/genepainter.html.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 96
    Publication Date: 2013-02-08
    Description: Background: Binning 16S rRNA sequences into operational taxonomic units (OTUs) is an initial crucial step in analyzing large sequence datasets generated to determine microbial community compositions in various environments including that of the human gut. Various methods have been developed, but most suffer from either inaccuracies or from being unable to handle millions of sequences generated in current studies. Furthermore, existing binning methods usually require a priori decisions regarding binning parameters such as a distance level for defining an OTU. Results: We present a novel modularity-based approach (M-pick) to address the aforementioned problems. The new method utilizes ideas from community detection in graphs, where sequences are viewed as vertices on a weighted graph, each pair of sequences is connected by an imaginary edge, and the similarity of a pair of sequences represents the weight of the edge. M-pick first generates a graph based on pairwise sequence distances and then applies a modularity-based community detection technique on the graph to generate OTUs to capture the community structures in sequence data. To compare the performance of M-pick with that of existing methods, specifically CROP and ESPRIT-Tree, sequence data from different hypervariable regions of 16S rRNA were used and binning results were compared. Conclusions: A new modularity-based clustering method for OTU picking of 16S rRNA sequences is developed in this study. The algorithm does not require a predetermined cut-off level, and our simulation studies suggest that it is superior to existing methods that require specified distance levels to define OTUs. The source code is available at http://plaza.ufl.edu/xywang/Mpick.htm.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 97
    Publication Date: 2013-02-08
    Description: Background: The vitamins are important cofactors in various enzymatic-reactions. In past, many inhibitors have been designed against vitamin binding pockets in order to inhibit vitamin-protein interactions. Thus, it is important to identify vitamin interacting residues in a protein. It is possible to detect vitamin-binding pockets on a protein, if its tertiary structure is known. Unfortunately tertiary structures of limited proteins are available. Therefore, it is important to develop in-silico models for predicting vitamin interacting residues in protein from its primary structure. Results: In this study, first we compared protein-interacting residues of vitamins with other ligands using Two Sample Logo (TSL). It was observed that ATP, GTP, NAD, FAD and mannose preferred {G,R,K,S,H}, {G,K,T,S,D,N}, {T,G,Y}, {G,Y,W} and {Y,D,W,N,E} residues respectively, whereas vitamins preferred {Y,F,S,W,T,G,H} residues for the interaction with proteins. Furthermore, compositional information of preferred and non-preferred residues along with patterns-specificity was also observed within different vitamin-classes. Vitamins A, B and B6 preferred {F,I,W,Y,L,V}, {S,Y,G,T,H,W,N,E} and {S,T,G,H,Y,N} interacting residues respectively. It suggested that protein-binding patterns of vitamins are different from other ligands, and motivated us to develop separate predictor for vitamins and their sub-classes. The four different prediction modules, (i) vitamin interacting residues (VIRs), (ii) vitamin-A interacting residues (VAIRs), (iii) vitamin-B interacting residues (VBIRs) and (iv) pyridoxal-5-phosphate (vitamin B6) interacting residues (PLPIRs) have been developed. We applied various classifiers of SVM, BayesNet, NaiveBayes, ComplementNaiveBayes, NaiveBayesMultinomial, RandomForest and IBk etc., as machine learning techniques, using binary and Position-Specific Scoring Matrix (PSSM) features of protein sequences. Finally, we selected best performing SVM modules and obtained highest MCC of 0.53, 0.48, 0.61, 0.81 for VIRs, VAIRs, VBIRs, PLPIRs respectively, using PSSM-based evolutionary information. All the modules developed in this study have been trained and tested on non-redundant datasets and evaluated using five-fold cross-validation technique. The performances were also evaluated on the balanced and different independent datasets. Conclusions: This study demonstrates that it is possible to predict VIRs, VAIRs, VBIRs and PLPIRs from evolutionary information of protein sequence. In order to provide service to the scientific community, we have developed web-server and standalone software VitaPred (http://crdd.osdd.net/raghava/vitapred/).
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 98
    Publication Date: 2013-02-08
    Description: Background: The rate of elective surgeries varies dramatically by geography in the United States. For many of these surgeries, there is not clear evidence of their relative merits over alternate treatment choices and there are significant tradeoffs in short- and long-term risks and benefits of selecting one treatment option over another. Conditions and symptoms for which there is this lack of a single clear evidence-based treatment choice present great opportunities for patient and provider collaboration on decision making; back pain and joint osteoarthritis are two such ailments. A number of decision aids are in active use to encourage this shared decision-making process. Decision aids have been assessed in formal studies that demonstrate increases in patient knowledge, increases in patient-provider engagement, and reduction in surgery rates. These studies have not widely demonstrated the added benefit of health coaching in support of shared decision making nor have they commonly provided strong evidence of cost reductions. In order to add to this evidence base, we undertook a comparative study testing the relative impact on health utilization and costs of active outreach through interactive voice response technology to encourage health coaching in support of shared decision making in comparison to mailed outreach or no outreach. This study focused on individuals with back pain or joint pain. Methods: We conducted four waves of stratified randomized comparisons for individuals with risk for back, hip, or knee surgery who did not have claims-based evidence of one or more of five chronic conditions and were eligible for population care management services within three large regional health plans in the United States. An interactive voice response (IVR) form of outreach that included the capability for individuals to directly connect with health coaches telephonically, known as AutoDialog(R), was compared to a control (mailed outreach or natural levels of inbound calling depending on the study wave). In total, the study include 24,167 adults with commercial and Medicare Advantage private coverage at three health plans and at risk for lumbar back surgery, hip repair/replacement, or knee repair/replacement. Results: Interactive voice response outreach led to 10.7 (P-value 〈 .0001) times as many inbound calls within 30 days as the control. Over 180 days, the IVR group ("intervention") had 67 percent (P-value 〈 .0001) more health coach communications and agreed to be sent 3.2 (P-value 〈 .0001) as many DVD- and/or booklet-based decision aids. Targeted surgeries were reduced by 6.7 percent (P-value = .6039). Overall costs were lower by 4.9 percent (P-value = .055). Costs that were not related to maternity, cancer, trauma and substance abuse ("actionable costs") were reduced by 6.5 percent (P-value = .0286). Conclusions: IVR with a transfer-to-health coach-option significantly increased levels of health coaching compared to mailed or no outreach and lead to significantly reduced actionable medical costs. Providing high levels of health coaching to individuals with these types of risks appears to have produced important levels of actionable medical cost reductions. We believe this impact resulted from more informed and engaged health care decision making.
    Electronic ISSN: 1472-6947
    Topics: Computer Science , Medicine
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 99
    Publication Date: 2013-02-13
    Description: Background: Digitised monogenean images are usually stored in file system directories in an unstructured manner. In this paper we propose a semantic representation of these images in the form of a Monogenean Haptoral Bar Image (MHBI) ontology, which are annotated with taxonomic classification, diagnostic hard part and image properties. The data we used are basically of the monogenean species found in fish, thus we built a simple Fish ontology to demonstrate how the host (fish) ontology can be linked to the MHBI ontology. This will enable linking of information from the monogenean ontology to the host species found in the fish ontology without changing the underlying schema for either of the ontologies. Results: In this paper, we utilized the Taxonomic Data Working Group Life Sciences Identifier (TDWG LSID) vocabulary to represent our data and defined a new vocabulary which is specific for annotating monogenean haptoral bar images to develop the MHBI ontology and a merged MHBI-Fish ontologies. These ontologies are successfully evaluated using five criteria which are clarity, coherence, extendibility, ontology commitment and encoding bias Conclusions: In this paper, we show that unstructured data can be represented in a structured form using semantics. In the process, we have come up with a new vocabulary for annotating the monogenean images with textual information. The proposed monogenean image ontology will form the basis of a monogenean knowledge base to assist researchers in retrieving information for their analysis.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 100
    Publication Date: 2013-02-13
    Description: Background: Multigenic diseases are often associated with protein complexes or interactions involved in the same pathway. We wanted to estimate to what extent this is true given a consolidated protein interaction data set. The study stresses data integration and data representation issues. Results: We constructed 497 multigenic disease groups from OMIM and tested for overlaps with interaction and pathway data. A total of 159 disease groups had significant overlaps with protein interaction data consolidated by iRefIndex. A further 68 disease overlaps were found only in the KEGG pathway database. No single database contained all significant overlaps thus stressing the importance of data integration. We also found that disease groups overlapped with all three interaction data types: n-ary, spoke-represented complexes and binary data -- thus stressing the importance of considering each of these data types separately. Conclusions: Almost half of our multigenic disease groups could potentially be explained by protein complexes and pathways. However, the fact that no database or data type was able to cover all disease groups suggests that no single database has systematically covered all disease groups for potential related complex and pathway data. This survey provides a basis for further curation efforts to confirm and search for overlaps between diseases and interaction data. The accompanying R script can be used to reproduce the work and track progress in this area as databases change. Disease group overlaps can be further explored using the iRefscape plugin for Cytoscape.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...