ALBERT

All Library Books, journals and Electronic Records Telegrafenberg

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
Filter
  • Articles  (12,407)
  • BioMed Central  (12,407)
  • Hindawi
  • MDPI Publishing
  • BMC Bioinformatics  (2,126)
  • 9756
  • Computer Science  (12,407)
  • Mathematics
  • Agriculture, Forestry, Horticulture, Fishery, Domestic Science, Nutrition
  • 1
    Publication Date: 2021-08-21
    Description: Background In most flowering plants, the plastid genome exhibits a quadripartite genome structure, comprising a large and a small single copy as well as two inverted repeat regions. Thousands of plastid genomes have been sequenced and submitted to public sequence repositories in recent years. The quality of sequence annotations in many of these submissions is known to be problematic, especially regarding annotations that specify the length and location of the inverted repeats: such annotations are either missing or portray the length or location of the repeats incorrectly. However, many biological investigations employ publicly available plastid genomes at face value and implicitly assume the correctness of their sequence annotations. Results We introduce , a Python package that automatically assesses the frequency of incomplete or incorrect annotations of the inverted repeats among publicly available plastid genomes. Specifically, the tool automatically retrieves plastid genomes from NCBI Nucleotide under variable search parameters, surveys them for length and location specifications of inverted repeats, and confirms any inverted repeat annotations through self-comparisons of the genome sequences. The package also includes functionality for automatic identification and removal of duplicate genome records and accounts for taxa that genuinely lack inverted repeats. A survey of the presence of inverted repeat annotations among all plastid genomes of flowering plants submitted to NCBI Nucleotide until the end of 2020 using , followed by a statistical analysis of potential associations with record metadata, highlights that release year and publication status of the genome records have a significant effect on the frequency of complete and equal-length inverted repeat annotations. Conclusion The number of plastid genomes on NCBI Nucleotide has increased dramatically in recent years, and many more genomes will likely be submitted over the next decade. enables researchers to automatically access and evaluate the inverted repeats of these plastid genomes as well as their sequence annotations and, thus, contributes to increasing the reliability of publicly available plastid genomes. The software is freely available via the Python package index at http://pypi.python.org/pypi/airpg.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 2
    Publication Date: 2021-08-21
    Description: Background Drug–drug interactions (DDIs) refer to processes triggered by the administration of two or more drugs leading to side effects beyond those observed when drugs are administered by themselves. Due to the massive number of possible drug pairs, it is nearly impossible to experimentally test all combinations and discover previously unobserved side effects. Therefore, machine learning based methods are being used to address this issue. Methods We propose a Siamese self-attention multi-modal neural network for DDI prediction that integrates multiple drug similarity measures that have been derived from a comparison of drug characteristics including drug targets, pathways and gene expression profiles. Results Our proposed DDI prediction model provides multiple advantages: (1) It is trained end-to-end, overcoming limitations of models composed of multiple separate steps, (2) it offers model explainability via an Attention mechanism for identifying salient input features and (3) it achieves similar or better prediction performance (AUPR scores ranging from 0.77 to 0.92) compared to state-of-the-art DDI models when tested on various benchmark datasets. Novel DDI predictions are further validated using independent data resources. Conclusions We find that a Siamese multi-modal neural network is able to accurately predict DDIs and that an Attention mechanism, typically used in the Natural Language Processing domain, can be beneficially applied to aid in DDI model explainability.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 3
    Publication Date: 2021-02-25
    Description: Background Matrix factorization methods are linear models, with limited capability to model complex relations. In our work, we use tropical semiring to introduce non-linearity into matrix factorization models. We propose a method called Sparse Tropical Matrix Factorization () for the estimation of missing (unknown) values in sparse data. Results We evaluate the efficiency of the method on both synthetic data and biological data in the form of gene expression measurements downloaded from The Cancer Genome Atlas (TCGA) database. Tests on unique synthetic data showed that approximation achieves a higher correlation than non-negative matrix factorization (), which is unable to recover patterns effectively. On real data, outperforms on six out of nine gene expression datasets. While assumes normal distribution and tends toward the mean value, can better fit to extreme values and distributions. Conclusion is the first work that uses tropical semiring on sparse data. We show that in certain cases semirings are useful because they consider the structure, which is different and simpler to understand than it is with standard linear algebra.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 4
    Publication Date: 2021-02-01
    Description: Background Due to the complexity of the biological systems, the prediction of the potential DNA binding sites for transcription factors remains a difficult problem in computational biology. Genomic DNA sequences and experimental results from parallel sequencing provide available information about the affinity and accessibility of genome and are commonly used features in binding sites prediction. The attention mechanism in deep learning has shown its capability to learn long-range dependencies from sequential data, such as sentences and voices. Until now, no study has applied this approach in binding site inference from massively parallel sequencing data. The successful applications of attention mechanism in similar input contexts motivate us to build and test new methods that can accurately determine the binding sites of transcription factors. Results In this study, we propose a novel tool (named DeepGRN) for transcription factors binding site prediction based on the combination of two components: single attention module and pairwise attention module. The performance of our methods is evaluated on the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge datasets. The results show that DeepGRN achieves higher unified scores in 6 of 13 targets than any of the top four methods in the DREAM challenge. We also demonstrate that the attention weights learned by the model are correlated with potential informative inputs, such as DNase-Seq coverage and motifs, which provide possible explanations for the predictive improvements in DeepGRN. Conclusions DeepGRN can automatically and effectively predict transcription factor binding sites from DNA sequences and DNase-Seq coverage. Furthermore, the visualization techniques we developed for the attention modules help to interpret how critical patterns from different types of input features are recognized by our model.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 5
    Publication Date: 2021-02-01
    Description: Background IsomiRs are miRNA variants that vary in length and/or sequence when compared to their canonical forms. These variants display differences in length and/or sequence, including additions or deletions of one or more nucleotides (nts) at the 5′ and/or 3′ end, internal editings or untemplated 3′ end additions. Most available tools for small RNA-seq data analysis do not allow the identification of isomiRs and often require advanced knowledge of bioinformatics. To overcome this, we have developed IsomiR Window, a platform that supports the systematic identification, quantification and functional exploration of isomiR expression in small RNA-seq datasets, accessible to users with no computational skills. Methods IsomiR Window enables the discovery of isomiRs and identification of all annotated non-coding RNAs in RNA-seq datasets from animals and plants. It comprises two main components: the IsomiR Window pipeline for data processing; and the IsomiR Window Browser interface. It integrates over ten third-party softwares for the analysis of small-RNA-seq data and holds a new algorithm that allows the detection of all possible types of isomiRs. These include 3′ and 5′end isomiRs, 3′ end tailings, isomiRs with single nucleotide polymorphisms (SNPs) or potential RNA editings, as well as all possible fuzzy combinations. IsomiR Window includes all required databases for analysis and annotation, and is freely distributed as a Linux virtual machine, including all required software. Results IsomiR Window processes several datasets in an automated manner, without restrictions of input file size. It generates high quality interactive figures and tables which can be exported into different formats. The performance of isomiR detection and quantification was assessed using simulated small-RNA-seq data. For correctly mapped reads, it identified different types of isomiRs with high confidence and 100% accuracy. The analysis of a small RNA-seq data from Basal Cell Carcinomas (BCCs) using isomiR Window confirmed that miR-183-5p is up-regulated in Nodular BCCs, but revealed that this effect was predominantly due to a novel 5′end variant. This variant displays a different seed region motif and 1756 isoform-exclusive mRNA targets that are significantly associated with disease pathways, underscoring the biological relevance of isomiR-focused analysis. IsomiR Window is available at https://isomir.fc.ul.pt/.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 6
    Publication Date: 2021-02-01
    Description: Background Generating and analysing single-cell data has become a widespread approach to examine tissue heterogeneity, and numerous algorithms exist for clustering these datasets to identify putative cell types with shared transcriptomic signatures. However, many of these clustering workflows rely on user-tuned parameter values, tailored to each dataset, to identify a set of biologically relevant clusters. Whereas users often develop their own intuition as to the optimal range of parameters for clustering on each data set, the lack of systematic approaches to identify this range can be daunting to new users of any given workflow. In addition, an optimal parameter set does not guarantee that all clusters are equally well-resolved, given the heterogeneity in transcriptomic signatures in most biological systems. Results Here, we illustrate a subsampling-based approach (chooseR) that simultaneously guides parameter selection and characterizes cluster robustness. Through bootstrapped iterative clustering across a range of parameters, chooseR was used to select parameter values for two distinct clustering workflows (Seurat and scVI). In each case, chooseR identified parameters that produced biologically relevant clusters from both well-characterized (human PBMC) and complex (mouse spinal cord) datasets. Moreover, it provided a simple “robustness score” for each of these clusters, facilitating the assessment of cluster quality. Conclusion chooseR is a simple, conceptually understandable tool that can be used flexibly across clustering algorithms, workflows, and datasets to guide clustering parameter selection and characterize cluster robustness.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 7
    Publication Date: 2021-03-31
    Description: Background The most common measure of association between two continuous variables is the Pearson correlation (Maronna et al. in Safari an OMC. Robust statistics, 2019. https://login.proxy.bib.uottawa.ca/login?url=https://learning.oreilly.com/library/view/-/9781119214687/?ar&orpq&email=^u). When outliers are present, Pearson does not accurately measure association and robust measures are needed. This article introduces three new robust measures of correlation: Taba (T), TabWil (TW), and TabWil rank (TWR). The correlation estimators T and TW measure a linear association between two continuous or ordinal variables; whereas TWR measures a monotonic association. The robustness of these proposed measures in comparison with Pearson (P), Spearman (S), Quadrant (Q), Median (M), and Minimum Covariance Determinant (MCD) are examined through simulation. Taba distance is used to analyze genes, and statistical tests were used to identify those genes most significantly associated with Williams Syndrome (WS). Results Based on the root mean square error (RMSE) and bias, the three proposed correlation measures are highly competitive when compared to classical measures such as P and S as well as robust measures such as Q, M, and MCD. Our findings indicate TBL2 was the most significant gene among patients diagnosed with WS and had the most significant reduction in gene expression level when compared with control (P value = 6.37E-05). Conclusions Overall, when the distribution is bivariate Log-Normal or bivariate Weibull, TWR performs best in terms of bias and T performs best with respect to RMSE. Under the Normal distribution, MCD performs well with respect to bias and RMSE; but TW, TWR, T, S, and P correlations were in close proximity. The identification of TBL2 may serve as a diagnostic tool for WS patients. A Taba R package has been developed and is available for use to perform all necessary computations for the proposed methods.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 8
    Publication Date: 2021-03-31
    Description: Background Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins. Method We proposed a new multi-classification machine learning pipeline MultiLyGAN to identity seven types of lysine modified sites. Using eight different sequential and five structural construction methods, 1497 valid features were remained after the filtering by Pearson correlation coefficient. To solve the data imbalance problem, Conditional Generative Adversarial Network (CGAN) and Conditional Wasserstein Generative Adversarial Network (CWGAN), two influential deep generative methods were leveraged and compared to generate new samples for the types with fewer samples. Finally, random forest algorithm was utilized to predict seven categories. Results In the tenfold cross-validation, accuracy (Acc) and Matthews correlation coefficient (MCC) were 0.8589 and 0.8376, respectively. In the independent test, Acc and MCC were 0.8549 and 0.8330, respectively. The results indicated that CWGAN better solved the existing data imbalance and stabilized the training error. Alternatively, an accumulated feature importance analysis reported that CKSAAP, PWM and structural features were the three most important feature-encoding schemes. MultiLyGAN can be found at https://github.com/Lab-Xu/MultiLyGAN. Conclusions The CWGAN greatly improved the predictive performance in all experiments. Features derived from CKSAAP, PWM and structure schemes are the most informative and had the greatest contribution to the prediction of PTM.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 9
    Publication Date: 2021-03-24
    Description: Background Given expression data, gene regulatory network(GRN) inference approaches try to determine regulatory relations. However, current inference methods ignore the inherent topological characters of GRN to some extent, leading to structures that lack clear biological explanation. To increase the biophysical meanings of inferred networks, this study performed data-driven module detection before network inference. Gene modules were identified by decomposition-based methods. Results ICA-decomposition based module detection methods have been used to detect functional modules directly from transcriptomic data. Experiments about time-series expression, curated and scRNA-seq datasets suggested that the advantages of the proposed ModularBoost method over established methods, especially in the efficiency and accuracy. For scRNA-seq datasets, the ModularBoost method outperformed other candidate inference algorithms. Conclusions As a complicated task, GRN inference can be decomposed into several tasks of reduced complexity. Using identified gene modules as topological constraints, the initial inference problem can be accomplished by inferring intra-modular and inter-modular interactions respectively. Experimental outcomes suggest that the proposed ModularBoost method can improve the accuracy and efficiency of inference algorithms by introducing topological constraints.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 10
    Publication Date: 2021-03-27
    Description: Background Heritability is a central measure in genetics quantifying how much of the variability observed in a trait is attributable to genetic differences. Existing methods for estimating heritability are most often based on random-effect models, typically for computational reasons. The alternative of using a fixed-effect model has received much more limited attention in the literature. Results In this paper, we propose a generic strategy for heritability inference, termed as “boosting heritability”, by combining the advantageous features of different recent methods to produce an estimate of the heritability with a high-dimensional linear model. Boosting heritability uses in particular a multiple sample splitting strategy which leads in general to a stable and accurate estimate. We use both simulated data and real antibiotic resistance data from a major human pathogen, Sptreptococcus pneumoniae, to demonstrate the attractive features of our inference strategy. Conclusions Boosting is shown to offer a reliable and practically useful tool for inference about heritability.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 11
    Publication Date: 2021-03-25
    Description: Background A growing body of evidence has shown the association between tuberculosis (TB) infection and lung cancer. However, the possible effect of strain‐specific behavior of Mycobacterium tuberculosis (M.tb) population, the etiological agent of TB infection in this association has been neglected. In this context, this study was conducted to investigate this association with consideration of the genetic background of strains in the M.tb population. Results We employed the elastic net penalized logistic regression model, as a statistical-learning algorithm for gene selection, to evaluate this association in 129 genes involved in TLRs and NF-κB signaling pathways in response to two different M.tb sub-lineage strains (L3-CAS1and L 4.5). Of the 129 genes, 21 were found to be associated with the two studied M.tb sub-lineages. In addition, MAPK8IP3 gene was identified as a novel gene, which has not been reported in previous lung cancer studies and may have the potential to be recognized as a novel biomarker in lung cancer investigation. Conclusions This preliminary study provides new insights into the mechanistic association between TB infection and lung cancer. Further mechanistic investigations of this association with a large number of M.tb strains, encompassing the other main M.tb lineages and using the whole transcriptome of the host cell are inevitable.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 12
    Publication Date: 2021-03-22
    Description: Background Telomeres, nucleoprotein structures comprising short tandem repeats and delimiting the ends of linear eukaryotic chromosomes, play an important role in the maintenance of genome stability. Therefore, the determination of the length of telomeres is of high importance for many studies. Over the last years, new methods for the analysis of the length of telomeres have been developed, including those based on PCR or analysis of NGS data. Despite that, terminal restriction fragment (TRF) method remains the gold standard to this day. However, this method lacks universally accepted and precise tool capable to analyse and statistically evaluate TRF results. Results To standardize the processing of TRF results, we have developed WALTER, an online toolset allowing rapid, reproducible, and user-friendly analysis including statistical evaluation of the data. Given its web-based nature, it provides an easily accessible way to analyse TRF data without any need to install additional software. Conclusions WALTER represents a major upgrade from currently available tools for the image processing of TRF scans. This toolset enables a rapid, highly reproducible, and user-friendly evaluation of almost any TRF scan including in-house statistical evaluation of the data. WALTER platform together with user manual describing the evaluation of TRF scans in detail and presenting tips and troubleshooting, as well as test data to demo the software are available at https://www.ceitec.eu/chromatin-molecular-complexes-jiri-fajkus/rg51/tab?tabId=125#WALTER and the source code at https://github.com/mlyc93/WALTER.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 13
    Publication Date: 2021-03-22
    Description: Background Polyploidy is very common in plants and can be seen as one of the key drivers in the domestication of crops and the establishment of important agronomic traits. It can be the main source of genomic repatterning and introduces gene duplications, affecting gene expression and alternative splicing. Since fully sequenced genomes are not yet available for many plant species including crops, de novo transcriptome assembly is the basis to understand molecular and functional mechanisms. However, in complex polyploid plants, de novo transcriptome assembly is challenging, leading to increased rates of fused or redundant transcripts. Since assemblers were developed mainly for diploid organisms, they may not well suited for polyploids. Also, comparative evaluations of these tools on higher polyploid plants are extremely rare. Thus, our aim was to fill this gap and to provide a basic guideline for choosing the optimal de novo assembly strategy focusing on autotetraploids, as the scientific interest in this type of polyploidy is steadily increasing. Results We present a comparison of two common (SOAPdenovo-Trans, Trinity) and one recently published transcriptome assembler (TransLiG) on diploid and autotetraploid species of the genera Acer and Vaccinium using Arabidopsis thaliana as a reference. The number of assembled transcripts was up to 11 and 14 times higher with an increased number of short transcripts for Acer and Vaccinium, respectively, compared to A. thaliana. In diploid samples, Trinity and TransLiG performed similarly good while in autotetraploids, TransLiG assembled most complete transcriptomes with an average of 1916 assembled BUSCOs vs. 1705 BUSCOs for Trinity. Of all three assemblers, SOAPdenovo-Trans performed worst (1133 complete BUSCOs). Conclusion All three assembly tools produced complete assemblies when dealing with the model organism A. thaliana, independently of its ploidy level, but their performances differed extremely when it comes to non-model autotetraploids, where specifically TransLiG and Trinity produced a high number of redundant transcripts. The recently published assembler TransLiG has not been tested yet on any plant organism but showed highest completeness and full-length transcriptomes, especially in autotetraploids. Including such species during the development and testing of new assembly tools is highly appreciated and recommended as many important crops are polyploid.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 14
    Publication Date: 2021-03-24
    Description: Background A number of predictive models for aquatic toxicity are available, however, the accuracy and extent of easy to use of these in silico tools in risk assessment still need further studied. This study evaluated the performance of seven in silico tools to daphnia and fish: ECOSAR, T.E.S.T., Danish QSAR Database, VEGA, KATE, Read Across and Trent Analysis. 37 Priority Controlled Chemicals in China (PCCs) and 92 New Chemicals (NCs) were used as validation dataset. Results In the quantitative evaluation to PCCs with the criteria of 10-fold difference between experimental value and estimated value, the accuracies of VEGA is the highest among all of the models, both in prediction of daphnia and fish acute toxicity, with accuracies of 100% and 90% after considering AD, respectively. The performance of KATE, ECOSAR and T.E.S.T. is similar, with accuracies are slightly lower than VEGA. The accuracy of Danish Q.D. is the lowest among the above tools with which QSAR is the main mechanism. The performance of Read Across and Trent Analysis is lowest among all of the tested in silico tools. The predictive ability of models to NCs was lower than that of PCCs possibly because never appeared in training set of the models, and ECOSAR perform best than other in silico tools. Conclusion QSAR based in silico tools had the greater prediction accuracy than category approach (Read Across and Trent Analysis) in predicting the acute toxicity of daphnia and fish. Category approach (Read Across and Trent Analysis) requires expert knowledge to be utilized effectively. ECOSAR performs well in both PCCs and NCs, and the application shoud be promoted in both risk assessment and priority activities. We suggest that distribution of multiple data and water solubility should be considered when developing in silico models. Both more intelligent in silico tools and testing are necessary to identify hazards of Chemicals.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 15
    Publication Date: 2021-02-02
    Description: Background Transposable elements (TEs) constitute a significant part of eukaryotic genomes. Short interspersed nuclear elements (SINEs) are non-autonomous TEs, which are widely represented in mammalian genomes and also found in plants. After insertion in a new position in the genome, TEs quickly accumulate mutations, which complicate their identification and annotation by modern bioinformatics methods. In this study, we searched for highly divergent SINE copies in the genome of rice (Oryza sativa subsp. japonica) using the Highly Divergent Repeat Search Method (HDRSM). Results The HDRSM considers correlations of neighboring symbols to construct position weight matrix (PWM) for a SINE family, which is then used to perform a search for new copies. In order to evaluate the accuracy of the method and compare it with the RepeatMasker program, we generated a set of SINE copies containing nucleotide substitutions and indels and inserted them into an artificial chromosome for analysis. The HDRSM showed better results both in terms of the number of identified inserted repeats and the accuracy of determining their boundaries. A search for the copies of 39 SINE families in the rice genome produced 14,030 hits; among them, 5704 were not detected by RepeatMasker. Conclusions The HDRSM could find divergent SINE copies, correctly determine their boundaries, and offer a high level of statistical significance. We also found that RepeatMasker is able to find relatively short copies of the SINE families with a higher level of similarity, while HDRSM is able to find more diverged copies. To obtain a comprehensive profile of SINE distribution in the genome, combined application of the HDRSM and RepeatMasker is recommended.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 16
    Publication Date: 2021-03-22
    Description: Background Recently, many computational methods have been proposed to predict cancer genes. One typical kind of method is to find the differentially expressed genes between tumour and normal samples. However, there are also some genes, for example, ‘dark’ genes, that play important roles at the network level but are difficult to find by traditional differential gene expression analysis. In addition, network controllability methods, such as the minimum feedback vertex set (MFVS) method, have been used frequently in cancer gene prediction. However, the weights of vertices (or genes) are ignored in the traditional MFVS methods, leading to difficulty in finding the optimal solution because of the existence of many possible MFVSs. Results Here, we introduce a novel method, called weighted MFVS (WMFVS), which integrates the gene differential expression value with MFVS to select the maximum-weighted MFVS from all possible MFVSs in a protein interaction network. Our experimental results show that WMFVS achieves better performance than using traditional bio-data or network-data analyses alone. Conclusion This method balances the advantage of differential gene expression analyses and network analyses, improves the low accuracy of differential gene expression analyses and decreases the instability of pure network analyses. Furthermore, WMFVS can be easily applied to various kinds of networks, providing a useful framework for data analysis and prediction.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 17
    Publication Date: 2021-03-22
    Description: Background Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. Results We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in better space savings, and even in better execution times over compressed data, with respect to the use of generic compressors available in Hadoop, in particular for FASTQ files. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System. Conclusions Our Methods and the corresponding software substantially contribute to achieve space and time savings for the storage and processing of FASTA/Q files in Hadoop and Spark. Being our approach general, it is very likely that it can be applied also to FASTA/Q compression methods that will appear in the future. Availability The software and the datasets are available at https://github.com/fpalini/fastdoopc
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 18
    Publication Date: 2021-03-23
    Description: Background A common approach for sequencing studies is to do joint-calling and store variants of all samples in a single file. If new samples are continually added or controls are re-used for several studies, the cost and time required to perform joint-calling for each analysis can become prohibitive. Results We present ATAV, an analysis platform for large-scale whole-exome and whole-genome sequencing projects. ATAV stores variant and per site coverage data for all samples in a centralized database, which is efficiently queried by ATAV to support diagnostic analyses for trios and singletons, as well as rare-variant collapsing analyses for finding disease associations in complex diseases. Runtime logs ensure full reproducibility and the modularized ATAV framework makes it extensible to continuous development. Besides helping with the identification of disease-causing variants for a range of diseases, ATAV has also enabled the discovery of disease-genes by rare-variant collapsing on datasets containing more than 20,000 samples. Analyses to date have been performed on data of more than 110,000 individuals demonstrating the scalability of the framework. To allow users to easily access variant-level data directly from the database, we provide a web-based interface, the ATAV data browser (http://atavdb.org/). Through this browser, summary-level data for more than 40,000 samples can be queried by the general public representing a mix of cases and controls of diverse ancestries. Users have access to phenotype categories of variant carriers, as well as predicted ancestry, gender, and quality metrics. In contrast to many other platforms, the data browser is able to show data of newly-added samples in real-time and therefore evolves rapidly as more and more samples are sequenced. Conclusions Through ATAV, users have public access to one of the largest variant databases for patients sequenced at a tertiary care center and can look up any genes or variants of interest. Additionally, since the entire code is freely available on GitHub, ATAV can easily be deployed by other groups that wish to build their own platform, database, and user interface.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 19
    Publication Date: 2021-03-23
    Description: Background Currently, no proven effective drugs for the novel coronavirus disease COVID-19 exist and despite widespread vaccination campaigns, we are far short from herd immunity. The number of people who are still vulnerable to the virus is too high to hamper new outbreaks, leading a compelling need to find new therapeutic options devoted to combat SARS-CoV-2 infection. Drug repurposing represents an effective drug discovery strategy from existing drugs that could shorten the time and reduce the cost compared to de novo drug discovery. Results We developed a network-based tool for drug repurposing provided as a freely available R-code, called SAveRUNNER (Searching off-lAbel dRUg aNd NEtwoRk), with the aim to offer a promising framework to efficiently detect putative novel indications for currently marketed drugs against diseases of interest. SAveRUNNER predicts drug–disease associations by quantifying the interplay between the drug targets and the disease-associated proteins in the human interactome through the computation of a novel network-based similarity measure, which prioritizes associations between drugs and diseases located in the same network neighborhoods. Conclusions The algorithm was successfully applied to predict off-label drugs to be repositioned against the new human coronavirus (2019-nCoV/SARS-CoV-2), and it achieved a high accuracy in the identification of well-known drug indications, thus revealing itself as a powerful tool to rapidly detect potential novel medical indications for various drugs that are worth of further investigation. SAveRUNNER source code is freely available at https://github.com/giuliafiscon/SAveRUNNER.git, along with a comprehensive user guide.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 20
    Publication Date: 2021-03-25
    Description: Background Synthetic long reads (SLR) with long-range co-barcoding information are now widely applied in genomics research. Although several tools have been developed for each specific SLR technique, a robust standalone scaffolder with high efficiency is warranted for hybrid genome assembly. Results In this work, we developed a standalone scaffolding tool, SLR-superscaffolder, to link together contigs in draft assemblies using co-barcoding and paired-end read information. Our top-to-bottom scheme first builds a global scaffold graph based on Jaccard Similarity to determine the order and orientation of contigs, and then locally improves the scaffolds with the aid of paired-end information. We also exploited a screening algorithm to reduce the negative effect of misassembled contigs in the input assembly. We applied SLR-superscaffolder to a human single tube long fragment read sequencing dataset and increased the scaffold NG50 of its corresponding draft assembly 1349 fold. Moreover, benchmarking on different input contigs showed that this approach overall outperformed existing SLR scaffolders, providing longer contiguity and fewer misassemblies, especially for short contigs assembled by next-generation sequencing data. The open-source code of SLR-superscaffolder is available at https://github.com/BGI-Qingdao/SLR-superscaffolder. Conclusions SLR-superscaffolder can dramatically improve the contiguity of a draft assembly by integrating a hybrid assembly strategy.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 21
    Publication Date: 2021-03-30
    Description: Background Hepatocellular carcinoma (HCC), derived from hepatocytes, is the main histological subtype of primary liver cancer and poses a serious threat to human health due to the high incidence and poor prognosis. This study aimed to establish a multigene prognostic model to predict the prognosis of patients with HCC. Results Gene expression datasets (GSE121248, GSE40873, GSE62232) were used to identify differentially expressed genes (DEGs) between tumor and adjacent or normal tissues, and then hub genes were screened by protein–protein interaction (PPI) network and Cytoscape software. Seventeen genes among hub genes were significantly associated with prognosis and used to construct a prognostic model through COX hazard regression analysis. The predictive performance of this model was evaluated with TCGA data and was further validated with independent dataset GSE14520. Six genes (CDKN3, ZWINT, KIF20A, NUSAP1, HMMR, DLGAP5) were involved in the prognostic model, which separated HCC patients from TCGA dataset into high- and low-risk groups. Kaplan–Meier (KM) survival analysis and risk score analysis demonstrated that low-risk group represented a survival advantage. Univariate and multivariate regression analysis showed risk score could be an independent prognostic factor. The receiver operating characteristic (ROC) curve showed there was a better predictive power of the risk score than that of other clinical indicators. At last, the results from GSE14520 demonstrated the reliability of this prognostic model in some extent. Conclusion This prognostic model represented significance for prognosis of HCC, and the risk score according to this model may be a better prognostic factor than other traditional clinical indicators.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 22
    Publication Date: 2021-03-25
    Description: Background As the use of nanopore sequencing for metagenomic analysis increases, tools capable of performing long-read taxonomic classification (ie. determining the composition of a sample) in a fast and accurate manner are needed. Existing tools were either designed for short-read data (eg. Centrifuge), take days to analyse modern sequencer outputs (eg. MetaMaps) or suffer from suboptimal accuracy (eg. CDKAM). Additionally, all tools require command line expertise and do not scale in the cloud. Results We present BugSeq, a novel, highly accurate metagenomic classifier for nanopore reads. We evaluate BugSeq on simulated data, mock microbial communities and real clinical samples. On the ZymoBIOMICS Even and Log communities, BugSeq (F1 = 0.95 at species level) offers better read classification than MetaMaps (F1 = 0.89–0.94) in a fraction of the time. BugSeq significantly improves on the accuracy of Centrifuge (F1 = 0.79–0.93) and CDKAM (F1 = 0.91–0.94) while offering competitive run times. When applied to 41 samples from patients with lower respiratory tract infections, BugSeq produces greater concordance with microbiological culture and qPCR compared with “What’s In My Pot” analysis. Conclusion BugSeq is deployed to the cloud for easy and scalable long-read metagenomic analyses. BugSeq is freely available for non-commercial use at https://bugseq.com/free.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 23
    Publication Date: 2021-03-24
    Description: Background Recent studies have confirmed that N7-methylguanosine (m7G) modification plays an important role in regulating various biological processes and has associations with multiple diseases. Wet-lab experiments are cost and time ineffective for the identification of disease-associated m7G sites. To date, tens of thousands of m7G sites have been identified by high-throughput sequencing approaches and the information is publicly available in bioinformatics databases, which can be leveraged to predict potential disease-associated m7G sites using a computational perspective. Thus, computational methods for m7G-disease association prediction are urgently needed, but none are currently available at present. Results To fill this gap, we collected association information between m7G sites and diseases, genomic information of m7G sites, and phenotypic information of diseases from different databases to build an m7G-disease association dataset. To infer potential disease-associated m7G sites, we then proposed a heterogeneous network-based model, m7G Sites and Diseases Associations Inference (m7GDisAI) model. m7GDisAI predicts the potential disease-associated m7G sites by applying a matrix decomposition method on heterogeneous networks which integrate comprehensive similarity information of m7G sites and diseases. To evaluate the prediction performance, 10 runs of tenfold cross validation were first conducted, and m7GDisAI got the highest AUC of 0.740(± 0.0024). Then global and local leave-one-out cross validation (LOOCV) experiments were implemented to evaluate the model’s accuracy in global and local situations respectively. AUC of 0.769 was achieved in global LOOCV, while 0.635 in local LOOCV. A case study was finally conducted to identify the most promising ovarian cancer-related m7G sites for further functional analysis. Gene Ontology (GO) enrichment analysis was performed to explore the complex associations between host gene of m7G sites and GO terms. The results showed that m7GDisAI identified disease-associated m7G sites and their host genes are consistently related to the pathogenesis of ovarian cancer, which may provide some clues for pathogenesis of diseases. Conclusion The m7GDisAI web server can be accessed at http://180.208.58.66/m7GDisAI/, which provides a user-friendly interface to query disease associated m7G. The list of top 20 m7G sites predicted to be associted with 177 diseases can be achieved. Furthermore, detailed information about specific m7G sites and diseases are also shown.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 24
    Publication Date: 2021-03-10
    Description: Background Correlation network analysis has become an integral tool to study metabolite datasets. Networks are constructed by omitting correlations between metabolites based on two thresholds—namely the r and the associated p-values. While p-value threshold settings follow the rules of multiple hypotheses testing correction, guidelines for r-value threshold settings have not been defined. Results Here, we introduce a method that allows determining the r-value threshold based on an iterative approach, where different networks are constructed and their network topology is monitored. Once the network topology changes significantly, the threshold is set to the corresponding correlation coefficient value. The approach was exemplified on: (i) a metabolite and morphological trait dataset from a potato association panel, which was grown under normal irrigation and water recovery conditions; and validated (ii) on a metabolite dataset of hearts of fed and fasted mice. For the potato normal irrigation correlation network a threshold of Pearson’s |r|≥ 0.23 was suggested, while for the water recovery correlation network a threshold of Pearson’s |r|≥ 0.41 was estimated. For both mice networks the threshold was calculated with Pearson’s |r|≥ 0.84. Conclusions Our analysis corrected the previously stated Pearson’s correlation coefficient threshold from 0.4 to 0.41 in the water recovery network and from 0.4 to 0.23 for the normal irrigation network. Furthermore, the proposed method suggested a correlation threshold of 0.84 for both mice networks rather than a threshold of 0.7 as applied earlier. We demonstrate that the proposed approach is a valuable tool for constructing biological meaningful networks.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 25
    Publication Date: 2021-03-18
    Description: Background Historical and updated information provided by time-course data collected during an entire treatment period proves to be more useful than information provided by single-point data. Accurate predictions made using time-course data on multiple biomarkers that indicate a patient’s response to therapy contribute positively to the decision-making process associated with designing effective treatment programs for various diseases. Therefore, the development of prediction methods incorporating time-course data on multiple markers is necessary. Results We proposed new methods that may be used for prediction and gene selection via time-course gene expression profiles. Our prediction method consolidated multiple probabilities calculated using gene expression profiles collected over a series of time points to predict therapy response. Using two data sets collected from patients with hepatitis C virus (HCV) infection and multiple sclerosis (MS), we performed numerical experiments that predicted response to therapy and evaluated their accuracies. Our methods were more accurate than conventional methods and successfully selected genes, the functions of which were associated with the pathology of HCV infection and MS. Conclusions The proposed method accurately predicted response to therapy using data at multiple time points. It showed higher accuracies at early time points compared to those of conventional methods. Furthermore, this method successfully selected genes that were directly associated with diseases.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 26
    Publication Date: 2021-03-13
    Description: Background Trauma-induced coagulopathy (TIC) is a disorder that occurs in one-third of severely injured trauma patients, manifesting as increased bleeding and a 4X risk of mortality. Understanding the mechanisms driving TIC, clinical risk factors are essential to mitigating this coagulopathic bleeding and is therefore essential for saving lives. In this retrospective, single hospital study of 891 trauma patients, we investigate and quantify how two prominently described phenotypes of TIC, consumptive coagulopathy and hyperfibrinolysis, affect survival odds in the first 25 h, when deaths from TIC are most prevalent. Methods We employ a joint survival model to estimate the longitudinal trajectories of the protein Factor II (% activity) and the log of the protein fragment D-Dimer ($$upmu$$ μ g/ml), representative biomarkers of consumptive coagulopathy and hyperfibrinolysis respectively, and tie them together with patient outcomes. Joint models have recently gained popularity in medical studies due to the necessity to simultaneously track continuously measured biomarkers as a disease evolves, as well as to associate them with patient outcomes. In this work, we estimate and analyze our joint model using Bayesian methods to obtain uncertainties and distributions over associations and trajectories. Results We find that a unit increase in log D-Dimer increases the risk of mortality by 2.22 [1.57, 3.28] fold while a unit increase in Factor II only marginally decreases the risk of mortality by 0.94 [0.91,0.96] fold. This suggests that, while managing consumptive coagulopathy and hyperfibrinolysis both seem to affect survival odds, the effect of hyperfibrinolysis is much greater and more sensitive. Furthermore, we find that the longitudinal trajectories, controlling for many fixed covariates, trend differently for different patients. Thus, a more personalized approach is necessary when considering treatment and risk prediction under these phenotypes. Conclusion This study reinforces the finding that hyperfibrinolysis is linked with poor patient outcomes regardless of factor consumption levels. Furthermore, it quantifies the degree to which measured D-Dimer levels correlate with increased risk. The single hospital, retrospective nature can be understood to specify the results to this particular hospital’s patients and protocol in treating trauma patients. Expanding to a multi-hospital setting would result in better estimates about the underlying nature of consumptive coagulopathy and hyperfibrinolysis with survival, regardless of protocol. Individual trajectories obtained with these estimates can be used to provide personalized dynamic risk prediction when making decisions regarding management of blood factors.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 27
    Publication Date: 2021-02-17
    Description: Background Oocyte quality decreases with aging, thereby increasing errors in fertilization, chromosome segregation, and embryonic cleavage. Oocyte appearance also changes with aging, suggesting a functional relationship between oocyte quality and appearance. However, no methods are available to objectively quantify age-associated changes in oocyte appearance. Results We show that statistical image processing of Nomarski differential interference contrast microscopy images can be used to quantify age-associated changes in oocyte appearance in the nematode Caenorhabditis elegans. Max–min value (mean difference between the maximum and minimum intensities within each moving window) quantitatively characterized the difference in oocyte cytoplasmic texture between 1- and 3-day-old adults (Day 1 and Day 3 oocytes, respectively). With an appropriate parameter set, the gray level co-occurrence matrix (GLCM)-based texture feature Correlation (COR) more sensitively characterized this difference than the Max–min Value. Manipulating the smoothness of and/or adding irregular structures to the cytoplasmic texture of Day 1 oocyte images reproduced the difference in Max–min Value but not in COR between Day 1 and Day 3 oocytes. Increasing the size of granules in synthetic images recapitulated the age-associated changes in COR. Manual measurements validated that the cytoplasmic granules in oocytes become larger with aging. Conclusions The Max–min value and COR objectively quantify age-related changes in C. elegans oocyte in Nomarski DIC microscopy images. Our methods provide new opportunities for understanding the mechanism underlying oocyte aging.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 28
    Publication Date: 2021-03-25
    Description: Background Cumulative evidence from biological experiments has confirmed that miRNAs have significant roles to diagnose and treat complex diseases. However, traditional medical experiments have limitations in time-consuming and high cost so that they fail to find the unconfirmed miRNA and disease interactions. Thus, discovering potential miRNA-disease associations will make a contribution to the decrease of the pathogenesis of diseases and benefit disease therapy. Although, existing methods using different computational algorithms have favorable performances to search for the potential miRNA-disease interactions. We still need to do some work to improve experimental results. Results We present a novel combined embedding model to predict MiRNA-disease associations (CEMDA) in this article. The combined embedding information of miRNA and disease is composed of pair embedding and node embedding. Compared with the previous heterogeneous network methods that are merely node-centric to simply compute the similarity of miRNA and disease, our method fuses pair embedding to pay more attention to capturing the features behind the relative information, which models the fine-grained pairwise relationship better than the previous case when each node only has a single embedding. First, we construct the heterogeneous network from supported miRNA-disease pairs, disease semantic similarity and miRNA functional similarity. Given by the above heterogeneous network, we find all the associated context paths of each confirmed miRNA and disease. Meta-paths are linked by nodes and then input to the gate recurrent unit (GRU) to directly learn more accurate similarity measures between miRNA and disease. Here, the multi-head attention mechanism is used to weight the hidden state of each meta-path, and the similarity information transmission mechanism in a meta-path of miRNA and disease is obtained through multiple network layers. Second, pair embedding of miRNA and disease is fed to the multi-layer perceptron (MLP), which focuses on more important segments in pairwise relationship. Finally, we combine meta-path based node embedding and pair embedding with the cost function to learn and predict miRNA-disease association. The source code and data sets that verify the results of our research are shown at https://github.com/liubailong/CEMDA. Conclusions The performance of CEMDA in the leave-one-out cross validation and fivefold cross validation are 93.16% and 92.03%, respectively. It denotes that compared with other methods, CEMDA accomplishes superior performance. Three cases with lung cancers, breast cancers, prostate cancers and pancreatic cancers show that 48,50,50 and 50 out of the top 50 miRNAs, which are confirmed in HDMM V2.0. Thus, this further identifies the feasibility and effectiveness of our method.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 29
    Publication Date: 2021-03-29
    Description: Background Exploring the relationship between disease and gene is of great significance for understanding the pathogenesis of disease and developing corresponding therapeutic measures. The prediction of disease-gene association by computational methods accelerates the process. Results Many existing methods cannot fully utilize the multi-dimensional biological entity relationship to predict disease-gene association due to multi-source heterogeneous data. This paper proposes FactorHNE, a factor graph-aggregated heterogeneous network embedding method for disease-gene association prediction, which captures a variety of semantic relationships between the heterogeneous nodes by factorization. It produces different semantic factor graphs and effectively aggregates a variety of semantic relationships, by using end-to-end multi-perspectives loss function to optimize model. Then it produces good nodes embedding to prediction disease-gene association. Conclusions Experimental verification and analysis show FactorHNE has better performance and scalability than the existing models. It also has good interpretability and can be extended to large-scale biomedical network data analysis.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 30
    Publication Date: 2021-03-26
    Description: Background Coronavirus Disease 2019 (COVID-19) is a viral pandemic disease that may induce severe pneumonia in humans. In this paper, we investigated the putative implication of 12 vaccines, including BCG, OPV and MMR in the protection against COVID-19. Sequences of the main antigenic proteins in the investigated vaccines and SARS-CoV-2 proteins were compared to identify similar patterns. The immunogenic effect of identified segments was, then, assessed using a combination of structural and antigenicity prediction tools. Results A total of 14 highly similar segments were identified in the investigated vaccines. Structural and antigenicity prediction analysis showed that, among the identified patterns, three segments in Hepatitis B, Tetanus, and Measles proteins presented antigenic properties that can induce putative protective effect against COVID-19. Conclusions Our results suggest a possible protective effect of HBV, Tetanus and Measles vaccines against COVID-19, which may explain the variation of the disease severity among regions.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 31
    Publication Date: 2021-03-30
    Description: Background Women are at more than 1.5-fold higher risk for clinically relevant adverse drug events. While this higher prevalence is partially due to gender-related effects, biological sex differences likely also impact drug response. Publicly available gene expression databases provide a unique opportunity for examining drug response at a cellular level. However, missingness and heterogeneity of metadata prevent large-scale identification of drug exposure studies and limit assessments of sex bias. To address this, we trained organism-specific models to infer sample sex from gene expression data, and used entity normalization to map metadata cell line and drug mentions to existing ontologies. Using this method, we inferred sex labels for 450,371 human and 245,107 mouse microarray and RNA-seq samples from refine.bio. Results Overall, we find slight female bias (52.1%) in human samples and (62.5%) male bias in mouse samples; this corresponds to a majority of mixed sex studies in humans and single sex studies in mice, split between female-only and male-only (25.8% vs. 18.9% in human and 21.6% vs. 31.1% in mouse, respectively). In drug studies, we find limited evidence for sex-sampling bias overall; however, specific categories of drugs, including human cancer and mouse nervous system drugs, are enriched in female-only and male-only studies, respectively. We leverage our expression-based sex labels to further examine the complexity of cell line sex and assess the frequency of metadata sex label misannotations (2–5%). Conclusions Our results demonstrate limited overall sex bias, while highlighting high bias in specific subfields and underscoring the importance of including sex labels to better understand the underlying biology. We make our inferred and normalized labels, along with flags for misannotated samples, publicly available to catalyze the routine use of sex as a study variable in future analyses.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 32
    Publication Date: 2021-03-25
    Description: Background Deep immune receptor sequencing, RepSeq, provides unprecedented opportunities for identifying and studying condition-associated T-cell clonotypes, represented by T-cell receptor (TCR) CDR3 sequences. However, due to the immense diversity of the immune repertoire, identification of condition relevant TCR CDR3s from total repertoires has mostly been limited to either “public” CDR3 sequences or to comparisons of CDR3 frequencies observed in a single individual. A methodology for the identification of condition-associated TCR CDR3s by direct population level comparison of RepSeq samples is currently lacking. Results We present a method for direct population level comparison of RepSeq samples using immune repertoire sub-units (or sub-repertoires) that are shared across individuals. The method first performs unsupervised clustering of CDR3s within each sample. It then finds matching clusters across samples, called immune sub-repertoires, and performs statistical differential abundance testing at the level of the identified sub-repertoires. It finally ranks CDR3s in differentially abundant sub-repertoires for relevance to the condition. We applied the method on total TCR CDR3β RepSeq datasets of celiac disease patients, as well as on public datasets of yellow fever vaccination. The method successfully identified celiac disease associated CDR3β sequences, as evidenced by considerable agreement of TRBV-gene and positional amino acid usage patterns in the detected CDR3β sequences with previously known CDR3βs specific to gluten in celiac disease. It also successfully recovered significantly high numbers of previously known CDR3β sequences relevant to each condition than would be expected by chance. Conclusion We conclude that immune sub-repertoires of similar immuno-genomic features shared across unrelated individuals can serve as viable units of immune repertoire comparison, serving as proxy for identification of condition-associated CDR3s.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 33
    Publication Date: 2021-03-25
    Description: Background Large-scale biological data sets are often contaminated by noise, which can impede accurate inferences about underlying processes. Such measurement noise can arise from endogenous biological factors like cell cycle and life history variation, and from exogenous technical factors like sample preparation and instrument variation. Results We describe a general method for automatically reducing noise in large-scale biological data sets. This method uses an interaction network to identify groups of correlated or anti-correlated measurements that can be combined or “filtered” to better recover an underlying biological signal. Similar to the process of denoising an image, a single network filter may be applied to an entire system, or the system may be first decomposed into distinct modules and a different filter applied to each. Applied to synthetic data with known network structure and signal, network filters accurately reduce noise across a wide range of noise levels and structures. Applied to a machine learning task of predicting changes in human protein expression in healthy and cancerous tissues, network filtering prior to training increases accuracy up to 43% compared to using unfiltered data. Conclusions Network filters are a general way to denoise biological data and can account for both correlation and anti-correlation between different measurements. Furthermore, we find that partitioning a network prior to filtering can significantly reduce errors in networks with heterogenous data and correlation patterns, and this approach outperforms existing diffusion based methods. Our results on proteomics data indicate the broad potential utility of network filters to applications in systems biology.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 34
    Publication Date: 2021-03-25
    Description: Background Model averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage model averaging procedure to enhance accuracy and stability in prediction for high-dimensional linear regression. First we employ a high-dimensional variable selection method such as LASSO to screen redundant predictors and construct a class of candidate models, then we apply the jackknife cross-validation to optimize model weights for averaging. Results In simulation studies, the proposed technique outperforms commonly used alternative methods under high-dimensional regression setting, in terms of minimizing the mean of the squared prediction error. We apply the proposed method to a riboflavin data, the result show that such method is quite efficient in forecasting the riboflavin production rate, when there are thousands of genes and only tens of subjects. Conclusions Compared with a recent high-dimensional model averaging procedure (Ando and Li in J Am Stat Assoc 109:254–65, 2014), the proposed approach enjoys three appealing features thus has better predictive performance: (1) More suitable methods are applied for model constructing and weighting. (2) Computational flexibility is retained since each candidate model and its corresponding weight are determined in the low-dimensional setting and the quadratic programming is utilized in the cross-validation. (3) Model selection and averaging are combined in the procedure thus it makes full use of the strengths of both techniques. As a consequence, the proposed method can achieve stable and accurate predictions in high-dimensional linear models, and can greatly help practical researchers analyze genetic data in medical research.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 35
    Publication Date: 2021-03-25
    Description: Background Translation is a fundamental process in gene expression. Ribosome profiling is a method that enables the study of transcriptome-wide translation. A fundamental, technical challenge in analyzing Ribo-Seq data is identifying the A-site location on ribosome-protected mRNA fragments. Identification of the A-site is essential as it is at this location on the ribosome where a codon is translated into an amino acid. Incorrect assignment of a read to the A-site can lead to lower signal-to-noise ratio and loss of correlations necessary to understand the molecular factors influencing translation. Therefore, an easy-to-use and accurate analysis tool is needed to accurately identify the A-site locations. Results We present RiboA, a web application that identifies the most accurate A-site location on a ribosome-protected mRNA fragment and generates the A-site read density profiles. It uses an Integer Programming method that reflects the biological fact that the A-site of actively translating ribosomes is generally located between the second codon and stop codon of a transcript, and utilizes a wide range of mRNA fragment sizes in and around the coding sequence (CDS). The web application is containerized with Docker, and it can be easily ported across platforms. Conclusions The Integer Programming method that RiboA utilizes is the most accurate in identifying the A-site on Ribo-Seq mRNA fragments compared to other methods. RiboA makes it easier for the community to use this method via a user-friendly and portable web application. In addition, RiboA supports reproducible analyses by tracking all the input datasets and parameters, and it provides enhanced visualization to facilitate scientific exploration. RiboA is available as a web service at https://a-site.vmhost.psu.edu/. The code is publicly available at https://github.com/obrien-lab/aip_web_docker under the MIT license.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 36
    Publication Date: 2021-03-15
    Description: Background Tissues are often heterogeneous in their single-cell molecular expression, and this can govern the regulation of cell fate. For the understanding of development and disease, it is important to quantify heterogeneity in a given tissue. Results We present the R package stochprofML which uses the maximum likelihood principle to parameterize heterogeneity from the cumulative expression of small random pools of cells. We evaluate the algorithm’s performance in simulation studies and present further application opportunities. Conclusion Stochastic profiling outweighs the necessary demixing of mixed samples with a saving in experimental cost and effort and less measurement error. It offers possibilities for parameterizing heterogeneity, estimating underlying pool compositions and detecting differences between cell populations between samples.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 37
    Publication Date: 2021-03-09
    Description: Background Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of 11 active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network as the text classification model. Results We compare the performance of each active learning strategy using two differently sized datasets and two different classification tasks. Our results show that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, i.e., no active learning. On our large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On our small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. We found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes. Conclusions Active learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. Our results show that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset constructed using random sampling.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 38
    Publication Date: 2021-03-09
    Description: Background Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms’ classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species’ demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling. Results Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions. Conclusions There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 39
    Publication Date: 2021-03-12
    Description: Background Metagenomics is the study of microbial genomes for pathogen detection and discovery in human clinical, animal, and environmental samples via Next-Generation Sequencing (NGS). Metagenome de novo sequence assembly is a crucial analytical step in which longer contigs, ideally whole chromosomes/genomes, are formed from shorter NGS reads. However, the contigs generated from the de novo assembly are often very fragmented and rarely longer than a few kilo base pairs (kb). Therefore, a time-consuming extension process is routinely performed on the de novo assembled contigs. Results To facilitate this process, we propose a new tool for metagenome contig extension after de novo assembly. ContigExtender employs a novel recursive extending strategy that explores multiple extending paths to achieve highly accurate longer contigs. We demonstrate that ContigExtender outperforms existing tools in synthetic, animal, and human metagenomics datasets. Conclusions A novel software tool ContigExtender has been developed to assist and enhance the performance of metagenome de novo assembly. ContigExtender effectively extends contigs from a variety of sources and can be incorporated in most viral metagenomics analysis pipelines for a wide variety of applications, including pathogen detection and viral discovery.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 40
    Publication Date: 2021-03-12
    Description: Background Recently, copy number variations (CNV) impacting genes involved in oncogenic pathways have attracted an increasing attention to manage disease susceptibility. CNV is one of the most important somatic aberrations in the genome of tumor cells. Oncogene activation and tumor suppressor gene inactivation are often attributed to copy number gain/amplification or deletion, respectively, in many cancer types and stages. Recent advances in next generation sequencing protocols allow for the addition of unique molecular identifiers (UMI) to each read. Each targeted DNA fragment is labeled with a unique random nucleotide sequence added to sequencing primers. UMI are especially useful for CNV detection by making each DNA molecule in a population of reads distinct. Results Here, we present molecular Copy Number Alteration (mCNA), a new methodology allowing the detection of copy number changes using UMI. The algorithm is composed of four main steps: the construction of UMI count matrices, the use of control samples to construct a pseudo-reference, the computation of log-ratios, the segmentation and finally the statistical inference of abnormal segmented breaks. We demonstrate the success of mCNA on a dataset of patients suffering from Diffuse Large B-cell Lymphoma and we highlight that mCNA results have a strong correlation with comparative genomic hybridization. Conclusion We provide mCNA, a new approach for CNV detection, freely available at https://gitlab.com/pierrejulien.viailly/mcna/ under MIT license. mCNA can significantly improve detection accuracy of CNV changes by using UMI.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 41
    Publication Date: 2021-03-12
    Description: Background The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence. Results We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results. Conclusions The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 42
    Publication Date: 2021-03-26
    Description: Background Hidden Markov models (HMM) are a powerful tool for analyzing biological sequences in a wide variety of applications, from profiling functional protein families to identifying functional domains. The standard method used for HMM training is either by maximum likelihood using counting when sequences are labelled or by expectation maximization, such as the Baum–Welch algorithm, when sequences are unlabelled. However, increasingly there are situations where sequences are just partially labelled. In this paper, we designed a new training method based on the Baum–Welch algorithm to train HMMs for situations in which only partial labeling is available for certain biological problems. Results Compared with a similar method previously reported that is designed for the purpose of active learning in text mining, our method achieves significant improvements in model training, as demonstrated by higher accuracy when the trained models are tested for decoding with both synthetic data and real data. Conclusions A novel training method is developed to improve the training of hidden Markov models by utilizing partial labelled data. The method will impact on detecting de novo motifs and signals in biological sequence data. In particular, the method will be deployed in active learning mode to the ongoing research in detecting plasmodesmata targeting signals and assess the performance with validations from wet-lab experiments.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 43
    Publication Date: 2021-03-17
    Description: Background Identification of features is a critical task in microbiome studies that is complicated by the fact that microbial data are high dimensional and heterogeneous. Masked by the complexity of the data, the problem of separating signals (differential features between groups) from noise (features that are not differential between groups) becomes challenging and troublesome. For instance, when performing differential abundance tests, multiple testing adjustments tend to be overconservative, as the probability of a type I error (false positive) increases dramatically with the large numbers of hypotheses. Moreover, the grouping effect of interest can be obscured by heterogeneity. These factors can incorrectly lead to the conclusion that there are no differences in the microbiome compositions. Results We translate and represent the problem of identifying differential features, which are differential in two-group comparisons (e.g., treatment versus control), as a dynamic layout of separating the signal from its random background. More specifically, we progressively permute the grouping factor labels of the microbiome samples and perform multiple differential abundance tests in each scenario. We then compare the signal strength of the most differential features from the original data with their performance in permutations, and will observe a visually apparent decreasing trend if these features are true positives identified from the data. Simulations and applications on real data show that the proposed method creates a U-curve when plotting the number of significant features versus the proportion of mixing. The shape of the U-Curve can convey the strength of the overall association between the microbiome and the grouping factor. We also define a fragility index to measure the robustness of the discoveries. Finally, we recommend the identified features by comparing p-values in the observed data with p-values in the fully mixed data. Conclusions We have developed this into a user-friendly and efficient R-shiny tool with visualizations. By default, we use the Wilcoxon rank sum test to compute the p-values, since it is a robust nonparametric test. Our proposed method can also utilize p-values obtained from other testing methods, such as DESeq. This demonstrates the potential of the progressive permutation method to be extended to new settings.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 44
    Publication Date: 2021-03-17
    Description: Background An increasing number of clinical trials require biomarker-driven patient stratification, especially for revolutionary immune checkpoint blockade therapy. Due to the complicated interaction between a tumor and its microenvironment, single biomarkers, such as PDL1 protein level, tumor mutational burden (TMB), single gene mutation and expression, are far from satisfactory for response prediction or patient stratification. Recently, combinatorial biomarkers were reported to be more precise and powerful for predicting therapy response and identifying potential target populations with superior survival. However, there is a lack of dedicated tools for such combinatorial biomarker analysis. Results Here, we present dualmarker, an R package designed to facilitate the data exploration for dual biomarker combinations. Given two biomarkers, dualmarker comprehensively visualizes their association with drug response and patient survival through 14 types of plots, such as boxplots, scatterplots, ROCs, and Kaplan–Meier plots. Using logistic regression and Cox regression models, dualmarker evaluated the superiority of dual markers over single markers by comparing the data fitness of dual-marker versus single-marker models, which was utilized for de novo searching for new biomarker pairs. We demonstrated this straightforward workflow and comprehensive capability by using public biomarker data from one bladder cancer patient cohort (IMvigor210 study); we confirmed the previously reported biomarker pair TMB/TGF-beta signature and CXCL13 expression/ARID1A mutation for response and survival analyses, respectively. In addition, dualmarker de novo identified new biomarker partners, for example, in overall survival modelling, the model with combination of HMGB1 expression and ARID1A mutation had statistically better goodness-of-fit than the model with either HMGB1 or ARID1A as single marker. Conclusions The dualmarker package is an open-source tool for the visualization and identification of combinatorial dual biomarkers. It streamlines the dual marker analysis flow into user-friendly functions and can be used for data exploration and hypothesis generation. Its code is freely available at GitHub at https://github.com/maxiaopeng/dualmarker under MIT license.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 45
    Publication Date: 2021-03-19
    Description: Background Non-coding RNA (ncRNA) and protein interactions play essential roles in various physiological and pathological processes. The experimental methods used for predicting ncRNA–protein interactions are time-consuming and labor-intensive. Therefore, there is an increasing demand for computational methods to accurately and efficiently predict ncRNA–protein interactions. Results In this work, we presented an ensemble deep learning-based method, EDLMFC, to predict ncRNA–protein interactions using the combination of multi-scale features, including primary sequence features, secondary structure sequence features, and tertiary structure features. Conjoint k-mer was used to extract protein/ncRNA sequence features, integrating tertiary structure features, then fed into an ensemble deep learning model, which combined convolutional neural network (CNN) to learn dominating biological information with bi-directional long short-term memory network (BLSTM) to capture long-range dependencies among the features identified by the CNN. Compared with other state-of-the-art methods under five-fold cross-validation, EDLMFC shows the best performance with accuracy of 93.8%, 89.7%, and 86.1% on RPI1807, NPInter v2.0, and RPI488 datasets, respectively. The results of the independent test demonstrated that EDLMFC can effectively predict potential ncRNA–protein interactions from different organisms. Furtherly, EDLMFC is also shown to predict hub ncRNAs and proteins presented in ncRNA–protein networks of Mus musculus successfully. Conclusions In general, our proposed method EDLMFC improved the accuracy of ncRNA–protein interaction predictions and anticipated providing some helpful guidance on ncRNA functions research. The source code of EDLMFC and the datasets used in this work are available at https://github.com/JingjingWang-87/EDLMFC.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 46
    Publication Date: 2021-03-21
    Description: Background Numerous studies have demonstrated that long non-coding RNAs are related to plenty of human diseases. Therefore, it is crucial to predict potential lncRNA-disease associations for disease prognosis, diagnosis and therapy. Dozens of machine learning and deep learning algorithms have been adopted to this problem, yet it is still challenging to learn efficient low-dimensional representations from high-dimensional features of lncRNAs and diseases to predict unknown lncRNA-disease associations accurately. Results We proposed an end-to-end model, VGAELDA, which integrates variational inference and graph autoencoders for lncRNA-disease associations prediction. VGAELDA contains two kinds of graph autoencoders. Variational graph autoencoders (VGAE) infer representations from features of lncRNAs and diseases respectively, while graph autoencoders propagate labels via known lncRNA-disease associations. These two kinds of autoencoders are trained alternately by adopting variational expectation maximization algorithm. The integration of both the VGAE for graph representation learning, and the alternate training via variational inference, strengthens the capability of VGAELDA to capture efficient low-dimensional representations from high-dimensional features, and hence promotes the robustness and preciseness for predicting unknown lncRNA-disease associations. Further analysis illuminates that the designed co-training framework of lncRNA and disease for VGAELDA solves a geometric matrix completion problem for capturing efficient low-dimensional representations via a deep learning approach. Conclusion Cross validations and numerical experiments illustrate that VGAELDA outperforms the current state-of-the-art methods in lncRNA-disease association prediction. Case studies indicate that VGAELDA is capable of detecting potential lncRNA-disease associations. The source code and data are available at https://github.com/zhanglabNKU/VGAELDA.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 47
    Publication Date: 2021-03-22
    Description: Background Flow and mass cytometry are important modern immunology tools for measuring expression levels of multiple proteins on single cells. The goal is to better understand the mechanisms of responses on a single cell basis by studying differential expression of proteins. Most current data analysis tools compare expressions across many computationally discovered cell types. Our goal is to focus on just one cell type. Our narrower field of application allows us to define a more specific statistical model with easier to control statistical guarantees. Results Differential analysis of marker expressions can be difficult due to marker correlations and inter-subject heterogeneity, particularly for studies of human immunology. We address these challenges with two multiple regression strategies: a bootstrapped generalized linear model and a generalized linear mixed model. On simulated datasets, we compare the robustness towards marker correlations and heterogeneity of both strategies. For paired experiments, we find that both strategies maintain the target false discovery rate under medium correlations and that mixed models are statistically more powerful under the correct model specification. For unpaired experiments, our results indicate that much larger patient sample sizes are required to detect differences. We illustrate the R package and workflow for both strategies on a pregnancy dataset. Conclusion Our approach to finding differential proteins in flow and mass cytometry data reduces biases arising from marker correlations and safeguards against false discoveries induced by patient heterogeneity.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 48
    Publication Date: 2021-02-06
    Description: Background An inverted repeat is a DNA sequence followed downstream by its reverse complement, potentially with a gap in the centre. Inverted repeats are found in both prokaryotic and eukaryotic genomes and they have been linked with countless possible functions. Many international consortia provide a comprehensive description of common genetic variation making alternative sequence representations, such as IUPAC encoding, necessary for leveraging the full potential of such broad variation datasets. Results We present IUPACpal, an exact tool for efficient identification of inverted repeats in IUPAC-encoded DNA sequences allowing also for potential mismatches and gaps in the inverted repeats. Conclusion Within the parameters that were tested, our experimental results show that IUPACpal compares favourably to a similar application packaged with EMBOSS. We show that IUPACpal identifies many previously unidentified inverted repeats when compared with EMBOSS, and that this is also performed with orders of magnitude improved speed.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 49
    Publication Date: 2021-02-08
    Description: Background Mass spectrometry imaging (MSI) is a family of acquisition techniques producing images of the distribution of molecules in a sample, without any prior tagging of the molecules. This makes it a very interesting technique for exploratory research. However, the images are difficult to analyze because the enclosed data has high dimensionality, and their content does not necessarily reflect the shape of the object of interest. Conversely, magnetic resonance imaging (MRI) scans reflect the anatomy of the tissue. MRI also provides complementary information to MSI, such as the content and distribution of water. Results We propose a new workflow to merge the information from 2D MALDI–MSI and MRI images. Our workflow can be applied to large MSI datasets in a limited amount of time. Moreover, the workflow is fully automated and based on deterministic methods which ensures the reproducibility of the results. Our methods were evaluated and compared with state-of-the-art methods. Results show that the images are combined precisely and in a time-efficient manner. Conclusion Our workflow reveals molecules which co-localize with water in biological images. It can be applied on any MSI and MRI datasets which satisfy a few conditions: same regions of the shape enclosed in the images and similar intensity distributions.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 50
    Publication Date: 2021-02-09
    Description: Background Long noncoding RNAs represent a large class of transcripts with two common features: they exceed an arbitrary length threshold of 200 nt and are assumed to not encode proteins. Although a growing body of evidence indicates that the vast majority of lncRNAs are potentially nonfunctional, hundreds of them have already been revealed to perform essential gene regulatory functions or to be linked to a number of cellular processes, including those associated with the etiology of human diseases. To better understand the biology of lncRNAs, it is essential to perform a more in-depth study of their evolution. In contrast to protein-encoding transcripts, however, they do not show the strong sequence conservation that usually results from purifying selection; therefore, software that is typically used to resolve the evolutionary relationships of protein-encoding genes and transcripts is not applicable to the study of lncRNAs. Results To tackle this issue, we developed lncEvo, a computational pipeline that consists of three modules: (1) transcriptome assembly from RNA-Seq data, (2) prediction of lncRNAs, and (3) conservation study—a genome-wide comparison of lncRNA transcriptomes between two species of interest, including search for orthologs. Importantly, one can choose to apply lncEvo solely for transcriptome assembly or lncRNA prediction, without calling the conservation-related part. Conclusions lncEvo is an all-in-one tool built with the Nextflow framework, utilizing state-of-the-art software and algorithms with customizable trade-offs between speed and sensitivity, ease of use and built-in reporting functionalities. The source code of the pipeline is freely available for academic and nonacademic use under the MIT license at https://gitlab.com/spirit678/lncrna_conservation_nf.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 51
    Publication Date: 2021-02-08
    Description: Background Biological phenomena usually evolves over time and recent advances in high-throughput microscopy have made possible to collect multiple 3D images over time, generating $$3D+t$$ 3 D + t (or 4D) datasets. To extract useful information there is the need to extract spatial and temporal data on the particles that are in the images, but particle tracking and feature extraction need some kind of assistance. Results This manuscript introduces our new freely downloadable toolbox, the Visual4DTracker. It is a MATLAB package implementing several useful functionalities to navigate, analyse and proof-read the track of each particle detected in any $$3D+t$$ 3 D + t stack. Furthermore, it allows users to proof-read and to evaluate the traces with respect to a given gold standard. The Visual4DTracker toolbox permits the users to visualize and save all the generated results through a user-friendly graphical user interface. This tool has been successfully used in three applicative examples. The first processes synthetic data to show all the software functionalities. The second shows how to process a 4D image stack showing the time-lapse growth of Drosophila cells in an embryo. The third example presents the quantitative analysis of insulin granules in living beta-cells, showing that such particles have two main dynamics that coexist inside the cells. Conclusions Visual4DTracker is a software package for MATLAB to visualize, handle and manually track $$3D+t$$ 3 D + t stacks of microscopy images containing objects such cells, granules, etc.. With its unique set of functions, it remarkably permits the user to analyze and proof-read 4D data in a friendly 3D fashion. The tool is freely available at https://drive.google.com/drive/folders/19AEn0TqP-2B8Z10kOavEAopTUxsKUV73?usp=sharing
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 52
    Publication Date: 2021-02-09
    Description: Background Hub transcription factors, regulating many target genes in gene regulatory networks (GRNs), play important roles as disease regulators and potential drug targets. However, while numerous methods have been developed to predict individual regulator-gene interactions from gene expression data, few methods focus on inferring these hubs. Results We have developed ComHub, a tool to predict hubs in GRNs. ComHub makes a community prediction of hubs by averaging over predictions by a compendium of network inference methods. Benchmarking ComHub against the DREAM5 challenge data and two independent gene expression datasets showed a robust performance of ComHub over all datasets. Conclusions In contrast to other evaluated methods, ComHub consistently scored among the top performing methods on data from different sources. Lastly, we implemented ComHub to work with both predefined networks and to perform stand-alone network inference, which will make the method generally applicable.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 53
    Publication Date: 2021-02-09
    Description: Background Current high-throughput technologies—i.e. whole genome sequencing, RNA-Seq, ChIP-Seq, etc.—generate huge amounts of data and their usage gets more widespread with each passing year. Complex analysis pipelines involving several computationally-intensive steps have to be applied on an increasing number of samples. Workflow management systems allow parallelization and a more efficient usage of computational power. Nevertheless, this mostly happens by assigning the available cores to a single or few samples’ pipeline at a time. We refer to this approach as naive parallel strategy (NPS). Here, we discuss an alternative approach, which we refer to as concurrent execution strategy (CES), which equally distributes the available processors across every sample’s pipeline. Results Theoretically, we show that the CES results, under loose conditions, in a substantial speedup, with an ideal gain range spanning from 1 to the number of samples. Also, we observe that the CES yields even faster executions since parallelly computable tasks scale sub-linearly. Practically, we tested both strategies on a whole exome sequencing pipeline applied to three publicly available matched tumour-normal sample pairs of gastrointestinal stromal tumour. The CES achieved speedups in latency up to 2–2.4 compared to the NPS. Conclusions Our results hint that if resources distribution is further tailored to fit specific situations, an even greater gain in performance of multiple samples pipelines execution could be achieved. For this to be feasible, a benchmarking of the tools included in the pipeline would be necessary. It is our opinion these benchmarks should be consistently performed by the tools’ developers. Finally, these results suggest that concurrent strategies might also lead to energy and cost savings by making feasible the usage of low power machine clusters.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 54
    Publication Date: 2021-02-08
    Description: Background The high variability in envelope regions of some viruses such as HIV allow the virus to establish infection and to escape subsequent immune surveillance. This variability, as well as increasing incorporation of N-linked glycosylation sites, is fundamental to this evasion. It also creates difficulties for multiple sequence alignment methods (MSA) that provide the first step in their analysis. Existing MSA tools often fail to properly align highly variable HIV envelope sequences requiring extensive manual editing that is impractical with even a moderate number of these variable sequences. Results We developed an automated library building tool NGlyAlign, that organizes similar N-linked glycosylation sites as block constraints and statistically conserved global sites as single site constraints to automatically enforce partial columns in consistency-based MSA methods such as Dialign. This combined method accurately aligns variable HIV-1 envelope sequences. We tested the method on two datasets: a set of 156 founder and chronic gp160 HIV-1 subtype B sequences as well as a set of reference sequences of gp120 in the highly variable region 1. On measures such as entropy scores, sum of pair scores, column score, and similarity heat maps, NGlyAlign+Dialign proved superior against methods such as T-Coffee, ClustalOmega, ClustalW, Praline, HIValign and Muscle. The method is scalable to large sequence sets producing accurate alignments without requiring manual editing. As well as this application to HIV, our method can be used for other highly variable glycoproteins such as hepatitis C virus envelope. Conclusions NGlyAlign is an automated tool for mapping and building glycosylation motif libraries to accurately align highly variable regions in HIV sequences. It can provide the basis for many studies reliant on single robust alignments. NGlyAlign has been developed as an open-source tool and is freely available at https://github.com/UNSW-Mathematical-Biology/NGlyAlign_v1.0 .
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 55
    Publication Date: 2021-02-08
    Description: Background Identification and selection of protein particles in cryo-electron micrographs is an important step in single particle analysis. In this study, we developed a deep learning-based particle picking network to automatically detect particle centers from cryoEM micrographs. This is a challenging task due to the nature of cryoEM data, having low signal-to-noise ratios with variable particle sizes, shapes, distributions, grayscale variations as well as other undesirable artifacts. Results We propose a double convolutional neural network (CNN) cascade for automated detection of particles in cryo-electron micrographs. This approach, entitled Deep Regression Picker Network or “DRPnet”, is simple but very effective in recognizing different particle sizes, shapes, distributions and grayscale patterns corresponding to 2D views of 3D particles. Particles are detected by the first network, a fully convolutional regression network (FCRN), which maps the particle image to a continuous distance map that acts like a probability density function of particle centers. Particles identified by FCRN are further refined to reduce false particle detections by the second classification CNN. DRPnet’s first CNN pretrained with only a single cryoEM dataset can be used to detect particles from different datasets without retraining. Compared to RELION template-based autopicking, DRPnet results in better particle picking performance with drastically reduced user interactions and processing time. DRPnet also outperforms the state-of-the-art particle picking networks in terms of the supervised detection evaluation metrics recall, precision, and F-measure. To further highlight quality of the picked particle sets, we compute and present additional performance metrics assessing the resulting 3D reconstructions such as number of 2D class averages, efficiency/angular coverage, Rosenthal-Henderson plots and local/global 3D reconstruction resolution. Conclusion DRPnet shows greatly improved time-savings to generate an initial particle dataset compared to manual picking, followed by template-based autopicking. Compared to other networks, DRPnet has equivalent or better performance. DRPnet excels on cryoEM datasets that have low contrast or clumped particles. Evaluating other performance metrics, DRPnet is useful for higher resolution 3D reconstructions with decreased particle numbers or unknown symmetry, detecting particles with better angular orientation coverage.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 56
    Publication Date: 2021-02-08
    Description: Background Drug repositioning refers to the identification of new indications for existing drugs. Drug-based inference methods for drug repositioning apply some unique features of drugs for new indication prediction. Complementary information is provided by these different features. It is therefore necessary to integrate these features for more accurate in silico drug repositioning. Results In this study, we collect 3 different types of drug features (i.e., chemical, genomic and pharmacological spaces) from public databases. Similarities between drugs are separately calculated based on each of the features. We further develop a fusion method to combine the 3 similarity measurements. We test the inference abilities of the 4 similarity datasets in drug repositioning under the guilt-by-association principle. Leave-one-out cross-validations show the integrated similarity measurement IntegratedSim receives the best prediction performance, with the highest AUC value of 0.8451 and the highest AUPR value of 0.2201. Case studies demonstrate IntegratedSim produces the largest numbers of confirmed predictions in most cases. Moreover, we compare our integration method with 3 other similarity-fusion methods using the datasets in our study. Cross-validation results suggest our method improves the prediction accuracy in terms of AUC and AUPR values. Conclusions Our study suggests that the 3 drug features used in our manuscript are valuable information for drug repositioning. The comparative results indicate that integration of the 3 drug features would improve drug-disease association prediction. Our study provides a strategy for the fusion of different drug features for in silico drug repositioning.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 57
    Publication Date: 2021-02-18
    Description: Background One component of precision medicine is to construct prediction models with their predicitve ability as high as possible, e.g. to enable individual risk prediction. In genetic epidemiology, complex diseases like coronary artery disease, rheumatoid arthritis, and type 2 diabetes, have a polygenic basis and a common assumption is that biological and genetic features affect the outcome under consideration via interactions. In the case of omics data, the use of standard approaches such as generalized linear models may be suboptimal and machine learning methods are appealing to make individual predictions. However, most of these algorithms focus mostly on main or marginal effects of the single features in a dataset. On the other hand, the detection of interacting features is an active area of research in the realm of genetic epidemiology. One big class of algorithms to detect interacting features is based on the multifactor dimensionality reduction (MDR). Here, we further develop the model-based MDR (MB-MDR), a powerful extension of the original MDR algorithm, to enable interaction empowered individual prediction. Results Using a comprehensive simulation study we show that our new algorithm (median AUC: 0.66) can use information hidden in interactions and outperforms two other state-of-the-art algorithms, namely the Random Forest (median AUC: 0.54) and Elastic Net (median AUC: 0.50), if interactions are present in a scenario of two pairs of two features having small effects. The performance of these algorithms is comparable if no interactions are present. Further, we show that our new algorithm is applicable to real data by comparing the performance of the three algorithms on a dataset of rheumatoid arthritis cases and healthy controls. As our new algorithm is not only applicable to biological/genetic data but to all datasets with discrete features, it may have practical implications in other research fields where interactions between features have to be considered as well, and we made our method available as an R package (https://github.com/imbs-hl/MBMDRClassifieR). Conclusions The explicit use of interactions between features can improve the prediction performance and thus should be included in further attempts to move precision medicine forward.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 58
    Publication Date: 2021-02-24
    Description: Background Benchmarking the performance of complex analytical pipelines is an essential part of developing Lab Developed Tests (LDT). Reference samples and benchmark calls published by Genome in a Bottle (GIAB) consortium have enabled the evaluation of analytical methods. The performance of such methods is not uniform across the different genomic regions of interest and variant types. Several benchmarking methods such as hap.py, vcfeval, and vcflib are available to assess the analytical performance characteristics of variant calling algorithms. However, assessing the performance characteristics of an overall LDT assay still requires stringing together several such methods and experienced bioinformaticians to interpret the results. In addition, these methods are dependent on the hardware, operating system and other software libraries, making it impossible to reliably repeat the analytical assessment, when any of the underlying dependencies change in the assay. Here we present a scalable and reproducible, cloud-based benchmarking workflow that is independent of the laboratory and the technician executing the workflow, or the underlying compute hardware used to rapidly and continually assess the performance of LDT assays, across their regions of interest and reportable range, using a broad set of benchmarking samples. Results The benchmarking workflow was used to evaluate the performance characteristics for secondary analysis pipelines commonly used by Clinical Genomics laboratories in their LDT assays such as the GATK HaplotypeCaller v3.7 and the SpeedSeq workflow based on FreeBayes v0.9.10. Five reference sample truth sets generated by Genome in a Bottle (GIAB) consortium, six samples from the Personal Genome Project (PGP) and several samples with validated clinically relevant variants from the Centers for Disease Control were used in this work. The performance characteristics were evaluated and compared for multiple reportable ranges, such as whole exome and the clinical exome. Conclusions We have implemented a benchmarking workflow for clinical diagnostic laboratories that generates metrics such as specificity, precision and sensitivity for germline SNPs and InDels within a reportable range using whole exome or genome sequencing data. Combining these benchmarking results with validation using known variants of clinical significance in publicly available cell lines, we were able to establish the performance of variant calling pipelines in a clinical setting.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 59
    Publication Date: 2021-02-25
    Description: Background Microbes perform a fundamental economic, social, and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions they play in those environments and their responsibility is measured by their genes. The advances of next-generation sequencing technology have facilitated metagenomics research however it also creates a heavy computational burden. Large and complex biological datasets are available as never before. There are many gene predictors available that can aid the gene annotation process though they lack handling appropriately metagenomic data complexities. There is no standard metagenomic benchmark data for gene prediction. Thus, gene predictors may inflate their results by obfuscating low false discovery rates. Results We introduce geneRFinder, an ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar’s test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval. Conclusions We provide geneRFinder, an approach for gene prediction in distinct metagenomic complexities, available at gitlab.com/r.lorenna/generfinder and https://osf.io/w2yd6/, and also we provide a novel, comprehensive benchmark data for gene prediction—which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions—available at https://sourceforge.net/p/generfinder-benchmark.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 60
    Publication Date: 2021-02-18
    Description: Background The competing endogenous RNA (ceRNA) regulation is a newly discovered post-transcriptional regulation mechanism and plays significant roles in physiological and pathological progress. CeRNA networks provide global views to help understand the regulation of ceRNAs. CeRNA networks have been widely used to detect survival biomarkers, select candidate regulators of disease genes, and predict long noncoding RNA functions. However, there is no software platform to provide overall functions from the construction to analysis of ceRNA networks. Results To fill this gap, we introduce CeNet Omnibus, an R/Shiny application, which provides a unified framework for the construction and analysis of ceRNA network. CeNet Omnibus enables users to select multiple measurements, such as Pearson correlation coefficient (PCC), mutual information (MI), and liquid association (LA), to identify ceRNA pairs and construct ceRNA networks. Furthermore, CeNet Omnibus provides a one-stop solution to analyze the topological properties of ceRNA networks, detect modules, and perform gene enrichment analysis and survival analysis. CeNet Omnibus intends to cover comprehensiveness, high efficiency, high expandability, and user customizability, and it also offers a web-based user-friendly interface to users to obtain the output intuitionally. Conclusion CeNet Omnibus is a comprehensive platform for the construction and analysis of ceRNA networks. It is highly customizable and outputs the results in intuitive and interactive. We expect that CeNet Omnibus will assist researchers to understand the property of ceRNA networks and associated biological phenomena. CeNet Omnibus is an R/Shiny application based on the Shiny framework developed by RStudio. The R package and detailed tutorial are available on our GitHub page with the URL https://github.com/GaoLabXDU/CeNetOmnibus.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 61
    Publication Date: 2021-02-17
    Description: Background Bioimaging techniques offer a robust tool for studying molecular pathways and morphological phenotypes of cell populations subjected to various conditions. As modern high-resolution 3D microscopy provides access to an ever-increasing amount of high-quality images, there arises a need for their analysis in an automated, unbiased, and simple way. Segmentation of structures within the cell nucleus, which is the focus of this paper, presents a new layer of complexity in the form of dense packing and significant signal overlap. At the same time, the available segmentation tools provide a steep learning curve for new users with a limited technical background. This is especially apparent in the bulk processing of image sets, which requires the use of some form of programming notation. Results In this paper, we present PartSeg, a tool for segmentation and reconstruction of 3D microscopy images, optimised for the study of the cell nucleus. PartSeg integrates refined versions of several state-of-the-art algorithms, including a new multi-scale approach for segmentation and quantitative analysis of 3D microscopy images. The features and user-friendly interface of PartSeg were carefully planned with biologists in mind, based on analysis of multiple use cases and difficulties encountered with other tools, to offer an ergonomic interface with a minimal entry barrier. Bulk processing in an ad-hoc manner is possible without the need for programmer support. As the size of datasets of interest grows, such bulk processing solutions become essential for proper statistical analysis of results. Advanced users can use PartSeg components as a library within Python data processing and visualisation pipelines, for example within Jupyter notebooks. The tool is extensible so that new functionality and algorithms can be added by the use of plugins. For biologists, the utility of PartSeg is presented in several scenarios, showing the quantitative analysis of nuclear structures. Conclusions In this paper, we have presented PartSeg which is a tool for precise and verifiable segmentation and reconstruction of 3D microscopy images. PartSeg is optimised for cell nucleus analysis and offers multi-scale segmentation algorithms best-suited for this task. PartSeg can also be used for the bulk processing of multiple images and its components can be reused in other systems or computational experiments.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 62
    Publication Date: 2021-02-03
    Description: Background Differential expression and feature selection analyses are essential steps for the development of accurate diagnostic/prognostic classifiers of complicated human diseases using transcriptomics data. These steps are particularly challenging due to the curse of dimensionality and the presence of technical and biological noise. A promising strategy for overcoming these challenges is the incorporation of pre-existing transcriptomics data in the identification of differentially expressed (DE) genes. This approach has the potential to improve the quality of selected genes, increase classification performance, and enhance biological interpretability. While a number of methods have been developed that use pre-existing data for differential expression analysis, existing methods do not leverage the identities of experimental conditions to create a robust metric for identifying DE genes. Results In this study, we propose a novel differential expression and feature selection method—GEOlimma—which combines pre-existing microarray data from the Gene Expression Omnibus (GEO) with the widely-applied Limma method for differential expression analysis. We first quantify differential gene expression across 2481 pairwise comparisons from 602 curated GEO Datasets, and we convert differential expression frequencies to DE prior probabilities. Genes with high DE prior probabilities show enrichment in cell growth and death, signal transduction, and cancer-related biological pathways, while genes with low prior probabilities were enriched in sensory system pathways. We then applied GEOlimma to four differential expression comparisons within two human disease datasets and performed differential expression, feature selection, and supervised classification analyses. Our results suggest that use of GEOlimma provides greater experimental power to detect DE genes compared to Limma, due to its increased effective sample size. Furthermore, in a supervised classification analysis using GEOlimma as a feature selection method, we observed similar or better classification performance than Limma given small, noisy subsets of an asthma dataset. Conclusions Our results demonstrate that GEOlimma is a more effective method for differential gene expression and feature selection analyses compared to the standard Limma method. Due to its focus on gene-level differential expression, GEOlimma also has the potential to be applied to other high-throughput biological datasets.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 63
    Publication Date: 2021-02-01
    Description: Background The drive to understand how microbial communities interact with their environments has inspired innovations across many fields. The data generated from sequence-based analyses of microbial communities typically are of high dimensionality and can involve multiple data tables consisting of taxonomic or functional gene/pathway counts. Merging multiple high dimensional tables with study-related metadata can be challenging. Existing microbiome pipelines available in R have created their own data structures to manage this problem. However, these data structures may be unfamiliar to analysts new to microbiome data or R and do not allow for deviations from internal workflows. Existing analysis tools also focus primarily on community-level analyses and exploratory visualizations, as opposed to analyses of individual taxa. Results We developed the R package “tidyMicro” to serve as a more complete microbiome analysis pipeline. This open source software provides all of the essential tools available in other popular packages (e.g., management of sequence count tables, standard exploratory visualizations, and diversity inference tools) supplemented with multiple options for regression modelling (e.g., negative binomial, beta binomial, and/or rank based testing) and novel visualizations to improve interpretability (e.g., Rocky Mountain plots, longitudinal ordination plots). This comprehensive pipeline for microbiome analysis also maintains data structures familiar to R users to improve analysts’ control over workflow. A complete vignette is provided to aid new users in analysis workflow. Conclusions tidyMicro provides a reliable alternative to popular microbiome analysis packages in R. We provide standard tools as well as novel extensions on standard analyses to improve interpretability results while maintaining object malleability to encourage open source collaboration. The simple examples and full workflow from the package are reproducible and applicable to external data sets.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 64
    Publication Date: 2021-02-10
    Description: Background We present here a computational shortcut to improve a powerful wavelet-based method by Shim and Stephens (Ann Appl Stat 9(2):665–686, 2015. 10.1214/14-AOAS776) called WaveQTL that was originally designed to identify DNase I hypersensitivity quantitative trait loci (dsQTL). Results WaveQTL relies on permutations to evaluate the significance of an association. We applied a recent method by Zhou and Guan (J Am Stat Assoc 113(523):1362–1371, 2017. 10.1080/01621459.2017.1328361) to boost computational speed, which involves calculating the distribution of Bayes factors and estimating the significance of an association by simulations rather than permutations. We called this simulation-based approach “fast functional wavelet” (FFW), and tested it on a publicly available DNA methylation (DNAm) dataset on colorectal cancer. The simulations confirmed a substantial gain in computational speed compared to the permutation-based approach in WaveQTL. Furthermore, we show that FFW controls the type I error satisfactorily and has good power for detecting differentially methylated regions. Conclusions Our approach has broad utility and can be applied to detect associations between different types of functions and phenotypes. As more and more DNAm datasets are being made available through public repositories, an attractive application of FFW would be to re-analyze these data and identify associations that might have been missed by previous efforts. The full R package for FFW is freely available at GitHub https://github.com/william-denault/ffw.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 65
    Publication Date: 2021-02-12
    Description: Background The clustering of data produced by liquid chromatography coupled to mass spectrometry analyses (LC-MS data) has recently gained interest to extract meaningful chemical or biological patterns. However, recent instrumental pipelines deliver data which size, dimensionality and expected number of clusters are too large to be processed by classical machine learning algorithms, so that most of the state-of-the-art relies on single pass linkage-based algorithms. Results We propose a clustering algorithm that solves the powerful but computationally demanding kernel k-means objective function in a scalable way. As a result, it can process LC-MS data in an acceptable time on a multicore machine. To do so, we combine three essential features: a compressive data representation, Nyström approximation and a hierarchical strategy. In addition, we propose new kernels based on optimal transport, which interprets as intuitive similarity measures between chromatographic elution profiles. Conclusions Our method, referred to as CHICKN, is evaluated on proteomics data produced in our lab, as well as on benchmark data coming from the literature. From a computational viewpoint, it is particularly efficient on raw LC-MS data. From a data analysis viewpoint, it provides clusters which differ from those resulting from state-of-the-art methods, while achieving similar performances. This highlights the complementarity of differently principle algorithms to extract the best from complex LC-MS data.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 66
    Publication Date: 2021-02-10
    Description: Background Human dicer is an enzyme that cleaves pre-miRNAs into miRNAs. Several models have been developed to predict human dicer cleavage sites, including PHDCleav and LBSizeCleav. Given an input sequence, these models can predict whether the sequence contains a cleavage site. However, these models only consider each sequence independently and lack interpretability. Therefore, it is necessary to develop an accurate and explainable predictor, which employs relations between different sequences, to enhance the understanding of the mechanism by which human dicer cleaves pre-miRNA. Results In this study, we develop an accurate and explainable predictor for human dicer cleavage site – ReCGBM. We design relational features and class features as inputs to a lightGBM model. Computational experiments show that ReCGBM achieves the best performance compared to the existing methods. Further, we find that features in close proximity to the center of pre-miRNA are more important and make a significant contribution to the performance improvement of the developed method. Conclusions The results of this study show that ReCGBM is an interpretable and accurate predictor. Besides, the analyses of feature importance show that it might be of particular interest to consider more informative features close to the center of the pre-miRNA in future predictors.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 67
    Publication Date: 2021-02-10
    Description: Background Recent cancer genomic studies have generated detailed molecular data on a large number of cancer patients. A key remaining problem in cancer genomics is the identification of driver genes. Results We propose BetweenNet, a computational approach that integrates genomic data with a protein-protein interaction network to identify cancer driver genes. BetweenNet utilizes a measure based on betweenness centrality on patient specific networks to identify the so-called outlier genes that correspond to dysregulated genes for each patient. Setting up the relationship between the mutated genes and the outliers through a bipartite graph, it employs a random-walk process on the graph, which provides the final prioritization of the mutated genes. We compare BetweenNet against state-of-the art cancer gene prioritization methods on lung, breast, and pan-cancer datasets. Conclusions Our evaluations show that BetweenNet is better at recovering known cancer genes based on multiple reference databases. Additionally, we show that the GO terms and the reference pathways enriched in BetweenNet ranked genes and those that are enriched in known cancer genes overlap significantly when compared to the overlaps achieved by the rankings of the alternative methods.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 68
    Publication Date: 2021-02-11
    Description: Background The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. Results We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. Conclusions S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 69
    Publication Date: 2021-02-12
    Description: Background The search for statistically significant relationships between molecular markers and outcomes is challenging when dealing with high-dimensional, noisy and collinear multivariate omics data, such as metabolomic profiles. Permutation procedures allow for the estimation of adjusted significance levels without assuming independence among metabolomic variables. Nevertheless, the complex non-normal structure of metabolic profiles and outcomes may bias the permutation results leading to overly conservative threshold estimates i.e. lower than those from a Bonferroni or Sidak correction. Methods Within a univariate permutation procedure we employ parametric simulation methods based on the multivariate (log-)Normal distribution to obtain adjusted significance levels which are consistent across different outcomes while effectively controlling the type I error rate. Next, we derive an alternative closed-form expression for the estimation of the number of non-redundant metabolic variates based on the spectral decomposition of their correlation matrix. The performance of the method is tested for different model parametrizations and across a wide range of correlation levels of the variates using synthetic and real data sets. Results Both the permutation-based formulation and the more practical closed form expression are found to give an effective indication of the number of independent metabolic effects exhibited by the system, while guaranteeing that the derived adjusted threshold is stable across outcome measures with diverse properties.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 70
    Publication Date: 2021-02-12
    Description: Background Conformational transitions are implicated in the biological function of many proteins. Structural changes in proteins can be described approximately as the relative movement of rigid domains against each other. Despite previous efforts, there is a need to develop new domain segmentation algorithms that are capable of analysing the entire structure database efficiently and do not require the choice of protein-dependent tuning parameters such as the number of rigid domains. Results We develop a graph-based method for detecting rigid domains in proteins. Structural information from multiple conformational states is represented by a graph whose nodes correspond to amino acids. Graph clustering algorithms allow us to reduce the graph and run the Viterbi algorithm on the associated line graph to obtain a segmentation of the input structures into rigid domains. In contrast to many alternative methods, our approach does not require knowledge about the number of rigid domains. Moreover, we identified default values for the algorithmic parameters that are suitable for a large number of conformational ensembles. We test our algorithm on examples from the DynDom database and illustrate our method on various challenging systems whose structural transitions have been studied extensively. Conclusions The results strongly suggest that our graph-based algorithm forms a novel framework to characterize structural transitions in proteins via detecting their rigid domains. The web server is available at http://azifi.tz.agrar.uni-goettingen.de/webservice/.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 71
    Publication Date: 2021-04-19
    Description: Background Genotype–phenotype predictions are of great importance in genetics. These predictions can help to find genetic mutations causing variations in human beings. There are many approaches for finding the association which can be broadly categorized into two classes, statistical techniques, and machine learning. Statistical techniques are good for finding the actual SNPs causing variation where Machine Learning techniques are good where we just want to classify the people into different categories. In this article, we examined the Eye-color and Type-2 diabetes phenotype. The proposed technique is a hybrid approach consisting of some parts from statistical techniques and remaining from Machine learning. Results The main dataset for Eye-color phenotype consists of 806 people. 404 people have Blue-Green eyes where 402 people have Brown eyes. After preprocessing we generated 8 different datasets, containing different numbers of SNPs, using the mutation difference and thresholding at individual SNP. We calculated three types of mutation at each SNP no mutation, partial mutation, and full mutation. After that data is transformed for machine learning algorithms. We used about 9 classifiers, RandomForest, Extreme Gradient boosting, ANN, LSTM, GRU, BILSTM, 1DCNN, ensembles of ANN, and ensembles of LSTM which gave the best accuracy of 0.91, 0.9286, 0.945, 0.94, 0.94, 0.92, 0.95, and 0.96% respectively. Stacked ensembles of LSTM outperformed other algorithms for 1560 SNPs with an overall accuracy of 0.96, AUC = 0.98 for brown eyes, and AUC = 0.97 for Blue-Green eyes. The main dataset for Type-2 diabetes consists of 107 people where 30 people are classified as cases and 74 people as controls. We used different linear threshold to find the optimal number of SNPs for classification. The final model gave an accuracy of 0.97%. Conclusion Genotype–phenotype predictions are very useful especially in forensic. These predictions can help to identify SNP variant association with traits and diseases. Given more datasets, machine learning model predictions can be increased. Moreover, the non-linearity in the Machine learning model and the combination of SNPs Mutations while training the model increases the prediction. We considered binary classification problems but the proposed approach can be extended to multi-class classification.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 72
    Publication Date: 2021-04-12
    Description: Background Technological and research advances have produced large volumes of biomedical data. When represented as a network (graph), these data become useful for modeling entities and interactions in biological and similar complex systems. In the field of network biology and network medicine, there is a particular interest in predicting results from drug–drug, drug–disease, and protein–protein interactions to advance the speed of drug discovery. Existing data and modern computational methods allow to identify potentially beneficial and harmful interactions, and therefore, narrow drug trials ahead of actual clinical trials. Such automated data-driven investigation relies on machine learning techniques. However, traditional machine learning approaches require extensive preprocessing of the data that makes them impractical for large datasets. This study presents wide range of machine learning methods for predicting outcomes from biomedical interactions and evaluates the performance of the traditional methods with more recent network-based approaches. Results We applied a wide range of 32 different network-based machine learning models to five commonly available biomedical datasets, and evaluated their performance based on three important evaluations metrics namely AUROC, AUPR, and F1-score. We achieved this by converting link prediction problem as binary classification problem. In order to achieve this we have considered the existing links as positive example and randomly sampled negative examples from non-existant set. After experimental evaluation we found that Prone, ACT and $$LRW_5$$ L R W 5 are the top 3 best performers on all five datasets. Conclusions This work presents a comparative evaluation of network-based machine learning algorithms for predicting network links, with applications in the prediction of drug-target and drug–drug interactions, and applied well known network-based machine learning methods. Our work is helpful in guiding researchers in the appropriate selection of machine learning methods for pharmaceutical tasks.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 73
    Publication Date: 2021-04-12
    Description: Background Clustering is a crucial step in the analysis of single-cell data. Clusters identified in an unsupervised manner are typically annotated to cell types based on differentially expressed genes. In contrast, supervised methods use a reference panel of labelled transcriptomes to guide both clustering and cell type identification. Supervised and unsupervised clustering approaches have their distinct advantages and limitations. Therefore, they can lead to different but often complementary clustering results. Hence, a consensus approach leveraging the merits of both clustering paradigms could result in a more accurate clustering and a more precise cell type annotation. Results We present scConsensus, an $${mathbf {R}}$$ R framework for generating a consensus clustering by (1) integrating results from both unsupervised and supervised approaches and (2) refining the consensus clusters using differentially expressed genes. The value of our approach is demonstrated on several existing single-cell RNA sequencing datasets, including data from sorted PBMC sub-populations. Conclusions scConsensus combines the merits of unsupervised and supervised approaches to partition cells with better cluster separation and homogeneity, thereby increasing our confidence in detecting distinct cell types. scConsensus is implemented in $${mathbf {R}}$$ R and is freely available on GitHub at https://github.com/prabhakarlab/scConsensus.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 74
    Publication Date: 2021-04-16
    Description: Background Recent clinical advances in cancer immuno-therapeutics underscore the need for improved understanding of the complex relationship between cancer and the multiple, multi-functional, inter-dependent, cellular and humoral mediators/regulators of the human immune system. This interdisciplinary effort exploits engineering analysis methods utilized to investigate anomalous physical system behaviors to explore immune system behaviors. Cancer Immune Control Dynamics (CICD), a systems analysis approach, attempts to identify differences between systemic immune homeostasis of 27 healthy volunteers versus 14 patients with metastatic malignant melanoma based on daily serial measurements of conventional peripheral blood biomarkers (15 cell subsets, 35 cytokines). The modeling strategy applies engineering control theory to analyze an individual’s immune system based on the biomarkers’ dynamic non-linear oscillatory behaviors. The reverse engineering analysis uses a Singular Value Decomposition (SVD) algorithm to solve the inverse problem and identify a solution profile of the active biomarker relationships. Herein, 28,605 biologically possible biomarker interactions are modeled by a set of matrix equations creating a system interaction model. CICD quantifies the model with a participant’s biomarker data then computationally solves it to measure each relationship’s activity allowing a visualization of the individual’s current state of immunity. Results CICD results provide initial evidence that this model-based analysis is consistent with identified roles of biomarkers in systemic immunity of cancer patients versus that of healthy volunteers. The mathematical computations alone identified a plausible network of immune cells, including T cells, natural killer (NK) cells, monocytes, and dendritic cells (DC) with cytokines MCP-1 [CXCL2], IP-10 [CXCL10], and IL-8 that play a role in sustaining the state of immunity in advanced cancer. Conclusions With CICD modeling capabilities, the complexity of the immune system is mathematically quantified through thousands of possible interactions between multiple biomarkers. Therefore, the overall state of an individual’s immune system regardless of clinical status, is modeled as reflected in their blood samples. It is anticipated that CICD-based capabilities will provide tools to specifically address cancer and treatment modulated (immune checkpoint inhibitors) parameters of human immunity, revealing clinically relevant biological interactions.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 75
    Publication Date: 2021-04-20
    Description: Background A major challenge in evaluating quantitative ChIP-seq analyses, such as peak calling and differential binding, is a lack of reliable ground truth data. Accurate simulation of ChIP-seq data can mitigate this challenge, but existing frameworks are either too cumbersome to apply genome-wide or unable to model a number of important experimental conditions in ChIP-seq. Results We present ChIPs, a toolkit for rapidly simulating ChIP-seq data using statistical models of key experimental steps. We demonstrate how ChIPs can be used for a range of applications, including benchmarking analysis tools and evaluating the impact of various experimental parameters. ChIPs is implemented as a standalone command-line program written in C++ and is available from https://github.com/gymreklab/chips. Conclusions ChIPs is an efficient ChIP-seq simulation framework that generates realistic datasets over a flexible range of experimental conditions. It can serve as an important component in various ChIP-seq analyses where ground truth data are needed.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 76
    Publication Date: 2021-04-15
    Description: Background Gene Set Analysis (GSA) is arguably the method of choice for the functional interpretation of omics results. The following paper explores the popularity and the performance of all the GSA methodologies and software published during the 20 years since its inception. "Popularity" is estimated according to each paper's citation counts, while "performance" is based on a comprehensive evaluation of the validation strategies used by papers in the field, as well as the consolidated results from the existing benchmark studies. Results Regarding popularity, data is collected into an online open database ("GSARefDB") which allows browsing bibliographic and method-descriptive information from 503 GSA paper references; regarding performance, we introduce a repository of jupyter workflows and shiny apps for automated benchmarking of GSA methods (“GSA-BenchmarKING”). After comparing popularity versus performance, results show discrepancies between the most popular and the best performing GSA methods. Conclusions The above-mentioned results call our attention towards the nature of the tool selection procedures followed by researchers and raise doubts regarding the quality of the functional interpretation of biological datasets in current biomedical studies. Suggestions for the future of the functional interpretation field are made, including strategies for education and discussion of GSA tools, better validation and benchmarking practices, reproducibility, and functional re-analysis of previously reported data.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 77
    Publication Date: 2021-04-15
    Description: Background The Cox proportional hazards model is commonly used to predict hazard ratio, which is the risk or probability of occurrence of an event of interest. However, the Cox proportional hazard model cannot directly generate an individual survival time. To do this, the survival analysis in the Cox model converts the hazard ratio to survival times through distributions such as the exponential, Weibull, Gompertz or log-normal distributions. In other words, to generate the survival time, the Cox model has to select a specific distribution over time. Results This study presents a method to predict the survival time by integrating hazard network and a distribution function network. The Cox proportional hazards network is adapted in DeepSurv for the prediction of the hazard ratio and a distribution function network applied to generate the survival time. To evaluate the performance of the proposed method, a new evaluation metric that calculates the intersection over union between the predicted curve and ground truth was proposed. To further understand significant prognostic factors, we use the 1D gradient-weighted class activation mapping method to highlight the network activations as a heat map visualization over an input data. The performance of the proposed method was experimentally verified and the results compared to other existing methods. Conclusions Our results confirmed that the combination of the two networks, Cox proportional hazards network and distribution function network, can effectively generate accurate survival time.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 78
    Publication Date: 2021-04-20
    Description: Background Computational mathematical models of biological and biomedical systems have been successfully applied to advance our understanding of various regulatory processes, metabolic fluxes, effects of drug therapies, and disease evolution and transmission. Unfortunately, despite community efforts leading to the development of SBML and the BioModels database, many published models have not been fully exploited, largely due to a lack of proper documentation or the dependence on proprietary software. To facilitate the reuse and further development of systems biology and systems medicine models, an open-source toolbox that makes the overall process of model construction more consistent, understandable, transparent, and reproducible is desired. Results and discussion We provide an update on the development of , a free, expandable Python package for constructing and analysing ordinary differential equation-based mathematical models of dynamic systems. It provides intuitive and unified methods to construct and solve these systems. Significantly expanded visualisation methods allow for convenient analysis of the structural and dynamic properties of models. After specifying reaction stoichiometries and rate equations modelbase can automatically assemble the associated system of differential equations. A newly provided library of common kinetic rate laws reduces the repetitiveness of the computer programming code.  is also fully compatible with SBML. Previous versions provided functions for the automatic construction of networks for isotope labelling studies. Now, using user-provided label maps,  v1.2.3 streamlines the expansion of classic models to their isotope-specific versions. Finally, the library of previously published models implemented in  is growing continuously. Ranging from photosynthesis to tumour cell growth to viral infection evolution, all these models are now available in a transparent, reusable and unified format through . Conclusion With this new Python software package, which is written in currently one of the most popular programming languages, the user can develop new models and actively profit from the work of others.  enables reproducing and replicating models in a consistent, tractable and expandable manner. Moreover, the expansion of models to their isotopic label-specific versions enables simulating label propagation, thus providing quantitative information regarding network topology and metabolic fluxes.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 79
    Publication Date: 2021-04-20
    Description: Background Drug–target interaction (DTI) plays a vital role in drug discovery. Identifying drug–target interactions related to wet-lab experiments are costly, laborious, and time-consuming. Therefore, computational methods to predict drug–target interactions are an essential task in the drug discovery process. Meanwhile, computational methods can reduce search space by proposing potential drugs already validated on wet-lab experiments. Recently, deep learning-based methods in drug-target interaction prediction have gotten more attention. Traditionally, DTI prediction methods' performance heavily depends on additional information, such as protein sequence and molecular structure of the drug, as well as deep supervised learning. Results This paper proposes a method based on deep unsupervised learning for drug-target interaction prediction called AutoDTI++. The proposed method includes three steps. The first step is to pre-process the interaction matrix. Since the interaction matrix is sparse, we solved the sparsity of the interaction matrix with drug fingerprints. Then, in the second step, the AutoDTI approach is introduced. In the third step, we post-preprocess the output of the AutoDTI model. Conclusions Experimental results have shown that we were able to improve the prediction performance. To this end, the proposed method has been compared to other algorithms using the same reference datasets. The proposed method indicates that the experimental results of running five repetitions of tenfold cross-validation on golden standard datasets (Nuclear Receptors, GPCRs, Ion channels, and Enzymes) achieve good performance with high accuracy.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 80
    Publication Date: 2021-04-20
    Description: Background Gene annotation in eukaryotes is a non-trivial task that requires meticulous analysis of accumulated transcript data. Challenges include transcriptionally active regions of the genome that contain overlapping genes, genes that produce numerous transcripts, transposable elements and numerous diverse sequence repeats. Currently available gene annotation software applications depend on pre-constructed full-length gene sequence assemblies which are not guaranteed to be error-free. The origins of these sequences are often uncertain, making it difficult to identify and rectify errors in them. This hinders the creation of an accurate and holistic representation of the transcriptomic landscape across multiple tissue types and experimental conditions. Therefore, to gauge the extent of diversity in gene structures, a comprehensive analysis of genome-wide expression data is imperative. Results We present FINDER, a fully automated computational tool that optimizes the entire process of annotating genes and transcript structures. Unlike current state-of-the-art pipelines, FINDER automates the RNA-Seq pre-processing step by working directly with raw sequence reads and optimizes gene prediction from BRAKER2 by supplementing these reads with associated proteins. The FINDER pipeline (1) reports transcripts and recognizes genes that are expressed under specific conditions, (2) generates all possible alternatively spliced transcripts from expressed RNA-Seq data, (3) analyzes read coverage patterns to modify existing transcript models and create new ones, and (4) scores genes as high- or low-confidence based on the available evidence across multiple datasets. We demonstrate the ability of FINDER to automatically annotate a diverse pool of genomes from eight species. Conclusions FINDER takes a completely automated approach to annotate genes directly from raw expression data. It is capable of processing eukaryotic genomes of all sizes and requires no manual supervision—ideal for bench researchers with limited experience in handling computational tools.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 81
    Publication Date: 2021-04-20
    Description: Background Co-expression correlations provide the ability to predict gene functionality within specific biological contexts, such as different tissue and disease conditions. However, current gene co-expression databases generally do not consider biological context. In addition, these tools often implement a limited range of unsophisticated analysis approaches, diminishing their utility for exploring gene functionality and gene relationships. Furthermore, they typically do not provide the summary visualizations necessary to communicate these results, posing a significant barrier to their utilization by biologists without computational skills. Results We present Correlation AnalyzeR, a user-friendly web interface for exploring co-expression correlations and predicting gene functions, gene–gene relationships, and gene set topology. Correlation AnalyzeR provides flexible access to its database of tissue and disease-specific (cancer vs normal) genome-wide co-expression correlations, and it also implements a suite of sophisticated computational tools for generating functional predictions with user-friendly visualizations. In the usage example provided here, we explore the role of BRCA1-NRF2 interplay in the context of bone cancer, demonstrating how Correlation AnalyzeR can be effectively implemented to generate and support novel hypotheses. Conclusions Correlation AnalyzeR facilitates the exploration of poorly characterized genes and gene relationships to reveal novel biological insights. The database and all analysis methods can be accessed as a web application at https://gccri.bishop-lab.uthscsa.edu/correlation-analyzer/ and as a standalone R package at https://github.com/Bishop-Laboratory/correlationAnalyzeR.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 82
    Publication Date: 2021-04-20
    Description: Background Genetic testing is widely used in evaluating a patient’s predisposition to hereditary diseases. In the case of cancer, when a functionally impactful mutation (i.e. genetic variant) is identified in a disease-relevant gene, the patient is at elevated risk of developing a lesion in their lifetime. Unfortunately, as the rate and coverage of genetic testing has accelerated, our ability to assess the functional status of new variants has fallen behind. Therefore, there is an urgent need for more practical, streamlined and cost-effective methods for classifying variants. Results To directly address this issue, we designed a new approach that uses alterations in protein subcellular localization as a key indicator of loss of function. Thus, new variants can be rapidly functionalized using high-content microscopy (HCM). To facilitate the analysis of the large amounts of imaging data, we developed a new software toolkit, named MAPS for machine-assisted phenotype scoring, that utilizes deep learning to extract and classify cell-level features. MAPS helps users leverage cloud-based deep learning services that are easy to train and deploy to fit their specific experimental conditions. Model training is code-free and can be done with limited training images. Thus, MAPS allows cell biologists to easily incorporate deep learning into their image analysis pipeline. We demonstrated an effective variant functionalization workflow that integrates HCM and MAPS to assess missense variants of PTEN, a tumor suppressor that is frequently mutated in hereditary and somatic cancers. Conclusions This paper presents a new way to rapidly assess variant function using cloud deep learning. Since most tumor suppressors have well-defined subcellular localizations, our approach could be widely applied to functionalize variants of uncertain significance and help improve the utility of genetic testing.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 83
    Publication Date: 2021-04-14
    Description: Background Harmonin Homogy Domains (HHD) are recently identified orphan domains of about 70 residues folded in a compact five alpha-helix bundle that proved to be versatile in terms of function, allowing for direct binding to a partner as well as regulating the affinity and specificity of adjacent domains for their own targets. Adding their small size and rather simple fold, HHDs appear as convenient modules to regulate protein–protein interactions in various biological contexts. Surprisingly, only nine HHDs have been detected in six proteins, mainly expressed in sensory neurons. Results Here, we built a profile Hidden Markov Model to screen the entire UniProtKB for new HHD-containing proteins. Every hit was manually annotated, using a clustering approach, confirming that only a few proteins contain HHDs. We report the phylogenetic coverage of each protein and build a phylogenetic tree to trace the evolution of HHDs. We suggest that a HHD ancestor is shared with Paired Amphipathic Helices (PAH) domains, a four-helix bundle partially sharing fold and functional properties. We characterized amino-acid sequences of the various HHDs using pairwise BLASTP scoring coupled with community clustering and manually assessed sequence features among each individual family. These sequence features were analyzed using reported structures as well as homology models to highlight structural motifs underlying HHDs fold. We show that functional divergence is carried out by subtle differences in sequences that automatized approaches failed to detect. Conclusions We provide the first HHD databases, including sequences and conservation, phylogenic trees and a list of HHD variants found in the auditory system, which are available for the community. This case study highlights surprising phylogenetic properties found in orphan domains and will assist further studies of HHDs. We unveil the implication of HHDs in their various binding interfaces using conservation across families and a new protein–protein surface predictor. Finally, we discussed the functional consequences of three identified pathogenic HHD variants involved in Hoyeraal-Hreidarsson syndrome and of three newly reported pathogenic variants identified in patients suffering from Usher Syndrome.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 84
    Publication Date: 2021-04-15
    Description: Background ChIP-seq combines chromatin immunoprecipitation assays with sequencing and identifies genome-wide binding sites for DNA binding proteins. While many binding sites have strong ChIP-seq ‘peak’ observations and are well captured, there are still regions bound by proteins weakly, with a relatively low ChIP-seq signal enrichment. These weak binding sites, especially those at promoters and enhancers, are functionally important because they also regulate nearby gene expression. Yet, it remains a challenge to accurately identify weak binding sites in ChIP-seq data due to the ambiguity in differentiating these weak binding sites from the amplified background DNAs. Results ChIP-BIT2 (http://sourceforge.net/projects/chipbitc/) is a software package for ChIP-seq peak detection. ChIP-BIT2 employs a mixture model integrating protein and control ChIP-seq data and predicts strong or weak protein binding sites at promoters, enhancers, or other genomic locations. For binding sites at gene promoters, ChIP-BIT2 simultaneously predicts their target genes. ChIP-BIT2 has been validated on benchmark regions and tested using large-scale ENCODE ChIP-seq data, demonstrating its high accuracy and wide applicability. Conclusion ChIP-BIT2 is an efficient ChIP-seq peak caller. It provides a better lens to examine weak binding sites and can refine or extend the existing binding site collection, providing additional regulatory regions for decoding the mechanism of gene expression regulation.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 85
    Publication Date: 2021-04-29
    Description: Background Amyloid signaling motifs are a class of protein motifs which share basic structural and functional features despite the lack of clear sequence homology. They are hard to detect in large sequence databases either with the alignment-based profile methods (due to short length and diversity) or with generic amyloid- and prion-finding tools (due to insufficient discriminative power). We propose to address the challenge with a machine learning grammatical model capable of generalizing over diverse collections of unaligned yet related motifs. Results First, we introduce and test improvements to our probabilistic context-free grammar framework for protein sequences that allow for inferring more sophisticated models achieving high sensitivity at low false positive rates. Then, we infer universal grammars for a collection of recently identified bacterial amyloid signaling motifs and demonstrate that the method is capable of generalizing by successfully searching for related motifs in fungi. The results are compared to available alternative methods. Finally, we conduct spectroscopy and staining analyses of selected peptides to verify their structural and functional relationship. Conclusions While the profile HMMs remain the method of choice for modeling homologous sets of sequences, PCFGs seem more suitable for building meta-family descriptors and extrapolating beyond the seed sample.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 86
    Publication Date: 2021-04-19
    Description: An amendment to this paper has been published and can be accessed via the original article.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 87
  • 88
    Publication Date: 2021-04-01
    Description: Background Tremor severity assessment is an important step for the diagnosis and treatment decision-making of essential tremor (ET) patients. Traditionally, tremor severity is assessed by using questionnaires (e.g., ETRS and QUEST surveys). In this work we assume the possibility of assessing tremor severity using sensor data and computerized analyses. The goal of this work is to assess severity of tremor objectively, to be better able to asses improvement in ET patients due to deep brain stimulation or other treatments. Methods We collect tremor data by strapping smartphones to the wrists of ET patients. The resulting raw sensor data is then pre-processed to remove any artifact due to patient’s intentional movement. Finally, this data is exploited to automatically build a transparent, interpretable, and succinct fuzzy model for the severity assessment of ET. For this purpose, we exploit pyFUME, a tool for the data-driven estimation of fuzzy models. It leverages the FST-PSO swarm intelligence meta-heuristic to identify optimal clusters in data, reducing the possibility of a premature convergence in local minima which would result in a sub-optimal model. pyFUME was also combined with GRABS, a novel methodology for the automatic simplification of fuzzy rules. Results Our model is able to assess tremor severity of patients suffering from Essential Tremor, notably without the need for subjective questionnaires nor interviews. The fuzzy model improves the mean absolute error (MAE) metric by 78–81% compared to linear models and by 71–74% compared to a model based on decision trees. Conclusion This study confirms that tremor data gathered using the smartphones is useful for the constructing of machine learning models that can be used to support the diagnosis and monitoring of patients who suffer from Essential Tremor. The model produced by our methodology is easy to inspect and, notably, characterized by a lower error with respect to approaches based on linear models or decision trees.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 89
    Publication Date: 2021-04-26
    Description: Background Next generation sequencing has allowed the discovery of miRNA isoforms, termed isomiRs. Some isomiRs are derived from imprecise processing of pre-miRNA precursors, leading to length variants. Additional variability is introduced by non-templated addition of bases at the ends or editing of internal bases, resulting in base differences relative to the template DNA sequence. We hypothesized that some component of the isomiR variation reported so far could be due to systematic technical noise and not real. Results We have developed the XICRA pipeline to analyze small RNA sequencing data at the isomiR level. We exploited its ability to use single or merged reads to compare isomiR results derived from paired-end (PE) reads with those from single reads (SR) to address whether detectable sequence differences relative to canonical miRNAs found in isomiRs are true biological variations or the result of errors in sequencing. We have detected non-negligible systematic differences between SR and PE data which primarily affect putative internally edited isomiRs, and at a much smaller frequency terminal length changing isomiRs. This is relevant for the identification of true isomiRs in small RNA sequencing datasets. Conclusions We conclude that potential artifacts derived from sequencing errors and/or data processing could result in an overestimation of abundance and diversity of miRNA isoforms. Efforts in annotating the isomiRnome should take this into account.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 90
    Publication Date: 2021-04-28
    Description: Background lncRNA may be involved in the occurrence, metastasis, and chemical reaction of hepatocellular carcinoma (HCC) through various pathways associated with autophagy. Therefore, it is urgent to reveal more autophagy-related lncRNAs, explore these lncRNAs’ clinical significance, and find new targeted treatment strategies. Methods The corresponding data of HCC patients and autophagy genes were obtained from the TCGA database, and the human autophagy database respectively. Based on the co-expression and Cox regression analysis to construct prognostic prediction signature. Results Finally, a signature containing seven autophagy-related lncRNAs (PRRT3-AS1, RP11-479G22.8, RP11-73M18.8, LINC01138, CTD-2510F5.4, CTC-297N7.9, RP11-324I22.4) was constructed. Based on the risk score of signature, Overall survival (OS) curves show that the OS of high-risk patients is significantly lower than that of low-risk patients (P = 2.292e−10), and the prognostic prediction accuracy of risk score (AUC = 0.786) is significantly higher than that of ALBI (0.532), child_pugh (0.573), AFP (0.5751), and AJCC_stage (0.631). Moreover, multivariate Cox analysis and Nomogram of risk score are indicated that the 1-year and 3-year survival rates of patients are obviously accuracy by the combined analysis of the risk score, child_pugh, age, M_stage, and Grade (The AUC of 1- and 3-years are 0.87, and 0.855). Remarkably, the 7 autophagy-related lncRNAs may participate in Spliceosome, Cell cycle, RNA transport, DNA replication, and mRNA surveillance pathway and be related to the biological process of RNA splicing and mRNA splicing. Conclusion In conclusion, the 7 autophagy-related lncRNAs might be promising prognostic and therapeutic targets for HCC.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 91
    Publication Date: 2021-04-01
    Description: Background Genome-wide reconstructions of metabolism opened the way to thorough investigations of cell metabolism for health care and industrial purposes. However, the predictions offered by Flux Balance Analysis (FBA) can be strongly affected by the choice of flux boundaries, with particular regard to the flux of reactions that sink nutrients into the system. To mitigate possible errors introduced by a poor selection of such boundaries, a rational approach suggests to focus the modeling efforts on the pivotal ones. Methods In this work, we present a methodology for the automatic identification of the key fluxes in genome-wide constraint-based models, by means of variance-based sensitivity analysis. The goal is to identify the parameters for which a small perturbation entails a large variation of the model outcomes, also referred to as sensitive parameters. Due to the high number of FBA simulations that are necessary to assess sensitivity coefficients on genome-wide models, our method exploits a master-slave methodology that distributes the computation on massively multi-core architectures. We performed the following steps: (1) we determined the putative parameterizations of the genome-wide metabolic constraint-based model, using Saltelli’s method; (2) we applied FBA to each parameterized model, distributing the massive amount of calculations over multiple nodes by means of MPI; (3) we then recollected and exploited the results of all FBA runs to assess a global sensitivity analysis. Results We show a proof-of-concept of our approach on latest genome-wide reconstructions of human metabolism Recon2.2 and Recon3D. We report that most sensitive parameters are mainly associated with the intake of essential amino acids in Recon2.2, whereas in Recon 3D they are associated largely with phospholipids. We also illustrate that in most cases there is a significant contribution of higher order effects. Conclusion Our results indicate that interaction effects between different model parameters exist, which should be taken into account especially at the stage of calibration of genome-wide models, supporting the importance of a global strategy of sensitivity analysis.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 92
    Publication Date: 2021-04-01
    Description: Background High-throughput sequencing Chromosome Conformation Capture (Hi-C) allows the study of DNA interactions and 3D chromosome folding at the genome-wide scale. Usually, these data are represented as matrices describing the binary contacts among the different chromosome regions. On the other hand, a graph-based representation can be advantageous to describe the complex topology achieved by the DNA in the nucleus of eukaryotic cells. Methods Here we discuss the use of a graph database for storing and analysing data achieved by performing Hi-C experiments. The main issue is the size of the produced data and, working with a graph-based representation, the consequent necessity of adequately managing a large number of edges (contacts) connecting nodes (genes), which represents the sources of information. For this, currently available graph visualisation tools and libraries fall short with Hi-C data. The use of graph databases, instead, supports both the analysis and the visualisation of the spatial pattern present in Hi-C data, in particular for comparing different experiments or for re-mapping omics data in a space-aware context efficiently. In particular, the possibility of describing graphs through statistical indicators and, even more, the capability of correlating them through statistical distributions allows highlighting similarities and differences among different Hi-C experiments, in different cell conditions or different cell types. Results These concepts have been implemented in NeoHiC, an open-source and user-friendly web application for the progressive visualisation and analysis of Hi-C networks based on the use of the Neo4j graph database (version 3.5). Conclusion With the accumulation of more experiments, the tool will provide invaluable support to compare neighbours of genes across experiments and conditions, helping in highlighting changes in functional domains and identifying new co-organised genomic compartments.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 93
    Publication Date: 2021-04-01
    Description: Background Unsupervised learning can discover various unseen abnormalities, relying on large-scale unannotated medical images of healthy subjects. Towards this, unsupervised methods reconstruct a 2D/3D single medical image to detect outliers either in the learned feature space or from high reconstruction loss. However, without considering continuity between multiple adjacent slices, they cannot directly discriminate diseases composed of the accumulation of subtle anatomical anomalies, such as Alzheimer’s disease (AD). Moreover, no study has shown how unsupervised anomaly detection is associated with either disease stages, various (i.e., more than two types of) diseases, or multi-sequence magnetic resonance imaging (MRI) scans. Results We propose unsupervised medical anomaly detection generative adversarial network (MADGAN), a novel two-step method using GAN-based multiple adjacent brain MRI slice reconstruction to detect brain anomalies at different stages on multi-sequence structural MRI: (Reconstruction) Wasserstein loss with Gradient Penalty + 100 $$ell _1$$ ℓ 1 loss—trained on 3 healthy brain axial MRI slices to reconstruct the next 3 ones—reconstructs unseen healthy/abnormal scans; (Diagnosis) Average $$ell _2$$ ℓ 2 loss per scan discriminates them, comparing the ground truth/reconstructed slices. For training, we use two different datasets composed of 1133 healthy T1-weighted (T1) and 135 healthy contrast-enhanced T1 (T1c) brain MRI scans for detecting AD and brain metastases/various diseases, respectively. Our self-attention MADGAN can detect AD on T1 scans at a very early stage, mild cognitive impairment (MCI), with area under the curve (AUC) 0.727, and AD at a late stage with AUC 0.894, while detecting brain metastases on T1c scans with AUC 0.921. Conclusions Similar to physicians’ way of performing a diagnosis, using massive healthy training data, our first multiple MRI slice reconstruction approach, MADGAN, can reliably predict the next 3 slices from the previous 3 ones only for unseen healthy images. As the first unsupervised various disease diagnosis, MADGAN can reliably detect the accumulation of subtle anatomical anomalies and hyper-intense enhancing lesions, such as (especially late-stage) AD and brain metastases on multi-sequence MRI scans.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 94
    Publication Date: 2021-04-01
    Description: Background Mass spectrometry remains the privileged method to characterize proteins. Nevertheless, most of the spectra generated by an experiment remain unidentified after their analysis, mostly because of the modifications they carry. Open Modification Search (OMS) methods offer a promising answer to this problem. However, assessing the quality of OMS identifications remains a difficult task. Methods Aiming at better understanding the relationship between (1) similarity of pairs of spectra provided by OMS methods and (2) relevance of their corresponding peptide sequences, we used a dataset composed of theoretical spectra only, on which we applied two OMS strategies. We also introduced two appropriately defined measures for evaluating the above mentioned spectra/sequence relevance in this context: one is a color classification representing the level of difficulty to retrieve the proper sequence of the peptide that generated the identified spectrum ; the other, called LIPR, is the proportion of common masses, in a given Peptide Spectrum Match (PSM), that represent dissimilar sequences. These two measures were also considered in conjunction with the False Discovery Rate (FDR). Results According to our measures, the strategy that selects the best candidate by taking the mass difference between two spectra into account yields better quality results. Besides, although the FDR remains an interesting indicator in OMS methods (as shown by LIPR), it is questionable: indeed, our color classification shows that a non negligible proportion of relevant spectra/sequence interpretations corresponds to PSMs coming from the decoy database. Conclusions The three above mentioned measures allowed us to clearly determine which of the two studied OMS strategies outperformed the other, both in terms of number of identifications and of accuracy of these identifications. Even though quality evaluation of PSMs in OMS methods remains challenging, the study of theoretical spectra is a favorable framework for going further in this direction.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 95
    Publication Date: 2021-03-04
    Description: BackgroundVisual exploration of gene product behavior across multiple omic datasets can pinpoint technical limitations in data and reveal biological trends. Still, such exploration is challenging as there is a need for visualizations that are tailored for the purpose.ResultsThe OmicLoupe software was developed to facilitate visual data exploration and provides more than 15 interactive cross-dataset visualizations for omics data. It expands visualizations to multiple datasets for quality control, statistical comparisons and overlap and correlation analyses, while allowing for rapid inspection and downloading of selected features. The usage of OmicLoupe is demonstrated in three different studies, where it allowed for detection of both technical data limitations and biological trends across different omic layers. An example is an analysis of SARS-CoV-2 infection based on two previously published studies, where OmicLoupe facilitated the identification of gene products with consistent expression changes across datasets at both the transcript and protein levels.ConclusionsOmicLoupe provides fast exploration of omics data with tailored visualizations for comparisons within and across data layers. The interactive visualizations are highly informative and are expected to be useful in various analyses of both newly generated and previously published data. OmicLoupe is available at quantitativeproteomics.org/omicloupe
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 96
    Publication Date: 2021-03-02
    Description: Background Deep learning contributes to uncovering molecular and cellular processes with highly performant algorithms. Convolutional neural networks have become the state-of-the-art tool to provide accurate and fast image data processing. However, published algorithms mostly solve only one specific problem and they typically require a considerable coding effort and machine learning background for their application. Results We have thus developed InstantDL, a deep learning pipeline for four common image processing tasks: semantic segmentation, instance segmentation, pixel-wise regression and classification. InstantDL enables researchers with a basic computational background to apply debugged and benchmarked state-of-the-art deep learning algorithms to their own data with minimal effort. To make the pipeline robust, we have automated and standardized workflows and extensively tested it in different scenarios. Moreover, it allows assessing the uncertainty of predictions. We have benchmarked InstantDL on seven publicly available datasets achieving competitive performance without any parameter tuning. For customization of the pipeline to specific tasks, all code is easily accessible and well documented. Conclusions With InstantDL, we hope to empower biomedical researchers to conduct reproducible image processing with a convenient and easy-to-use pipeline.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 97
    Publication Date: 2021-03-02
    Description: Background VCF formatted files are the lingua franca of next-generation sequencing, whereas HL7 FHIR is emerging as a standard language for electronic health record interoperability. A growing number of FHIR-based clinical genomics applications are emerging. Here, we describe an open source utility for converting variants from VCF format into HL7 FHIR format. Results vcf2fhir converts VCF variants into a FHIR Genomics Diagnostic Report. Conversion translates each VCF row into a corresponding FHIR-formatted variant in the generated report. In scope are simple variants (SNVs, MNVs, Indels), along with zygosity and phase relationships, for autosomes, sex chromosomes, and mitochondrial DNA. Input parameters include VCF file and genome build (‘GRCh37’ or ‘GRCh38’); and optionally a conversion region that indicates the region(s) to convert, a studied region that lists genomic regions studied by the lab, and a non-callable region that lists studied regions deemed uncallable by the lab. Conversion can be limited to a subset of VCF by supplying genomic coordinates of the conversion region(s). If studied and non-callable regions are also supplied, the output FHIR report will include ‘region-studied’ observations that detail which portions of the conversion region were studied, and of those studied regions, which portions were deemed uncallable. We illustrate the vcf2fhir utility via two case studies. The first, 'SMART Cancer Navigator', is a web application that offers clinical decision support by linking patient EHR information to cancerous gene variants. The second, 'Precision Genomics Integration Platform', intersects a patient's FHIR-formatted clinical and genomic data with knowledge bases in order to provide on-demand delivery of contextually relevant genomic findings and recommendations to the EHR. Conclusions Experience to date shows that the vcf2fhir utility can be effectively woven into clinically useful genomic-EHR integration pipelines. Additional testing will be a critical step towards the clinical validation of this utility, enabling it to be integrated in a variety of real world data flow scenarios. For now, we propose the use of this utility primarily to accelerate FHIR Genomics understanding and to facilitate experimentation with further integration of genomics data into the EHR.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 98
    Publication Date: 2021-03-02
    Description: Background Non-targeted cytotoxics with anticancer activity are often developed through preclinical stages using response criteria observed in cell lines and xenografts. A panel of the NCI-60 cell lines is frequently the first line to define tumor types that are optimally responsive. Open data on the gene expression of the NCI-60 cell lines, provides a unique opportunity to add another dimension to the preclinical development of such drugs by interrogating correlations with gene expression patterns. Machine learning can be used to reduce the complexity of whole genome gene expression patterns to derive manageable signatures of response. Application of machine learning in early phases of preclinical development is likely to allow a better positioning and ultimate clinical success of molecules. LP-184 is a highly potent novel alkylating agent where the preclinical development is being guided by a dedicated machine learning-derived response signature. We show the feasibility and the accuracy of such a signature of response by accurately predicting the response to LP-184 validated using wet lab derived IC50s on a panel of cell lines. Results We applied our proprietary RADR® platform to an NCI-60 discovery dataset encompassing LP-184 IC50s and publicly available gene expression data. We used multiple feature selection layers followed by the XGBoost regression model and reduced the complexity of 20,000 gene expression values to generate a 16-gene signature leading to the identification of a set of predictive candidate biomarkers which form an LP-184 response gene signature. We further validated this signature and predicted response to an additional panel of cell lines. Considering fold change differences and correlation between actual and predicted LP-184 IC50 values as validation performance measures, we obtained 86% accuracy at four-fold cut-off, and a strong (r = 0.70) and significant (p value 1.36e−06) correlation between actual and predicted LP-184 sensitivity. In agreement with the perceived mechanism of action of LP-184, PTGR1 emerged as the top weighted gene. Conclusion Integration of a machine learning-derived signature of response with in vitro assessment of LP-184 efficacy facilitated the derivation of manageable yet robust biomarkers which can be used to predict drug sensitivity with high accuracy and clinical value.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 99
    Publication Date: 2021-03-02
    Description: Background The rapid expansion of the CRISPR toolbox through tagging effector domains to either enzymatically inactive Cas9 (dCas9) or Cas9 nickase (nCas9) has led to several promising new gene editing strategies. Recent additions include CRISPR cytosine or adenine base editors (CBEs and ABEs) and the CRISPR prime editors (PEs), in which a deaminase or reverse transcriptase are fused to nCas9, respectively. These tools hold great promise to model and correct disease-causing mutations in animal and plant models. But so far, no widely-available tools exist to automate the design of both BE and PE reagents. Results We developed PnB Designer, a web-based application for the design of pegRNAs for PEs and guide RNAs for BEs. PnB Designer makes it easy to design targeting guide RNAs for single or multiple targets on a variant or reference genome from organisms spanning multiple kingdoms. With PnB Designer, we designed pegRNAs to model all known disease causing mutations available in ClinVar. Additionally, PnB Designer can be used to design guide RNAs to install or revert a SNV, scanning the genome with one CBE and seven different ABE PAM variants and returning the best BE to use. PnB Designer is publicly accessible at http://fgcz-shiny.uzh.ch/PnBDesigner/ Conclusion With PnB Designer we created a user-friendly design tool for CRISPR PE and BE reagents, which should simplify choosing editing strategy and avoiding design complications.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 100
    Publication Date: 2021-03-20
    Description: Background Significant efforts have been made in building large-scale kinetic models of cellular metabolism in the past two decades. However, most kinetic models published to date, remain focused around central carbon pathways or are built around ad hoc reduced models without clear justification on their derivation and usage. Systematic algorithms exist for reducing genome-scale metabolic reconstructions to build thermodynamically feasible and consistently reduced stoichiometric models. However, it is important to study how network complexity affects conclusions derived from large-scale kinetic models built around consistently reduced models before we can apply them to study biological systems. Results We reduced the iJO1366 Escherichia Coli genome-scale metabolic reconstruction systematically to build three stoichiometric models of different size. Since the reduced models are expansions around the core subsystems for which the reduction was performed, the models are nested. We present a method for scaling up the flux profile and the concentration vector reference steady-states from the smallest model to the larger ones, whilst preserving maximum equivalency. Populations of kinetic models, preserving similarity in kinetic parameters, were built around the reference steady-states and their metabolic sensitivity coefficients (MSCs) were computed. The MSCs were sensitive to the model complexity. We proposed a metric for measuring the sensitivity of MSCs to these structural changes. Conclusions We proposed for the first time a workflow for scaling up the size of kinetic models while preserving equivalency between the kinetic models. Using this workflow, we demonstrate that model complexity in terms of networks size has significant impact on sensitivity characteristics of kinetic models. Therefore, it is essential to account for the effects of network complexity when constructing kinetic models. The presented metric for measuring MSC sensitivity to structural changes can guide modelers and experimentalists in improving model quality and guide synthetic biology and metabolic engineering. Our proposed workflow enables the testing of the suitability of a kinetic model for answering certain study-specific questions. We argue that the model-based metabolic design targets that are common across models of different size are of higher confidence, while those that are different could be the objective of investigations for model improvement.
    Electronic ISSN: 1471-2105
    Topics: Biology , Computer Science
    Published by BioMed Central
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...