ALBERT — All Library Books, journals and Electronic Records Telegrafenberg

1

Unknown

Evaluation of different approaches for missing data imputation on features associated to genomic data (2021)

Petrazzini, Ben Omega ; Naya, Hugo ; Lopez-Bello, Fernando ; [et al.]

BioMed Central

In: BioData Mining. 2021; 14(1): 44. Published 2021 Sep 03. doi: 10.1186/s13040-021-00274-7.

add to mindlist on the mindlist

Details

Publication Date: 2021-09-03

Description: Background Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. Results Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. Conclusions We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology.

Electronic ISSN: 1756-0381

Topics: Biology , Computer Science

Published by BioMed Central

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

2

Unknown

Taxonomy-based data representation for data mining: an example of the magnitude of risk associated with H. pylori infection (2021)

Polaka, Inese ; Razuka-Ebela, Danute ; Park, Jin Young ; [et al.]

BioMed Central

In: BioData Mining. 2021; 14(1): 43. Published 2021 Aug 28. doi: 10.1186/s13040-021-00271-w.

add to mindlist on the mindlist

Details

Publication Date: 2021-08-28

Description: Background The amount of available and potentially significant data describing study subjects is ever growing with the introduction and integration of different registries and data banks. The single specific attribute of these data are not always necessary; more often, membership to a specific group (e.g. diet, social ‘bubble’, living area) is enough to build a successful machine learning or data mining model without overfitting it. Therefore, in this article we propose an approach to building taxonomies using clustering to replace detailed data from large heterogenous data sets from different sources, while improving interpretability. We used the GISTAR study data base that holds exhaustive self-assessment questionnaire data to demonstrate this approach in the task of differentiating between H. pylori positive and negative study participants, and assessing their potential risk factors. We have compared the results of taxonomy-based classification to the results of classification using raw data. Results Evaluation of our approach was carried out using 6 classification algorithms that induce rule-based or tree-based classifiers. The taxonomy-based classification results show no significant loss in information, with similar and up to 2.5% better classification accuracy. Information held by 10 and more attributes can be replaced by one attribute demonstrating membership to a cluster in a hierarchy at a specific cut. The clusters created this way can be easily interpreted by researchers (doctors, epidemiologists) and describe the co-occurring features in the group, which is significant for the specific task. Conclusions While there are always features and measurements that must be used in data analysis as they are, the use of taxonomies for the description of study subjects in parallel allows using membership to specific naturally occurring groups and their impact on an outcome. This can decrease the risk of overfitting (picking attributes and values specific to the training set without explaining the underlying conditions), improve the accuracy of the models, and improve privacy protection of study participants by decreasing the amount of specific information used to identify the individual.

Electronic ISSN: 1756-0381

Topics: Biology , Computer Science

Published by BioMed Central

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

3

Unknown

Comparison of 16S and whole genome dog microbiomes using machine learning (2021)

Lewis, Scott ; Nash, Andrea ; Li, Qinghong ; [et al.]

BioMed Central

In: BioData Mining. 2021; 14(1): 41. Published 2021 Aug 21. doi: 10.1186/s13040-021-00270-x.

add to mindlist on the mindlist

Details

Publication Date: 2021-08-21

Description: Background Recent advances in sequencing technologies have driven studies identifying the microbiome as a key regulator of overall health and disease in the host. Both 16S amplicon and whole genome shotgun sequencing technologies are currently being used to investigate this relationship, however, the choice of sequencing technology often depends on the nature and experimental design of the study. In principle, the outputs rendered by analysis pipelines are heavily influenced by the data used as input; it is then important to consider that the genomic features produced by different sequencing technologies may emphasize different results. Results In this work, we use public 16S amplicon and whole genome shotgun sequencing (WGS) data from the same dogs to investigate the relationship between sequencing technology and the captured gut metagenomic landscape in dogs. In our analyses, we compare the taxonomic resolution at the species and phyla levels and benchmark 12 classification algorithms in their ability to accurately identify host phenotype using only taxonomic relative abundance information from 16S and WGS datasets with identical study designs. Our best performing model, a random forest trained by the WGS dataset, identified a species (Bacteroides coprocola) that predominantly contributes to the abundance of leuB, a gene involved in branched chain amino acid biosynthesis; a risk factor for glucose intolerance, insulin resistance, and type 2 diabetes. This trend was not conserved when we trained the model using 16S sequencing profiles from the same dogs. Conclusions Our results indicate that WGS sequencing of dog microbiomes detects a greater taxonomic diversity than 16S sequencing of the same dogs at the species level and with respect to four gut-enriched phyla levels. This difference in detection does not significantly impact the performance metrics of machine learning algorithms after down-sampling. Although the important features extracted from our best performing model are not conserved between the two technologies, the important features extracted from either instance indicate the utility of machine learning algorithms in identifying biologically meaningful relationships between the host and microbiome community members. In conclusion, this work provides the first systematic machine learning comparison of dog 16S and WGS microbiomes derived from identical study designs.

Electronic ISSN: 1756-0381

Topics: Biology , Computer Science

Published by BioMed Central

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

4

Unknown

Development and validation of a novel blending machine learning model for hospital mortality prediction in ICU patients with Sepsis (2021)

Zeng, Zhixuan ; Yao, Shuo ; Zheng, Jianfei ; [et al.]

BioMed Central

In: BioData Mining. 2021; 14(1): 40. Published 2021 Aug 16. doi: 10.1186/s13040-021-00276-5.

add to mindlist on the mindlist

Details

Publication Date: 2021-08-16

Description: Background Early prediction of hospital mortality is crucial for ICU patients with sepsis. This study aimed to develop a novel blending machine learning (ML) model for hospital mortality prediction in ICU patients with sepsis. Methods Two ICU databases were employed: eICU Collaborative Research Database (eICU-CRD) and Medical Information Mart for Intensive Care III (MIMIC-III). All adult patients who fulfilled Sepsis-3 criteria were identified. Samples from eICU-CRD constituted training set and samples from MIMIC-III constituted test set. Stepwise logistic regression model was used for predictor selection. Blending ML model which integrated nine sorts of basic ML models was developed for hospital mortality prediction in ICU patients with sepsis. Model performance was evaluated by various measures related to discrimination or calibration. Results Twelve thousand five hundred fifty-eight patients from eICU-CRD were included as the training set, and 12,095 patients from MIMIC-III were included as the test set. Both the training set and the test set showed a hospital mortality of 17.9%. Maximum and minimum lactate, maximum and minimum albumin, minimum PaO2/FiO2 and age were important predictors identified by both random forest and extreme gradient boosting algorithm. Blending ML models based on corresponding set of predictors presented better discrimination than SAPS II (AUROC, 0.806 vs. 0.771; AUPRC 0.515 vs. 0.429) and SOFA (AUROC, 0.742 vs. 0.706; AUPRC 0.428 vs. 0.381) on the test set. In addition, calibration curves showed that blending ML models had better calibration than SAPS II. Conclusions The blending ML model is capable of integrating different sorts of basic ML models efficiently, and outperforms conventional severity scores in predicting hospital mortality among septic patients in ICU.

Electronic ISSN: 1756-0381

Topics: Biology , Computer Science

Published by BioMed Central

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

5

Unknown

Prediction of MoRFs based on sequence properties and convolutional neural networks (2021)

He, Hao ; Zhou, Yatong ; Chi, Yue ; [et al.]

BioMed Central

In: BioData Mining. 2021; 14(1): 39. Published 2021 Aug 14. doi: 10.1186/s13040-021-00275-6.

add to mindlist on the mindlist

Details

Publication Date: 2021-08-14

Description: Background Intrinsically disordered proteins possess flexible 3-D structures, which makes them play an important role in a variety of biological functions. Molecular recognition features (MoRFs) act as an important type of functional regions, which are located within longer intrinsically disordered regions and undergo disorder-to-order transitions upon binding their interaction partners. Results We develop a method, MoRFCNN, to predict MoRFs based on sequence properties and convolutional neural networks (CNNs). The sequence properties contain structural and physicochemical properties which are used to describe the differences between MoRFs and non-MoRFs. Especially, to highlight the correlation between the target residue and adjacent residues, three windows are selected to preprocess the selected properties. After that, these calculated properties are combined into the feature matrix to predict MoRFs through the constructed CNN. Comparing with other existing methods, MoRFCNN obtains better performance. Conclusions MoRFCNN is a new individual MoRFs prediction method which just uses protein sequence properties without evolutionary information. The simulation results show that MoRFCNN is effective and competitive.

Electronic ISSN: 1756-0381

Topics: Biology , Computer Science

Published by BioMed Central

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

6

Unknown

Probability calibration-based prediction of recurrence rate in patients with diffuse large B-cell lymphoma (2021)

Fan, Shuanglong ; Zhao, Zhiqiang ; Zhang, Yanbo ; [et al.]

BioMed Central

In: BioData Mining. 2021; 14(1): 38. Published 2021 Aug 13. doi: 10.1186/s13040-021-00272-9.

add to mindlist on the mindlist

Details

Publication Date: 2021-08-13

Description: Background Although many patients receive good prognoses with standard therapy, 30–50% of diffuse large B-cell lymphoma (DLBCL) cases may relapse after treatment. Statistical or computational intelligent models are powerful tools for assessing prognoses; however, many cannot generate accurate risk (probability) estimates. Thus, probability calibration-based versions of traditional machine learning algorithms are developed in this paper to predict the risk of relapse in patients with DLBCL. Methods Five machine learning algorithms were assessed, namely, naïve Bayes (NB), logistic regression (LR), random forest (RF), support vector machine (SVM) and feedforward neural network (FFNN), and three methods were used to develop probability calibration-based versions of each of the above algorithms, namely, Platt scaling (Platt), isotonic regression (IsoReg) and shape-restricted polynomial regression (RPR). Performance comparisons were based on the average results of the stratified hold-out test, which was repeated 500 times. We used the AUC to evaluate the discrimination ability (i.e., classification ability) of the model and assessed the model calibration (i.e., risk prediction accuracy) using the H-L goodness-of-fit test, ECE, MCE and BS. Results Sex, stage, IPI, KPS, GCB, CD10 and rituximab were significant factors predicting the 3-year recurrence rate of patients with DLBCL. For the 5 uncalibrated algorithms, the LR (ECE = 8.517, MCE = 20.100, BS = 0.188) and FFNN (ECE = 8.238, MCE = 20.150, BS = 0.184) models were well-calibrated. The errors of the initial risk estimate of the NB (ECE = 15.711, MCE = 34.350, BS = 0.212), RF (ECE = 12.740, MCE = 27.200, BS = 0.201) and SVM (ECE = 9.872, MCE = 23.800, BS = 0.194) models were large. With probability calibration, the biased NB, RF and SVM models were well-corrected. The calibration errors of the LR and FFNN models were not further improved regardless of the probability calibration method. Among the 3 calibration methods, RPR achieved the best calibration for both the RF and SVM models. The power of IsoReg was not obvious for the NB, RF or SVM models. Conclusions Although these algorithms all have good classification ability, several cannot generate accurate risk estimates. Probability calibration is an effective method of improving the accuracy of these poorly calibrated algorithms. Our risk model of DLBCL demonstrates good discrimination and calibration ability and has the potential to help clinicians make optimal therapeutic decisions to achieve precision medicine.

Electronic ISSN: 1756-0381

Topics: Biology , Computer Science

Published by BioMed Central

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

7

Unknown

Estimating sequencing error rates using families (2021)

Paskov, Kelley ; Jung, Jae-Yoon ; Chrisman, Brianna ; [et al.]

BioMed Central

In: BioData Mining. 2021; 14(1): 27. Published 2021 Apr 23. doi: 10.1186/s13040-021-00259-6.

add to mindlist on the mindlist

Details

Publication Date: 2021-04-23

Description: Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

Electronic ISSN: 1756-0381

Topics: Biology , Computer Science

Published by BioMed Central

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

8

Unknown

Machine learning and statistical approaches for classification of risk of coronary artery disease using plasma cytokines (2021)

Saharan, Seema Singh ; Nagar, Pankaj ; Creasy, Kate Townsend ; [et al.]

BioMed Central

In: BioData Mining. 2021; 14(1): 26. Published 2021 Apr 15. doi: 10.1186/s13040-021-00260-z.

add to mindlist on the mindlist

Details

Publication Date: 2021-04-15

Description: Background As per the 2017 WHO fact sheet, Coronary Artery Disease (CAD) is the primary cause of death in the world, and accounts for 31% of total fatalities. The unprecedented 17.6 million deaths caused by CAD in 2016 underscores the urgent need to facilitate proactive and accelerated pre-emptive diagnosis. The innovative and emerging Machine Learning (ML) techniques can be leveraged to facilitate early detection of CAD which is a crucial factor in saving lives. The standard techniques like angiography, that provide reliable evidence are invasive and typically expensive and risky. In contrast, ML model generated diagnosis is non-invasive, fast, accurate and affordable. Therefore, ML algorithms can be used as a supplement or precursor to the conventional methods. This research demonstrates the implementation and comparative analysis of K Nearest Neighbor (k-NN) and Random Forest ML algorithms to achieve a targeted “At Risk” CAD classification using an emerging set of 35 cytokine biomarkers that are strongly indicative predictive variables that can be potential targets for therapy. To ensure better generalizability, mechanisms such as data balancing, repeated k-fold cross validation for hyperparameter tuning, were integrated within the models. To determine the separability efficacy of “At Risk” CAD versus Control achieved by the models, Area under Receiver Operating Characteristic (AUROC) metric is used which discriminates the classes by exhibiting tradeoff between the false positive and true positive rates. Results A total of 2 classifiers were developed, both built using 35 cytokine predictive features. The best AUROC score of .99 with a 95% Confidence Interval (CI) (.982,.999) was achieved by the Random Forest classifier using 35 cytokine biomarkers. The second-best AUROC score of .954 with a 95% Confidence Interval (.929,.979) was achieved by the k-NN model using 35 cytokines. A p-value of less than 7.481e-10 obtained by an independent t-test validated that Random Forest classifier was significantly better than the k-NN classifier with regards to the AUROC score. Presently, as large-scale efforts are gaining momentum to enable early, fast, reliable, affordable, and accessible detection of individuals at risk for CAD, the application of powerful ML algorithms can be leveraged as a supplement to conventional methods such as angiography. Early detection can be further improved by incorporating 65 novel and sensitive cytokine biomarkers. Investigation of the emerging role of cytokines in CAD can materially enhance the detection of risk and the discovery of mechanisms of disease that can lead to new therapeutic modalities.

Electronic ISSN: 1756-0381

Topics: Biology , Computer Science

Published by BioMed Central

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

9

Unknown

Coupling sparse Cox models with clustering of longitudinal transcriptomics data for trauma prognosis (2021)

Constantino, Cláudia S. ; Carvalho, Alexandra M. ; Vinga, Susana

BioMed Central

In: BioData Mining. 2021; 14(1): 25. Published 2021 Apr 14. doi: 10.1186/s13040-021-00257-8.

add to mindlist on the mindlist

Details

Publication Date: 2021-04-14

Description: Background Longitudinal gene expression analysis and survival modeling have been proved to add valuable biological and clinical knowledge. This study proposes a novel framework to discover gene signatures and patterns in a high-dimensional time series transcriptomics data and to assess their association with hospital length of stay. Methods We investigated a longitudinal and high-dimensional gene expression dataset from 168 blunt-force trauma patients followed during the first 28 days after injury. To model the length of stay, an initial dimensionality reduction step was performed by applying Cox regression with elastic net regularization using gene expression data from the first hospitalization days. Also, a novel methodology to impute missing values to the genes selected previously was proposed. We then applied multivariate time series (MTS) clustering to analyse gene expression over time and to stratify patients with similar trajectories. The validation of the patients’ partitions obtained by MTS clustering was performed using Kaplan-Meier curves and log-rank tests. Results We were able to unravel 22 genes strongly associated with hospital’s discharge. Their expression values in the first days after trauma showed to be good predictors of the length of stay. The proposed mixed imputation method allowed to achieve a complete dataset of short time series with a minimum loss of information for the 28 days of follow-up. MTS clustering enabled to group patients with similar genes trajectories and, notably, with similar discharge days from the hospital. Patients within each cluster have comparable genes’ trajectories and may have an analogous response to injury. Conclusion The proposed framework was able to tackle the joint analysis of time-to-event information with longitudinal multivariate high-dimensional data. The application to length of stay and transcriptomics data revealed a strong relationship between gene expression trajectory and patients’ recovery, which may improve trauma patient’s management by healthcare systems. The proposed methodology can be easily adapted to other medical data, towards more effective clinical decision support systems for health applications.

Electronic ISSN: 1756-0381

Topics: Biology , Computer Science

Published by BioMed Central

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

10

Unknown

Prescreening and treatment of aortic dissection through an analysis of infinite-dimension data (2021)

Qiu, Peng ; Li, Yixuan ; Liu, Kai ; [et al.]

BioMed Central

In: BioData Mining. 2021; 14(1): 24. Published 2021 Apr 01. doi: 10.1186/s13040-021-00249-8.

add to mindlist on the mindlist

Details

Publication Date: 2021-04-01

Description: Background Aortic dissection (AD) is one of the most catastrophic aortic diseases associated with a high mortality rate. In contrast to the advances in most cardiovascular diseases, both the incidence and in-hospital mortality rate of AD have experienced deviant increases over the past 20 years, highlighting the need for fresh prospects on the prescreening and in-hospital treatment strategies. Methods Through two cross-sectional studies, we adopt image recognition techniques to identify pre-disease aortic morphology for prior diagnoses; assuming that AD has occurred, we employ functional data analysis to determine the optimal timing for BP and HR interventions to offer the highest possible survival rate. Results Compared with the healthy control group, the aortic centerline is significantly more slumped for the AD group. Further, controlling patients’ blood pressure and heart rate according to the likelihood of adverse events can offer the highest possible survival probability. Conclusions The degree of slumpness is introduced to depict aortic morphological changes comprehensively. The morphology-based prediction model is associated with an improvement in the predictive accuracy of the prescreening of AD. The dynamic model reveals that blood pressure and heart rate variations have a strong predictive power for adverse events, confirming this model’s ability to improve AD management.

Electronic ISSN: 1756-0381

Topics: Biology , Computer Science

Published by BioMed Central

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext