ALBERT — All Library Books, journals and Electronic Records Telegrafenberg

1

Unknown

Measuring Performance Metrics of Machine Learning Algorithms for Detecting and Classifying Transposable Elements (2020)

Orozco-Arias, Simon ; Piña, Johan S. ; Tabares-Soto, Reinel ; [et al.]

Molecular Diversity Preservation International

In: Processes. 2020; 8(6): 638. Published 2020 May 27. doi: 10.3390/pr8060638.

add to mindlist on the mindlist

Details

Publication Date: 2020-05-27

Description: Because of the promising results obtained by machine learning (ML) approaches in several fields, every day is more common, the utilization of ML to solve problems in bioinformatics. In genomics, a current issue is to detect and classify transposable elements (TEs) because of the tedious tasks involved in bioinformatics methods. Thus, ML was recently evaluated for TE datasets, demonstrating better results than bioinformatics applications. A crucial step for ML approaches is the selection of metrics that measure the realistic performance of algorithms. Each metric has specific characteristics and measures properties that may be different from the predicted results. Although the most commonly used way to compare measures is by using empirical analysis, a non-result-based methodology has been proposed, called measure invariance properties. These properties are calculated on the basis of whether a given measure changes its value under certain modifications in the confusion matrix, giving comparative parameters independent of the datasets. Measure invariance properties make metrics more or less informative, particularly on unbalanced, monomodal, or multimodal negative class datasets and for real or simulated datasets. Although several studies applied ML to detect and classify TEs, there are no works evaluating performance metrics in TE tasks. Here, we analyzed 26 different metrics utilized in binary, multiclass, and hierarchical classifications, through bibliographic sources, and their invariance properties. Then, we corroborated our findings utilizing freely available TE datasets and commonly used ML algorithms. Based on our analysis, the most suitable metrics for TE tasks must be stable, even using highly unbalanced datasets, multimodal negative class, and training datasets with errors or outliers. Based on these parameters, we conclude that the F1-score and the area under the precision-recall curve are the most informative metrics since they are calculated based on other metrics, providing insight into the development of an ML application.

Electronic ISSN: 2227-9717

Topics: Biology , Chemistry and Pharmacology

Published by Molecular Diversity Preservation International

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

2

Unknown

A Machine Learning-based Pipeline for the Classification of CTX-M in Metagenomics Samples (2019)

Ceballos, Diego ; López-Álvarez, Diana ; Isaza, Gustavo ; [et al.]

Molecular Diversity Preservation International

In: Processes. 2019; 7(4): 235. Published 2019 Apr 24. doi: 10.3390/pr7040235.

add to mindlist on the mindlist

Details

Publication Date: 2019-04-24

Description: Bacterial infections are a major global concern, since they can lead to public health problems. To address this issue, bioinformatics contributes extensively with the analysis and interpretation of in silico data by enabling to genetically characterize different individuals/strains, such as in bacteria. However, the growing volume of metagenomic data requires new infrastructure, technologies, and methodologies that support the analysis and prediction of this information from a clinical point of view, as intended in this work. On the other hand, distributed computational environments allow the management of these large volumes of data, due to significant advances in processing architectures, such as multicore CPU (Central Process Unit) and GPGPU (General Propose Graphics Process Unit). For this purpose, we developed a bioinformatics workflow based on filtered metagenomic data with Duk tool. Data formatting was done through Emboss software and a prototype of a workflow. A pipeline was also designed and implemented in bash script based on machine learning. Further, Python 3 programming language was used to normalize the training data of the artificial neural network, which was implemented in the TensorFlow framework, and its behavior was visualized in TensorBoard. Finally, the values from the initial bioinformatics process and the data generated during the parameterization and optimization of the Artificial Neural Network are presented and validated based on the most optimal result for the identification of the CTX-M gene group.

Electronic ISSN: 2227-9717

Topics: Biology , Chemistry and Pharmacology

Published by Molecular Diversity Preservation International

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

3

Unknown

TIP_finder: An HPC Software to Detect Transposable Element Insertion Polymorphisms in Large Genomic Datasets (2020)

Orozco-Arias, Simon ; Tobon-Orozco, Nicolas ; Piña, Johan S. ; [et al.]

Molecular Diversity Preservation International

In: Biology. 2020; 9(9): 281. Published 2020 Sep 09. doi: 10.3390/biology9090281.

add to mindlist on the mindlist

Details

Publication Date: 2020-09-09

Description: Transposable elements (TEs) are non-static genomic units capable of moving indistinctly from one chromosomal location to another. Their insertion polymorphisms may cause beneficial mutations, such as the creation of new gene function, or deleterious in eukaryotes, e.g., different types of cancer in humans. A particular type of TE called LTR-retrotransposons comprises almost 8% of the human genome. Among LTR retrotransposons, human endogenous retroviruses (HERVs) bear structural and functional similarities to retroviruses. Several tools allow the detection of transposon insertion polymorphisms (TIPs) but fail to efficiently analyze large genomes or large datasets. Here, we developed a computational tool, named TIP_finder, able to detect mobile element insertions in very large genomes, through high-performance computing (HPC) and parallel programming, using the inference of discordant read pair analysis. TIP_finder inputs are (i) short pair reads such as those obtained by Illumina, (ii) a chromosome-level reference genome sequence, and (iii) a database of consensus TE sequences. The HPC strategy we propose adds scalability and provides a useful tool to analyze huge genomic datasets in a decent running time. TIP_finder accelerates the detection of transposon insertion polymorphisms (TIPs) by up to 55 times in breast cancer datasets and 46 times in cancer-free datasets compared to the fastest available algorithms. TIP_finder applies a validated strategy to find TIPs, accelerates the process through HPC, and addresses the issues of runtime for large-scale analyses in the post-genomic era.

Electronic ISSN: 2079-7737

Topics: Biology

Published by Molecular Diversity Preservation International

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

4

Unknown

A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data (2020)

Tabares-Soto, Reinel ; Orozco-Arias, Simon ; Romero-Cano, Victor ; [et al.]

PeerJ

In: PeerJ Computer Science. 2020; 6: e270. e270. Published 2020 Apr 13. doi: 10.7717/peerj-cs.270.

add to mindlist on the mindlist

Details

Publication Date: 2020-04-13

Description: Cancer classification is a topic of major interest in medicine since it allows accurate and efficient diagnosis and facilitates a successful outcome in medical treatments. Previous studies have classified human tumors using a large-scale RNA profiling and supervised Machine Learning (ML) algorithms to construct a molecular-based classification of carcinoma cells from breast, bladder, adenocarcinoma, colorectal, gastro esophagus, kidney, liver, lung, ovarian, pancreas, and prostate tumors. These datasets are collectively known as the 11_tumor database, although this database has been used in several works in the ML field, no comparative studies of different algorithms can be found in the literature. On the other hand, advances in both hardware and software technologies have fostered considerable improvements in the precision of solutions that use ML, such as Deep Learning (DL). In this study, we compare the most widely used algorithms in classical ML and DL to classify the tumors described in the 11_tumor database. We obtained tumor identification accuracies between 90.6% (Logistic Regression) and 94.43% (Convolutional Neural Networks) using k-fold cross-validation. Also, we show how a tuning process may or may not significantly improve algorithms’ accuracies. Our results demonstrate an efficient and accurate classification method based on gene expression (microarray data) and ML/DL algorithms, which facilitates tumor type prediction in a multi-cancer-type scenario.

Electronic ISSN: 2376-5992

Topics: Computer Science

Published by PeerJ

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

5

Unknown

Worldwide co-occurrence analysis of 17 species of the genus Brachypodium using data mining (2019)

Orozco-Arias, Simon ; Núñez-Rincón, Ana María ; Tabares-Soto, Reinel ; [et al.]

PeerJ

In: PeerJ. 2019; 6: e6193. e6193. Published 2019 Jan 14. doi: 10.7717/peerj.6193.

add to mindlist on the mindlist

Details

Publication Date: 2019-01-14

Description: The co-occurrence of plant species is a fundamental aspect of plant ecology that contributes to understanding ecological processes, including the establishment of ecological communities and its applications in biological conservation. A priori algorithms can be used to measure the co-occurrence of species in a spatial distribution given by coordinates. We used 17 species of the genus Brachypodium, downloaded from the Global Biodiversity Information Facility data repository or obtained from bibliographical sources, to test an algorithm with the spatial points process technique used by Silva et al. (2016), generating association rules for co-occurrence analysis. Brachypodium spp. has emerged as an effective model for monocot species, growing in different environments, latitudes, and elevations; thereby, representing a wide range of biotic and abiotic conditions that may be associated with adaptive natural genetic variation. We created seven datasets of two, three, four, six, seven, 15, and 17 species in order to test the algorithm with four different distances (1, 5, 10, and 20 km). Several measurements (support, confidence, lift, Chi-square, and p-value) were used to evaluate the quality of the results generated by the algorithm. No negative association rules were created in the datasets, while 95 positive co-occurrences rules were found for datasets with six, seven, 15, and 17 species. Using 20 km in the dataset with 17 species, we found 16 positive co-occurrences involving five species, suggesting that these species are coexisting. These findings are corroborated by the results obtained in the dataset with 15 species, where two species with broad range distributions present in the previous dataset are eliminated, obtaining seven positive co-occurrences. We found that B. sylvaticum has co-occurrence relations with several species, such as B. pinnatum, B. rupestre, B. retusum, and B. phoenicoides, due to its wide distribution in Europe, Asia, and north of Africa. We demonstrate the utility of the algorithm implemented for the analysis of co-occurrence of 17 species of the genus Brachypodium, agreeing with distributions existing in nature. Data mining has been applied in the field of biological sciences, where a great amount of complex and noisy data of unseen proportion has been generated in recent years. Particularly, ecological data analysis represents an opportunity to explore and comprehend biological systems with data mining and bioinformatics tools.

Electronic ISSN: 2167-8359

Topics: Biology , Medicine

Published by PeerJ

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

6

Unknown

Inpactor, Integrated and Parallel Analyzer and Classifier of LTR Retrotransposons and Its Application for Pineapple LTR Retrotransposons Diversity and Dynamics (2018)

Orozco-Arias, Simon ; Liu, Juan ; Tabares-Soto, Reinel ; [et al.]

Molecular Diversity Preservation International

In: Biology. 2018; 7(2): 32. Published 2018 May 25. doi: 10.3390/biology7020032.

add to mindlist on the mindlist

Details

Publication Date: 2018-05-25

Electronic ISSN: 2079-7737

Topics: Biology

Published by Molecular Diversity Preservation International

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

7

Unknown

A systematic review of the application of machine learning in the detection and classification of transposable elements (2019)

Orozco-Arias, Simon ; Isaza, Gustavo ; Guyot, Romain ; [et al.]

PeerJ

In: PeerJ. 2019; 7: e8311. e8311. Published 2019 Dec 18. doi: 10.7717/peerj.8311.

add to mindlist on the mindlist

Details

Publication Date: 2019-12-18

Description: Background Transposable elements (TEs) constitute the most common repeated sequences in eukaryotic genomes. Recent studies demonstrated their deep impact on species diversity, adaptation to the environment and diseases. Although there are many conventional bioinformatics algorithms for detecting and classifying TEs, none have achieved reliable results on different types of TEs. Machine learning (ML) techniques can automatically extract hidden patterns and novel information from labeled or non-labeled data and have been applied to solving several scientific problems. Methodology We followed the Systematic Literature Review (SLR) process, applying the six stages of the review protocol from it, but added a previous stage, which aims to detect the need for a review. Then search equations were formulated and executed in several literature databases. Relevant publications were scanned and used to extract evidence to answer research questions. Results Several ML approaches have already been tested on other bioinformatics problems with promising results, yet there are few algorithms and architectures available in literature focused specifically on TEs, despite representing the majority of the nuclear DNA of many organisms. Only 35 articles were found and categorized as relevant in TE or related fields. Conclusions ML is a powerful tool that can be used to address many problems. Although ML techniques have been used widely in other biological tasks, their utilization in TE analyses is still limited. Following the SLR, it was possible to notice that the use of ML for TE analyses (detection and classification) is an open problem, and this new field of research is growing in interest.

Electronic ISSN: 2167-8359

Topics: Biology , Medicine

Published by PeerJ

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

8

Unknown

Strategy to improve the accuracy of convolutional neural network architectures applied to digital image steganalysis in the spatial domain (2021)

Tabares-Soto, Reinel ; Arteaga-Arteaga, Harold Brayan ; Mora-Rubio, Alejandro ; [et al.]

PeerJ

In: PeerJ Computer Science. 2021; 7: e451. e451. Published 2021 Apr 09. doi: 10.7717/peerj-cs.451.

add to mindlist on the mindlist

Details

Publication Date: 2021-04-09

Description: In recent years, Deep Learning techniques applied to steganalysis have surpassed the traditional two-stage approach by unifying feature extraction and classification in a single model, the Convolutional Neural Network (CNN). Several CNN architectures have been proposed to solve this task, improving steganographic images’ detection accuracy, but it is unclear which computational elements are relevant. Here we present a strategy to improve accuracy, convergence, and stability during training. The strategy involves a preprocessing stage with Spatial Rich Models filters, Spatial Dropout, Absolute Value layer, and Batch Normalization. Using the strategy improves the performance of three steganalysis CNNs and two image classification CNNs by enhancing the accuracy from 2% up to 10% while reducing the training time to less than 6 h and improving the networks’ stability.

Electronic ISSN: 2376-5992

Topics: Computer Science

Published by PeerJ

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext