ALBERT — All Library Books, journals and Electronic Records Telegrafenberg

1

Unknown

Improved data sets and evaluation methods for the automatic prediction of DNA-binding proteins (2021)

Zaitzeff, Alexander ; Leiby, Nicholas ; Motta, Francis C ; [et al.]

Oxford University Press

In: Bioinformatics. 2021; Published 2021 Aug 20. doi: 10.1093/bioinformatics/btab603. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-08-20

Description: Motivation Accurate automatic annotation of protein function relies on both innovative models and robust data sets. Due to their importance in biological processes, the identification of DNA-binding proteins directly from protein sequence has been the focus of many studies. However, the data sets used to train and evaluate these methods have suffered from substantial flaws. We describe some of the weaknesses of the data sets used in previous DNA-binding protein literature and provide several new data sets addressing these problems. We suggest new evaluative benchmark tasks that more realistically assess real-world performance for protein annotation models. We propose a simple new model for the prediction of DNA-binding proteins and compare its performance on the improved data sets to two previously published models. Additionally, we provide extensive tests showing how the best models predict across taxonomies. Results Our new gradient boosting model, which uses features derived from a published protein language model, outperforms the earlier models. Perhaps surprisingly, so does a baseline nearest neighbor model using BLAST percent identity. We evaluate the sensitivity of these models to perturbations of DNA-binding regions and control regions of protein sequences. The successful data-driven models learn to focus on DNA-binding regions. When predicting across taxonomies, the best models are highly accurate across species in the same kingdom and can provide some information when predicting across kingdoms. Code and Data Availability The data and results for this paper can be found at https://doi.org/10.5281/zenodo.5153906. The code for this paper can be found at https://doi.org/10.5281/zenodo.5153683. The code, data and results can also be found at https://github.com/AZaitzeff/tools_for_dna_binding_proteins.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

2

Unknown

Detecting quantitative trait loci and exploring chromosomal pairing in autopolyploids using polyqtlR (2021)

Bourke, Peter M ; Voorrips, Roeland E ; Hackett, Christine A ; [et al.]

Oxford University Press

In: Bioinformatics. 2021; Published 2021 Aug 06. doi: 10.1093/bioinformatics/btab574. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-08-06

Description: Motivation The investigation of quantitative trait loci (QTL) is an essential component in our understanding of how organisms vary phenotypically. However, many important crop species are polyploid (carrying more than two copies of each chromosome), requiring specialized tools for such analyses. Moreover, deciphering meiotic processes at higher ploidy levels is not straightforward, but is necessary to understand the reproductive dynamics of these species, or uncover potential barriers to their genetic improvement. Results Here, we present polyqtlR, a novel software tool to facilitate such analyses in (auto)polyploid crops. It performs QTL interval mapping in F1 populations of outcrossing polyploids of any ploidy level using identity-by-descent probabilities. The allelic composition of discovered QTL can be explored, enabling favourable alleles to be identified and tracked in the population. Visualization tools within the package facilitate this process, and options to include genetic co-factors and experimental factors are included. Detailed information on polyploid meiosis including prediction of multivalent pairing structures, detection of preferential chromosomal pairing and location of double reduction events can be performed. Availabilityand implementation polyqtlR is freely available from the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/package=polyqtlR. Supplementary information Supplementary data are available at Bioinformatics online.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

3

Unknown

Comparative evaluation of shape retrieval methods on macromolecular surfaces: an application of computer vision methods in structural bioinformatics (2021)

Machat, Mohamed ; Langenfeld, Florent ; Craciun, Daniela ; [et al.]

Oxford University Press

In: Bioinformatics. 2021; Published 2021 Jul 11. doi: 10.1093/bioinformatics/btab511. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-07-11

Description: Motivation The investigation of the structure of biological systems at the molecular level gives insights about their functions and dynamics. Shape and surface of biomolecules are fundamental to molecular recognition events. Characterizing their geometry can lead to more adequate predictions of their interactions. In the present work, we assess the performance of reference shape retrieval methods from the computer vision community on protein shapes. Results Shape retrieval methods are efficient in identifying orthologous proteins and tracking large conformational changes. This work illustrates the interest for the protein surface shape as a higher-level representation of the protein structure that (i) abstracts the underlying protein sequence, structure or fold, (ii) allows the use of shape retrieval methods to screen large databases of protein structures to identify surficial homologs and possible interacting partners and (iii) opens an extension of the protein structure–function paradigm toward a protein structure-surface(s)-function paradigm. Availabilityand implementation All data are available online at http://datasetmachat.drugdesign.fr. Supplementary information Supplementary data are available at Bioinformatics online.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

4

Unknown

MMpred: a distance-assisted multimodal conformation sampling for de novo protein structure prediction (2021)

Zhao, Kai-Long ; Liu, Jun ; Zhou, Xiao-Gen ; [et al.]

Oxford University Press

In: Bioinformatics. 2021; Published 2021 Jun 29. doi: 10.1093/bioinformatics/btab484. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-06-29

Description: Motivation The mathematically optimal solution in computational protein folding simulations does not always correspond to the native structure, due to the imperfection of the energy force fields. There is therefore a need to search for more diverse suboptimal solutions in order to identify the states close to the native. We propose a novel multimodal optimization protocol to improve the conformation sampling efficiency and modeling accuracy of de novo protein structure folding simulations. Results A distance-assisted multimodal optimization sampling algorithm, MMpred, is proposed for de novo protein structure prediction. The protocol consists of three stages: The first is a modal exploration stage, in which a structural similarity evaluation model DMscore is designed to control the diversity of conformations, generating a population of diverse structures in different low-energy basins. The second is a modal maintaining stage, where an adaptive clustering algorithm MNDcluster is proposed to divide the populations and merge the modal by adjusting the annealing temperature to locate the promising basins. In the last stage of modal exploitation, a greedy search strategy is used to accelerate the convergence of the modal. Distance constraint information is used to construct the conformation scoring model to guide sampling. MMpred is tested on a large set of 320 non-redundant proteins, where MMpred obtains models with TM-score≥0.5 on 291 cases, which is 28% higher than that of Rosetta guided with the same set of distance constraints. In addition, on 320 benchmark proteins, the enhanced version of MMpred (E-MMpred) has 167 targets better than trRosetta when the best of five models are evaluated. The average TM-score of the best model of E-MMpred is 0.732, which is comparable to trRosetta (0.730). Availability and implementation The source code and executable are freely available at https://github.com/iobio-zjut/MMpred. Supplementary information Supplementary data are available at Bioinformatics online.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

5

Unknown

Co-evolutionary distance predictions contain flexibility information (2021)

Schwarz, Dominik ; Georges, Guy ; Kelm, Sebastian ; [et al.]

Oxford University Press

In: Bioinformatics. 2021; Published 2021 Aug 12. doi: 10.1093/bioinformatics/btab562. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-08-12

Description: Motivation Co-evolution analysis can be used to accurately predict residue–residue contacts from multiple sequence alignments. The introduction of machine-learning techniques has enabled substantial improvements in precision and a shift from predicting binary contacts to predict distances between pairs of residues. These developments have significantly improved the accuracy of de novo prediction of static protein structures. With AlphaFold2 lifting the accuracy of some predicted protein models close to experimental levels, structure prediction research will move on to other challenges. One of those areas is the prediction of more than one conformation of a protein. Here, we examine the potential of residue–residue distance predictions to be informative of protein flexibility rather than simply static structure. Results We used DMPfold to predict distance distributions for every residue pair in a set of proteins that showed both rigid and flexible behaviour. Residue pairs that were in contact in at least one reference structure were classified as rigid, flexible or neither. The predicted distance distribution of each residue pair was analysed for local maxima of probability indicating the most likely distance or distances between a pair of residues. We found that rigid residue pairs tended to have only a single local maximum in their predicted distance distributions while flexible residue pairs more often had multiple local maxima. These results suggest that the shape of predicted distance distributions contains information on the rigidity or flexibility of a protein and its constituent residues. Supplementary information Supplementary data are available at Bioinformatics online.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

6

Unknown

DAMA: a method for computing multiple alignments of protein structures using local structure descriptors (2021)

Daniluk, Paweł ; Oleniecki, Tymoteusz ; Lesyng, Bogdan

Oxford University Press

In: Bioinformatics. 2021; Published 2021 Aug 16. doi: 10.1093/bioinformatics/btab571. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-08-16

Description: Motivation The well-known fact that protein structures are more conserved than their sequences forms the basis of several areas of computational structural biology. Methods based on the structure analysis provide more complete information on residue conservation in evolutionary processes. This is crucial for the determination of evolutionary relationships between proteins and for the identification of recurrent structural patterns present in biomolecules involved in similar functions. However, algorithmic structural alignment is much more difficult than multiple sequence alignment. This study is devoted to the development and applications of DAMA—a novel effective environment capable to compute and analyze multiple structure alignments. Results DAMA is based on local structural similarities, using local 3D structure descriptors and thus accounts for nearest-neighbor molecular environments of aligned residues. It is constrained neither by protein topology nor by its global structure. DAMA is an extension of our previous study (DEDAL) which demonstrated the applicability of local descriptors to pairwise alignment problems. Since the multiple alignment problem is NP-complete, an effective heuristic approach has been developed without imposing any artificial constraints. The alignment algorithm searches for the largest, consistent ensemble of similar descriptors. The new method is capable to capture most of the biologically significant similarities present in canonical test sets and is discriminatory enough to prevent the emergence of larger, but meaningless, solutions. Tests performed on the test sets, including protein kinases, demonstrate DAMA’s capability of identifying equivalent residues, which should be very useful in discovering the biological nature of proteins similarity. Performance profiles show the advantage of DAMA over other methods, in particular when using a strict similarity measure QC, which is the ratio of correctly aligned columns, and when applying the methods to more difficult cases. Availability and implementation DAMA is available online at http://dworkowa.imdik.pan.pl/EP/DAMA. Linux binaries of the software are available upon request. Supplementary information Supplementary data are available at Bioinformatics online.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

7

Unknown

OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches (2021)

Rossier, Victor ; Vesztrocy, Alex Warwick ; Robinson-Rechavi, Marc ; [et al.]

Oxford University Press

In: Bioinformatics. 2021; Published 2021 Mar 31. doi: 10.1093/bioinformatics/btab219. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-03-31

Description: Motivation Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive. Results Here, we first show that in multiple animal and plant datasets, 18 to 62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily-informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND. Availability OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer. Supplementary information Supplementary data are available at Bioinformatics online.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

8

Unknown

VCFShark: how to squeeze a VCF file (2021)

Deorowicz, Sebastian ; Danek, Agnieszka ; Kokot, Marek

Oxford University Press

In: Bioinformatics. 2021; Published 2021 Mar 31. doi: 10.1093/bioinformatics/btab211. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-03-31

Description: Summary VCF files with results of sequencing projects take a lot of space. We propose the VCFShark, which is able to compress VCF files up to an order of magnitude better than the de facto standards (gzipped VCF and BCF). The advantage over competitors is the greatest when compressing VCF files containing large amounts of genotype data. The processing speeds up to 100 MB/s and main memory requirements lower than 30 GB allow to use our tool at typical workstations even for large datasets. Availability and Implementation https://github.com/refresh-bio/vcfshark Supplementary information Supplementary data are available at publisher’s Web site.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

9

Unknown

ShinyCell: Simple and sharable visualisation of single-cell gene expression data (2021)

Ouyang, John F ; Kamaraj, Uma S ; Cao, Elaine Y ; [et al.]

Oxford University Press

In: Bioinformatics. 2021; Published 2021 Mar 28. doi: 10.1093/bioinformatics/btab209. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-03-28

Description: Motivation As the generation of complex single-cell RNA sequencing datasets becomes more commonplace it is the responsibility of researchers to provide access to these data in a way that can be easily explored and shared. Whilst it is often the case that data is deposited for future bioinformatic analysis many studies do not release their data in a way that is easy to explore by non-computational researchers. Results In order to help address this we have developed ShinyCell, an R package that converts single-cell RNA sequencing datasets into explorable and shareable interactive interfaces. These interfaces can be easily customised in order to maximise their usability and can be easily uploaded to online platforms to facilitate wider access to published data. Availability ShinyCell is available at https://github.com/SGDDNB/ShinyCell and https://figshare.com/projects/ShinyCell/100439.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

10

Unknown

L2,1-norm regularized multivariate regression model with applications to genomic prediction (2021)

Mbebi, Alain J ; Tong, Hao ; Nikoloski, Zoran

Oxford University Press

In: Bioinformatics. 2021; Published 2021 Mar 28. doi: 10.1093/bioinformatics/btab212. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-03-28

Description: Motivation Genomic selection (GS) is currently deemed the most effective approach to speed up breeding of agricultural varieties. It has been recognized that consideration of multiple traits in GS can improve accuracy of prediction for traits of low heritability. However, since GS forgoes statistical testing with the idea of improving predictions, it does not facilitate mechanistic understanding of the contribution of particular single nucleotide polymorphisms (SNP). Results Here we propose a L2,1-norm regularized multivariate regression model and devise a fast and efficient iterative optimization algorithm, called L2,1-joint, applicable in multi-trait GS. The usage of the L2,1-norm facilitates variable selection in a penalized multivariate regression that considers the relation between individuals, when the number of SNPs is much larger than the number of individuals. The capacity for variable selection allows us to define master regulators that can be used in a multi-trait GS setting to dissect the genetic architecture of the analyzed traits. Our comparative analyses demonstrate that the proposed model is a favorable candidate compared to existing state-of-the-art approaches. Prediction and variable selection with data sets from Brassica napus, wheat and Arabidopsis thaliana diversity panels are conducted to further showcase the performance of the proposed model. Availability and implementation The model is implemented using R programming language and the code is freely available from https://github.com/alainmbebi/L21-norm-GS. Supplementary information Supplementary data are available at Bioinformatics online.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext