ALBERT

All Library Books, journals and Electronic Records Telegrafenberg

feed icon rss

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
  • 1
    Publication Date: 2024-02-01
    Description: The Distributed System of Scientific Collections (DiSSCo) is a pan-European Research \nInfrastructure (RI) initiative. DiSSCo aims to bring together natural science collections from \n175 museums, botanical gardens, universities and research institutes across 23 countries \nin a distributed infrastructure that makes these collections physically and digitally open and \naccessible for all forms of research and innovation. DiSSCo RI entered the ESFRI \nroadmap in 2018 and successfully concluded its Preparatory Phase in early 2023. The RI \nis now transitioning towards the constitution of its legal entity (an ERIC) and the start of its \nscaled-up construction (implementation) programme. This publication is an abridged \nversion of the successful grant proposal for the DiSSCo Transition Project which has the \ngoal of ensuring the seamless transition of the DiSSCo RI from its Preparatory Phase to \nthe Construction Phase (expected to start in 2025). In this transition period, the Project will \naddress five objectives building on the outcomes of the Preparatory Phase project: \n1) Advance the DiSSCo ERIC process and complete its policy framework, ensuring the \nsmooth early-phase Implementation of DISSCo; \n2) Engage & support DiSSCo National Nodes to strengthen national commitments; \n3) Advance the development of core e-services to avoid the accumulation of technical debt \nbefore the start of the Implementation Phase; \n4) Continue international collaboration on standards & best practices needed for the \nDiSSCo service provision; and \n5) Continue supporting DiSSCo RI interim governance bodies and transition them to the \nDiSSCo ERIC formal governance. \nThe Project\xe2\x80\x99s impact will be measured against the increase in the RI\'s overall \nImplementation Readiness Level (IRL). More specifically, we will monitor its impact towards \nreaching the required level of maturity in four of the five dimensions of the IRL that can \nbenefit from further developments. These include the organisational, financial, \ntechnological and data readiness levels.
    Keywords: natural science collections ; natural history collections ; research infrastructure ; global ; natural science ; digitisation ; data standards ; Distributed System of Scientific Collections ; DiSSCo ; Digital Specimen Architecture ; FAIR Data Ecosystem ; FAIR digital objects
    Repository Name: National Museum of Natural History, Netherlands
    Type: info:eu-repo/semantics/article
    Format: application/pdf
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 2
    Publication Date: 2024-01-12
    Description: We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline\'s component parts based on state-of-the-art technologies. \n \nOptical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images. \n \nNot all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google\'s Cloud Vision, which is based on deep learning, is trained on large-scale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text. \n \nDetermining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0. \n \nWe have highlighted the main recommendations for potential pipeline components. The paper also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process.
    Keywords: automated text digitisation ; natural language processing ; named entity recognition ; optical character recognition ; handwritten text recognition ; language identification ; terminology extraction ; scientific workflows ; natural history specimens ; label data
    Repository Name: National Museum of Natural History, Netherlands
    Type: info:eu-repo/semantics/article
    Format: application/pdf
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...