Skip to content

Text Analysis

We have extensive experience with publishing different kinds of digital scholarly text collections. These include literary editions, correspondence catalogues, as well as historical manuscripts and linguistic collections. The collections we publish are usually supplemented by us with descriptive metadata, page images and other types of enrichments such as links to structured data and user annotations.

We offer researchers different means to carry out complex searches — for example by using 'fuzzy' text patterns, finding semantically related passages or by filtering texts on the basis of structured data such as the persons or places mentioned in a text. We can also provide the necessary know-how and software to allow project teams to edit and visualize their text collections, to publish them on online, and to deposit them in a certified digital repository to ensure their long term preservation and access.

For all our projects, based on enriched texts, we try to strictly separate raw text and enrichments. Texts, in all variations and versions, are stored and made available in our open source text repository. Enrichments find their place in our annotation repository in the form of standardised Web Annotations. Any text fragment in the text repository is thus directly retrievable or annotable online, independent of the original text format. Applications such as a web frontend or an editor can then make use the APIs of these two systems. This is how we build our own generic web environment for visualising and searching digital text editions. But our APIs are also directly usable by anyone who wants to query our collection data directly or build their own applications.

Contact

Hennie Brugman, Lead Developer for Team Text (Research Gate, LinkedIn, Pure)

Related Research Projects

  • Nederlab (Meertens Institute) is an online portal for historical research on Dutch language, literature and culture. On the site, researchers can search, view, and analyse millions of Dutch texts.
  • Republic (Huygens Institute) is an acronym for REsolutions PUBLished In a Computational environment. The goal of the project is to make all of the manuscript and printed resolutions of the Dutch States General (1576-1796) freely available online as full texts and page images.
  • Globalise (Huygens Institute). The NWO Groot funded Globalise project will develop an online infrastructure that unlocks the key series of VOC reports (c. 4.7M pages) with advanced research methods. The project uses our text repository infrastructure as a hub to synchronise the enrichment and curation of the historical text transcriptions.
  • CLARIAH Plus. We are making several contributions to this national infrastructure, in particular with respect to NLP tools and formats (LaMachine, FoLiA) and software for creating, publishing and sharing annotations of online collections.

Software and Data

  • Text Repository is a backend repository system to store and share text corpora with metadata and versions.
  • LaMachine is a unified software distribution for Natural Language Processing. It integrates numerous open-source NLP tools, programming libraries, web services and web applications into a single virtual research environment that can be installed on a wide variety of machines.
  • analiticcl is a system for spelling correction, normalisation, and post-OCR correction.
  • TextAnnoViz is a flexible and customisable web application for searching and visualising digital (scholarly) text editions.
  • AnnoRepo is our repository for storing and providing W3C Web Annotations. AnnoRepo adheres to W3C standards and also offers additional search capabilities.
  • Dexter is a web application we developed as part of CLARIAH Plus. Researchers can use Dexter to autonomously build, annotate, and share their own virtual research collections.
  • Text-Fabric is a tool to process text corpora plus (large) sets of annotations. It serves as a bridge between researchers and data scientists.

Publications and Presentations