Digital Linguistics: An Interview with Piotr Bański, Ulrich Heid and Laura Herzberg
Nowadays, as huge swathes of our linguistic lives are recorded, measured and analyzed digitally, the question is no longer simply how much language data we have, but how well it is organized.
Digital linguistics may sound technical, but at its heart it asks a simple question: in an unpredictable digital world, how can we make language data useful and reliable? In Harmonizing language data: Standards for linguistic resources, Piotr Bański, Ulrich Heid, and Laura Herzberg bring together a wide-ranging set of standards and best practices for working with texts, lexicons, annotations, metadata, and archives. Their new open-access volume in De Gruyter’s Digital Linguistics series shows why interoperability, transparency and long-term sustainability matter not only for specialists, but for anyone who cares about preserving language as a scientific and cultural resource.
In the following interview, the editors explain what it means to “harmonize” language data, why poor data management can make resources effectively disappear, and how infrastructures, standards, and even large language models are reshaping the field.
Alexandra Hinz: What does it mean to “harmonize” language data and why is it so important for preserving language in the digital world?
Piotr Bański, Ulrich Heid & Laura Herzberg: In the context of the book, “harmonizing” language data means creating and encoding linguistic resources in ways that are compatible and interoperable across projects, tools, and institutions. Harmonization does not imply uniformity, but rather the use of shared standards and formats that allow different datasets to be combined, queried, and preserved together. This is crucial for digital language preservation, because linguistic data is only sustainable if it remains interpretable over time.
Without harmonization, there is a risk of data becoming inaccessible due to undocumented formats or missing metadata. Standards, together with community-driven best practices, ensure transparency, reproducibility, and long-term accessibility. Harmonization therefore directly supports the FAIR principles and enables language data to function as a durable cultural and scientific resource in the digital age.
AH: Can you give us a real-life example of what can happen when language data isn’t properly stored or organized?
“Linguistic data is only sustainable if it remains interpretable over time […] without harmonization, there is a risk of data becoming inaccessible.”
PB, UH & LH: Data storage in formats and locations that do not follow standards or documented guidelines will inevitably lead to data loss. If we imagine a successor in our project or research group finding a USB stick labeled “corpus,” the immediate question will be: what exactly is this? Without information about encoding, structure, annotation principles, or even basic metadata, interpretation becomes guesswork.
Storage in an outdated character encoding format will lead to sequences of symbols that are hard to decipher, especially if there is no indication of the format used. Annotations must be interpretable, with guidelines explaining what they mean and which conditions regulate their application. If the text is broken up into unpredictable pieces, it becomes impossible to even reconstruct it.
By the way, we use this latter device in creating what is called ‘derived text formats’: a reorganization of text data in a way that makes it impossible to reconstruct the original text, e.g. to allow scientists to work with copyrighted material without infringement to the copyright.
AH: What motivated you to initiate and edit a book on harmonizing language data, and were there any insights or debates during the process that particularly stuck with you?
PB, UH & LH: We now have large bodies of corpus data, both spoken and written, and researchers continue to produce new datasets at a rapid pace. Yet, often at the end of a project or a PhD thesis, it is not clear how these data can be preserved or made accessible to others. At the same time, Digital Linguistics is constantly evolving, and multiple proposals for good practices and standards are emerging.
We wanted to bring together experts who are deeply familiar with these developments to provide the field with a comprehensive set of best practices covering the entire lifecycle of language resources, a kind of guide or common thread. Collecting these recommendations in a single, accessible volume seemed both timely and useful. Our goal was to offer guidance that helps researchers preserve, share, and enhance data effectively, ensuring that their work remains interpretable, reusable, and valuable for the wider scientific community.
“Digital Linguistics is constantly evolving, and multiple proposals for good practices and standards are emerging.”
AH: Could you share a few examples of projects in Digital Linguistics that you find particularly exciting or innovative?
PB, UH & LH: People may say that infrastructure is rarely exciting or full of surprises, but it is a necessary part of a research environment. However, also infrastructural projects can be highly innovative. The European research infrastructure CLARIN ERIC exemplifies this approach by supporting researchers across countries in collecting, curating, documenting, and sharing language resources. In Germany, Text+, operating within the German National Research Data Infrastructure (NFDI), develops sustainable services and standards for text- and language-based data.
These initiatives may appear infrastructural rather than spectacular, yet their innovation lies precisely in enabling reuse and longevity. Individual projects rarely cover the entire data lifecycle; infrastructures provide continuity, governance, and technical stability. By pooling expertise and aligning practices, they transform scattered datasets into a coherent ecosystem that empowers researchers to ask more ambitious and interdisciplinary questions.
AH: How do you see large language models (LLMs), like ChatGPT, and other fast-moving technologies shaping the field of Digital Linguistics?
PB, UH & LH: The rapid rise of LLMs such as ChatGPT has undeniably changed the landscape. We see these systems as powerful tools, particularly for generating fluent text, yet also as reminders of how crucial structured, well-documented data remains. Much of Digital Linguistics involves classification, annotation, and the extraction of interpretable knowledge from texts. For such tasks, transparency and reproducibility are essential.
Fast-moving AI can accelerate exploration and multilingual processing, especially for well-resourced languages, but without harmonized datasets and clear standards, its outputs risk becoming opaque or difficult to verify. Rather than replacing human-curated language resources, LLMs make them even more crucial. These tools emphasize the need for high-quality, interoperable corpora and for expert oversight.
“Future researchers will spend less time reconstructing infrastructure and more time addressing linguistic, cultural, and societal questions.”
AH: Looking ahead, what excites you most about the future of language in the digital age?
PB, UH & LH: Looking ahead, we are optimistic about both language and linguistics in the digital age. Languages will continue to evolve, as they always have, shaped by intensified global exchange and digital communication. For our field, the real promise lies in building durable, harmonized ecosystems of data, tools, and standards. If we succeed, future researchers will spend less time reconstructing infrastructure and more time addressing linguistic, cultural, and societal questions.
We are focused on better documenting under-described languages and on expanding resources for spoken and multimodal communication. We also see great potential in using language data to study historical societies, human behavior in collaboration with psychologists or sociologists, and to organize specialized knowledge in administrations, companies, or broader societal contexts. Structured and well-documented language resources can support research, decision-making, and cultural preservation in multiple domains.
Learn more in this related title from De Gruyter Brill
[Title Image by sotopiko/iStock/Getty Images]
