zum Inhalt springen

Strings & Structures: Codes of Sense and Function in Genomics and Linguistics

Examination of function and meaning as they depend on structuring patterns and clusters of elementary textual units


Prof. Dr. Jürgen Rolshoven (PI)
Institute of Linguistics, UoC
Albertus-Magnus-Platz, 50923 Köln

Prof. Dr. Thomas Wiehe (PI)
Institute for Genetics, UoC
Zülpicher Straße 47a, 50674 Köln

Funded by:

German Excellence Initiative

Project Duration:

01/2015 - 04/2017


Marcel Boeing, David Neugebauer, Todor Todorov

Workshop Website:


Project Description:

Bioinformatics and Computer Linguistics are two specialized computer science disciplines which are both well established at the University of Cologne. While on an international scale a tradition of tight connections, mutual interest and benefit between computational geneticists and linguists reaches back to the 1970s the potential synergies between the two fields are still to be tapped (and revived) at our university. Walter Doerfler, a Cologne geneticist, wrote in 1982: „Genetics and linguistics share common principles and may have much to learn from each other. Research in either field might be profitably pursued with the idea in mind that DNA sequence and language may be just different expressions of related principles“. Today, more than thirty years later, knowledge in both disciplines is immensely amplified and an unprecedented body of digitalized data, new highly efficient string processing algorithms and computational power are available to address questions which go beyond those of primarily syntactic nature posed in the 1980s. Semantic data mining and context analysis have now become key topics in both fields.

Context analysis seeks not only to identify coding units ('words') and to catalogue their occurrences but to classify them according to their respective textual contexts. Probabilistic clustering methods play an important role to accomplish this task. Furthermore, hierarchical clustering methods aid to identify structural dependencies (parse trees) and evolutionary genealogies (phylogenetic trees). By decoding these types of trees we aim to uncover possible semantics which may lie hidden in linguistic as well as genetic texts. The ultimate common goal of this project is to discover function and meaning as they depend on structuring patterns and clusters of elementary textual units. In particular, we will

  • reconstruct the duplication history of a large immune system gene family
  • guided by the genealogical tree, decode grammatic rules relating to function;
  • generate patterns and parse trees from large natural language text corpora and
  • apply vector analysis to uncover function and semantics.

Both PIs use informatics methods to transform text data into tree structures from which function and meaning are extracted. Sharing similar sources and structures provides a common ground also for other natural and social sciences to use the methods and results of this project. Possible neighboring fields are Genetics, Evolution, Immunology, Business Informatics, Linguistics, Classical Studies and Social Media Sciences.

The Cologne Center for eHumanities (CCeH) and the Cologne Center of Language Sciences (CCLS) are strategic partners within the University. The interdisciplinary approach outlined here fits very well with the Emerging Group Dynamic Structuring in Language and Communication (DSLC) of the Faculty of Philosophy. In addition, this research project will have strong connections to the Competence Area CA III.