Seminar in Computational Linguistics

  • Date: –15:00
  • Location: Engelska parken 9-3042
  • Lecturer: Harald Hammarström
  • Contact person: Gongbo Tang
  • Seminarium

How Well-Described is a Language? Multilingual Measuring of Grammatical Descriptions

Of the ca 7 000 languages on the planer, some are described
in great detail (e.g., Swedish, English), some in a lesser number of
shorter publications (e.g., Betoi --- an extinct language of
Venezuela), and others hardly at all (e.g., Mor --- a minority
language in Papua, Indonesia). At the same time, languages are
endangered (see, e.g., Moseley 2010). To best prioritize language
documentation it is important to know the extent of existing
documentation for every language (Hauk & Heaton 2018). Sufficiently
extensive bibliographies for minority languages are collected in
Glottolog (glottolog.org) and a primitive way, which is still "better
than nothing", to estimate the answer is to simply count the number of
pages of the respective publication. One of several important
drawbacks is that page numbers are not additive in the desired way,
i.e., given two different books with similar content the sum amount of
description is not the same as the sum number of pages.

In a project concerning digitized language descriptions we have access
to a full-text database of ca 30 000 publications with language
descriptions spanning over 6 000 languages, written in over 50
different (meta-)languages. Thanks to the fulltexts we can improve on
the page-number estimate by counting terms relating to language
description (e.g., suffix, imperative, plural, etc.) in a way that
supports additivity. We will discuss methods to automatically extract
the lists of linguistic terms from the collection in a
language-independent way, its enhancement using vector-space
techniques and its cross-lingual linking. We will show empirically
which way (not) to count terms most closely approximates how judgment
of the "extent of description" contained in the same document.

Hauk, Bryn and Raina Heaton. 2018. Triage: Setting Priorities for Endangered Language Research. In Lyle Campbell and Anna Belew (eds.), Cataloguing the World's Endangered Languages, 259-304. London: Routledge.

Moseley, Christopher. (2010) Atlas of the world's languages in danger. 3rd edn. Paris: UNESCO Publishing.