23 September 2015

Inside BASE - Language Processing

BASE is one of the world's most voluminous search engines for academic open access web resources. BASE is operated by Bielefeld University Library.

We are harvesting thousands of repositories worldwide via OAI-PMH from many different OAI server systems. Each OAI server by it's own is a connecting link to a database holding the metadata which we are harvesting. Due to this composition there are many possibilities to arrange and deliver the metadata via OAI-PMH, including the quality of metadata by itself. Thus there is some processing necessary to get some kind of homogeneous metadata for indexing into BASE. This page will give some insight about the processing we are doing using the example of language processing.
You can see this as a kind of "snapshot of the current state" because there is always some development going on to improve the quality of metadata.

The language set used in BASE is ISO 639-2/B. This is a 3-letter code. The "B" stands for code set for bibliographic applications. Many repositories use ISO 639-1 (a 2-letter code) which will be mapped to ISO 639-2/B. Also many repositories have the language as name or a textual describtion of the language.
This is why the language processing of the metadata is actually a pattern matching and normalisation because many of the records have sparse metadata which would result in wrong or even impossible automated language detection. Also the language of the metadata might be different to the content described by the metadata, especially if the document consists of multiple languages.
In general it can be said that BASE relies on the dc:language delivered with the metadata. Nevertheless there is much work left to detect and normalize the content of field dc:language.
The result is stored in the dclang element.

Examples:

fromto
<dc:language>de</dc:language>
<element name="dclang">ger</element>
<dc:language>English (United States)</dc:language>
<element name="dclang">eng</element>
<dc:language xml:lang="pl"><![CDATA[łaciński]]></dc:language>
<element name="dclang">lat</element>
<dc:language xml:lang="pl"><![CDATA[Français ; Allemand]]></dc:language>
<element name="dclang">fre</element>
<element name="dclang">ger</element>
<dc:language>English; Hebrew; Arabic</dc:language>
<element name="dclang">eng</element>
<element name="dclang">heb</element>
<element name="dclang">ara</element>
<dc:language>English and Maori language with English language subtitles</dc:language>
<element name="dclang">eng</element>
<element name="dclang">mao</element>
<dc:language>Rukiga</dc:language>
<element name="dclang">bnt</element>
<dc:language>英文</dc:language>
<element name="dclang">eng</element>

The above examples are only a few of what you can expect to get as dc:language in metadata. You can have nice 2-letter code but also phrases, sentences, multiple languages and the full UTF-8 code range in any language like the last example. Also any kind of markup language like HTML, SGML and so on.
There are also cases where you can't map directly like the "Rukiga" example. "Rukiga" means "language of Kiga" and Kiga (Chiga) has "cgg" in ISO 639-3 code but no ISO 639-2/B code. To solve this the mapping goes upwards in the language family and the next member of ISO 639-2/B is "bnt" which stands for "Bantu languages" and is a collective language code element which represents a group of individual languages.

The processing has the following steps.

  • First, cleanup and preparation of content from field dc:language with the use of a Perl program.
  • Second, transformation with XSLT into dclang elements.
  • Third, remove duplicate languages from dclang elements, again with the use of a Perl program.
Every record from any source has to go through this processing chain.

Finally, the result of language processing is:

  • The original content of dc:language will not change and is stored in a dclanguage element.
  • The processing can produce one or more dclang elements.
  • There is always at least one dclang element. If no language could be found the content is "unknown".
  • The dclang element is normalized because it has either ISO 639-2/B code or "unknown".


Best regards, Bernd Fehling
 
(Homepage)