BASE is one of the world's most voluminous search engines for academic open access web resources. BASE is operated by Bielefeld University Library.
We are harvesting thousands of repositories worldwide via OAI-PMH from many different OAI server systems.
Each OAI server by it's own is a connecting link to a database holding the metadata which we are harvesting.
Due to this composition there are many possibilities to arrange and deliver the metadata via OAI-PMH, including the quality of metadata by itself.
Thus there is some processing necessary to get some kind of homogeneous metadata for indexing into BASE.
This page will give some insight about the processing we are doing using the example of language processing.
You can see this as a kind of "snapshot of the current state" because there is always some development going on to improve the quality of metadata.
The language set used in BASE is ISO 639-2/B. This is a 3-letter code. The "B" stands for code set for
bibliographic applications. Many repositories use ISO 639-1 (a 2-letter code) which will be mapped to ISO 639-2/B.
Also many repositories have the language as name or a textual describtion of the language.
This is why the language processing of the metadata is actually a pattern matching and normalisation
because many of the records have sparse metadata which would result in wrong or even impossible automated language detection.
Also the language of the metadata might be different to the content described by the metadata, especially if the document
consists of multiple languages.
In general it can be said that BASE relies on the dc:language delivered with the metadata.
Nevertheless there is much work left to detect and normalize the content of field dc:language.
The result is stored in the dclang element.
Examples:
from | to |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The above examples are only a few of what you can expect to get as dc:language in metadata.
You can have nice 2-letter code but also phrases, sentences, multiple languages and the full UTF-8 code range in any
language like the last example. Also any kind of markup language like HTML, SGML and so on.
There are also cases where you can't map directly like the "Rukiga" example.
"Rukiga" means "language of Kiga" and Kiga (Chiga) has "cgg" in ISO 639-3 code but no ISO 639-2/B code.
To solve this the mapping goes upwards in the language family and the next member of ISO 639-2/B is "bnt"
which stands for "Bantu languages" and is a collective language code element which represents a group
of individual languages.
The processing has the following steps.
Finally, the result of language processing is:
Best regards, Bernd Fehling
(Homepage)