28 September 2015

Inside BASE - dc:type Processing

BASE is one of the world's most voluminous search engines for academic open access web resources. BASE is operated by Bielefeld University Library.

We are harvesting thousands of repositories worldwide via OAI-PMH from many different OAI server systems. Each OAI server by it's own is a connecting link to a database holding the metadata which we are harvesting. Due to this composition there are many possibilities to arrange and deliver the metadata via OAI-PMH, including the quality of metadata by itself. Thus there is some processing necessary to get some kind of homogeneous metadata for indexing into BASE. This page will give some insight about the processing we are doing using the example of dc:type processing.
You can see this as a kind of "snapshot of the current state" because there is always some development going on to improve the quality of metadata.

By analyzing the dc:type field from millions of records we created a list of terms used in the dc:type field. This list also contains text object typology types provided by the DRIVER guidelines for repository content and document and publication types from the DINI recommendations. The list of terms is normalized and grouped into categories where each category has a 4 digit code. For normalization we use the Perl module Text::Normalize::NACO which is based on the NACO rules.

List of 4 digit codes:

digit codedocument type
0000Text
0001Article, Journal
0002Book
0003Report, Paper, Lecture
0004Thesis
0005Review
0101Audio
0102Video
0103Image
0104Map
0105Software
0106Primary Data
0107Sheet Music
9999Unknown Material

During processing the dc:type field is also normalized and compared with the list of terms. If there is a match between the normalized dc:type field and an normalized list entry we generate a dctypenorm element with the according 4 digit code.

Best regards, Bernd Fehling
 
(Homepage)