Services

28 September 2015

Inside BASE - dc:type Processing

BASE is one of the world's most voluminous search engines for academic open access web resources. BASE is operated by Bielefeld University Library.

We are harvesting thousands of repositories worldwide via OAI-PMH from many different OAI server systems. Each OAI server by it's own is a connecting link to a database holding the metadata which we are harvesting. Due to this composition there are many possibilities to arrange and deliver the metadata via OAI-PMH, including the quality of metadata by itself. Thus there is some processing necessary to get some kind of homogeneous metadata for indexing into BASE. This page will give some insight about the processing we are doing using the example of dc:type processing.
You can see this as a kind of "snapshot of the current state" because there is always some development going on to improve the quality of metadata.

By analyzing the dc:type field from millions of records we created a list of terms used in the dc:type field. This list also contains text object typology types provided by the DRIVER guidelines for repository content and document and publication types from the DINI recommendations. The list of terms is normalized and grouped into categories where each category has a 4 digit code. For normalization we use the Perl module Text::Normalize::NACO which is based on the NACO rules.

List of 4 digit codes:

digit code	document type
0000	Text
0001	Article, Journal
0002	Book
0003	Report, Paper, Lecture
0004	Thesis
0005	Review
0101	Audio
0102	Video
0103	Image
0104	Map
0105	Software
0106	Primary Data
0107	Sheet Music
9999	Unknown Material

During processing the dc:type field is also normalized and compared with the list of terms. If there is a match between the normalized dc:type field and an normalized list entry we generate a dctypenorm element with the according 4 digit code.

Best regards, Bernd Fehling

(Homepage)