18 September 2015

Inside BASE - OA (open access) Processing

BASE is one of the world's most voluminous search engines for academic open access web resources. BASE is operated by Bielefeld University Library.

We are harvesting thousands of repositories worldwide via OAI-PMH from many different OAI server systems. Each OAI server by it's own is a connecting link to a database holding the metadata which we are harvesting. Due to this composition there are many possibilities to arrange and deliver the metadata via OAI-PMH, including the quality of metadata by itself. Thus there is some processing necessary to get some kind of homogeneous metadata for indexing into BASE. This page will give some insight about the processing we are doing using the example of classifying a metadata record as open access.
You can see this as a kind of "snapshot of the current state" because there is always some development going on to improve the quality of metadata.

First there is our Admin-Database where we collect several informations about each source (OAI repositories) we will index, already have indexed or disappeared and are offline. This Admin-Database has a field of name "oa" which can have the states "oa=0" (source is not open access), "oa=1" (source is completely open access) or "oa=2" (open access is unknown). In the case of "oa=2" (unknown) there is another field with name "oaset" which can hold a string. During processing the "oaset" is compared with the "setSpec" field in the "oai:header" of each record.

Example:

   <record>
     <header>
       <identifier>oai:abc.de-opus4-uni:1</identifier>
       <datestamp>2015-01-27</datestamp>
       <setSpec>doc-type:doctoralthesis</setSpec>
       <setSpec>bibliography:false</setSpec>
       <setSpec>ddc:530</setSpec>
       <setSpec>open_access</setSpec>
     </header>
     <metadata>
       <oai_dc:dc ...
       ...
       </oai_dc:dc>
     </metadata>
   </record>
   
If the "oa" field for that source is set to "oa=2" and the "oaset" field is set to "oaset=open_access" then the processing of that record will produce a new element
<element name="dcoa">1</element>
in the processing output. If there is no match of "oaset" with "setSpec" then the result will be
<element name="dcoa">2</element>
in the processing output.
In summary it can be said that the "oa" field of the Admin-Database is setting the "dcoa" element on repository level but in combination with the "oaset" field it can be used to set the "dcoa" element on record level.
Many repositories don't use this possibility to mark records with a "setSpec" entry in the oai:header. They use the "dc:rights" element (for oai_dc) between the metadata tags for some kind of rights information about the record. The advantage is that this is more flexible and different records can have different rights informations or even multiple rights informations.

This brings us to the next step in our processing where we analyze the rights fields of each record. As a result of this processing we either keep the value of our previously generated "dcoa" element or change it to a new value and, if possible, generate another new element "dcrightsnorm" in the processing output. This processing is done by pattern matching, including terms-of-use and license information.
First something about the patterns, they are select/choosen by analyzing millions of dc:rights fields from records. The patterns are used in programs in the processing chain, are revisited from time to time and can be enhanced to get better results for terms-of-use/license detection. The patterns are odered by restriction so that PDM (public domain mark) comes first.

The current order is:

  • PDM (public domain mark)
  • CC0 (no rights reserved)
  • CC-BY-NC-ND
  • CC-BY-NC-SA
  • CC-BY-NC
  • CC-BY-ND
  • CC-BY-SA
  • CC-BY

This order starts with "most open", so PDM is before CC0, but then continues with "least open" CC-BY-NC-ND to make sure to catch the most restrictive license first if there are multiple dc:rights fields. But our processing is even more complicated because we are also doing some phrase detection about terms-of-use and licenses, like "non commercial" or "no comercial" and many more like "restricted", "embargoed" or "closed". As you can imagine this covers a great variety of languages but is also an ongoing process of development.

Finally:
First we have the Admin-Database with entries for "oa" and "oaset" which works like a presetting in the processing chain. Then later on in the chain we come to the dc:rights processing with pattern matching. Assume we start with "oa=2" (for unknown) but during processing detect "public domain" in dc:rights, this will result in a change of the element dcoa from "dcoa=2" to "dcoa=1" and will also add the element "dcrightsnorm=PDM".
So the rights processing is overruling the presetting and able to change the content of element dcoa. It will also generate a new element dcrightsnorm which has the detected terms-of-use or license in normalized form.
The results of this processing is used in the advanced search in sections "Terms of Re-use/Licenses" and "Access".


Best regards, Bernd Fehling
 
(Homepage)