DFG project Automatic Enhancement of OAI Metadata — English Summary

Project description


Having access to high-quality scientific information is an important prerequisite of scholarship. The increasing availability of electronic publications in content-stores (repositories) distributed over the internet and their aggregation within the framework of the Open Archives Initiative (OAI) substantially contribute to this already. The project "Automatic enrichment of OAI metadata by means of computational linguistics methodology and development of services for content-based integration of repositories" funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) aims at the enhancement of subject classification for these scientific documents using OAI metadata. The project is a cooperation between Bielefeld University Library, the Text Technology Lab at Frankfurt University (Professor Alexander Mehler), and the NLP Group at the University of Leipzig (Professor Gerhard Heyer).

Within the scope of the project, documents lacking sufficient classificatory information are to be classified automatically using different schemes, in a first step the Dewey Decimal Classification (DDC), by means of computational linguistics methodology. The classification information will be integrated into the metadata and can then be used in different contexts, e.g. it can be returned to the repositories or included in scientific search engines like the Bielefeld Academic Search Engine (BASE). The data will be made available to other organizations for further re-use. Eventually, the normalized data shall facilitate the semantic integration of distributed repositories. Semantic browsing and search will become feasible and improve the quality of electronic literature search and retrieval.

The project combines the fields of digital libraries and text technology (computational linguistics): The University Library provides access to high-quality document collections through a standardized interface, whereas the text technology delivers the linguistically based classification results.

Results


Better Subject Indexing of Open Access Documents

We could raise the amount of DDC-classified documents in the BASE index from 429,496 to currently 1,753,712 documents. This number will continue to rise, as documents are enhanced now on a daily basis.

Users can directly benefit from the improved subject indexing by using the BASE DDC browsing interface, which allows the exploration of the BASE index via the hierarchical category tree of the DDC.

Reuse of the Enhanced Metadata

The benefits of the enhanced metadata are not limited to BASE. Given the increasing amount of categorized data, subject repositories and/or portals can now directly subject specific raw metadata subsets from BASE. This is made possible by a newly developed API, which is located at: http://129.70.12.31/mdapi/doc/

Reuse of the Categorizers

Third-party organizations can also use our categorizers. This may be helpful when attempting to categorize large portions of legacy data or for semi-automatic acquisition of metadata in a deposit process.

The categorization API is located at: http://129.70.12.31/clfapi/doc/.

The Automatic Classification Toolbox for Digital Libraries (ACT-DL)

ACT-DL is a utility site which allows the use of the DDC categorizers directly from your web browser. It provides tools for classifying different types of documents (text, PDF, and web pages) according to the DDC.

Project management


Bielefeld University Library
Dr. Wolfram Horstmann, CIO Scientific Information at Bielefeld University
Text Technology, Frankfurt am Main
Prof. Dr. Alexander Mehler
NLP Group, University of Leipzig
Prof. Dr. Gerhard Heyer

Team Members

Bielefeld UL
Mathias Loesch Mathias.Loesch@uni-bielefeld.de
Text Technology Frankfurt
Tim vor der Brück vorderBrueck@em.uni-frankfurt.de

Project funding

German Research Foundation (Deutsche Forschungsgemeinschaft, DFG)

Duration

2 years (starting October 2009)

Publications

  • Lösch, M., U. Waltinger, W. Horstmann, and A. Mehler (2011). Building a DDC-annotated Corpus from OAI Metadata. Journal of Digital Information (12)2.
  • Lösch, M. (2011). A Multidisciplinary Search Engine for Scientific Open Access Documents, in: R. Depping, & S. Christiane (Eds.), Elektronische Schriftenreihe der Universitäts- und Stadtbibliothek Köln, 2. Cologne: EBSLG Annual General Conference, 11–15. http://pub.uni-bielefeld.de/publication/2083906
  • Mehler, A. and Waltinger, U. (2009). Enhancing document modeling by means of open topic models: Crossing the frontier of classification schemes in digital libraries by example of the DDC. Library Hi Tech, 27(4):520–539. PrePrint
  • Mehler, A. (2010). A Quantitative Graph Model of Social Ontologies by Example of Wikipedia. Dehmer, M., F. Emmert-Streib and A. Mehler (eds.): Towards an Information Theory of Complex Networks: Statistical Methods and Applications. Boston/Basel: Birkhäuser (appears).
  • Summann, F.: Open Acces and Institutional Repositories from Local Initiatives to Global Solutions In: CASLIN 2009: Institutional Online Repositories and Open Access, Pilsen 2009, S.39-42.
  • Waltinger, U., A. Mehler, M. Lösch und W. Horstmann (2011). Hierarchical classification of OAI metadata using the DDC taxonomy. In R. Bernardi, S. Chambers, B. Gottfried, F. Segond, und I. Zaihrayeu (Hrsg.), Advanced Language Technologies for Digital Libraries, Volume 6699 of Lecture Notes in Computer Science, S. 29–40. Springer Berlin / Heidelberg.

Web links