27 March 2013

Inside BASE - Solr and the multilingual EuroVoc Thesaurus

BASE is one of the world's most voluminous search engines for academic open access web resources. BASE is operated by Bielefeld University Library.

BASE started with a single server years ago (self developed PHP GUI as frontend / FAST FDS as backend). The FAST FDS backend was later extended to multi server with 6 indexer nodes, 1 search node and 1 processing and control node. Then the decision came up either to go from FAST FDS to FAST ESP or switch to Apache Solr. We decided us for Apache Solr and while changing from FAST FDS to Solr we wanted to keep our functionality in front- and backend, including some self developed solutions like multi language search with EuroVoc Thesaurus.
The self developed PHP GUI was replaced with VuFind which is written in PHP.

What is EuroVoc?
EuroVoc is the multilingual thesaurus of the European Union. It contains terms in over 22 languages.
What is the benefit of using a multilingual thesaurus?
With query-time search term expansion you can not only search with synonyms but also multilingual.
Why query-time search term expansion?
It will not bloat your index, but will need some milliseconds for search term processing and expansion.
It is more flexible than index-time term expansion.
Where are the problems?
Generally the n-to-m mapping of synonyms (single term to single term, single term to multi term, multi term to single term, multi term to multi term).

How we did it... and the pitfalls to take care about.

1. Setting up field and analyzer chain

First we need a "capture all" field which holds all text to be indexed. In schema.xml define a fieldType and field. The field should be multivalue and need only be indexed and not stored. Keep the fieldType analyzer index chain simple. I have MappingCharFilterFactory, WhitespaceTokenizerFactory, StopFilterFactory, WordDelimiterFilterFactory and LowerCaseFilterFactory.

The fieldType analyzer query chain needs some special setup. I started with MappingCharFilterFactory, WhitespaceTokenizerFactory, LowerCaseFilterFactory and SynonymFilterFactory. Ready to index the test content.

Then use Solr Admin Analysis and start with a simple single query term like salt and you get salz, άλατα, сол, soľ, sol, sale, zout, salt, sal, só, druska, sāls, sel, sůl, suola, sare, sól of type SYNONYM from SynonymFilterFactory. Nice :-)

Now try other single and multi terms and check the results.
Problem: After trying some phrases we see that phrases can't be handled by SynonymFilterFactory.
Solution: Add a PatternReplaceFilterFactory right after WhitespaceTokenizerFactory and remove quotation marks.

2. The query expansion

Single term searches which expand to other single terms are no problem at all but if they expand to multi terms then we get trouble with the QueryParser.
E.g. erwachsenenbildung --> adult education, educación de adultos, ...
Problem: Actually we want to search for the phrase "adult education" and not the multi term adult education.
Solution: We have to write a little SynonymPostFilterFactory which turns all multi term synonym expansions into phrases.

Now let us do a phrase search like "adult education" and see what happens. Looks good.

Let's try a multi term like adult education. Booom!!!
Problem: We only get results for adult (erwachsener, adulto, ...) and education (bildung, istruzione, ...) but not for adult education!!!
After using the debugger it turns out that the QueryParser is also doing some tokenizing :-( . We never get the full multi term at once into our analyzer chain.
Attention: Solr Admin Analysis works like it gets phrases to analyse. It is NOT the same as getting the analyzer input from the QueryParser.
Solution: Write a SynonymQueryComponent which turns your multi term query into a phrase query. It can also do some pre-parsing to prepare and cleanup your new phrase query. Define a new requestHandler with name "synonym" and place it in the first-components section and also define your new searchComponent in solrconfig.xml.

3. The query

Following the chain of a query through Solr/Lucene we now come to the query itself. Turn debugging on and check which query types we get. A single term query which can't be expanded will result in a TermQuery. After some further tests with different queries (multi term, phrases, with- and without expansion) we also see sometimes BooleanQuery and PhraseQuery, but very often MultiPhraseQuery. TermQuery, BooleanQuery and PhraseQuery are no problem. But most time, if expansion occurs, we get MultiPhraseQuery and that is not doing what we are looking for.
Problem: MultiPhraseQuery gets "confused" by the expansion. Even if using the mm parameter you don't always get what you want.
Solution: Turn the MultiPhraseQuery into a BooleanQuery by writing a SynonymQParserPlugin. If the result of parse(qstring) is instanceof MultiPhraseQuery we iterate over the TermsArrays, walk through the list of expanded terms and build our BooleanQuery. All expanded terms are ORed together (BooleanClause.Occur.SHOULD). The arrays are then combined with another BooleanQuery according the q.op parameter or at least DefaultOperator. Define your new SynonymQParserPlugin in solrconfig.xml under "Query Parsers" section. Now use your new SynonymQParserPlugin with your new requestHandler from step 2 by adding the parameter defType=synonym.

4. Enhancements

Basically it is working now and we know that we get enhancements for adult education as multi term and as phrase query. But what if the query is adult education organization?
Problem: If there is no entry for the whole multi term then we are lost even though there are expansions for adult education and organization.
Solution: Add a ShingleFilterFactory to your query analyzer chain between LowerCaseFilterFactory and SynonymFilterFactory. I have checked the EuroVoc for the length of multi terms and decided to set maxShingleSize to 10 and also to outputUnigrams=true.

This helps by producing shingles and I get expansion results for adult, education, organization and adult education.
Problem: But it also "spoils" my query with to many unwanted shingles.
Solution: We have to enhance our SynonymPostFilterFactory to remove shingles which are left over and are not expanded with synonyms. Either we have an unexpanded single term (from outputUnigrams of shingling) or we have synonyms expanded from single- or multi term shingles. If in SynonymsPostFilterFactory the typeAtt.type() equals "shingle" then clear the termAtt.setEmpty().

Now it is doing the job.
But wouldn't it be great to get the expanded synonyms of the original query term boosted over the single terms or the synonyms of the single terms?
Definately yes!
Solution: Add the parameter "syn.boost" to the SynonymQParserPlugin (defaults to 1.0) which can be set with the query command as syn.boost=100 or via solrconfig.xml requestHandler default parameter as <float name="syn.boost">100</float>.

Done.

We hope you enjoyed reading this report.

Best regards, Bernd Fehling
 
(Homepage)