Indexing Special Terms Using Solr

Indexing special terms can tricky in Solr. Is that period the end of a sentence, or the start of a technology name? Continue reading to see how to tell the difference and provide stellar, fast solr search results to your customers.

Problem

If you use Solr for any technical corpus, you will soon need to know how to perform indexing special terms using Solr. Solr supplies some really convenient field types in their default schema.xml. If you have used text-general to index any document with special terms (words) you have probably experienced the frustration with missing hyphenated terms. Also, special terms like computer skills are not indexed correctly.

Requirement

Index documents like resumes and job descriptions, which contain terms that include punctuation. Preserve the punctuation on terms that need to be indexed, but remove similar punctuation else where so other terms are also properly indexed.

Solution

Before we look at the solution, let's consider how Solr's fieldType element works. As you would expect, I am going to point you to the official Solr Wiki on Analyzers, Tokenizers and TokenFilters. Although you can put your tokenizers, filters and charFilter elements in any order you choose, Solr will process them in this order.

charFilter
analyzer
filter

text_general Breakdown

Let us break down the out-of-the-box text_general fieldType as copied from the default schema.xml that ships with Solr 4.0.

<!-- A general text field that has reasonable, generic
cross-language defaults: it tokenizes with StandardTokenizer,
removes stop words from case-insensitive "stopwords.txt"
(empty by default), and down cases. At query time only, it
also applies synonyms. -->
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
	<analyzer type="index">
		<tokenizer class="solr.StandardTokenizerFactory"/>
		<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
		<!-- in this example, we will only use synonyms at query time
		<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
		-->
		<filter class="solr.LowerCaseFilterFactory"/>
	</analyzer>
	<analyzer type="query">
		<tokenizer class="solr.StandardTokenizerFactory"/>
		<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
		<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
		<filter class="solr.LowerCaseFilterFactory"/>
	</analyzer>

Index Analyzer

The Index Analyzer defines the path of parsing data to index it. This happens when the document is added to the Solr core. This is the breakdown of how the indexing is performed for text_general.

The solr.StandardTokenizerFactory is used to tokenize (aka break apart) the text into individual elements (tokens).
The solr.StopFilterFactory reads the file stopwords.txt, and removes any tokens that exists in the stopwords.txt file.
The solr.LowerCaseFilterFactory simply converts all the tokens to lower case, as the name implies.

Query analyzer

The solr.StandardTokenizerFactory is used to tokenize (aka break apart) the text into individual elements (tokens).
The solr.StopFilterFactory reads the file stopwords.txt, and removes any tokens that exists in the stopwords.txt file.
The solr.SynonymFilterFactory reads the file synonyms.txt and expands or replaces matched tokens.
The solr.LowerCaseFilterFactory simply converts all the tokens to lower case, as the name implies.

Effects on Index

Figure 1

The reason why this default field type wreaks havoc on our special terms, is because the StandardTokenizerFactory splits (tokenizes) on whitespace and punctuation. Lets go ahead and call out some specific terms and focus on the effects of these analyzers.

C++ and C#

When you need to know how Solr is parsing terms for indexing and querying, you should use the "Analyzer" feature under the individual cores. In this figure, you can see how a field that is text_general reduces the term C# to just "c". Those of you know know your "C's" can attest that C, C++ and C# are not the same thing, and you would not want the three treated the same while searching.

text_tech breakdown

Now we are going to create a new fieldType called text_tech. This will be a fieldType that is more friendly with technical terms, and good at parsing documents with terms like Oracle 8i/9i/10g/11g and DB2/UDB.

<fieldType name="text_tech" class="solr.TextField" positionIncrementGap="100">
	<analyzer type="index">
		<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\.\s)" replacement=" " />
		<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\.$)" replacement="" />
		<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(,)" replacement=" " />
		<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(;)" replacement=" " />
		<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\|)" replacement=" " />
		<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\/)" replacement=" " />
		<tokenizer class="solr.WhitespaceTokenizerFactory"/>
		<filter class="solr.SynonymFilterFactory" synonyms="punctuation-whitelist.txt" ignoreCase="true" expand="false"/>
		<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
		<filter class="solr.LowerCaseFilterFactory"/>
	</analyzer>
	<analyzer type="query">
		<tokenizer class="solr.WhitespaceTokenizerFactory"/>
		<filter class="solr.SynonymFilterFactory" synonyms="punctuation-whitelist.txt" ignoreCase="true" expand="false"/>
		<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
		<filter class="solr.LowerCaseFilterFactory"/>
	</analyzer>
</fieldType>

Contents of punctuation-whitelist.txt

c++ => cplusplus
C# => csharp
.NET => dotnet

Index Analyzer

The Index Analyzer defines the path of parsing data to index it. This happens when the document is added to the Solr core. This is the breakdown of how the indexing is performed for text_tech.

The series of charFilter PatternReplaceCharFilterFactory lines take common punctuation patterns that we know are not patterns
1. (\.\s) matches end of sentences where there is a period followed by a space.
2. (\.$) matches end of sentences where there is a period followed by the end of a line.
3. (,) matches any comma. This is typically used to separate skills in a skill listing.
4. (;) matches semi-colons.
5. (\|) matches a vertical pipe.
6. (\/) matches a front slash.
The solr.WhiteSpaceTokenizerFactory is used to tokenize (aka break apart) the text into individual elements (tokens). This only breaks on white space leaving all punctuation intact.
The solr.SynonymFilterFactory reads the file punctuation-whitelist.txt and expands or replaces matched tokens.
The solr.StopFilterFactory reads the file stopwords.txt, and removes any tokens that exists in the stopwords.txt file.
The solr.LowerCaseFilterFactory simply converts all the tokens to lower case, as the name implies.

Query analyzer

The solr.WhiteSpaceTokenizerFactory is used to tokenize (aka break apart) the text into individual elements (tokens). This only breaks on white space leaving all punctuation intact.
The solr.SynonymFilterFactory reads the file punctuation-whitelist.txt and expands or replaces matched tokens.
The solr.StopFilterFactory reads the file stopwords.txt, and removes any tokens that exists in the stopwords.txt file.
The solr.LowerCaseFilterFactory simply converts all the tokens to lower case, as the name implies.

Effects on Index

Figure 2

The reason why the index and query parser now respect the special characters is attributed mostly to the use of the WhiteSpaceTokenizerFactory since it does not strip out all of the punctuation during the tokenization process.

Since there is not a perfect tokenizer for knowing which punctuation we do and do not want to keep, we help is a little but using the charFilter. We take common punctuation that is not commonly associated with technical terms, and just convert it to whitespace so the tokenizer can do all the heavy lifting. As you can see, using regular expressions goes a long way. Knowing regular expressions is a must if you are to succeed in the world of big data and large text processing.

Figure 3

So C# gets conveniently converted to csharp thanks to puncutation-whitelist.txt and the SynonymFilterFactory. You should know that the synonyms are converted for the process of indexing and querying only. A search in Solr for C# with highlighting enabled accurately highlights the searched and matched term, "C#" in this case, even though our token was converted to "csharp" during indexing and our query term was converted to "csharp" for the actual query.

Conclusion

The ability to define you own fieldTypes in Solr is a very powerful feature. The use of charFilter, tokenizer and filter allow you to slice and dice your text anyway you see fit. Using regular expressions along with the built in Solr filters should allow you to tackle just about any use case you come across.

Solr is a great Lucene wrapper that allows the developer to focus on the system and product being developed, while exerting less energy on the search engine. Is a Solr Implementation easy? It is not. But, it is easier than trying to build your own search algorithms. And, it is really fast once you get it dialed in.

Ready To Get Started

Whether you need help configuring your Java App container connection pool, or require the delivery of a fully functional Solr core, the staff at Prowave Consulting can assist with 1% or 100% of your development needs. Our flexible service offering can work within your framework, from acting as a consulting arm to your development team to providing a statement of work to develop and implement an entire production environment for your organization. For help or to schedule a consultation, we can be reached at 904-302-7734 or by using our Contact Form.