Don’t use TextField for your unique key in Solr


This seems immediately obvious when you think about it, but TextField is what you use for fuzzy searches in Solr, and why would a person want a fuzzy search on a unique value? While I can come up with some oddball use cases, making use of copy fields would seem to be the more valid approach and fitting with the typical use of Solr IE you filter on strings and query on text.

However, people have done this a few times and they throw me for a loop and in the case of DataStax Enterprise Search (built on Solr) this creates an interesting split between the index and the data.

Given a Cassandra schema of

A Solr Schema of (important bits in bold):

Initial records never get indexed

I’m assuming this is because the aspect of indexing that checks to see if it’s been visited or not is thrown by the tokens:

First fill up a table

Then turn on indexing

Add one more record

Then query via Solr and…no ‘1235’ or ‘1234’

But Cassandra knows all about them

To recap we never indexed ‘1234’ and ‘1235’ for some reason ‘123’ indexes and later on when I add 9999 it indexes fine. Later testing showed that as soon as readded ‘1234’ is joined the search results, so this only appears to happen to records that were there before hand.

Deletes can greedily remove LOTS

I delete id ‘1234’

But when I query Solr I find only:

So where did ‘1234 4566’, ‘1235’, and ‘1230’ go? If I query Cassandra directly they’re safe and sound only now Solr has no idea about them.

To recap, this is just nasty and the only fix I’ve found is either reindexing or just adding the records again.

Summary

Just use a StrField type for your key and everything is happy. Special thanks to J.B. Langston (twitter https://twitter.com/jblang3) of DataStax for finding the nooks and crannies and then letting me take credit by posting about it.

Spark job that writes to Cassandra just hangs when one node goes down?