Christopher Berner is a programmer and entrepreneur, and currently is co-founder and CTO of Carsabi, a search engine for used cars. He and his co-founder, Dwight Crow, met while in school at UC Berkeley, and have been hacking on side projects ever since. At previous jobs, Christopher has worked on recommendation engines, and other machine learning algorithms. In his spare time, he likes to travel, and build robots. Christopher has posted 1 posts at DZone. You can read more from them at their website. View Full User Profile

Optimizing Solr - Boosting Your Search Speed by 7x!

  • submit to reddit

Apache Solr powers enterprise search on sites from Ebay to Zappos. It also powers Carsabi, but when we reached 1.8M listings per month (passing Autotrader & our basic installation began to run about as fast as an octogenarian in congealing cement. I’d like to share the basics of Solr optimization, as well as some data on real world gains.

Very briefly, our stack has gone through a few iterations which may be sufficient for your corpus volume – no sense in over-engineering. Postgres tables had to be denormalized at 100k vehicles, and we switched to WebSolr’s extremely convenient Solr solution at 300k – their Heroku plugin will create an installation in minutes for just $20/month. This worked very well until about 1M listings, at which point even their beefiest plan was returning results with >800ms latency.


Hardware: Bigger Is Better. A Lot Better

Our previous Solr-as-a-Service had been hosted on an Amazon EC2 Large instance and returned in 800ms. Fortunately, we had spare capacity on an EC2 Cluster Compute Eight Extra Large, which we use for our webcrawler, and just moving to this machine dropped our query time to 282ms – a speed increase of 2.84x. Notice this corresponds to the processor speed increase of 2.75x between a Large and CC8XL, not the 22x gain in total compute units. Memory appears to be equally irrelevant with both the Large and CC8XL easily keeping our 3GB index in RAM. However, do make sure to give Solr sufficient memory by adjusting the JVM heap size via the -Xmx option.

Software: Shard that Sh*t

282ms is pretty good, but I wanted better - Solr was still responsible for over 50% of our user latency. Google was consulted with surprising results: even if you have just one server, you should still shard your workload. This struck me as odd. Surely Solr is multi-threaded, so why the difference? A quick look at top told me this wasn't the case: Solr's CPU usage never went above 100% (on one core) even though our server had 16 physical cores. A couple hours later, and now with our index spread over 8 shards, our query time was down to 43ms! Also, top showed Solr's CPU usage at 483%, so it was clearly using multiple cores.

I ran some benchmarks, and the following results show solid gains up to 8 shards. If you’re interested in how to shard your own index, I’ve published a brief summary here. The test was run on a representative query from our workload using sorting on two dimensions, a geo bounding box, four numeric range filters, one datetime range filter, and a categorical range filter. I don't know if the fact that it levels off after 8 is due to our specific dataset, or the fact that our server has dual 8-core processors – however, if you have any insight please shoot me a line! *edit* or join in the discussion on HN

Published at DZone with permission of its author, Christopher Berner. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


Yaron Levy replied on Sun, 2012/06/10 - 10:10am

Thanks for this writeup! We're currently using websolr as well, and so far our indexes are small enough that we're not having performance problems, but this looks like a great guide on ways to scale it out.

Question for you: How much faceting/filtering are you using, and have you seen different scaling properties for facet-heavy queries as compared to pure fulltext queries? Thanks!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.