We’re big fans of the Lucene search engine at Building Blocks, and in particular Solr and Nutch. We regularly have to set up new instances and integrate them so have documented the process on our intranet, which we think others may find useful.

The search engine is going to be comprised of two parts:

Solr – the search engine interface to the Apache Lucene search library
Nutch – the open source web crawler used to index web content.

First off, let’s install Solr and Nutch.

Solr

If you’re on Windows, the easiest thing to do is to get LucidImagination’s pre-built Solr installer from here: http://www.lucidimagination.com/Downloads/Lucidworks-for-Solr/Installer .

If you’re on the Mac/Linux you can grab the latest build of Solr from http://hudson.zones.apache.org/hudson/job/Solr-trunk/lastSuccessfulBuild/artifact/trunk/solr/dist/ .

Follow the setup (or extract the tgz file) and then start Solr:

  • With LucidImagination (Windows) – double click the start icon on your desktop
  • OSX – in a terminal window, navigate to /solr/example/ and run:
java -jar start.jar

Now browse to http://localhost:8983/solr/admin/ and verify you up and running, you should see the Solr admin screen. If you get errors have a look in the console and it should give you some detail.

Nutch

Grab the latest build of Nutch (make sure you get v1.0 or later) from http://nutch.apache.org/ . (Update – I wrote this post using Nutch 1.1, as of Nutch 1.3 a few things changed so I’ll be creating an updated version soon). After unzipping it, you may need to set the JAVA_HOME and NUTCH_JAVA_HOME environment variables. On Windows, just add them in your Environment Variables box in Computer Properties, pointing them to the location of the JRE on your machine. On OSX issue the following commands in a terminal:

set JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/
export JAVA_HOME
set NUTCH_JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/
export NUTCH_JAVA_HOME

There is some more detailed information about running Nutch on Windows at http://zillionics.com/resources/articles/NutchGuideForDummies.htm

Before indexing any data, you need to set some default properties on Nutch. To do this, open the nutch-site.xml file in the conf directory and add the following:

<?xml version="1.0"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch-solr-integration</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>100</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
</configuration>

This basically sets the userAgent property used in the HTTP request headers when Nutch hits a site.

Now change to the /nutch/ directory where you extracted the files and make sure that you can run the following command without errors:

bin/nutch

Indexing

Now we’re ready to start indexing our data. Before we can do that, we need to tell Nutch where to index – this is done by creating a flat file full of the URLS you wish to spider. Create a new directory in your nutch folder called “urls” and then create a text file within it called seed.txt. In that file put a list of websites, e.g:

http://www.building-blocks.com/

Now you need to amend the conf/crawl-urlfilter.txt file to include your domains. Replace MY.DOMAIN.NAME with your domain name, e.g:

+^<a href="http:">http://</a>([a-z0-9]*.)*building-blocks.com/

Now you can start your crawl. This is done by issuing the following command:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

The options are as follows:

  • -dir – the directory to put the crawl in
  • -depth – the number of levels to traverse down from the root page (keep this low initially)
  • -topN – the maximum number of pages to index for the url

Now Nutch will go off and spider each URL and build a database of the results. You should be able to watch it’s progress in the console window. Once it’s done, we’re ready to get the data into Solr.

Pushing data into Solr

Solr is built around the concept of schemas; it needs to know the shape of the data it is going to accept. The schemas are defined in a file called schema.xml. For the purposes of this demo we only need to know that you can define a list of fields within the schema and these fields will be filled with data ready to be searched.

When you installed the default Solr build it will have created a directory called “example” – this is a demo implementation of Solr and is what we will use for our search engine. Change to the example directory and then open up solr/conf/schema.xml.

We need to tell Solr about the fields Nutch stores its data in, so add the following to schema.xml:

<field name="digest" type="text" stored="true" indexed="true"/>
<field name="boost" type="text" stored="true" indexed="true"/>
<field name="segment" type="text" stored="true" indexed="true"/>
<field name="host" type="text" stored="true" indexed="true"/>
<field name="site" type="text" stored="true" indexed="true"/>
<field name="content" type="text" stored="true" indexed="true"/>
<field name="tstamp" type="text" stored="true" indexed="false"/>
<field name="url" type="string" stored="true" indexed="true"/>
<field name="anchor" type="text" stored="true" indexed="false" multiValued="true"/>

Now look for an element called “uniquekey”. If it exists, change its value to be “url”, or just add it if it’s not already there:

<uniqueKey>url</uniqueKey>

Solr is now ready to read the data indexed by Nutch, however we still need some way of getting the data into it. Solr exposes itself as a set of webservices and these webservices are configured by something called “requestHandlers”.

We need to add a new requestHandler to tell Solr to listen for requests from Nutch. To do this, open up example/solr/conf/solrconfig.xml and add the following requestHandler alongside the others:


<requestHandler name="/nutch" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
content^0.5 anchor^1.0 title^1.2
</str>
<str name="pf">
content^0.5 anchor^1.5 title^1.2 site^1.5
</str>
<str name="fl">
url
</str>
<str name="mm">
2<-1 5<-2 6<90%
</str>
<int name="ps">100</int>
<bool name="hl">true</bool>
<str name="q.alt">*:*</str>
<str name="hl.fl">title url content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
</lst>
</requestHandler>

Now solr is ready to accept Nutch requests. Before continuing, make sure that Solr is running!

With Solr running, you can push your Nutch data into it by running the following command:

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments

(This is assuming you have solr running at the default location of http://127.0.0.1:8983/solr/)

After running this command, watch your Solr console window and make sure that you don’t see any Java exceptions being thrown. If you do, scroll up and review the error message – it will usually be an error in your Solr config.

If you don’t see any errors, your data has been indexed and you’re ready to search!

Searching

Solr comes with a default web interface which allows you to run test searches. Access it at http://localhost:8983/solr/admin/. Enter some text into the QueryString box and hit “Search”. If your query matched any results you should see an XML file containing the indexed pages of your websites.

That’s it! Now all you have to do is write something to talk to Solr from your application and you have an Enterprise ready search engine capable of indexing millions of websites on the internet.

For more information on Solr and Nutch, we recommend visiting the following sites:

There’s also an excellent book on Solr, Solr 1.4 Enterprise Search Server by David Smiley

Comments