Solr is a text indexing package. All interaction with it is through GETting and POSTting to the service, and then XML responses.

After you do the GET to start an import with Solr's DataImportHandler, you have to check a status URL, and Solr gives a response like this:

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">0</int>
    </lst>
    <lst name="initArgs">
        <lst name="defaults">
            <str name="config">jdbc.xml</str>
        </lst>
    </lst>
    <str name="command">status</str>
    <str name="status">busy</str>
    <str name="importResponse">A command is still running...</str>
    <lst name="statusMessages">
        <str name="Time Elapsed">0:0:4.545</str>
        <str name="Total Requests made to DataSource">1</str>
        <str name="Total Rows Fetched">36262</str>
        <str name="Total Documents Processed">36261</str>
        <str name="Total Documents Skipped">0</str>
        <str name="Full Dump Started">2012-07-11 09:31:03</str>
    </lst>
    <str name="WARNING">This response format is experimental.  It is likely to change in the future.</str>
</response>

And then after a while when you check the status URL, the response looks like this:

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">0</int>
    </lst>
    <lst name="initArgs">
        <lst name="defaults">
            <str name="config">jdbc.xml</str>
        </lst>
    </lst>
    <str name="command">status</str>
    <str name="status">idle</str>
    <str name="importResponse"/>
    <lst name="statusMessages">
        <str name="Total Requests made to DataSource">1</str>
        <str name="Total Rows Fetched">1000000</str>
        <str name="Total Documents Skipped">0</str>
        <str name="Full Dump Started">2012-07-11 09:23:30</str>
        <str name="">Indexing completed. Added/Updated: 1000000 documents. Deleted 0 documents.</str>
        <str name="Committed">2012-07-11 09:26:01</str>
        <str name="Total Documents Processed">1000000</str>
        <str name="Time taken">0:2:31.95</str>
    </lst>
    <str name="WARNING">This response format is experimental.  It is likely to change in the future.</str>
</response>

But when does it finish? There's no way to tell other than hitting that status URL and watching for it to change. I needed a tool to tell me when importing had finished, so I could use it in my makefile. It just has to check the status until it's completed, and then exit.

So, I wrote a little program to do the monitoring, using Ruby and the Nokogiri library. Nokogiri is a web client similar to Perl's WWW::Mechanize, with built-in XPath and CSS selector capabilities.

#!/usr/bin/ruby

require 'rubygems'
require 'nokogiri'
require 'open-uri'

while true
    doc = Nokogiri::XML( open( 'http://hostname:8080/solr/db/dih?command=status' ) )

    # If it's still running, this status will say something like "A process is still running..."
    # The status turns blank when the process has stopped.
    status = doc.xpath( '//response/str[@name="importResponse"]' ).inner_text
    if ( status == '' )
        break
    end

    # Get the import process's elapsed time and record count and display then
    time_elapsed   = doc.xpath( '//response/lst[@name = "statusMessages"]/str[@name = "Time Elapsed"]' ).inner_text
    docs_processed = doc.xpath( '//response/lst[@name = "statusMessages"]/str[@name = "Total Documents Processed"]' ).inner_text
    puts docs_processed + ' documents in ' + time_elapsed + ' seconds'

    sleep(2)
end

I'm not much of a Ruby guy, but this was pretty simple to write. Most of my time was looking at Nokogiri's method listings and reacquainting myself with XPath syntax. The one Ruby gotcha I found was that before Ruby 1.9, if your program uses any Ruby gems, you have to put require 'rubygems' before you require any other gems.