Blog | Got A Cool Story? Post It Here.
Home » Reference Materials » Technical Article » Indexing web sites in Solr with Python
Indexing web sites in Solr with Python

Indexing web sites in Solr with Python

In this post I will show a simple yet effective way of indexing web sites into a Solr index, using Scrapy and Python.

We see a lot of advanced Solr-based applications, with sophisticated custom data pipelines that combine data from multiple sources, or that have large scale requirements. Equally we often see people who want to start implementing search in a minimally-invase way, using existing websites as integration points rather than implementing a deep integration with particular CMSes or databases which may be maintained by other groups in an organisation. While crawling websites sounds fairly basic, you soon find that there are gotchas, with the mechanics of crawling, but more importantly, with the structure of websites. If you simply parse the HTML and index the text, you will index a lot of text that is not actually relevant to the page: navigation sections, headers and footers, ads, links to related pages. Trying to clean that up afterwards is often not effective; you’re much better off preventing that cruft going into the index in the first place. That involves parsing the content of the web page, and extracting information intelligently. And there’s a great tool for doing this: Scrapy. In this post I will give a simple example of its use. See Scrapy’s tutorial for an introduction and further information.

Crawling

My example site will be my personal blog. I write the blog in Markdown, generate HTML with Jekyll, deploy through git, and host on lighttpd and CloudFront; but none of that makes a difference to our consumption of that content, we’ll just crawl the website.

First to prepare to run Scrapy, in a Python virtualenv:

PROJECT_DIR=~/projects/scrapy
mkdir $PROJECT_DIR
cd $PROJECT_DIR
virtualenv scrapyenv
source scrapyenv/bin/activate
pip install scrapy

Then to create a Scrapy application, named blog:

scrapy startproject blog
cd blog

The items we want to index are the blog posts; I just use title, URL and text fields:

cat > blog/items.py <<EOM
from scrapy.item import Item, Field

class BlogItem(Item):
    title = Field()
    url = Field()
    text = Field()
EOM

Next I create a simple spider which crawls my site, identifies blog posts by URL structure, and extract the text from the blog post. The cool thing about this is that we can extract specific parts of the page.

cat > blog/spiders/blogspider.py <<EOM
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from blog.items import BlogItem
from scrapy.item import Item
from urlparse import urljoin
import re

class BlogSpider(BaseSpider):
    name = 'blog'
    allowed_domains = ['www.greenhills.co.uk']
    start_urls = ['http://www.greenhills.co.uk/']

    seen = set()

    def parse(self, response):
        if response.url in self.seen:
            self.log('already seen  %s' % response.url)
        else:
            self.log('parsing  %s' % response.url)
            self.seen.add(response.url)

        hxs = HtmlXPathSelector(response)
        if re.match(r'http://www.greenhills.co.uk/\d{4}/\d{2}/\d{2}', response.url):
            item = BlogItem()
            item['title'] = hxs.select('//title/text()').extract()
            item['url'] = response.url
            item['text'] = hxs.select('//section[@id="main"]//child::node()/text()').extract()
            self.log("yielding item " + response.url)
            yield item

        for url in hxs.select('//a/@href').extract():
            url = urljoin(response.url, url)
            if not url in self.seen and not re.search(r'\.(pdf|zip|jar)$', url):
                self.log("yielding request " + url)
                yield Request(url, callback=self.parse)
EOM

The main bit of logic here is the matching: my blog URLs all start with dates (/YYYY/MM/DD/), so I use that to identify blog posts, which I then parse using XPath. Gotchas here are that you need to create absolute URLs from relative paths in HTML A tags (with urljoin), and that I skip links to binary types. I could have used the CrawlSpider and defined rules for extracting/parsing, but with the BaseCrawler it’s a bit clearer to see what happens.

To run the crawl:

scrapy crawl blog -o items.json -t json

which produces a JSON file items.json with a list of items like:

{"url": "http://www.greenhills.co.uk/2013/05/22/installing-distributed-solr-4-with-fabric.html",
"text": ["Installing Distributed Solr 4 with Fabric", "22 May 2013", "I wrote an article ", "\u201cInstalling Distributed Solr 4 with Fabric\u201d", "\nabout deploying SolrCloud with ", "Fabric", ".\nCode is on ", "github.com/LucidWorks/solr-fabric", ".", "My ", "VM strategy", "\nand ", "server", " worked great for developing/testing this!", "\u00a9 2011 Martijn Koster \u2014 ", "terms", "\n", "\n"],
"title": ["Installing Distributed Solr 4 with Fabric"]},

Indexing

Next, to get that into Solr, we could use the JSON Request Handler, and just transform the JSON into the appropriate form. But, seeing as we’re using Python, we’ll just use pysolr.

To install pysolr:

pip install pysolr

and write the python code:

cat > inject.py <<EOM
import pysolr,json,argparse
parser = argparse.ArgumentParser(description='load json into python.')
parser.add_argument('input_file', metavar='input', type=str, help='json input file')
parser.add_argument('solr_url', metavar='url', type=str, help='solr URL')

args = parser.parse_args()
solr = pysolr.Solr(args.solr_url, timeout=10)

items = json.load(open(args.input_file))
for item in items:
  item['id'] = item['url']

solr.add(items)
EOM

This looks too easy, right? The trick is that the attribute names title/url/body in the JSON file match field definitions in the Default Solr schema.xml. Note that the text field is configured to be indexed, but not stored; this means you do not get the page content back with your query, and you can’t do things like highlighting.

We do need a “id” field, and we’ll use the URL for that. I could have set that in the crawler, so it would become part of the JSON file, but seeing as it is a Solr-specific requirement (and so I could illustrate simple field mapping) I choose to do it here.

The url points at a Solr 4 server on the host “vm116.lan” on my LAN. Adjust hostname and port number to match yours. See the Solr 4.3.0 tutorial for details on how to run Solr.

To run (change the URL to point to your Solr instance):

python inject.py items.json http://vm116.lan:8983/solr/collection1

and to query, point your browser to http://vm116.lan:8983/solr/collection1/browse and search for Fabric, which should find the Installing Distributed Solr 4 with Fabric post. Or use the lower-level Solr query page http://vm116.lan:8983/solr/#/collection1/query and do a query for text:Fabric.

In this example we’re crawling and indexing in separate stages; you may want to inject directly from the crawler.

License and Disclaimer

The code snippets above are covered by:

The MIT License (MIT)

Copyright (c) 2013 Martijn Koster

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Google+