Calling Scrapy from a Python script

When you need to do some web scraping job in Python, an excellent choice is the Scrapy framework. Not only it takes care of most of the networking (HTTP, SSL, proxies, etc) but it also facilitates the process of extracting data from the web by providing things such as nifty xpath selectors.

Scrapy is built upon the Twisted networking engine. A limitation of its core component, the reactor, is that it cannot be restarted. This might cause us some troubles if we are trying to devise a mechanism to run Scrapy spiders independently from a Python script (and not from Scrapy shell). Say for example we want to implement a Python function that receives some parameters, performs a search/web scraping in some sites and returns a list of scrapped items. A naive solution such as this will not work, since in each of the function calls we need to have the Twisted reactor restarted, and this is unfortunately not possible.

A workaround for this is to run Scrapy on its own process. After doing a search, I could get no solution to work on latest Scrapy. However one of those used Multiprocessing and it came pretty close! Here is an updated version for Scrapy 0.13:

from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
import multiprocessing

class CrawlerWorker(multiprocessing.Process):

    def __init__(self, spider, result_queue):
        multiprocessing.Process.__init__(self)
        self.result_queue = result_queue

        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()

        self.items = []
        self.spider = spider
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)
 
    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        self.crawler.stop()
        self.result_queue.put(self.items)

One way to invoke this, say inside a function, would be:

        result_queue = Queue()
        crawler = CrawlerWorker(MySpider(myArgs), result_queue)
        crawler.start()
        for item in result_queue.get():
            yield item

where MySpider is of course the class of the Spider you want to run, and myArgs are the arguments you wish to invoke the spider with.

We are Tryolabs. A boutique hi-tech company that creates amazing Internet & Mobile products.

12 thoughts on “Calling Scrapy from a Python script

  1. You do not set any environement variable?
    Just new in scrapy and still get an error .
    "crawler = CrawlerWorker(MySpider('url=http://www.example.com'), result_queue)"

    What should be MySpider? the class name? the project name? the name of of the crawler (name="myspider" in the class)?

    Regards,

  2. It works only for one process running…

    When I run this code for two or more processes concurrently


    for spider in spiders:
    crawler = CrawlerWorker(spider(myArgs), result_queue)
    crawler.start()

    I have got errors with Twisted

    Unhandled Error
    Traceback (most recent call last):
    File "/usr/lib64/python2.7/site-packages/twisted/python/log.py", line 84, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
    File "/usr/lib64/python2.7/site-packages/twisted/python/log.py", line 69, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
    File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
    File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
    — <exception caught here> —
    File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 631, in _doReadOrWrite
    why = selectable.doWrite()
    File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 1094, in doWrite
    raise RuntimeError, "doWrite called on a %s" % reflect.qual(self.__class__)
    exceptions.RuntimeError: doWrite called on a twisted.internet.tcp.Port

  3. Errors in Twisted in example above was eliminated by setting WEBSERVICE_ENABLED and TELNETCONSOLE_ENABLED to FALSE. So I can run any count of processes with own spider in process without errors

    1. Hi Serg, I am getting the same error
      File “/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py”, line 619, in _doReadOrWrite
      why = selectable.doWrite()
      ‘_SIGCHLDWaker’ object has no attribute ‘doWrite’

      How do you set the WEBSERVICE_ENABLED and TELNETCONSOLE_ENABLED to FALSE please? Just in crawler settings.py like this?

      WEBSERVICE_ENABLED = False
      TELNETCONSOLE_ENABLED = False

      I tried but still randomly get the error when there were many processes with own spider

  4. Hi Alan,

    I am learning scrapy and python basically I am a java developer, I am using Eclipse PyDev IDE for this development so i need to install scrapy in my eclipse, please help me out how to achieve it.

  5. Hey Alan, thanks for the example! I'm building a spider for my site Spoots for crawl all pages and get social stats, but in Mac is not easy to install :'(

    On my server all goes well for luck.

    Thanks!
    Emiliano

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>