Web Service

Scrapy comes with a built-in web service for monitoring and controlling a running crawler. The service exposes most resources using the JSON-RPC 2.0 protocol, but there are also other (read-only) resources which just output JSON data.

Provides an extensible web service for managing a Scrapy process. It’s enabled by the WEBSERVICE_ENABLED setting. The web server will listen in the port specified in WEBSERVICE_PORT, and will log to the file specified in WEBSERVICE_LOGFILE.

The web service is a built-in Scrapy extension which comes enabled by default, but you can also disable it if you’re running tight on memory.

Web service resources

The web service contains several resources, defined in the WEBSERVICE_RESOURCES setting. Each resource provides a different functionality. See Available JSON-RPC resources for a list of resources available by default.

Although you can implement your own resources using any protocol, there are two kinds of resources bundled with Scrapy:

  • Simple JSON resources - which are read-only and just output JSON data
  • JSON-RPC resources - which provide direct access to certain Scrapy objects using the JSON-RPC 2.0 protocol

Available JSON-RPC resources

These are the JSON-RPC resources available by default in Scrapy:

Crawler JSON-RPC resource

class scrapy.contrib.webservice.crawler.CrawlerResource

Provides access to the main Crawler object that controls the Scrapy process.

Available by default at: http://localhost:6080/crawler

Stats Collector JSON-RPC resource

class scrapy.contrib.webservice.stats.StatsResource

Provides access to the Stats Collector used by the crawler.

Available by default at: http://localhost:6080/stats

Spider Manager JSON-RPC resource

You can access the spider manager JSON-RPC resource through the Crawler JSON-RPC resource at: http://localhost:6080/crawler/spiders

Extension Manager JSON-RPC resource

You can access the extension manager JSON-RPC resource through the Crawler JSON-RPC resource at: http://localhost:6080/crawler/spiders

Available JSON resources

These are the JSON resources available by default:

Engine status JSON resource

class scrapy.contrib.webservice.enginestatus.EngineStatusResource

Provides access to engine status metrics.

Available by default at: http://localhost:6080/enginestatus

Web service settings

These are the settings that control the web service behaviour:

WEBSERVICE_ENABLED

Default: True

A boolean which specifies if the web service will be enabled (provided its extension is also enabled).

WEBSERVICE_LOGFILE

Default: None

A file to use for logging HTTP requests made to the web service. If unset web the log is sent to standard scrapy log.

WEBSERVICE_PORT

Default: [6080, 7030]

The port range to use for the web service. If set to None or 0, a dynamically assigned port is used.

WEBSERVICE_HOST

Default: '0.0.0.0'

The interface the web service should listen on

WEBSERVICE_RESOURCES

Default: {}

The list of web service resources enabled for your project. See Web service resources. These are added to the ones available by default in Scrapy, defined in the WEBSERVICE_RESOURCES_BASE setting.

WEBSERVICE_RESOURCES_BASE

Default:

{
    'scrapy.contrib.webservice.crawler.CrawlerResource': 1,
    'scrapy.contrib.webservice.enginestatus.EngineStatusResource': 1,
    'scrapy.contrib.webservice.stats.StatsResource': 1,
}

The list of web service resources available by default in Scrapy. You shouldn’t change this setting in your project, change WEBSERVICE_RESOURCES instead. If you want to disable some resource set its value to None in WEBSERVICE_RESOURCES.

Writing a web service resource

Web service resources are implemented using the Twisted Web API. See this Twisted Web guide for more information on Twisted web and Twisted web resources.

To write a web service resource you should subclass the JsonResource or JsonRpcResource classes and implement the renderGET method.

class scrapy.webservice.JsonResource

A subclass of twisted.web.resource.Resource that implements a JSON web service resource. See

ws_name

The name by which the Scrapy web service will known this resource, and also the path where this resource will listen. For example, assuming Scrapy web service is listening on http://localhost:6080/ and the ws_name is 'resource1' the URL for that resource will be:

class scrapy.webservice.JsonRpcResource(crawler, target=None)

This is a subclass of JsonResource for implementing JSON-RPC resources. JSON-RPC resources wrap Python (Scrapy) objects around a JSON-RPC API. The resource wrapped must be returned by the get_target() method, which returns the target passed in the constructor by default

get_target()

Return the object wrapped by this JSON-RPC resource. By default, it returns the object passed on the constructor.

Examples of web service resources

StatsResource (JSON-RPC resource)

from scrapy.webservice import JsonRpcResource

class StatsResource(JsonRpcResource):

    ws_name = 'stats'

    def __init__(self, crawler):
        JsonRpcResource.__init__(self, crawler, crawler.stats)

EngineStatusResource (JSON resource)

from scrapy.webservice import JsonResource
from scrapy.utils.engine import get_engine_status

class EngineStatusResource(JsonResource):

    ws_name = 'enginestatus'

    def __init__(self, crawler, spider_name=None):
        JsonResource.__init__(self, crawler)
        self._spider_name = spider_name
        self.isLeaf = spider_name is not None

    def render_GET(self, txrequest):
        status = get_engine_status(self.crawler.engine)
        if self._spider_name is None:
            return status
        for sp, st in status['spiders'].items():
            if sp.name == self._spider_name:
                return st

    def getChild(self, name, txrequest):
        return EngineStatusResource(name, self.crawler)

Example of web service client

scrapy-ws.py script

#!/usr/bin/env python
"""
Example script to control a Scrapy server using its JSON-RPC web service.

It only provides a reduced functionality as its main purpose is to illustrate
how to write a web service client. Feel free to improve or write you own.

Also, keep in mind that the JSON-RPC API is not stable. The recommended way for
controlling a Scrapy server is through the execution queue (see the "queue"
command).

"""

import sys, optparse, urllib, json
from urlparse import urljoin

from scrapy.utils.jsonrpc import jsonrpc_client_call, JsonRpcError

def get_commands():
    return {
        'help': cmd_help,
        'stop': cmd_stop,
        'list-available': cmd_list_available,
        'list-running': cmd_list_running,
        'list-resources': cmd_list_resources,
        'get-global-stats': cmd_get_global_stats,
        'get-spider-stats': cmd_get_spider_stats,
    }

def cmd_help(args, opts):
    """help - list available commands"""
    print "Available commands:"
    for _, func in sorted(get_commands().items()):
        print "  ", func.__doc__

def cmd_stop(args, opts):
    """stop <spider> - stop a running spider"""
    jsonrpc_call(opts, 'crawler/engine', 'close_spider', args[0])

def cmd_list_running(args, opts):
    """list-running - list running spiders"""
    for x in json_get(opts, 'crawler/engine/open_spiders'):
        print x

def cmd_list_available(args, opts):
    """list-available - list name of available spiders"""
    for x in jsonrpc_call(opts, 'crawler/spiders', 'list'):
        print x

def cmd_list_resources(args, opts):
    """list-resources - list available web service resources"""
    for x in json_get(opts, '')['resources']:
        print x

def cmd_get_spider_stats(args, opts):
    """get-spider-stats <spider> - get stats of a running spider"""
    stats = jsonrpc_call(opts, 'stats', 'get_stats', args[0])
    for name, value in stats.items():
        print "%-40s %s" % (name, value)

def cmd_get_global_stats(args, opts):
    """get-global-stats - get global stats"""
    stats = jsonrpc_call(opts, 'stats', 'get_stats')
    for name, value in stats.items():
        print "%-40s %s" % (name, value)

def get_wsurl(opts, path):
    return urljoin("http://%s:%s/"% (opts.host, opts.port), path)

def jsonrpc_call(opts, path, method, *args, **kwargs):
    url = get_wsurl(opts, path)
    return jsonrpc_client_call(url, method, *args, **kwargs)

def json_get(opts, path):
    url = get_wsurl(opts, path)
    return json.loads(urllib.urlopen(url).read())

def parse_opts():
    usage = "%prog [options] <command> [arg] ..."
    description = "Scrapy web service control script. Use '%prog help' " \
        "to see the list of available commands."
    op = optparse.OptionParser(usage=usage, description=description)
    op.add_option("-H", dest="host", default="localhost", \
        help="Scrapy host to connect to")
    op.add_option("-P", dest="port", type="int", default=6080, \
        help="Scrapy port to connect to")
    opts, args = op.parse_args()
    if not args:
        op.print_help()
        sys.exit(2)
    cmdname, cmdargs, opts = args[0], args[1:], opts
    commands = get_commands()
    if cmdname not in commands:
        sys.stderr.write("Unknown command: %s\n\n" % cmdname)
        cmd_help(None, None)
        sys.exit(1)
    return commands[cmdname], cmdargs, opts

def main():
    cmd, args, opts = parse_opts()
    try:
        cmd(args, opts)
    except IndexError:
        print cmd.__doc__
    except JsonRpcError, e:
        print str(e)
        if e.data:
            print "Server Traceback below:"
            print e.data


if __name__ == '__main__':
    main()