Core API¶

New in version 0.15.

This section documents the Scrapy core API, and it’s intended for developers of extensions and middlewares.

Crawler API¶

The main entry point to Scrapy API is the Crawler object, passed to extensions through the from_crawler class method. This object provides access to all Scrapy core components, and it’s the only way for extensions to access them and hook their functionality into Scrapy.

The Extension Manager is responsible for loading and keeping track of installed extensions and it’s configured through the EXTENSIONS setting which contains a dictionary of all available extensions and their order similar to how you configure the downloader middlewares.

class scrapy.crawler.Crawler(settings)¶

The Crawler object must be instantiated with a scrapy.settings.Settings object.

settings¶

The settings manager of this crawler.

This is used by extensions & middlewares to access the Scrapy settings of this crawler.

For an introduction on Scrapy settings see Settings.

For the API see Settings class.

signals¶

The signals manager of this crawler.

This is used by extensions & middlewares to hook themselves into Scrapy functionality.

For an introduction on signals see Signals.

For the API see SignalManager class.

stats¶

The stats collector of this crawler.

This is used from extensions & middlewares to record stats of their behaviour, or access stats collected by other extensions.

For an introduction on stats collection see Stats Collection.

For the API see StatsCollector class.

extensions¶

The extension manager that keeps track of enabled extensions.

Most extensions won’t need to access this attribute.

For an introduction on extensions and a list of available extensions on Scrapy see Extensions.

spiders¶

The spider manager which takes care of loading and instantiating spiders.

Most extensions won’t need to access this attribute.

engine¶

The execution engine, which coordinates the core crawling logic between the scheduler, downloader and spiders.

Some extension may want to access the Scrapy engine, to modify inspect or modify the downloader and scheduler behaviour, although this is an advanced use and this API is not yet stable.

configure()¶

Configure the crawler.

This loads extensions, middlewares and spiders, leaving the crawler ready to be started. It also configures the execution engine.

start()¶: Start the crawler. This calls configure() if it hasn’t been called yet. Returns a deferred that is fired when the crawl is finished.

Settings API¶

class scrapy.settings.Settings¶

This object that provides access to Scrapy settings.

overrides¶

Global overrides are the ones that take most precedence, and are usually populated by command-line options.

Overrides should be populated before configuring the Crawler object (through the configure() method), otherwise they won’t have any effect. You don’t typically need to worry about overrides unless you are implementing your own Scrapy command.

get(name, default=None)¶

Get a setting value without affecting its original type.

Parameters:	name (string) – the setting name default (any) – the value to return if no setting is found

getbool(name, default=False)¶

Get a setting value as a boolean. For example, both 1 and '1', and True return True, while 0, '0', False and None return False``

For example, settings populated through environment variables set to '0' will return False when using this method.

Parameters:	name (string) – the setting name default (any) – the value to return if no setting is found

getint(name, default=0)¶

Get a setting value as an int

Parameters:	name (string) – the setting name default (any) – the value to return if no setting is found

getfloat(name, default=0.0)¶

Get a setting value as a float

Parameters:	name (string) – the setting name default (any) – the value to return if no setting is found

getlist(name, default=None)¶

Get a setting value as a list. If the setting original type is a list it will be returned verbatim. If it’s a string it will be split by ”,”.

For example, settings populated through environment variables set to 'one,two' will return a list [‘one’, ‘two’] when using this method.

Parameters:	name (string) – the setting name default (any) – the value to return if no setting is found

Signals API¶

class scrapy.signalmanager.SignalManager¶

connect(receiver, signal)¶

Connect a receiver function to a signal.

The signal can be any object, although Scrapy comes with some predefined signals that are documented in the Signals section.

Parameters:	receiver (callable) – the function to be connected signal (object) – the signal to connect to

send_catch_log(signal, **kwargs)¶

Send a signal, catch exceptions and log them.

The keyword arguments are passed to the signal handlers (connected through the connect() method).

send_catch_log_deferred(signal, **kwargs)¶

Like send_catch_log() but supports returning deferreds from signal handlers.

Returns a deferred that gets fired once all signal handlers deferreds were fired. Send a signal, catch exceptions and log them.

The keyword arguments are passed to the signal handlers (connected through the connect() method).

disconnect(receiver, signal)¶: Disconnect a receiver function from a signal. This has the opposite effect of the connect() method, and the arguments are the same.

disconnect_all(signal)¶

Disconnect all receivers from the given signal.

Parameters:	signal (object) – the signal to disconnect from

Stats Collector API¶

There are several Stats Collectors available under the scrapy.statscol module and they all implement the Stats Collector API defined by the StatsCollector class (which they all inherit from).

class scrapy.statscol.StatsCollector¶

get_value(key, default=None)¶: Return the value for the given stats key or default if it doesn’t exist.

get_stats()¶: Get all stats from the currently running spider as a dict.

set_value(key, value)¶: Set the given value for the given stats key.

set_stats(stats)¶: Override the current stats with the dict passed in stats argument.

inc_value(key, count=1, start=0)¶: Increment the value of the given stats key, by the given count, assuming the start value given (when it’s not set).

max_value(key, value)¶: Set the given value for the given key only if current value for the same key is lower than value. If there is no current value for the given key, the value is always set.

min_value(key, value)¶: Set the given value for the given key only if current value for the same key is greater than value. If there is no current value for the given key, the value is always set.

clear_stats()¶: Clear all stats.

The following methods are not part of the stats collection api but instead used when implementing custom stats collectors:

open_spider(spider)¶: Open the given spider for stats collection.

close_spider(spider)¶: Close the given spider. After this is called, no more specific stats can be accessed or collected.