Scheduler middleware

The scheduler middleware is a framework of hooks in the Scrapy scheduling mechanism where you can plug custom functionality to process requests being enqueued.

Activating a scheduler middleware

To activate a scheduler middleware component, add it to the SCHEDULER_MIDDLEWARES setting, which is a dict whose keys are the middleware class path and their values are the middleware orders.

Here’s an example:

SCHEDULER_MIDDLEWARES = {
    'myproject.middlewares.CustomSchedulerMiddleware': 543,
}

The SCHEDULER_MIDDLEWARES setting is merged with the SCHEDULER_MIDDLEWARES_BASE setting defined in Scrapy (and not meant to be overridden) and then sorted by order to get the final sorted list of enabled middlewares: the first middleware is the one closer to the engine and the last is the one closer to the scheduler.

To decide which order to assign to your middleware see the SCHEDULER_MIDDLEWARES_BASE setting and pick a value according to where you want to insert the middleware. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.

If you want to disable a builtin middleware (the ones defined in SCHEDULER_MIDDLEWARES_BASE, and enabled by default) you must define it in your project SCHEDULER_MIDDLEWARES setting and assign None as its value. For example, if you want to disable the duplicates filter middleware:

SCHEDULER_MIDDLEWARES = {
    'myproject.middlewares.CustomSchedulerMiddleware': 543,
    'scrapy.contrib.schedulermiddleware.duplicatesfilter.DuplicatesFilterMiddleware: None,
}

Finally, keep in mind that some middlewares may need to be enabled through a particular setting. See each middleware documentation for more info.

Writing your own scheduler middleware

Writing your own scheduler middleware is easy. Each middleware component is a single Python class that defines one or more of the following methods:

class scrapy.contrib.schedulermiddleware.SchedulerMiddleware
enqueue_request(spider, request)

Process the given request which is being enqueued. This method can return None to avoid the request from being scheduled.

enqueue_request() should return either None, a Response object or a Deferred.

Parameters:
  • spider (BaseSpider object) – the spider originating the request
  • requests – the request to be enqueued

Built-in scheduler middleware reference

This page describes all scheduler middleware components that come with Scrapy.

For a list of the components enabled by default (and their orders) see the SCHEDULER_MIDDLEWARES_BASE setting.

DuplicatesFilterMiddleware

class scrapy.contrib.schedulermiddleware.duplicatesfilter.DuplicatesFilterMiddleware

Filter out already visited urls.

The DuplicatesFilterMiddleware can be configured through the following settings (see the settings documentation for more info):