Item Loaders¶
Item Loaders provide a convenient mechanism for populating scraped items. Even though items can be populated directly, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.
In other words, items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container.
Item Loaders are designed to provide a flexible, efficient and easy mechanism for extending and overriding different field parsing rules, either by spider, or by source format (HTML, XML, etc) without becoming a nightmare to maintain.
Note
Item Loaders are an extension of the itemloaders library that make it easier to work with Scrapy by adding support for responses.
Using Item Loaders to populate items¶
To use an Item Loader, you must first instantiate it. You can either
instantiate it with an item object or without one, in which
case an item object is automatically created in the
Item Loader __init__
method using the item class
specified in the ItemLoader.default_item_class
attribute.
Then, you start collecting values into the Item Loader, typically using Selectors. You can add more than one value to the same item field; the Item Loader will know how to “join” those values later using a proper processing function.
Note
Collected data is internally stored as lists,
allowing to add several values to the same field.
If an item
argument is passed when creating a loader,
each of the item’s values will be stored as-is if it’s already
an iterable, or wrapped with a list if it’s a single value.
Here is a typical Item Loader usage in a Spider, using the Product item declared in the Items chapter:
from scrapy.loader import ItemLoader
from myproject.items import Product
def parse(self, response):
l = ItemLoader(item=Product(), response=response)
l.add_xpath("name", '//div[@class="product_name"]')
l.add_xpath("name", '//div[@class="product_title"]')
l.add_xpath("price", '//p[@id="price"]')
l.add_css("stock", "p#stock")
l.add_value("last_updated", "today") # you can also use literal values
return l.load_item()
By quickly looking at that code, we can see the name
field is being
extracted from two different XPath locations in the page:
//div[@class="product_name"]
//div[@class="product_title"]
In other words, data is being collected by extracting it from two XPath
locations, using the add_xpath()
method. This is the
data that will be assigned to the name
field later.
Afterwards, similar calls are used for price
and stock
fields
(the latter using a CSS selector with the add_css()
method),
and finally the last_update
field is populated directly with a literal value
(today
) using a different method: add_value()
.
Finally, when all data is collected, the ItemLoader.load_item()
method is
called which actually returns the item populated with the data
previously extracted and collected with the add_xpath()
,
add_css()
, and add_value()
calls.
Working with dataclass items¶
By default, dataclass items require all fields to be
passed when created. This could be an issue when using dataclass items with
item loaders: unless a pre-populated item is passed to the loader, fields
will be populated incrementally using the loader’s add_xpath()
,
add_css()
and add_value()
methods.
One approach to overcome this is to define items using the
field()
function, with a default
argument:
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class InventoryItem:
name: Optional[str] = field(default=None)
price: Optional[float] = field(default=None)
stock: Optional[int] = field(default=None)
Input and Output processors¶
An Item Loader contains one input processor and one output processor for each
(item) field. The input processor processes the extracted data as soon as it’s
received (through the add_xpath()
, add_css()
or
add_value()
methods) and the result of the input processor is
collected and kept inside the ItemLoader. After collecting all data, the
ItemLoader.load_item()
method is called to populate and get the populated
item object. That’s when the output processor is
called with the data previously collected (and processed using the input
processor). The result of the output processor is the final value that gets
assigned to the item.
Let’s see an example to illustrate how the input and output processors are called for a particular field (the same applies for any other field):
l = ItemLoader(Product(), some_selector)
l.add_xpath("name", xpath1) # (1)
l.add_xpath("name", xpath2) # (2)
l.add_css("name", css) # (3)
l.add_value("name", "test") # (4)
return l.load_item() # (5)
So what happens is:
Data from
xpath1
is extracted, and passed through the input processor of thename
field. The result of the input processor is collected and kept in the Item Loader (but not yet assigned to the item).Data from
xpath2
is extracted, and passed through the same input processor used in (1). The result of the input processor is appended to the data collected in (1) (if any).This case is similar to the previous ones, except that the data is extracted from the
css
CSS selector, and passed through the same input processor used in (1) and (2). The result of the input processor is appended to the data collected in (1) and (2) (if any).This case is also similar to the previous ones, except that the value to be collected is assigned directly, instead of being extracted from a XPath expression or a CSS selector. However, the value is still passed through the input processors. In this case, since the value is not iterable it is converted to an iterable of a single element before passing it to the input processor, because input processor always receive iterables.
The data collected in steps (1), (2), (3) and (4) is passed through the output processor of the
name
field. The result of the output processor is the value assigned to thename
field in the item.
It’s worth noticing that processors are just callable objects, which are called with the data to be parsed, and return a parsed value. So you can use any function as input or output processor. The only requirement is that they must accept one (and only one) positional argument, which will be an iterable.
Changed in version 2.0: Processors no longer need to be methods.
Note
Both input and output processors must receive an iterable as their first argument. The output of those functions can be anything. The result of input processors will be appended to an internal list (in the Loader) containing the collected values (for that field). The result of the output processors is the value that will be finally assigned to the item.
The other thing you need to keep in mind is that the values returned by input processors are collected internally (in lists) and then passed to output processors to populate the fields.
Last, but not least, itemloaders comes with some commonly used processors built-in for convenience.
Declaring Item Loaders¶
Item Loaders are declared using a class definition syntax. Here is an example:
from itemloaders.processors import TakeFirst, MapCompose, Join
from scrapy.loader import ItemLoader
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
name_in = MapCompose(str.title)
name_out = Join()
price_in = MapCompose(str.strip)
# ...
As you can see, input processors are declared using the _in
suffix while
output processors are declared using the _out
suffix. And you can also
declare a default input/output processors using the
ItemLoader.default_input_processor
and
ItemLoader.default_output_processor
attributes.
Declaring Input and Output Processors¶
As seen in the previous section, input and output processors can be declared in the Item Loader definition, and it’s very common to declare input processors this way. However, there is one more place where you can specify the input and output processors to use: in the Item Field metadata. Here is an example:
import scrapy
from itemloaders.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags
def filter_price(value):
if value.isdigit():
return value
class Product(scrapy.Item):
name = scrapy.Field(
input_processor=MapCompose(remove_tags),
output_processor=Join(),
)
price = scrapy.Field(
input_processor=MapCompose(remove_tags, filter_price),
output_processor=TakeFirst(),
)
>>> from scrapy.loader import ItemLoader
>>> il = ItemLoader(item=Product())
>>> il.add_value("name", ["Welcome to my", "<strong>website</strong>"])
>>> il.add_value("price", ["€", "<span>1000</span>"])
>>> il.load_item()
{'name': 'Welcome to my website', 'price': '1000'}
The precedence order, for both input and output processors, is as follows:
Item Loader field-specific attributes:
field_in
andfield_out
(most precedence)Field metadata (
input_processor
andoutput_processor
key)Item Loader defaults:
ItemLoader.default_input_processor()
andItemLoader.default_output_processor()
(least precedence)
See also: Reusing and extending Item Loaders.
Item Loader Context¶
The Item Loader Context is a dict of arbitrary key/values which is shared among all input and output processors in the Item Loader. It can be passed when declaring, instantiating or using Item Loader. They are used to modify the behaviour of the input/output processors.
For example, suppose you have a function parse_length
which receives a text
value and extracts a length from it:
def parse_length(text, loader_context):
unit = loader_context.get("unit", "m")
# ... length parsing code goes here ...
return parsed_length
By accepting a loader_context
argument the function is explicitly telling
the Item Loader that it’s able to receive an Item Loader context, so the Item
Loader passes the currently active context when calling it, and the processor
function (parse_length
in this case) can thus use them.
There are several ways to modify Item Loader context values:
By modifying the currently active Item Loader context (
context
attribute):loader = ItemLoader(product) loader.context["unit"] = "cm"
On Item Loader instantiation (the keyword arguments of Item Loader
__init__
method are stored in the Item Loader context):loader = ItemLoader(product, unit="cm")
On Item Loader declaration, for those input/output processors that support instantiating them with an Item Loader context.
MapCompose
is one of them:class ProductLoader(ItemLoader): length_out = MapCompose(parse_length, unit="cm")
ItemLoader objects¶
- class scrapy.loader.ItemLoader(item=None, selector=None, response=None, parent=None, **context)[source]¶
A user-friendly abstraction to populate an item with data by applying field processors to scraped data. When instantiated with a
selector
or aresponse
it supports data extraction from web pages using selectors.- Parameters
item (scrapy.item.Item) – The item instance to populate using subsequent calls to
add_xpath()
,add_css()
, oradd_value()
.selector (
Selector
object) – The selector to extract data from, when using theadd_xpath()
,add_css()
,replace_xpath()
, orreplace_css()
method.response (
Response
object) – The response used to construct the selector using thedefault_selector_class
, unless the selector argument is given, in which case this argument is ignored.
If no item is given, one is instantiated automatically using the class in
default_item_class
.The item, selector, response and remaining keyword arguments are assigned to the Loader context (accessible through the
context
attribute).- item¶
The item object being parsed by this Item Loader. This is mostly used as a property so, when attempting to override this value, you may want to check out
default_item_class
first.
- default_item_class¶
An item class (or factory), used to instantiate items when not given in the
__init__
method.
- default_input_processor¶
The default input processor to use for those fields which don’t specify one.
- default_output_processor¶
The default output processor to use for those fields which don’t specify one.
- default_selector_class¶
The class used to construct the
selector
of thisItemLoader
, if only a response is given in the__init__
method. If a selector is given in the__init__
method this attribute is ignored. This attribute is sometimes overridden in subclasses.
- selector¶
The
Selector
object to extract data from. It’s either the selector given in the__init__
method or one created from the response given in the__init__
method using thedefault_selector_class
. This attribute is meant to be read-only.
- add_css(field_name: Optional[str], css: Union[str, Iterable[str]], *processors: Callable[..., Any], re: Union[str, Pattern[str], None] = None, **kw: Any) Self [source]¶
Similar to
ItemLoader.add_value()
but receives a CSS selector instead of a value, which is used to extract a list of unicode strings from the selector associated with thisItemLoader
.See
get_css()
forkwargs
.- Parameters
css (str) – the CSS selector to extract data from
- Returns
The current ItemLoader instance for method chaining.
- Return type
Examples:
# HTML snippet: <p class="product-name">Color TV</p> loader.add_css('name', 'p.product-name') # HTML snippet: <p id="price">the price is $1200</p> loader.add_css('price', 'p#price', re='the price is (.*)')
- add_jmes(field_name: Optional[str], jmes: str, *processors: Callable[..., Any], re: Union[str, Pattern[str], None] = None, **kw: Any) Self [source]¶
Similar to
ItemLoader.add_value()
but receives a JMESPath selector instead of a value, which is used to extract a list of unicode strings from the selector associated with thisItemLoader
.See
get_jmes()
forkwargs
.- Parameters
jmes (str) – the JMESPath selector to extract data from
- Returns
The current ItemLoader instance for method chaining.
- Return type
Examples:
# HTML snippet: {"name": "Color TV"} loader.add_jmes('name') # HTML snippet: {"price": the price is $1200"} loader.add_jmes('price', TakeFirst(), re='the price is (.*)')
- add_value(field_name: Optional[str], value: Any, *processors: Callable[..., Any], re: Union[str, Pattern[str], None] = None, **kw: Any) Self [source]¶
Process and then add the given
value
for the given field.The value is first passed through
get_value()
by giving theprocessors
andkwargs
, and then passed through the field input processor and its result appended to the data collected for that field. If the field already contains collected data, the new data is added.The given
field_name
can beNone
, in which case values for multiple fields may be added. And the processed value should be a dict with field_name mapped to values.- Returns
The current ItemLoader instance for method chaining.
- Return type
Examples:
loader.add_value('name', 'Color TV') loader.add_value('colours', ['white', 'blue']) loader.add_value('length', '100') loader.add_value('name', 'name: foo', TakeFirst(), re='name: (.+)') loader.add_value(None, {'name': 'foo', 'sex': 'male'})
- add_xpath(field_name: Optional[str], xpath: Union[str, Iterable[str]], *processors: Callable[..., Any], re: Union[str, Pattern[str], None] = None, **kw: Any) Self [source]¶
Similar to
ItemLoader.add_value()
but receives an XPath instead of a value, which is used to extract a list of strings from the selector associated with thisItemLoader
.See
get_xpath()
forkwargs
.- Parameters
xpath (str) – the XPath to extract data from
- Returns
The current ItemLoader instance for method chaining.
- Return type
Examples:
# HTML snippet: <p class="product-name">Color TV</p> loader.add_xpath('name', '//p[@class="product-name"]') # HTML snippet: <p id="price">the price is $1200</p> loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')
- get_collected_values(field_name: str) List[Any] [source]¶
Return the collected values for the given field.
- get_css(css: Union[str, Iterable[str]], *processors: Callable[[...], Any], re: Optional[Union[str, Pattern[str]]] = None, **kw: Any) Any [source]¶
Similar to
ItemLoader.get_value()
but receives a CSS selector instead of a value, which is used to extract a list of unicode strings from the selector associated with thisItemLoader
.- Parameters
Examples:
# HTML snippet: <p class="product-name">Color TV</p> loader.get_css('p.product-name') # HTML snippet: <p id="price">the price is $1200</p> loader.get_css('p#price', TakeFirst(), re='the price is (.*)')
- get_jmes(jmes: Union[str, Iterable[str]], *processors: Callable[[...], Any], re: Optional[Union[str, Pattern[str]]] = None, **kw: Any) Any [source]¶
Similar to
ItemLoader.get_value()
but receives a JMESPath selector instead of a value, which is used to extract a list of unicode strings from the selector associated with thisItemLoader
.- Parameters
Examples:
# HTML snippet: {"name": "Color TV"} loader.get_jmes('name') # HTML snippet: {"price": the price is $1200"} loader.get_jmes('price', TakeFirst(), re='the price is (.*)')
- get_output_value(field_name: str) Any [source]¶
Return the collected values parsed using the output processor, for the given field. This method doesn’t populate or modify the item at all.
- get_value(value: Any, *processors: Callable[[...], Any], re: Optional[Union[str, Pattern[str]]] = None, **kw: Any) Any [source]¶
Process the given
value
by the givenprocessors
and keyword arguments.Available keyword arguments:
- Parameters
re (str or Pattern[str]) – a regular expression to use for extracting data from the given value using
extract_regex()
method, applied before processors
Examples:
>>> from itemloaders import ItemLoader >>> from itemloaders.processors import TakeFirst >>> loader = ItemLoader() >>> loader.get_value('name: foo', TakeFirst(), str.upper, re='name: (.+)') 'FOO'
- get_xpath(xpath: Union[str, Iterable[str]], *processors: Callable[[...], Any], re: Optional[Union[str, Pattern[str]]] = None, **kw: Any) Any [source]¶
Similar to
ItemLoader.get_value()
but receives an XPath instead of a value, which is used to extract a list of unicode strings from the selector associated with thisItemLoader
.- Parameters
Examples:
# HTML snippet: <p class="product-name">Color TV</p> loader.get_xpath('//p[@class="product-name"]') # HTML snippet: <p id="price">the price is $1200</p> loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')
- load_item() Any [source]¶
Populate the item with the data collected so far, and return it. The data collected is first passed through the output processors to get the final value to assign to each item field.
- nested_css(css: str, **context: Any) Self [source]¶
Create a nested loader with a css selector. The supplied selector is applied relative to selector associated with this
ItemLoader
. The nested loader shares the item with the parentItemLoader
so calls toadd_xpath()
,add_value()
,replace_value()
, etc. will behave as expected.
- nested_xpath(xpath: str, **context: Any) Self [source]¶
Create a nested loader with an xpath selector. The supplied selector is applied relative to selector associated with this
ItemLoader
. The nested loader shares the item with the parentItemLoader
so calls toadd_xpath()
,add_value()
,replace_value()
, etc. will behave as expected.
- replace_css(field_name: Optional[str], css: Union[str, Iterable[str]], *processors: Callable[..., Any], re: Union[str, Pattern[str], None] = None, **kw: Any) Self [source]¶
Similar to
add_css()
but replaces collected data instead of adding it.- Returns
The current ItemLoader instance for method chaining.
- Return type
- replace_jmes(field_name: Optional[str], jmes: Union[str, Iterable[str]], *processors: Callable[..., Any], re: Union[str, Pattern[str], None] = None, **kw: Any) Self [source]¶
Similar to
add_jmes()
but replaces collected data instead of adding it.- Returns
The current ItemLoader instance for method chaining.
- Return type
- replace_value(field_name: Optional[str], value: Any, *processors: Callable[..., Any], re: Union[str, Pattern[str], None] = None, **kw: Any) Self [source]¶
Similar to
add_value()
but replaces the collected data with the new value instead of adding it.- Returns
The current ItemLoader instance for method chaining.
- Return type
- replace_xpath(field_name: Optional[str], xpath: Union[str, Iterable[str]], *processors: Callable[..., Any], re: Union[str, Pattern[str], None] = None, **kw: Any) Self [source]¶
Similar to
add_xpath()
but replaces collected data instead of adding it.- Returns
The current ItemLoader instance for method chaining.
- Return type
Nested Loaders¶
When parsing related values from a subsection of a document, it can be useful to create nested loaders. Imagine you’re extracting details from a footer of a page that looks something like:
Example:
<footer>
<a class="social" href="https://facebook.com/whatever">Like Us</a>
<a class="social" href="https://twitter.com/whatever">Follow Us</a>
<a class="email" href="mailto:whatever@example.com">Email Us</a>
</footer>
Without nested loaders, you need to specify the full xpath (or css) for each value that you wish to extract.
Example:
loader = ItemLoader(item=Item())
# load stuff not in the footer
loader.add_xpath("social", '//footer/a[@class = "social"]/@href')
loader.add_xpath("email", '//footer/a[@class = "email"]/@href')
loader.load_item()
Instead, you can create a nested loader with the footer selector and add values relative to the footer. The functionality is the same but you avoid repeating the footer selector.
Example:
loader = ItemLoader(item=Item())
# load stuff not in the footer
footer_loader = loader.nested_xpath("//footer")
footer_loader.add_xpath("social", 'a[@class = "social"]/@href')
footer_loader.add_xpath("email", 'a[@class = "email"]/@href')
# no need to call footer_loader.load_item()
loader.load_item()
You can nest loaders arbitrarily and they work with either xpath or css selectors. As a general guideline, use nested loaders when they make your code simpler but do not go overboard with nesting or your parser can become difficult to read.
Reusing and extending Item Loaders¶
As your project grows bigger and acquires more and more spiders, maintenance becomes a fundamental problem, especially when you have to deal with many different parsing rules for each spider, having a lot of exceptions, but also wanting to reuse the common processors.
Item Loaders are designed to ease the maintenance burden of parsing rules, without losing flexibility and, at the same time, providing a convenient mechanism for extending and overriding them. For this reason Item Loaders support traditional Python class inheritance for dealing with differences of specific spiders (or groups of spiders).
Suppose, for example, that some particular site encloses their product names in
three dashes (e.g. ---Plasma TV---
) and you don’t want to end up scraping
those dashes in the final product names.
Here’s how you can remove those dashes by reusing and extending the default
Product Item Loader (ProductLoader
):
from itemloaders.processors import MapCompose
from myproject.ItemLoaders import ProductLoader
def strip_dashes(x):
return x.strip("-")
class SiteSpecificLoader(ProductLoader):
name_in = MapCompose(strip_dashes, ProductLoader.name_in)
Another case where extending Item Loaders can be very helpful is when you have
multiple source formats, for example XML and HTML. In the XML version you may
want to remove CDATA
occurrences. Here’s an example of how to do it:
from itemloaders.processors import MapCompose
from myproject.ItemLoaders import ProductLoader
from myproject.utils.xml import remove_cdata
class XmlProductLoader(ProductLoader):
name_in = MapCompose(remove_cdata, ProductLoader.name_in)
And that’s how you typically extend input processors.
As for output processors, it is more common to declare them in the field metadata, as they usually depend only on the field and not on each specific site parsing rule (as input processors do). See also: Declaring Input and Output Processors.
There are many other possible ways to extend, inherit and override your Item Loaders, and different Item Loaders hierarchies may fit better for different projects. Scrapy only provides the mechanism; it doesn’t impose any specific organization of your Loaders collection - that’s up to you and your project’s needs.