How to pass a user defined argument to a scrapy spider

scrapylogo

Modifying a spider’s behaviour using optional parameters is a common task with scrapy. Maybe you want to pass in a URL subdirectory, a category to spider, an HTML tag to search for etc. As it says in the docs, “Some common uses for spider arguments are to define the start URLs or to restrict the crawl to certain sections of the site, but they can be used to configure any functionality of the spider.”

Spider arguments are passed through the crawl  command using the -a  option. For example:

Spiders receive arguments in their constructors:

Notice the  year=None, month=None  in the constructor, you can use this to set a default value and make your parameter optional. Accessing the parameter is as simple as referencing the variable name.

If you want to pass the parameter value through to an item then you can do something like the following:

You’ll need to obviously define the category item in you items.py  by just adding subcategory = scrapy.Field()  . If you’re dealing with a lot of spiders or a lot of parameters then it makes sense to store parameters in a database, something I’ll cover in the next post.

Finally, you can also pass parameters through the Scrapyd schedule.json  API. The following example passes both a spider setting, DOWNLOAD_DELAY  and an argument, arg1 .

You can download a working sample of this spider at: https://github.com/dataisbeautiful/scrapy_samples/tree/master/parameterdemo

About This Author

Big Data and Python geek. Writes an eclectic mix of news from the world of Big Data and Telecommunications interspersed with articles on Python, Hadoop, E-Commerce and my continual attempts to learn Russian!

1 Comment

You can post comments in this post.


  • Thanks for these tutorials and great work

    Tim 3 years ago Reply


Post A Reply