Integrating Scrapy with Django Using Celery

Scrapy is a powerful web scraping framework, and integrating it with Django can help automate data extraction and storage. In this guide, we'll use cookiecutter-django to set up a Django project with Celery enabled, create a Scrapy spider, and trigger it via Celery tasks. We'll also use Scrapy pipelines to store scraped data in Django models.

Step 1: Setting Up Django with Cookiecutter

First, install cookiecutter:

pip install cookiecutter

Now, create a new Django project using cookiecutter-django:

cookiecutter https://github.com/cookiecutter/cookiecutter-django

During the setup, make sure to:

Enable Celery (answer yes when prompted).
Choose a database like PostgreSQL or SQLite.
Set use_docker to yes if you prefer Docker.

Navigate into your project directory:

cd your_project

Install dependencies:

pip install -r requirements/local.txt

Run the initial setup:

docker-compose up -d

Make sure Celery is running:

docker-compose run --rm celeryworker

Step 2: Adding Scrapy to the Django Project

Now, install Scrapy in your Django environment:

pip install scrapy scrapy-djangoitem

Inside your Django project, create a new Scrapy project:

scrapy startproject scraper

Move it inside your Django app:

mv scraper your_project/scraper

Modify your_project/scraper/scraper/settings.py to use Django's database:

import sys
import os
import django

sys.path.append(os.path.dirname(os.path.abspath('.')))
os.environ['DJANGO_SETTINGS_MODULE'] = 'config.settings'
django.setup()

Step 3: Creating Django Models for Scraped Data

Inside one of your Django apps (e.g., scraping), define a model to store scraped data:

from django.db import models

class Article(models.Model):
    title = models.CharField(max_length=255)
    url = models.URLField(unique=True)
    content = models.TextField()
    published_at = models.DateTimeField(null=True, blank=True)

    def __str__(self):
        return self.title

Run migrations:

python manage.py makemigrations scraping
python manage.py migrate scraping

Step 4: Creating a Scrapy Spider

Create a Scrapy spider inside scraper/scraper/spiders/:

touch scraper/scraper/spiders/news_spider.py

Edit news_spider.py:

import scrapy
from scraper.items import ArticleItem

class NewsSpider(scrapy.Spider):
    name = "news"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com/news"]

    def parse(self, response):
        for article in response.css("div.article"):
            yield ArticleItem(
                title=article.css("h2::text").get(),
                url=response.urljoin(article.css("a::attr(href)").get()),
                content=article.css("p::text").get(),
                published_at=None
            )

Define the Scrapy Item to map to our Django model:

import scrapy
from scrapy_djangoitem import DjangoItem
from scraping.models import Article

class ArticleItem(DjangoItem):
    django_model = Article

Step 5: Using Scrapy Pipelines to Save Data in Django

Enable Scrapy's pipeline in scraper/settings.py:

ITEM_PIPELINES = {
    'scraper.pipelines.DjangoPipeline': 300,
}

Now, create a pipelines.py file in scraper/scraper/:

from scraping.models import Article

class DjangoPipeline:
    def process_item(self, item, spider):
        article, created = Article.objects.update_or_create(
            url=item['url'],
            defaults=item,
        )
        return item

Step 6: Triggering Scrapy via Celery Tasks

Now, define a Celery task inside one of your Django apps (e.g., scraping/tasks.py):

from celery import shared_task
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scraper.spiders.news_spider import NewsSpider

@shared_task
def run_news_spider():
    process = CrawlerProcess(get_project_settings())
    process.crawl(NewsSpider)
    process.start()

Run Celery:

docker-compose up -d celeryworker

You can now trigger the spider via Celery:

python manage.py shell

from scraping.tasks import run_news_spider
run_news_spider.delay()

Celery app also creates a very nice admin interface where you can create Periodic Tasks to run this job at a scheduled interval so definetly check that out as well.

Summary

We've successfully integrated Scrapy with Django using Celery. Now, you can:

Trigger Scrapy spiders asynchronously using Celery.
Store scraped data in Django models.
Use Scrapy pipelines to process and clean data before saving.

This setup is scalable and can handle complex scraping tasks efficiently.