Scrapy is a powerful web scraping framework, and integrating it with Django can help automate data extraction and storage. In this guide, we'll use cookiecutter-django
to set up a Django project with Celery enabled, create a Scrapy spider, and trigger it via Celery tasks. We'll also use Scrapy pipelines to store scraped data in Django models.
Step 1: Setting Up Django with Cookiecutter
First, install cookiecutter
:
pip install cookiecutter
Now, create a new Django project using cookiecutter-django
:
cookiecutter https://github.com/cookiecutter/cookiecutter-django
During the setup, make sure to:
- Enable Celery (answer
yes
when prompted). - Choose a database like PostgreSQL or SQLite.
- Set
use_docker
toyes
if you prefer Docker.
Navigate into your project directory:
cd your_project
Install dependencies:
pip install -r requirements/local.txt
Run the initial setup:
docker-compose up -d
Make sure Celery is running:
docker-compose run --rm celeryworker
Step 2: Adding Scrapy to the Django Project
Now, install Scrapy in your Django environment:
pip install scrapy scrapy-djangoitem
Inside your Django project, create a new Scrapy project:
scrapy startproject scraper
Move it inside your Django app:
mv scraper your_project/scraper
Modify your_project/scraper/scraper/settings.py
to use Django's database:
import sys
import os
import django
sys.path.append(os.path.dirname(os.path.abspath('.')))
os.environ['DJANGO_SETTINGS_MODULE'] = 'config.settings'
django.setup()
Step 3: Creating Django Models for Scraped Data
Inside one of your Django apps (e.g., scraping
), define a model to store scraped data:
from django.db import models
class Article(models.Model):
title = models.CharField(max_length=255)
url = models.URLField(unique=True)
content = models.TextField()
published_at = models.DateTimeField(null=True, blank=True)
def __str__(self):
return self.title
Run migrations:
python manage.py makemigrations scraping
python manage.py migrate scraping
Step 4: Creating a Scrapy Spider
Create a Scrapy spider inside scraper/scraper/spiders/
:
touch scraper/scraper/spiders/news_spider.py
Edit news_spider.py
:
import scrapy
from scraper.items import ArticleItem
class NewsSpider(scrapy.Spider):
name = "news"
allowed_domains = ["example.com"]
start_urls = ["https://example.com/news"]
def parse(self, response):
for article in response.css("div.article"):
yield ArticleItem(
title=article.css("h2::text").get(),
url=response.urljoin(article.css("a::attr(href)").get()),
content=article.css("p::text").get(),
published_at=None
)
Define the Scrapy Item
to map to our Django model:
import scrapy
from scrapy_djangoitem import DjangoItem
from scraping.models import Article
class ArticleItem(DjangoItem):
django_model = Article
Step 5: Using Scrapy Pipelines to Save Data in Django
Enable Scrapy's pipeline in scraper/settings.py
:
ITEM_PIPELINES = {
'scraper.pipelines.DjangoPipeline': 300,
}
Now, create a pipelines.py
file in scraper/scraper/
:
from scraping.models import Article
class DjangoPipeline:
def process_item(self, item, spider):
article, created = Article.objects.update_or_create(
url=item['url'],
defaults=item,
)
return item
Step 6: Triggering Scrapy via Celery Tasks
Now, define a Celery task inside one of your Django apps (e.g., scraping/tasks.py
):
from celery import shared_task
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scraper.spiders.news_spider import NewsSpider
@shared_task
def run_news_spider():
process = CrawlerProcess(get_project_settings())
process.crawl(NewsSpider)
process.start()
Run Celery:
docker-compose up -d celeryworker
You can now trigger the spider via Celery:
python manage.py shell
from scraping.tasks import run_news_spider
run_news_spider.delay()
Celery app also creates a very nice admin interface where you can create Periodic Tasks to run this job at a scheduled interval so definetly check that out as well.
Summary
We've successfully integrated Scrapy with Django using Celery. Now, you can:
- Trigger Scrapy spiders asynchronously using Celery.
- Store scraped data in Django models.
- Use Scrapy pipelines to process and clean data before saving.
This setup is scalable and can handle complex scraping tasks efficiently.