Guide: How to Integrate Scrapy Project with PostgreSQL database

1. Setup PostgreSQL

To get started, we first need to set up a PostgreSQL database instance. You can set it up locally on your PC following the guide here. The easiest solution is to set up a managed database instance from DigitalOcean. Once the database is online, you will need to compile a database connection URI (format):

Your URI should follow the following pattern: postgresql://username:password@host:port/database_name?sslmode=require

2. Create Scrapy Project

Start a new Scrapy project or copy our open-source Scrapy project template from GitHub. Add a DATABASE_URI variable to settings.py and set its value to the URI string from step 1.

3. Define Tables in Your Database

The next step is to add tables to your database. Our recommendation is to use Peewee ORM to handle most database interactions (reads, inserts, updates), so we are going to be using it within this guide. Here is documentation on how to use database tables with Peewee models. Use ChatGPT to define your own Peewee model from a JSON structure. Based on our experience, the following query works reasonably well: generate a peewee model based on the following JSON structure: ...

Establish Connection to Your Database

This code establishes a (singleton) database connection using Peewee. It ensures only one instance of the database connection is created and reused throughout the project.

from urllib.parse import urlparse, urlunparse
from peewee import DatabaseProxy
from playhouse.db_url import connect
from playhouse.postgres_ext import PostgresqlExtDatabase

class SingletonMeta(type):
    _instance = None

    def __call__(cls, *args, **kwargs):
        if not cls._instance:
            cls._instance = super().__call__(*args, **kwargs)
        if (
            not isinstance(cls._instance.obj, PostgresqlExtDatabase)
            and "db_uri" in kwargs
        ):
            db_uri = kwargs["db_uri"]
            parsed = urlparse(db_uri)
            if parsed.scheme == "postgres":
                parsed = parsed._replace(scheme="postgresext")
            db_uri = urlunparse(parsed)
            conn = connect(db_uri, unquote_password=True)
            cls._instance.initialize(conn)
        return cls._instance

class CustomDatabaseProxy(DatabaseProxy, metaclass=SingletonMeta):
    def __init__(self, db_uri=None):
        super().__init__()

Add a Base Peewee Model

This section sets up a base model class that all other models will inherit from. It centralizes the database connection configuration, ensuring that all models use the same database instance.

from peewee import *
from scrapy_project.utils.utils import CustomDatabaseProxy

class BaseModel(Model):
    class Meta:
        database = CustomDatabaseProxy()

Create Your Own Model Using `ProductItemModel`

This section defines a specific item model using ProductItemModel to represent the structure of the product data you will be storing. The model name is explicitly set to product_item to set your database table name.

from playhouse.postgres_ext import JSONField
import datetime

class ProductItemModel(BaseModel):
    vendor = TextField(null=False)  # str
    status = TextField(null=False)  # str
    created = DateTimeField(null=False, default=datetime.datetime.now)
    updated = DateTimeField(null=True)

    # General Product Information
    availability = TextField(default="InStock", null=True)  # str
    color = TextField(null=True)  # str
    currency = TextField(null=True)  # str
    currencyRaw = TextField(null=True)  # str
    productId = TextField(null=True)  # str

    # Product Identification Numbers (GTIN)
    gtin = JSONField(null=True)  # List[Dict[str, str]]

    # Product Images
    images = JSONField(null=True)  # List[Dict[str, str]]
    mainImage = JSONField(null=True)  # Dict[str, Any]

    # Product Information
    mpn = TextField(null=True)  # str Manufacture Product Number
    name = TextField(null=True)  # str Product Name
    available_quantity = TextField(null=True)  # str How many products are available to order
    price = TextField(null=True)  # str Product Price
    regularPrice = TextField(null=True)  # str Regular product price
    size = TextField(null=True)  # str Product Size
    sku = TextField(null=True)  # str Product Article
    style = TextField(null=True)  # str Product Style

    # Additional product properties
    additionalProperties = JSONField(null=True)  # List[Dict[str, Any]]

    # Product URLs
    url = TextField(null=True)  # str
    canonicalUrl = TextField(null=True)  # str

    # Product Rating
    aggregateRating = JSONField(null=True)  # Dict[str, Any]

    # Product Brand Information
    brand = JSONField(null=True)  # Dict[str, Any]

    # Breadcrumbs (navigation path)
    breadcrumbs = JSONField(null=True)  # List[Dict[str, Any]]

    # Product Features
    features = JSONField(null=True)  # List[Dict[str, Any]]

    # Product Description
    description = TextField(null=True)  # str
    descriptionHtml = TextField(null=True)  # str

    # Product Options
    variants = JSONField(null=True)  # List[Dict[str, Any]]

    # Additional metadata
    metadata = JSONField(null=True)  # Dict[str, Any]

    class Meta:
        db_table = "product_item"

The connect_to_db method will try to create a table in the database if it doesn't exist. Make sure to add your model name into the db_handle.create_tables call. Once you complete these steps and run the connect_to_db method, you will be able to see a new table in your database.

from peewee import PostgresqlDatabase
from scrapy_project.models import ProductItemModel

def connect_to_db(db_uri):
    db_handle = PostgresqlDatabase(db_uri)
    db_handle.create_tables([ProductItemModel,])
    return db_handle

4. Setup ItemPipelines

Reference our project: GitHub

Copy a base item pipeline from the project. The database connection will be set up in the __init__ function.

Variables:

items: A list that will store the items.
max_items: The maximum allowed number of items for each insert/update command. If you want to add items to the database as soon as they are generated, set this value to 1. In other words, it's a batch size.
total: A current items list length counter.

import copy
import logging
from itemadapter import ItemAdapter
from scrapy_project.models import connect_to_db

class BaseDBPipeline:
    max_items = 10000
    items = []
    total = 0

    def __init__(self, db_uri):
        self.db_handle = connect_to_db(db_uri)

    @classmethod
    def from_settings(cls, settings):
        db_uri = settings.get("DATABASE_URI")
        params = {"db_uri": db_uri}
        return cls(**params)

    @classmethod
    def from_crawler(cls, crawler):
        return cls.from_settings(crawler.settings)

    def process_item(self, item, spider):
        item_for_db = self.prepare_item(copy.deepcopy(item))
        item_for_db.pop("parent_url", None)
        self.total += 1
        self.items.append(ItemAdapter(item_for_db).asdict())

        if self.total > self.max_items:
            self.flush_data()
        return item

    def close_spider(self, spider):
        self.flush_data()

    def flush_data(self):
        with self.db_handle.atomic() as transaction:
            try:
                self.insert_to_db(self.items)
            except Exception as e:
                logging.error(f"Error in PostgreSQLItemPipeline: {e}")
            transaction.commit()
        self.items = []
        self.total = 0
        return None

    def insert_to_db(self, items):
        pass

    def prepare_item(self, item):
        return item

5. Dummy Spider

You can reference our project for an example spider: GitHub.

6. Save Scraped Items Into Database

Next, set up an item-specific ItemPipeline. Actual item upload code is located in the insert_to_db method.

PeeWee offers the following options for interactions with PostgreSQL databases

Inserts: Use insert_many to insert multiple items at once (documentation).
Bulk Updates: Use bulk_update to update multiple items at once (documentation).

Example of insert_to_db Method:

from scrapy_project.models import ProductItemModel

class ProductDBPipeline(BaseDBPipeline):
    def insert_to_db(self, items):
        ProductItemModel.insert_many(items).execute()

Suggestions

GUI for PostgreSQL: We recommend using DBeaver, a powerful GUI tool, to manage your PostgreSQL database more easily.

https://webscraping.net/wp-content/themes/impeka/images/empty/thumbnail.jpg 150 150 Web Scraping Solutions Web Scraping Solutions https://webscraping.net/wp-content/themes/impeka/images/empty/thumbnail.jpg June 20, 2024 June 20, 2024

Guide: How to Integrate Scrapy Project with PostgreSQL database