1. Setup PostgreSQL
To get started, we first need to set up a PostgreSQL database instance. You can set it up locally on your PC following the guide here. The easiest solution is to set up a managed database instance from DigitalOcean. Once the database is online, you will need to compile a database connection URI (format):
Your URI should follow the following pattern: postgresql://username:password@host:port/database_name?sslmode=require
2. Create Scrapy Project
Start a new Scrapy project or copy our open-source Scrapy project template from GitHub. Add a DATABASE_URI
variable to settings.py
and set its value to the URI string from step 1.
3. Define Tables in Your Database
The next step is to add tables to your database. Our recommendation is to use Peewee ORM to handle most database interactions (reads, inserts, updates), so we are going to be using it within this guide. Here is documentation on how to use database tables with Peewee models. Use ChatGPT to define your own Peewee model from a JSON structure. Based on our experience, the following query works reasonably well: generate a peewee model based on the following JSON structure: ...
Establish Connection to Your Database
This code establishes a (singleton) database connection using Peewee. It ensures only one instance of the database connection is created and reused throughout the project.
from urllib.parse import urlparse, urlunparse
from peewee import DatabaseProxy
from playhouse.db_url import connect
from playhouse.postgres_ext import PostgresqlExtDatabase
class SingletonMeta(type):
_instance = None
def __call__(cls, *args, **kwargs):
if not cls._instance:
cls._instance = super().__call__(*args, **kwargs)
if (
not isinstance(cls._instance.obj, PostgresqlExtDatabase)
and "db_uri" in kwargs
):
db_uri = kwargs["db_uri"]
parsed = urlparse(db_uri)
if parsed.scheme == "postgres":
parsed = parsed._replace(scheme="postgresext")
db_uri = urlunparse(parsed)
conn = connect(db_uri, unquote_password=True)
cls._instance.initialize(conn)
return cls._instance
class CustomDatabaseProxy(DatabaseProxy, metaclass=SingletonMeta):
def __init__(self, db_uri=None):
super().__init__()
Add a Base Peewee Model
This section sets up a base model class that all other models will inherit from. It centralizes the database connection configuration, ensuring that all models use the same database instance.
from peewee import *
from scrapy_project.utils.utils import CustomDatabaseProxy
class BaseModel(Model):
class Meta:
database = CustomDatabaseProxy()
Create Your Own Model Using ProductItemModel
This section defines a specific item model using ProductItemModel
to represent the structure of the product data you will be storing. The model name is explicitly set to product_item
to set your database table name.
from playhouse.postgres_ext import JSONField
import datetime
class ProductItemModel(BaseModel):
vendor = TextField(null=False) # str
status = TextField(null=False) # str
created = DateTimeField(null=False, default=datetime.datetime.now)
updated = DateTimeField(null=True)
# General Product Information
availability = TextField(default="InStock", null=True) # str
color = TextField(null=True) # str
currency = TextField(null=True) # str
currencyRaw = TextField(null=True) # str
productId = TextField(null=True) # str
# Product Identification Numbers (GTIN)
gtin = JSONField(null=True) # List[Dict[str, str]]
# Product Images
images = JSONField(null=True) # List[Dict[str, str]]
mainImage = JSONField(null=True) # Dict[str, Any]
# Product Information
mpn = TextField(null=True) # str Manufacture Product Number
name = TextField(null=True) # str Product Name
available_quantity = TextField(null=True) # str How many products are available to order
price = TextField(null=True) # str Product Price
regularPrice = TextField(null=True) # str Regular product price
size = TextField(null=True) # str Product Size
sku = TextField(null=True) # str Product Article
style = TextField(null=True) # str Product Style
# Additional product properties
additionalProperties = JSONField(null=True) # List[Dict[str, Any]]
# Product URLs
url = TextField(null=True) # str
canonicalUrl = TextField(null=True) # str
# Product Rating
aggregateRating = JSONField(null=True) # Dict[str, Any]
# Product Brand Information
brand = JSONField(null=True) # Dict[str, Any]
# Breadcrumbs (navigation path)
breadcrumbs = JSONField(null=True) # List[Dict[str, Any]]
# Product Features
features = JSONField(null=True) # List[Dict[str, Any]]
# Product Description
description = TextField(null=True) # str
descriptionHtml = TextField(null=True) # str
# Product Options
variants = JSONField(null=True) # List[Dict[str, Any]]
# Additional metadata
metadata = JSONField(null=True) # Dict[str, Any]
class Meta:
db_table = "product_item"
The connect_to_db
method will try to create a table in the database if it doesn't exist. Make sure to add your model name into the db_handle.create_tables
call. Once you complete these steps and run the connect_to_db
method, you will be able to see a new table in your database.
from peewee import PostgresqlDatabase
from scrapy_project.models import ProductItemModel
def connect_to_db(db_uri):
db_handle = PostgresqlDatabase(db_uri)
db_handle.create_tables([ProductItemModel,])
return db_handle
4. Setup ItemPipelines
Reference our project: GitHub
Copy a base item pipeline from the project. The database connection will be set up in the __init__
function.
Variables:
items
: A list that will store the items.max_items
: The maximum allowed number of items for each insert/update command. If you want to add items to the database as soon as they are generated, set this value to 1. In other words, it's a batch size.total
: A current items list length counter.
import copy
import logging
from itemadapter import ItemAdapter
from scrapy_project.models import connect_to_db
class BaseDBPipeline:
max_items = 10000
items = []
total = 0
def __init__(self, db_uri):
self.db_handle = connect_to_db(db_uri)
@classmethod
def from_settings(cls, settings):
db_uri = settings.get("DATABASE_URI")
params = {"db_uri": db_uri}
return cls(**params)
@classmethod
def from_crawler(cls, crawler):
return cls.from_settings(crawler.settings)
def process_item(self, item, spider):
item_for_db = self.prepare_item(copy.deepcopy(item))
item_for_db.pop("parent_url", None)
self.total += 1
self.items.append(ItemAdapter(item_for_db).asdict())
if self.total > self.max_items:
self.flush_data()
return item
def close_spider(self, spider):
self.flush_data()
def flush_data(self):
with self.db_handle.atomic() as transaction:
try:
self.insert_to_db(self.items)
except Exception as e:
logging.error(f"Error in PostgreSQLItemPipeline: {e}")
transaction.commit()
self.items = []
self.total = 0
return None
def insert_to_db(self, items):
pass
def prepare_item(self, item):
return item
5. Dummy Spider
You can reference our project for an example spider: GitHub.
6. Save Scraped Items Into Database
Next, set up an item-specific ItemPipeline. Actual item upload code is located in the insert_to_db
method.
PeeWee offers the following options for interactions with PostgreSQL databases
- Inserts: Use
insert_many
to insert multiple items at once (documentation). - Bulk Updates: Use
bulk_update
to update multiple items at once (documentation).
Example of insert_to_db
Method:
from scrapy_project.models import ProductItemModel
class ProductDBPipeline(BaseDBPipeline):
def insert_to_db(self, items):
ProductItemModel.insert_many(items).execute()
Suggestions
- GUI for PostgreSQL: We recommend using DBeaver, a powerful GUI tool, to manage your PostgreSQL database more easily.