The project had 2 stages:
- Go through categories on retailers’ websites, discover all product pages and store them in a database
- Visit each product page and extract relevant data points.
We wanted to collect the following fields for each product:
- Product name
- Image URLs
- Dimensions
- Description
- Breadcrumbs
- SKU(s)
Decisions that we’ve made:
Some websites had anti-bot measures in place. Considering a onetime extraction nature of the project, we decided that Zyte’s Smart Proxy Manager is the cheapest solution for that problem.
JavaScript rendering: we’ve settled on Splash because it was already in use by the client.
Dealing with multiple variants: some products had multiple options (different colors, sizes, shapes) that had to be treated as a separate product. Dealing with those required custom scripts in each case.
Images extraction: extracting the highest-quality images was non-trivial. Client asked to ensure we are extracting the highest resolution which took some effort.
Q&A instead of code reviews: we calculated it will be cheaper to do more manual QA rather than conduct code reviews.
Backend:
In order to minimize costs, we used Scrapy Cloud to run the project in the cloud.
Storing the output: we developed several solutions for hosting databases on different stages of the project, starting from simple Google sheets tables to a PostgreSQL integration.