Scraping Data from Pet Marketplace using Python BeautifulSoup4 + Rotate Proxy + Backoff Decorator

Ridho Aryo Pratama
4 min readOct 7, 2023

--

Photo by Ludemeula Fernandes on Unsplash

A little while ago, I embarked on a thrilling project on Upwork. The mission? Extract data from two intriguing product sections: Dogs and Cats. These sections had their own charm, being divided into 30 and 35 product categories, respectively. My task was to unearth data from a whopping 12,000+ products! This is the workflow I used to extract the website.

The Workflow

I wielded the power of the Python programming language to tackle this challenge. My trusty allies in this adventure were some famed libraries: Numpy, Pandas, BeautifulSoup4, and Requests.

The Next-page issue

Initially, I tried to decipher the architecture of the website. An interesting discovery was that this site utilized pagination to categorize its myriad products. My initial thought was simple: for each category I wish to extract, I’d just need to observe the maximum page number and loop through. But, the plot thickened when I delved deeper. Using the inspect element tool, I could spot the div tag for pagination. However, when I tried to fetch it using requests, it was as if it vanished into thin air, hidden from my sight. I knew I needed an alternative approach.

Hidden Div
Output:
<div class="boost-pfs-filter-bottom-pagination pagination"></div>

A keen observation of the URL gave me a hint. When navigating a specific page, there was a “page=[page number]” at the tail end of the URL. For instance, if we’re exploring a particular category and find ourselves on page 30, the URL cunningly reveals itself as “page=30”.

Access Page 30
Page 30 highlighted
[<div class="product-item product-item--vertical 1/3--tablet 1/4--lap-and-up"><div class="product-item__label-list"><span class="product-label product-label--on-sale">Save <span>$7.20</span></span></div><!-- Swym Wishlist Plus EPI Button-->
<div style="text-align:right;"><button class="swym-button swym-add-to-wishlist-view-product product_7004987424926" data-product-id="7004987424926" data-product-url="https://www.polypet.com.sg/products/zippypaws-monkey-ropetugz-green" data-swaction="addToWishlist" data-variant-id="40935798341790" data-with-epi="true"></button></div>
<!-- Swym Wishlist Plus EPI Button-->
<a class="product-item__image-wrapper product-item__image-wrapper--with-secondary" href="/collections/dog-accessories-toys/products/zippypaws-monkey-ropetugz-green"><div class="aspect-ratio" style="padding-bottom: 100.0%">
<img alt="Zippypaws Monkey RopeTugz - Green" class="product-item__primary-image lazyload image--fade-in" data-media-id="29639063470312" data-sizes="auto" data-src="//www.polypet.com.sg/cdn/shop/products/ZippypawsMonkeyRopeTugz-Green_{width}x.jpg?v=1645667717" data-widths="[200,300,400,500,600,700,800]"/><img alt="Zippypaws Monkey RopeTugz - Green" class="product-item__secondary-image lazyload image--fade-in" data-sizes="auto" data-src="//www.polypet.com.sg/cdn/shop/products/ZippypawsMonkeyRopeTugzAll_abc88987-ba94-4119-bb79-0cf984ac689a_{width}x.jpg?v=1645667954" data-widths="[200,300,400,500,600,700,800]"/><noscript>
...]

But what happens if the category has only 30 pages and I mischievously tweak the page number to 1000? The div with the product list mysteriously empties. Aha! It seems I was on the brink of cracking the code…

[]

Solution

This meant I could loop through every page without painstakingly noting the maximum pages for each category. The solution? A basic if-else statement. If the product variable’s length was zero, I’d move on to the next category. Otherwise, I’d extract every product on that page. In code language, this will be like this:

The Request issue

Yet, another challenge loomed. Continuous requests from the same proxy occasionally led to being banned, throwing my code into chaos.

Error when retrieving the data

Solution

To counteract this issue, I had to find a solution to prevent me from encountering these obstacles. After researching potential fixes, I found the answer: proxy rotation and backoff.

Proxy rotation is a technique where you consistently switch your proxy to another one during requests, thereby disguising your source of access.

On the other hand, backoff is a custom function we can create and it is used as a decorator of our main function. Its purpose is that if a connection to the desired URL fails, the code doesn’t immediately throw an error, as shown in the previous example. Instead, it pauses for a brief moment, switches the proxy, and then attempts to request access to the URL again.

The backoff will give us a result like this, instead of an error.

...
inside retry
Failed! Retrying..
1
inside retry
Resting..
page 4 done!
time complete: 3:50:41.053240
...

After addressing all the issues, all that’s left is to run the code and patiently wait for all the products to be extracted. Once that’s done, we can save the results in both .xlsx and .csv formats.

Thank you!

--

--

Ridho Aryo Pratama
Ridho Aryo Pratama

Written by Ridho Aryo Pratama

Data scientist from Indonesia. Teaching beginners about data science. Sharing knowledge, writing enthusiast, and avid gamer. Let's connect and learn together!

Responses (1)