Predicting the Success of a Product in E-Commerce with Image Classification — Part 1 Data Mining with Selenium

Alejandro White
5 min readDec 29, 2020

Hi everyone! In this series of posts, I will explore how to create a Deep Learning Approach in order to predict if a product will sell successfully in Shopee.

The series of posts consists of 2 parts — in the first one, we will be using Data Mining to extract images from the Shopee website using Selenium. In the second post, we will use different CNN architectures including pre-trained models to predict if a product will sell properly or not. This code is developed in Python.

Let’s begin!

As some of you may know, Shopee is one of the largest e-commerce platforms in South-East Asia. With millions of sellers and products, an important question to ask to yourself is: what makes a product successful or not?

The answer to this question is not straightforward as there are many factors that may affect if a product will be successful or not- the price, reviews, descriptions, and many others. For this post, we will only explore the general image composition of a product and how it can be used to predict its success on Shopee or any other e-commerce platform.

Let’s start by extracting the images!

To extract the images, we will use Selenium.

If you are new to Selenium, the first thing you need to do is to download the ChromeDriver. Here is the link: https://chromedriver.chromium.org/getting-started. Simply put, the driver will allow Selenium to control a Google Chrome screen and respond to your commands.

Let’s go ahead now and import some of the libraries we will use:

import requests
from requests import get
import numpy as np
import time
import pandas as pd
import re
import os
from selenium import webdriver
from selenium.common.exceptions import *
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep

Now let’s set the path of the webdriver:

webdriver_path = 'C:/Users/my_user/desktop/chromedriver.exe' # Enter the file directory of the Chromedriver

Once that’s set, we will now explore some Shopee products that we would like to analyze. For my data set, I will be using makeup products.

Let’s define the link of the Shopee category that I will be using:

shopee_url  = "https://shopee.ph/Makeup-Fragrances-cat.15816?page="

As you can see the Shopee URL ends with an =. If you add, for example, a 0 after that, it will direct you to the first page of products. We will use this to iterate through the pages to extract images.

We will now set the webdriver, which will open a new browser. The code is set to allow you to see the browser and what it is doing, but you can also disable it.

# Select custom Chrome options
options = webdriver.ChromeOptions()
#options.add_argument('--headless')
options.add_argument('disable-notifications')
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument('--disable-extensions')
# Open the Chrome browser
browser = webdriver.Chrome(webdriver_path, options=options)

Let's define the link for iteration through 100 pages. This will extract a total of 5000 images.

for i in range(100):
link = shopee_url + str(i)
browser.get(link)
delay = 5

WebDriverWait(browser, delay)
print ("Page is ready")
sleep(1)

Now, one of the challenges in Shopee is that it does not load the whole website from the beginning, so you will have to scroll to the bottom each time before getting any data. Nothing to worry about, this next code will help you out with that:

for i in range(10):

scroll_pause_time = 1
last_height = browser.execute_script("return document.body.scrollHeight")
browser.execute_script("window.scrollTo(0, window.scrollY + 500);")
time.sleep(scroll_pause_time)
new_height = browser.execute_script("return document.body.scrollHeight")

html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")

Basically, what’s happening here is that we are scrolling 10 times with a pause of 1 second per scroll. This will provide enough time so that all the information of the products will be loaded. An additional tip is to be respectful with the websites of companies as making requests can slow the usage of their platform and you might end up being blocked.

After this, we can use beautiful soup to read the HTML data.

soup = BeautifulSoup(html, "html.parser")

Now, we must find a way to extract the information of our products — a very useful tool for data mining is the inspector. If you are in Google Chrome, click right in an image and then click on Inspect. When you do this, you will see that all the products are under ‘div’ class = col-xs-2–4 shopee-search-item-result__item.

You can use the beautiful soup find_all method to extract the data of the 50 products on every page.

products = soup.find_all(‘div’, class_ = ‘col-xs-2–4 shopee-search-item-result__item’)

Following the same process, you can now extract the image links of the products. We will set an exception in case we find a missing link, so it will not interfere and stop the process.

img_links = []
for i in range(len(products)):
try:
img_links.append(products[i].find('div', class_ = '_39-Tsj _1tDEiO').\
select('img[src^="https://cf.shopee.ph/file"]')[0]['src'])
except:
img_links.append(np.nan)

Nice! We now have all the links of the images of the products. We will also extract the sales so that we can use it in the future as our target variable in our Supervised Model.

for i in range(len(products)):
try:
sales.append(products[i].find('div', class_ = '_18SLBt').text)
except:
sales.append(np.nan)

We are almost set. Now, we just have to get the images and save it to our local disk. We can create a folder with the name photos_shopee and then let’s get the images from the website.

index = 0
for image in img_links:
img = requests.get(image).content
with open("photos_shopee\\" + str(index)+'.jpg', 'wb+') as f:
f.write(img)
f.close()
index +=1

That should do the trick. Don't forget to close your browser as a good practice.

browser.close()

Our last step will be to clean up the data of the sales as it may have some weird formats that will impossible for Machine Learning to read.

sales2 = []
for i in sales:
i = re.sub(r'(?<=\.\d)K', '00', i)
i = re.sub(r'(?<=\d)K', '000', i)
i = re.sub(r'\.', "", i)

try:
i = re.match(r'\A\d+\.?\d+', i)[0]
except:
i = 0

sales2.append(i)

and voila, that’s it for this post! In the continuation post, we will explore how to use the data for predictions using Image Recognition.

I hope that you had a good time reading and getting new ideas for real life projects with Deep Learning.

Should you have any questions or just want to get in touch, message me on my LinkedIn: https://www.linkedin.com/in/alejandro-white/

Feel free to follow me to be updated with more posts about AI and Machine Learning.

--

--

Alejandro White

MSc. Data Science | Creating a better future with AI