Shopify App Reviews Analysis - Volume 1: Scraping the Data

December 5, 20246 min read

I have always loved doing thematic analysis on text data - either manually or with a tool. Analyzing sentiments and understanding what people are saying about chosen subjects has always fascinated me. I even wrote my master’s thesis using thematic analysis, but that’s a topic for another day.

Fast forward to when I started learning coding, I decided to build a Shopify app. You know, one of my "get rich quickly" schemes. That was one of them because I knew millions of people were starting e-commerce ventures and using Shopify. This is a great market for developers to monetize their ideas. I still think this way.

The first and most important step in building something for this market is to analyze the market (duh, that’s my inner marketer talking).

So, I decided to build a tool for analyzing Shopify app reviews. I’m not going to pretend to be a seasoned programmer, but let me try. I wrote a pseudocode that goes something like this:

  1. Visit the app review pages of a given URL
  2. Automatically visit all pages
  3. Scrape the data (reviewer name, star rating, review text, and date)
  4. Save the data into a CSV file
  5. Analyze the data in some way (wow, very technical, I know)
  6. Generate a textual output of the thematic analysis, including overall sentiment levels, etc.

I believe it’s a good practice to write pseudocode before starting to code. Knowing and planning the steps beforehand is important when building something useful. In my journey to create this tool for analyzing Shopify app reviews, the first and most crucial step was data scraping.

This blog post will walk you through how I built a Python-based scraper to fetch reviews from Shopify app pages. The other steps like thematic analysis itself will be covered in later posts.

If you’re curious about the full project and want to jump ahead, Try the code or Try the tool right away

Why Scraping?

I’m not sure if there’s an easier or more straightforward way to do this, but I knew I needed to learn scraping. It allows me to gather any data for analysis. Automating this process means I can focus on uncovering insights instead of manually copying and pasting reviews.

I didn’t have much knowledge about web scraping, so I started looking around and experimenting with LLMs. With a bit of Python knowledge and my pseudocode ready, it was a relatively easy project to start. It should be. This was my first "big-boy" project.

The Tools

For scraping, I used the following Python libraries:

  • requests: To fetch HTML content from Shopify app pages.
  • BeautifulSoup (from bs4): To parse and extract specific elements from the HTML.
  • os: To create new directories and handle file paths.
  • csv: To write output data to a CSV file.

The Code

Here’s how the scraping logic works:

1. Fetching the Page Content

I needed to fetch the URLs to get the data, but I didn’t know the best practices for writing efficient code. So, I frequently shared my code with my chosen LLM and asked for recommendations to improve its speed, error handling, and readability.

The first step was to extract reviews via soup in the app reviews page:

def extract_reviews_from_soup(soup): reviews_list = [] review_divs = soup.find_all('div', {'data-merchant-review': True}) for review in review_divs: review_data = {} # gather review data for each div reviews_list.append(review_data) return reviews_list

This function ensures all the relavant information is appended to an array.

2. Scraping Multiple Pages

This project became even easier thanks to Shopify's straightforward URL structure. Review URLs look like this: https://apps.shopify.com/app-name/reviews?page=1.

All I needed to do was change the page=1 part!

Here’s how I handled pagination:

for page_num in range(page_number, page_number + num_pages): response = requests.get(base_url + f"?page={page_num}") soup = BeautifulSoup(response.content, 'html.parser') all_reviews.extend(extract_reviews_from_soup(soup))

3. Putting Everything into the CSV in the Outputs Directory

This code checks all pages while putting the information to the reviews.csv file at the /output directory.

# in the same for loop of the above code for page_num in range(page_number, page_number + num_pages): response = requests.get(base_url + f"?page={page_num}") soup = BeautifulSoup(response.content, 'html.parser') all_reviews.extend(extract_reviews_from_soup(soup)) output_dir = 'output' os.makedirs(output_dir, exist_ok=True) output_file_path = os.path.join(output_dir, 'reviews.csv') with open(output_file_path, 'w', newline='', encoding='utf-8') as file: writer = csv.DictWriter(file, fieldnames=['reviewer_name', 'content', 'duration_using_app', 'reviewer_country', 'date', 'rating']) writer.writeheader() for review in all_reviews: writer.writerow(review)

When I first started learning Python, understanding loops was challenging. This project was a good practice for me to apply loops effectively (more on that later).

Example Usage

After completing everything, I used ChatGPT to turn this into a web app hosted on Vercel so I could share it with friends. But as a simple Python script, this works in any IDE of your choice (you can even use Replit!).

Here’s how I tested it:

base_url = input("Enter the base URL (excluding page number): ") num_pages = int(input("How many pages do you want to scrape? ")) page_number = int(input("Starting from which page? "))

This concludes the code! It's alive! ALIVE!

OK, I calmed down...

Stay tuned for Volume 2, where I’ll cover how I processed and analyzed the scraped reviews using NLP!

Let me know if this works for you!

Cheers, Berkem


Try It Out

import requests from bs4 import BeautifulSoup import csv import os def extract_reviews_from_soup(soup): reviews_list = [] review_divs = soup.find_all('div', {'data-merchant-review': True}) for review in review_divs: review_data = {} review_data['date'] = review.find('div', class_='tw-text-body-xs tw-text-fg-tertiary').text.strip() review_data['content'] = ' '.join([p.text for p in review.find('div', {'data-truncate-content-copy': True}).find_all('p')]) review_data['reviewer_name'] = review.find('div', class_='tw-text-heading-xs tw-text-fg-primary tw-overflow-hidden tw-text-ellipsis tw-whitespace-nowrap').text.strip() review_data['reviewer_country'] = review.select_one('div.tw-text-fg-primary.tw-overflow-hidden.tw-text-ellipsis.tw-whitespace-nowrap + div').text.strip() rating = review.find('div', class_='tw-flex tw-relative tw-space-x-0.5 tw-w-[88px] tw-h-md').get('aria-label') review_data['rating'] = rating.split(' ')[0] duration_div = review.select_one('div.tw-text-fg-primary.tw-overflow-hidden.tw-text-ellipsis.tw-whitespace-nowrap + div + div') review_data['duration_using_app'] = duration_div.text.strip() if duration_div else 'N/A' reviews_list.append(review_data) return reviews_list base_url = input("Enter the base URL (excluding page number): ") num_pages = int(input("How many pages do you want to scrape? ")) page_number = int(input("Starting from which page? ")) all_reviews = [] for page_num in range(page_number, page_number + num_pages): response = requests.get(base_url + f"?page={page_num}") soup = BeautifulSoup(response.content, 'html.parser') all_reviews.extend(extract_reviews_from_soup(soup)) output_dir = 'output' os.makedirs(output_dir, exist_ok=True) output_file_path = os.path.join(output_dir, 'reviews.csv') with open(output_file_path, 'w', newline='', encoding='utf-8') as file: writer = csv.DictWriter(file, fieldnames=['reviewer_name', 'content', 'duration_using_app', 'reviewer_country', 'date', 'rating']) writer.writeheader() for review in all_reviews: writer.writerow(review)