What is thematic analysis?

Thematic analysis is a qualitative research method used to identify, analyze, and report patterns (themes) within data, such as text or interviews.

Which Python libraries are best for thematic analysis?

Libraries like NLTK, SpaCy, and pandas are commonly used for text analysis. For topic modeling, libraries like Gensim and sklearn are also helpful.

Dec 13, 2024

Shopify App Reviews Analysis - Volume 2: Thematic Analysis

| Thematic-Analysis | | Shopify |

In my previous post, I explored how to scrape Shopify app reviews using Python. Now, it’s time to transform those raw reviews into meaningful insights using advanced data analysis techniques.

I’ve always loved dealing with statistics. My favorite part is that you can either dive deep into intricate analyses or keep it simple based on your needs. Having studied economics, my approach leans toward the analytical side, using things like t-tests and p-values.

When it comes to coding, though, I consider myself moderately experienced. My thematic analysis code relies on combining my maybe-above-average knowledge of statistics and thematic analysis with my maybe-functional Python skills. While I lack advanced Python expertise, I used various LLMs to refine and adapt the pseudocode for this project. This method proved effective, allowing me to create an output I found useful and insightful.

I plan this tool as a dynamic project. I’ll revisit and enhance as I learn new coding practices or thematic analysis methods. Let’s check the pseudocode that powers the analysis:

Take the CSV file from the output directory and process the content column as strings.
Calculate sentiment polarity and count negative/positive reviews.
Identify and list five major topics.
Extract sample reviews for each major topic.
Calculate the margin of error by finding the standard deviation and z-score.
Conduct a t-test for sentiment analysis.
Calculate Cohen’s D to understand effect size.
Count the number of ratings for each star.
Print warnings if the sample size is under 30 reviews.
Generate a summary message with insights such as sentiment polarity, notable positive and negative reviews, review distribution, and statistical test results.
Output all results into an MDX file.

If you’re curious about the full project and want to jump ahead, try the code

While drafting this pseudocode, I recognized certain challenges. The first one is the difficulty of handling similar topics and keywords. That is, some keywords are being shown more than once in different topics. The second one is my lack of knowledge in tokenization and advanced topic modeling techniques like NMF. Thing is, I don’t have any experience with these methods and libraries.

That didn’t stop me (famous last words).

The Journey from Data Collection to Analysis

After scraping the reviews, I had a CSV file filled with unstructured text data. My goal was to convert this data into actionable insights by analyzing user sentiments, identifying key themes, and testing the results if they were statistically significant.

The Analytical Toolkit

To achieve this, I used several Python libraries:

pandas for data manipulation
textblob for sentiment analysis
scikit-learn for text processing and topic modeling
scipy for statistical analysis

Script Breakdown: Turning Reviews into Insights

1. Text Preprocessing

def clean_and_tokenize(text):
    if pd.isna(text):
        text = ""
    text = re.sub(r'[^A-Za-z\s]', '', str(text)).lower()
    tokens = text.split()
    return tokens

This function prepares the text for analysis by removing non-alphabetic characters and converting it to lowercase. While the reviews included multiple languages, I processed everything in English without manually excluding non-English reviews. There is huge room for improvement here.

2. Sentiment Analysis

def sentiment_analysis(text):
    return TextBlob(text).sentiment.polarity

I opted for a straightforward sentiment analysis approach, assigning scores ranging from -1 (very negative) to 1 (very positive). Future iterations may incorporate nuanced emotional analysis, tracking sentiments like anger or joy. This is again a nice cozy room for improvement as reviews generally have certain emotional hints like support, anger, joy, etc.

3. Topic Modeling with NMF

n_topics = 5
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(reviews_df['content'])
nmf = NMF(n_components=n_topics, random_state=1, max_iter=500).fit(tfidf)
topics = display_topics(nmf, tfidf_vectorizer.get_feature_names_out(), 10)

NMF identifies key themes in the reviews. While effective, I plan to explore more advanced models in the future as it forces same keywords for different topics and make the analysis spurious even though it is statistically significant. I thought for the later versions, maybe I could feed some generic topics and get the confined results such as “pricing”, “features”, “support”, “ease of use”, etc.

4. Statistical Validation

My favorite part. Not because it is easy. Because I know what I’m doing here. To ensure these are robust findings, I performed statistical tests. The neat part about reviews is that they are continous. So, the general stance or the polarity can change. In order to understand the change, I’m planning to include Cohen’s d test to measure the effect size of each group.

t_statistic, 

p_value = stats.ttest_1samp(reviews_df['sentiment_polarity'], 0.5)

5. Generating the Final Report

The last part of the script compiles results into an MDX file, ‘analysis_report.mdx’, providing detailed insights that includes the following:

Overall sentiment distribution
Most positive and negative reviews
Thematic analysis with sample reviews
Statistical measures like standard deviation and effect size
Detailed breakdown of topics contributing to negative sentiment

The first iteration was just printing results into console, but formatting with markdown just looks better.

Practical Applications

One project I’m planning to work on is doing the same thematic analysis on other content such as website content of your competitors (inner B2B marketer, hehe).

So, this workflow isn’t confined to Shopify app reviews. It can be adapted for:

Product reviews
Customer feedback analysis
Social media sentiment tracking
Academic text analysis

Next Steps and Improvements

I’m planning to keep this code updated with improvements. To enhance the thematic analysis side of this project, I aim to:

Implement advanced NLP techniques
Add interactive visualizations
Enable real-time analysis pipelines
Incorporate machine learning for predictive insights (that is way over my league, but LLMs are 🚀)

While the current approach provides valuable outputs, this is only the beginning. Future updates will make the process more efficient and insightful, maybe enabling broader applications. This was a fun project to spend some of my time and brain juice.

Let me know if this works for you!

Cheers, Berkem

Try It Out

import pandas as pd
import re
from collections import Counter
from textblob import TextBlob
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from scipy import stats
import os

# Sample sentiment analysis function
def sentiment_analysis(text):
    return TextBlob(text).sentiment.polarity

# Function to display topics from NMF model
def display_topics(model, feature_names, no_top_words):
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict[topic_idx] = " ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]])
    return topic_dict

# Function to get a sample review for each topic
def get_sample_review_for_topic(df, topic):
    try:
        return df.loc[df['topic'] == topic].sample(1)['content'].values[0]
    except ValueError:
        return None

# Function to clean and tokenize text
def clean_and_tokenize(text):
    if pd.isna(text):
        text = ""
    text = re.sub(r'[^A-Za-z\s]', '', str(text)).lower()
    tokens = text.split()
    return tokens

# Load the CSV file
file_path = './output/reviews.csv'  # Replace with your file path
reviews_df = pd.read_csv(file_path)

# Convert all entries in the 'content' column to strings
reviews_df['content'] = reviews_df['content'].astype(str)

# Apply tokenization
reviews_df['tokens'] = reviews_df['content'].apply(clean_and_tokenize)

# Flatten the list of tokens and count the occurrences
all_tokens = [token for tokens in reviews_df['tokens'] for token in tokens]
word_freq = Counter(all_tokens)

# Most common words
most_common_words = word_freq.most_common(10)

# Apply sentiment analysis
reviews_df['sentiment_polarity'] = reviews_df['content'].apply(sentiment_analysis)

# Check if the DataFrame or the 'sentiment_polarity' column is empty
if not reviews_df.empty and 'sentiment_polarity' in reviews_df.columns and not reviews_df['sentiment_polarity'].empty:
    outlier_positive = reviews_df.loc[reviews_df['sentiment_polarity'].idxmax()]
    outlier_negative = reviews_df.loc[reviews_df['sentiment_polarity'].idxmin()]
else:
    outlier_positive = None
    outlier_negative = None
    print("The DataFrame or the 'sentiment_polarity' column is empty.")

# Ensure outlier_negative has an actually negative sentiment
if outlier_negative is not None and outlier_negative['sentiment_polarity'] >= 0:
    outlier_negative_text = "No extremely negative review found."
else:
    outlier_negative_text = outlier_negative['content'] if outlier_negative is not None else "No extremely negative review found."

# Calculating standard deviation and margin of error
std_dev = np.std(reviews_df['sentiment_polarity'])
z_score = 1.96  # For 95% confidence
margin_of_error = z_score * (std_dev / np.sqrt(len(reviews_df)))

# T-Test for sentiment polarity
t_statistic, p_value = stats.ttest_1samp(reviews_df['sentiment_polarity'], 0.5)

# Check if the DataFrame is empty before applying TfidfVectorizer
if not reviews_df.empty:
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')
    try:
        tfidf = tfidf_vectorizer.fit_transform(reviews_df['content'])
    except ValueError as e:
        print(f"Error with TfidfVectorizer: {e}")
else:
    print("The DataFrame is empty, cannot apply TfidfVectorizer.")

# Thematic analysis using NMF
n_topics = 5
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(reviews_df['content'])
nmf = NMF(n_components=n_topics, random_state=1, max_iter=500).fit(tfidf)  # Increase max_iter
topics = display_topics(nmf, tfidf_vectorizer.get_feature_names_out(), 10)

# Assigning topic to each review
reviews_topics = nmf.transform(tfidf)
reviews_df['topic'] = reviews_topics.argmax(axis=1)

# Analyzing sentiment polarity distribution
average_sentiment = reviews_df['sentiment_polarity'].mean()
if average_sentiment >= 0.5:
    sentiment_summary = "which indicates a generally positive sentiment among the reviewers."
elif average_sentiment > 0:
    sentiment_summary = "which indicates a moderately positive sentiment among the reviewers."
else:
    sentiment_summary = "which indicates a generally negative sentiment among the reviewers."

# Generate thematic summary
themes_summary = []
for topic, words in topics.items():
    sample_review = get_sample_review_for_topic(reviews_df, topic)
    themes_summary.append(f"**Topic {topic+1}**\n- **Keywords:** {', '.join(words.split()[:10])} \n> ***Sample Review:*** \"{sample_review}\"\n")

# Identifying topics contributing to negative sentiment based on ratings
negative_threshold = 2
negative_reviews_df = reviews_df[reviews_df['rating'] <= negative_threshold]
negative_topics = negative_reviews_df['topic'].value_counts().index.tolist()
negative_topics_analysis = []
for idx in negative_topics:
    sample_review = get_sample_review_for_topic(negative_reviews_df, idx)
    negative_topics_analysis.append(f"**Topic {idx+1}** \n- **Keywords:** {topics[idx]}  \n> ***Sample Negative Review:*** \"{sample_review}\"\n")

# Counting the ratings
rating_counts = reviews_df['rating'].value_counts().sort_index()
total_reviews = len(reviews_df)

# Check the sample size for statistical significance
sample_size_warning = ""
if total_reviews < 30:
    sample_size_warning = f"**Warning:** The sample size **({total_reviews})** is considered too small for reliable statistical analysis. Interpret the results with caution."

# Prepare final output as markdown
output = """# Thematic Analysis Report

## General Opinion

The average sentiment polarity of the reviews is approximately **{:.3f}**, {}

## Outlier Remarks

### Most Positive Review
"{}"  
*This review suggests a very high level of satisfaction and endorsement.*

### Most Negative Review
"{}"

## Review Distribution

Out of **{}** reviews analyzed:

""".format(
    average_sentiment, 
    sentiment_summary,
    outlier_positive.content if outlier_positive is not None else "No extremely positive review found",
    outlier_negative_text,
    total_reviews
)

# Add rating distribution
for rating in range(1, 6):
    count = rating_counts.get(rating, 0)
    output += f"- {rating}-{'star' if count == 1 else 'stars'} reviews: **{count}**\n"

output += """
## Thematic Analysis

The thematic analysis of the Shopify app reviews using topic modeling has revealed the following key themes and how people are using and perceiving the app:

"""

# Add themes
for theme_summary in themes_summary:
    output += theme_summary + "\n"

output += f"""
## Statistical Analysis

Overall, these themes indicate that the app is generally {('well-received' if average_sentiment > 0 else 'not well-received')} for its various features.

### Sentiment Distribution
- **Standard Deviation:** {std_dev:.3f}
- **Margin of Error (95% CI):** ±{margin_of_error:.3f}
- **True Average Sentiment Range:** {average_sentiment-margin_of_error:.3f} to {average_sentiment+margin_of_error:.3f}

### Statistical Tests

1. **T-Test Results**
   - **T-statistic:** {t_statistic:.3f}
   - **P-value:** {p_value:.3f}
   - **Interpretation:** {"This is **negative**. This makes the analysis ***significant***." if t_statistic < 0 and p_value < 0.05 else "This is positive. This makes the analysis ***not significant***."}

## Negative Sentiment Analysis

The following topics were frequently associated with negative reviews:

"""

# Add negative topics
for negative_topic in negative_topics_analysis:
    output += negative_topic + "\n"

if sample_size_warning:
    output += f"\n> ⚠️ {sample_size_warning}\n"

# Create the thematic-analysis directory if it doesn't exist
os.makedirs('thematic-analysis', exist_ok=True)

# Write the output to a markdown file
with open('thematic-analysis/analysis_report.mdx', 'w', encoding='utf-8') as f:
    f.write(output)

print("Analysis report has been saved to thematic-analysis/analysis_report.mdx")