In my previous post, I explored how to scrape Shopify app reviews using Python. Now, it's time to transform those raw reviews into meaningful insights using advanced data analysis techniques.
I’ve always loved dealing with statistics. My favorite part is that you can either dive deep into intricate analyses or keep it simple based on your needs. Having studied economics, my approach leans toward the analytical side, using things like t-tests and p-values.
When it comes to coding, though, I consider myself moderately experienced. My thematic analysis code relies on combining my maybe-above-average knowledge of statistics and thematic analysis with my maybe-functional Python skills. While I lack advanced Python expertise, I used various LLMs to refine and adapt the pseudocode for this project. This method proved effective, allowing me to create an output I found useful and insightful.
I plan this tool as a dynamic project. I’ll revisit and enhance as I learn new coding practices or thematic analysis methods. Let’s check the pseudocode that powers the analysis:
- Take the CSV file from the output directory and process the content column as strings.
- Calculate sentiment polarity and count negative/positive reviews.
- Identify and list five major topics.
- Extract sample reviews for each major topic.
- Calculate the margin of error by finding the standard deviation and z-score.
- Conduct a t-test for sentiment analysis.
- Calculate Cohen's D to understand effect size.
- Count the number of ratings for each star.
- Print warnings if the sample size is under 30 reviews.
- Generate a summary message with insights such as sentiment polarity, notable positive and negative reviews, review distribution, and statistical test results.
- Output all results into an MDX file.
If you’re curious about the full project and want to jump ahead, try the code
While drafting this pseudocode, I recognized certain challenges. The first one is the difficulty of handling similar topics and keywords. That is, some keywords are being shown more than once in different topics. The second one is my lack of knowledge in tokenization and advanced topic modeling techniques like NMF. Thing is, I don't have any experience with these methods and libraries.
That didn't stop me (famous last words).
The Journey from Data Collection to Analysis
After scraping the reviews, I had a CSV file filled with unstructured text data. My goal was to convert this data into actionable insights by analyzing user sentiments, identifying key themes, and testing the results if they were statistically significant.
The Analytical Toolkit
To achieve this, I used several Python libraries:
- pandas for data manipulation
- textblob for sentiment analysis
- scikit-learn for text processing and topic modeling
- scipy for statistical analysis
Script Breakdown: Turning Reviews into Insights
1. Text Preprocessing
def clean_and_tokenize(text): if pd.isna(text): text = "" text = re.sub(r'[^A-Za-z\s]', '', str(text)).lower() tokens = text.split() return tokens
This function prepares the text for analysis by removing non-alphabetic characters and converting it to lowercase. While the reviews included multiple languages, I processed everything in English without manually excluding non-English reviews. There is huge room for improvement here.
2. Sentiment Analysis
def sentiment_analysis(text): return TextBlob(text).sentiment.polarity
I opted for a straightforward sentiment analysis approach, assigning scores ranging from -1 (very negative) to 1 (very positive). Future iterations may incorporate nuanced emotional analysis, tracking sentiments like anger or joy. This is again a nice cozy room for improvement as reviews generally have certain emotional hints like support, anger, joy, etc.
3. Topic Modeling with NMF
n_topics = 5 tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english') tfidf = tfidf_vectorizer.fit_transform(reviews_df['content']) nmf = NMF(n_components=n_topics, random_state=1, max_iter=500).fit(tfidf) topics = display_topics(nmf, tfidf_vectorizer.get_feature_names_out(), 10)
NMF identifies key themes in the reviews. While effective, I plan to explore more advanced models in the future as it forces same keywords for different topics and make the analysis spurious even though it is statistically significant. I thought for the later versions, maybe I could feed some generic topics and get the confined results such as "pricing", "features", "support", "ease of use", etc.
4. Statistical Validation
My favorite part. Not because it is easy. Because I know what I'm doing here. To ensure these are robust findings, I performed statistical tests. The neat part about reviews is that they are continous. So, the general stance or the polarity can change. In order to understand the change, I'm planning to include Cohen's d test to measure the effect size of each group.
t_statistic, p_value = stats.ttest_1samp(reviews_df['sentiment_polarity'], 0.5)
5. Generating the Final Report
The last part of the script compiles results into an MDX file, 'analysis_report.mdx', providing detailed insights that includes the following:
- Overall sentiment distribution
- Most positive and negative reviews
- Thematic analysis with sample reviews
- Statistical measures like standard deviation and effect size
- Detailed breakdown of topics contributing to negative sentiment
The first iteration was just printing results into console, but formatting with markdown just looks better.
Practical Applications
One project I'm planning to work on is doing the same thematic analysis on other content such as website content of your competitors (inner B2B marketer, hehe).
So, this workflow isn’t confined to Shopify app reviews. It can be adapted for:
- Product reviews
- Customer feedback analysis
- Social media sentiment tracking
- Academic text analysis
Next Steps and Improvements
I'm planning to keep this code updated with improvements. To enhance the thematic analysis side of this project, I aim to:
- Implement advanced NLP techniques
- Add interactive visualizations
- Enable real-time analysis pipelines
- Incorporate machine learning for predictive insights (that is way over my league, but LLMs are 🚀)
While the current approach provides valuable outputs, this is only the beginning. Future updates will make the process more efficient and insightful, maybe enabling broader applications. This was a fun project to spend some of my time and brain juice.
Let me know if this works for you!
Cheers, Berkem
Try It Out
import pandas as pd import re from collections import Counter from textblob import TextBlob import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import NMF from scipy import stats import os # Sample sentiment analysis function def sentiment_analysis(text): return TextBlob(text).sentiment.polarity # Function to display topics from NMF model def display_topics(model, feature_names, no_top_words): topic_dict = {} for topic_idx, topic in enumerate(model.components_): topic_dict[topic_idx] = " ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]) return topic_dict # Function to get a sample review for each topic def get_sample_review_for_topic(df, topic): try: return df.loc[df['topic'] == topic].sample(1)['content'].values[0] except ValueError: return None # Function to clean and tokenize text def clean_and_tokenize(text): if pd.isna(text): text = "" text = re.sub(r'[^A-Za-z\s]', '', str(text)).lower() tokens = text.split() return tokens # Load the CSV file file_path = './output/reviews.csv' # Replace with your file path reviews_df = pd.read_csv(file_path) # Convert all entries in the 'content' column to strings reviews_df['content'] = reviews_df['content'].astype(str) # Apply tokenization reviews_df['tokens'] = reviews_df['content'].apply(clean_and_tokenize) # Flatten the list of tokens and count the occurrences all_tokens = [token for tokens in reviews_df['tokens'] for token in tokens] word_freq = Counter(all_tokens) # Most common words most_common_words = word_freq.most_common(10) # Apply sentiment analysis reviews_df['sentiment_polarity'] = reviews_df['content'].apply(sentiment_analysis) # Check if the DataFrame or the 'sentiment_polarity' column is empty if not reviews_df.empty and 'sentiment_polarity' in reviews_df.columns and not reviews_df['sentiment_polarity'].empty: outlier_positive = reviews_df.loc[reviews_df['sentiment_polarity'].idxmax()] outlier_negative = reviews_df.loc[reviews_df['sentiment_polarity'].idxmin()] else: outlier_positive = None outlier_negative = None print("The DataFrame or the 'sentiment_polarity' column is empty.") # Ensure outlier_negative has an actually negative sentiment if outlier_negative is not None and outlier_negative['sentiment_polarity'] >= 0: outlier_negative_text = "No extremely negative review found." else: outlier_negative_text = outlier_negative['content'] if outlier_negative is not None else "No extremely negative review found." # Calculating standard deviation and margin of error std_dev = np.std(reviews_df['sentiment_polarity']) z_score = 1.96 # For 95% confidence margin_of_error = z_score * (std_dev / np.sqrt(len(reviews_df))) # T-Test for sentiment polarity t_statistic, p_value = stats.ttest_1samp(reviews_df['sentiment_polarity'], 0.5) # Check if the DataFrame is empty before applying TfidfVectorizer if not reviews_df.empty: tfidf_vectorizer = TfidfVectorizer(stop_words='english') try: tfidf = tfidf_vectorizer.fit_transform(reviews_df['content']) except ValueError as e: print(f"Error with TfidfVectorizer: {e}") else: print("The DataFrame is empty, cannot apply TfidfVectorizer.") # Thematic analysis using NMF n_topics = 5 tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english') tfidf = tfidf_vectorizer.fit_transform(reviews_df['content']) nmf = NMF(n_components=n_topics, random_state=1, max_iter=500).fit(tfidf) # Increase max_iter topics = display_topics(nmf, tfidf_vectorizer.get_feature_names_out(), 10) # Assigning topic to each review reviews_topics = nmf.transform(tfidf) reviews_df['topic'] = reviews_topics.argmax(axis=1) # Analyzing sentiment polarity distribution average_sentiment = reviews_df['sentiment_polarity'].mean() if average_sentiment >= 0.5: sentiment_summary = "which indicates a generally positive sentiment among the reviewers." elif average_sentiment > 0: sentiment_summary = "which indicates a moderately positive sentiment among the reviewers." else: sentiment_summary = "which indicates a generally negative sentiment among the reviewers." # Generate thematic summary themes_summary = [] for topic, words in topics.items(): sample_review = get_sample_review_for_topic(reviews_df, topic) themes_summary.append(f"**Topic {topic+1}**\n- **Keywords:** {', '.join(words.split()[:10])} \n> ***Sample Review:*** \"{sample_review}\"\n") # Identifying topics contributing to negative sentiment based on ratings negative_threshold = 2 negative_reviews_df = reviews_df[reviews_df['rating'] <= negative_threshold] negative_topics = negative_reviews_df['topic'].value_counts().index.tolist() negative_topics_analysis = [] for idx in negative_topics: sample_review = get_sample_review_for_topic(negative_reviews_df, idx) negative_topics_analysis.append(f"**Topic {idx+1}** \n- **Keywords:** {topics[idx]} \n> ***Sample Negative Review:*** \"{sample_review}\"\n") # Counting the ratings rating_counts = reviews_df['rating'].value_counts().sort_index() total_reviews = len(reviews_df) # Check the sample size for statistical significance sample_size_warning = "" if total_reviews < 30: sample_size_warning = f"**Warning:** The sample size **({total_reviews})** is considered too small for reliable statistical analysis. Interpret the results with caution." # Prepare final output as markdown output = """# Thematic Analysis Report ## General Opinion The average sentiment polarity of the reviews is approximately **{:.3f}**, {} ## Outlier Remarks ### Most Positive Review "{}" *This review suggests a very high level of satisfaction and endorsement.* ### Most Negative Review "{}" ## Review Distribution Out of **{}** reviews analyzed: """.format( average_sentiment, sentiment_summary, outlier_positive.content if outlier_positive is not None else "No extremely positive review found", outlier_negative_text, total_reviews ) # Add rating distribution for rating in range(1, 6): count = rating_counts.get(rating, 0) output += f"- {rating}-{'star' if count == 1 else 'stars'} reviews: **{count}**\n" output += """ ## Thematic Analysis The thematic analysis of the Shopify app reviews using topic modeling has revealed the following key themes and how people are using and perceiving the app: """ # Add themes for theme_summary in themes_summary: output += theme_summary + "\n" output += f""" ## Statistical Analysis Overall, these themes indicate that the app is generally {('well-received' if average_sentiment > 0 else 'not well-received')} for its various features. ### Sentiment Distribution - **Standard Deviation:** {std_dev:.3f} - **Margin of Error (95% CI):** ±{margin_of_error:.3f} - **True Average Sentiment Range:** {average_sentiment-margin_of_error:.3f} to {average_sentiment+margin_of_error:.3f} ### Statistical Tests 1. **T-Test Results** - **T-statistic:** {t_statistic:.3f} - **P-value:** {p_value:.3f} - **Interpretation:** {"This is **negative**. This makes the analysis ***significant***." if t_statistic < 0 and p_value < 0.05 else "This is positive. This makes the analysis ***not significant***."} ## Negative Sentiment Analysis The following topics were frequently associated with negative reviews: """ # Add negative topics for negative_topic in negative_topics_analysis: output += negative_topic + "\n" if sample_size_warning: output += f"\n> ⚠️ {sample_size_warning}\n" # Create the thematic-analysis directory if it doesn't exist os.makedirs('thematic-analysis', exist_ok=True) # Write the output to a markdown file with open('thematic-analysis/analysis_report.mdx', 'w', encoding='utf-8') as f: f.write(output) print("Analysis report has been saved to thematic-analysis/analysis_report.mdx")