Predicting the "helpfulness" of peer-written product reviews

Some e-commerce sites let customers write reviews of their products, which other customers can then browse when considering buying a product. I know I’ve read product reviews written by my fellow customers to help me figure out if a product would be true to size, last a long time, or contain an ingredient I’m concerned about.

What if a business could predict which reviews its customers would find helpful? Maybe it could put those reviews first on the page so that readers could get the best information sooner. Maybe the business could note which topics come up in those helpful reviews and revise its product descriptions to contain more of that sort of information. Maybe the business could even identify “super reviewers,” users who are especially good at writing helpful reviews, and offer them incentives to review more products.

Using a large collection of product reviews from Amazon, I trained a range of machine learning models to try to identify which reviews readers rated as “helpful.” I tried Random Forests, logistic regression, a Support Vector Machine, GRU networks, and LSTM networks, along with a variety of natural language processing (NLP) techniques for preprocessing my data. As it turns it, predicting helpful reviews is pretty hard, but not impossible! To go straight to the code, check out my GitHub repo. To learn more about how I did it, read on.

Continue reading “Predicting the "helpfulness" of peer-written product reviews”

Evaluating a Random Forest model

The Random Forest is a powerful tool for classification problems, but as with many machine learning algorithms, it can take a little effort to understand exactly what is being predicted and what it means in context. Luckily, Scikit-Learn makes it pretty easy to run a Random Forest and interpret the results. In this post I’ll walk through the process of training a straightforward Random Forest model and evaluating its performance using confusion matrices and classification reports. I’ll even show you how to make a color-coded confusion matrix using Seaborn and Matplotlib. Read on!

A path extends into a dense forest of conifers and ferns.
Just some random forest. (The jokes write themselves!)
Continue reading “Evaluating a Random Forest model”

Time series modeling with Facebook Prophet

When trying to understand time series, there’s so much to think about. What’s the overall trend in the data? Is it affected by seasonality? What kind of model should I use, and how well will it perform?

All these questions can make time series modeling kind of intimidating, but it doesn’t have to be that bad. While working on a project for my data science bootcamp recently, I tried Facebook Prophet, an open-source package for time series modeling developed by … y’know, Facebook. I found it super quick and easy to get it running with my data, so in this post, I’ll show you how to do it and share a function I wrote to do modeling and validation with Prophet all-in-one. Read on!

Continue reading “Time series modeling with Facebook Prophet”

Hypothesis-testing the discount bump

I bet you know this feeling: an item you need is on sale, so you gleefully add it to your cart and start thinking, “What else can I buy with all this money I just saved?” A few clicks (or turns around the store) later, and you’ve got a lot more in your cart than you came for.

This is such a common phenomenon that some retailers openly exploit it. Amazon has those “add-on” items that are cheaper (or only available) if you add them to an order of a certain size. I’m pretty sure I’ve heard Target ads that crack jokes about the experience of coming to the store for something essential and leaving with a cartful of things you didn’t need but wanted once you saw what a great deal you were getting.

For a project in my data science bootcamp, I was asked to form and test some hypotheses using a database containing product and sales data from a fictional-but-realistic dealer in fine food products. It’s the Northwind database, and Microsoft created it as a sample for learning how to use some of their database products. While the data isn’t really real, it’s realistic, so most of the time it behaves the way you would expect real sales data to behave. It’s also really, really clean, which is unusual in data science.

In this post I’ll walk you through a hypothesis test using Welch’s t-test to determine whether customers spend more once they have been offered a discount (spoiler alert: they do!), and if so, how much more they spend.

Continue reading “Hypothesis-testing the discount bump”

Cleaning house data

For my first project as a Flatiron School data science bootcamper, I was asked to analyze data about the sale prices of houses in King County, Washington, in 2014 and 2015. The dataset is well known to students of data science because it lends itself to linear regression modeling. You can take a look at the data over at

In this post, I’ll describe my process of cleaning this dataset to prepare for modeling it using multiple linear regression, which allows me to consider the impact of multiple factors on a home’s sale price at the same time.

Continue reading “Cleaning house data”

Seven ways to scatterplot

You know scatterplots—those sprinkles of points that help you get an initial sense for how two variables relate to one another. If you have data to analyze, you’ll probably be making a scatterplot sooner or later. In this post, I’ll run through seven ways to make scatterplots using a variety of tools in Excel, Python, and R. I’ll always use the same data, so you can easily compare and decide what works for you.

Continue reading “Seven ways to scatterplot”

I dig data science

Hi! I’m Jenny, and over the last few years I’ve been slowly pivoting the focus of my work from Roman archaeology to data science. In August 2019, I started a full-time data science bootcamp offered by Flatiron School. At about 50 hours of coursework per week, the program is going to help me build solid skills pretty quickly, and I can’t wait to see where I go next.

If you’re just meeting me through this blog post, maybe you would like to know how a person becomes a Roman archaeologist—and then leaves it for something as different as data science. If you know me already, you may be surprised at this big change. From my point of view, this has been coming for a long time, although I didn’t always know that “data science” was the name of the thing I wanted to do. If you want to hear more about the path that brought me here, read on.

Continue reading “I dig data science”

In the toolbox: StoryMap.JS

You may have heard of StoryMap.JS, made by the Knight Lab at Northwestern University. I’ll be using it at work soon, so I decided to give it a spin with some photos from a roadtrip I took around Sardinia earlier this year. 

I loved how it easy it was to get started on a StoryMap. I can imagine that it will be easy to get even the tech-queasy to try this out. What I DON’T love is how badly it displays in the narrow text block of this WordPress theme. (If I make the iFrame any shorter, the internal scroll just gets to be too much.) Check it out here for a better experience. 

StoryMap would have come in really handy for a talk I gave about this particular trip. I may never look at that PowerPoint again, but a StoryMap might make a nice way to tell the same tale as a user-guided experience.