Evaluating a Random Forest model

The Random Forest is a powerful tool for classification problems, but as with many machine learning algorithms, it can take a little effort to understand exactly what is being predicted and what it means in context. Luckily, Scikit-Learn makes it pretty easy to run a Random Forest and interpret the results. In this post I’ll walk through the process of training a straightforward Random Forest model and evaluating its performance using confusion matrices and classification reports. I’ll even show you how to make a color-coded confusion matrix using Seaborn and Matplotlib. Read on!

A path extends into a dense forest of conifers and ferns.
Just some random forest. (The jokes write themselves!)
Continue reading “Evaluating a Random Forest model”

Time series modeling with Facebook Prophet

When trying to understand time series, there’s so much to think about. What’s the overall trend in the data? Is it affected by seasonality? What kind of model should I use, and how well will it perform?

All these questions can make time series modeling kind of intimidating, but it doesn’t have to be that bad. While working on a project for my data science bootcamp recently, I tried Facebook Prophet, an open-source package for time series modeling developed by … y’know, Facebook. I found it super quick and easy to get it running with my data, so in this post, I’ll show you how to do it and share a function I wrote to do modeling and validation with Prophet all-in-one. Read on!

Continue reading “Time series modeling with Facebook Prophet”

Hypothesis-testing the discount bump

I bet you know this feeling: an item you need is on sale, so you gleefully add it to your cart and start thinking, “What else can I buy with all this money I just saved?” A few clicks (or turns around the store) later, and you’ve got a lot more in your cart than you came for.

This is such a common phenomenon that some retailers openly exploit it. Amazon has those “add-on” items that are cheaper (or only available) if you add them to an order of a certain size. I’m pretty sure I’ve heard Target ads that crack jokes about the experience of coming to the store for something essential and leaving with a cartful of things you didn’t need but wanted once you saw what a great deal you were getting.

For a project in my data science bootcamp, I was asked to form and test some hypotheses using a database containing product and sales data from a fictional-but-realistic dealer in fine food products. It’s the Northwind database, and Microsoft created it as a sample for learning how to use some of their database products. While the data isn’t really real, it’s realistic, so most of the time it behaves the way you would expect real sales data to behave. It’s also really, really clean, which is unusual in data science.

In this post I’ll walk you through a hypothesis test using Welch’s t-test to determine whether customers spend more once they have been offered a discount (spoiler alert: they do!), and if so, how much more they spend.

Continue reading “Hypothesis-testing the discount bump”

Cleaning house data

For my first project as a Flatiron School data science bootcamper, I was asked to analyze data about the sale prices of houses in King County, Washington, in 2014 and 2015. The dataset is well known to students of data science because it lends itself to linear regression modeling. You can take a look at the data over at Kaggle.com.

In this post, I’ll describe my process of cleaning this dataset to prepare for modeling it using multiple linear regression, which allows me to consider the impact of multiple factors on a home’s sale price at the same time.

Continue reading “Cleaning house data”

Seven ways to scatterplot

You know scatterplots—those sprinkles of points that help you get an initial sense for how two variables relate to one another. If you have data to analyze, you’ll probably be making a scatterplot sooner or later. In this post, I’ll run through seven ways to make scatterplots using a variety of tools in Excel, Python, and R. I’ll always use the same data, so you can easily compare and decide what works for you.

Continue reading “Seven ways to scatterplot”

I dig data science

Hi! I’m Jenny, and over the last few years I’ve been slowly pivoting the focus of my work from Roman archaeology to data science. In August 2019, I started a full-time data science bootcamp offered by Flatiron School. At about 50 hours of coursework per week, the program is going to help me build solid skills pretty quickly, and I can’t wait to see where I go next.

If you’re just meeting me through this blog post, maybe you would like to know how a person becomes a Roman archaeologist—and then leaves it for something as different as data science. If you know me already, you may be surprised at this big change. From my point of view, this has been coming for a long time, although I didn’t always know that “data science” was the name of the thing I wanted to do. If you want to hear more about the path that brought me here, read on.

Continue reading “I dig data science”

In the toolbox: StoryMap.JS

You may have heard of StoryMap.JS, made by the Knight Lab at Northwestern University. I’ll be using it at work soon, so I decided to give it a spin with some photos from a roadtrip I took around Sardinia earlier this year. 

I loved how it easy it was to get started on a StoryMap. I can imagine that it will be easy to get even the tech-queasy to try this out. What I DON’T love is how badly it displays in the narrow text block of this WordPress theme. (If I make the iFrame any shorter, the internal scroll just gets to be too much.) Check it out here for a better experience. 

StoryMap would have come in really handy for a talk I gave about this particular trip. I may never look at that PowerPoint again, but a StoryMap might make a nice way to tell the same tale as a user-guided experience.