I dig data science

Hi! I’m Jenny, and over the last few years I’ve been slowly pivoting the focus of my work from Roman archaeology to data science. In August 2019, I started a full-time data science bootcamp offered by Flatiron School. At about 50 hours of coursework per week, the program is going to help me build solid skills pretty quickly, and I can’t wait to see where I go next.

If you’re just meeting me through this blog post, maybe you would like to know how a person becomes a Roman archaeologist—and then leaves it for something as different as data science. If you know me already, you may be surprised at this big change. From my point of view, this has been coming for a long time, although I didn’t always know that “data science” was the name of the thing I wanted to do. If you want to hear more about the path that brought me here, read on.

Indiana Jones and the Table of Doom

My first archaeological job, and my first exposure to the wild world of data wrangling, came in the form of a summer research gig after my first year of college. My advisor had received some heavy packages in the mail, containing huge zippered plastic bags full of excavation records on paper. These were all the records from an excavation done in Carthage, Tunisia, in the late ’80s and early 90s.

“Can you build a database?” she probably asked.

I probably shrugged in response.

“Well, you’ll learn.”

Armed with a copy of Microsoft Access for Dummies, I set about building my first database and discovering my passion for archaeology and for data. That summer I sifted through thousands of pages of records, trying to reconstruct not just what happened in a suburban cemetery in Carthage 1,700 years ago, but also how the archaeologists working there made the data that were handed down to me. (Find a copy of the publication near you.)

Kit seeks tools

A few years later, I entered a PhD program in classical art and archaeology, ready to dedicate the next seven years of my life to wrangling archaeological data into meaningful stories about the past. Looking back on my first three years of coursework, I can see a pretty clear theme in my projects and seminar papers: how to harness our wealth of data and 21st-century technologies to tell even better archaeological stories.

For a seminar on Achaemenid Persian cylinder seals, I gathered data on social interactions among seal owners and built a few of them their own profile pages on a mock-social-networking site.

For a course on Roman sculpture, I laid out basic user requirements for a database of all known Roman sarcophagi, complete with tagging to enhance discoverability.

I published an article that used basic statistical analysis of a corpus of epitaphs from Roman catacombs to try to identify implicit markers of gender or age.

Now I know that these are all data-science problems: how to measure connectedness in a social network, how to classify images by content, how to assign people (or inscriptions) to categories based on a few observable characteristics. At the time, I just felt like I had lots of interesting problems but lacked the tools to solve them.

Scaling up

Later in my grad school career, I got to work with data on a bigger scale, and I was fueled by the challenge. I had a summer gig drawing pottery for a University of Cincinnati project at Pompeii, and after a few years, my boss switched me over to a data wrangling role.

The project’s database contained over 6,000 records for stratigraphic units (the basic analytical unit in most modern styles of excavation–essentially, a bit of dirt that is different from the dirt around it). My job was to write a five- to ten-word description of each unit saying what it was and how it related to other units. This was harder than it sounded, because along the way I encountered all the little problems in this “dirty” dataset (yes, pun intended). I found dirt and floors and walls that, due to some small error or other, had gotten lost in time and space, or had jumped out of order and invaded some other part of our site’s story. What could have been an exercise in concise writing turned into a massive data cleaning operation. I. Was. Pumped.

While writing my dissertation, I really began to feel the pain of loving my data, having research questions that gave me energy to go to work every day, but still lacking the right tools to answer them. So I hacked some solutions. Since I couldn’t build myself a machine learning algorithm, I did descriptive statistics on my data and argued that there was almost certainly something really interesting going on there…to be clarified in the future. I learned to use an open-source network visualization program to model social networks among gravediggers and other workers at the cemeteries I studied. (You can read my dissertation here.)

And I started coding. I spent the first 30 to 90 minutes of my workday doing exercises at Codecademy, hoping that someday I would build the skills I needed to do the cool stuff with data that I was dying to do.

Data are everywhere

In my first job after earning my PhD, I got an unusual entry into the world of data science. I was already very excited about data science by this point; I had bought myself a DataCamp subscription, and I was cramming R and Python courses in the early morning, late evening, and every break at work. I was interning full-time at the Getty Foundation, doing research and administrative work to support grant-making initiatives in several subfields of art history and conservation.

My boss asked me to do some research on computer vision, since that was becoming a hot topic in the world of digital art history. My task was to study the major commercial computer-vision APIs from an art historian’s point of view: how could an art historian use these tools for research? Needless to say, I went all in on this project. I built myself a set of test images to fit various art historical use cases, and I made tables summarizing what each API could do. I ended up giving my presentation multiple times to folks in the Getty Foundation and the Getty Research Institute, where researchers were preparing to use computer vision on Ed Ruscha’s photographs of Los Angeles. A GRI staff member even asked to borrow some of my slides.

This brings me up to the past year, when I decided to commit to moving toward data science as a career. I was working at the University of Oregon as a program/project manager jointly appointed to the University of Oregon Libraries and the campus art museum, the Jordan Schnitzer Museum of Art. In addition to running a small internal grant program, I served as project manager for the projects funded by the grants we made.

In practical terms, this meant leading virtual teams of 15 to 20 professionals, faculty, and students in developing websites to present the faculty member’s research. (Read more about these projects in my portfolio.) The program was entirely new, and learning how to manage virtual teams while building hefty projects on a short schedule was challenging, to say the least. I was happiest when I could spend a few hours cleaning metadata or building out working mockups. Making data usable and figuring out how to present it on a website were my little oases in a hectic schedule of meetings and deadlines.

On the road again

Now it’s August 2019, and I have just finished the first week of my five-month bootcamp. So far I have strengthened my Python skills, gotten all tangled up in git, and spent some time scheming for future projects. I’m thankful for the opportunity to get this training, and I’m excited to apply my skills to real-world problems soon.