Diving into the Messy Reality of Data

Natural Sciences6 MIN READ

Students put their data science skills to the test in a weekend-long marathon

A student works at a computer
Even as the collection of data is exploding, the field of data science is growing at Colby. Here, Nafis Bhuiyan '26 works on his laptop during a 2023 AI hackathon.
Share
By Abigail CurtisPhotography by Gregory A. Rec
April 24, 2026

In most of her statistics and data science classes, Peggy Jones ’26 works with data sets that have been carefully “cleaned” by professors, meaning that incorrect, corrupted, duplicate, or incomplete data have been removed. 

Professors do this because messy data lead to bad conclusions, and they want students to have good data so they have the best chance of making good analyses. But outside controlled environments like college classrooms, that’s not how data analysis works, which is exactly why it’s critical for Colby to provide students such as Jones the opportunity to work with so-called messy data. 

Toward that goal, Jones, a data science major, and other students spent 48 hours over a recent weekend immersed in real-world data during the American Statistical Association’s DataFest. During the annual event, teams of undergraduates analyze complex, real-world datasets to solve important challenges. 

At first, the immense data set seemed overwhelming, Jones said. She and her teammates started looking at it on Friday afternoon, then met on Saturday morning to begin discussing their ideas. They toiled throughout the day to figure out the question they wanted to answer, and then worked until 2 a.m. Sunday. They slept for a few hours, then finalized their work and rehearsed their presentations. 

Students work at laptops set around a small table.
The study of data science is growing at Colby. In 2023, the College was among the first liberal arts colleges in the country to offer a data science major.

By Sunday afternoon, Jones and her teammates were exhausted after spending so much time exploring the data, drawing insights, and finding the best ways to share those insights with others. They were ready for a nap—but also exhilarated by their efforts. 

“As much as I had been looking at data in different ways, shapes, and forms throughout my college career, I hadn’t really seen an actual data set, or gone out of my way to see how messy and raw an actual data set would look,” Jones said. “I was really glad that I did this. I know that after this competition, when I go out into the real world and am given a data set, be it for a job or graduate school, there’s not going to be any panic or fear. I would go in feeling a lot more confident about what to do.” 

The data revolution

Data collection has exploded. What used to be a manageable stream has become a rushing deluge: the world generates about 400 million terabytes of data every day, a vast amount. If you were to print out a terabyte of text, it would be a stack of paper 1,700 feet high, taller than the Empire State Building. And that is only a single terabyte. 

Along with the precipitous growth in data generation, it has become increasingly important to have people who can work with and make sense of it. The U.S. Bureau of Labor Statistics is predicting a 34-percent increase in data science jobs over the next decade, making it one of the fastest-growing occupations in the country. 

Data science is growing at Colby. In 2023, the College was among the first liberal arts colleges in the country to offer a data science major. In their coursework, students learn skills to thrive in a field that has undergone profound changes over the past 20 years and continues to evolve in real time today, said Assistant Professor of Statistics Annie Tang.  

Students look at a presentation on a screen.
DataFest was founded at UCLA in 2011 to bring together the data science community through a celebration of data. During the event, teams of undergraduates work around the clock to find and share meaning from a large, rich, and complex data set.

“Back then, it was more like analyzing data and giving insights. I think data scientists were more like data analysts,” Tang said. “And now it’s less about analyzing existing data, and more about building things, and thinking about what kind of things should be built.”  

About 50 students finished this year’s DataFest, which took place from Friday, April 10, to Sunday, April 12, including a few teams from Bowdoin and Bates colleges. Details about the specific data set they worked with must be kept confidential until all participating colleges and universities have held their events. In past years, students have addressed such subjects as tracking pandemic impacts, improving public health, optimizing rugby performance, and analyzing commercial real estate and job markets.

“The idea of the DataFest is that you get a really complicated data set with many different files. You can use the whole thing, or just parts of it, and there are so many different variables,” Tang said. “It was cool to see how people did very different things, and looked at very different things.” 

‘A good springboard’

During most of her time at Colby, Jones was an environmental computation major. Over time, she was drawn to statistics and data science, taking many courses as she pursued her curiosity about the subject. 

“For me, taking these classes gave me a framework,” she said. “I was taking classes I was genuinely interested in, and realized, I really do enjoy being in this space—why not be a major?” 

Two students work at computers.
The field of data science has many possible career paths for students. Here, Ben Kaiser-Bulmash ’27, left, and Benjamin Wintersteen ’27 work on their laptops during a data event.

She is still formalizing her post-graduation plans, and she is drawn to the field of public health after taking a course last fall with Nirav Shah, a visiting professor and epidemiologist who served as director of the Maine Center for Disease Control & Prevention during the Covid-19 pandemic. It was her introduction to public health. 

“I thought it was an incredible class,” Jones said. “Data science is a good springboard. I can look at different data within different fields—whether it be in health, the environment, business—and see what I can do with it.” 

With another DataFest in the books, Colby students are proving that they aren’t just learning to handle the data deluge—they’re learning to lead it.

related

Highlights