Building a Better Statistical Model
Research by a Colby statistics professor and two students has been published in a peer-reviewed journal and downloaded thousands of times
People use statistics to seek answers to questions through data, but if the original data used is problematic, it’s likely that those answers will be, too.
Getting good data—or accurately accounting for imperfect data—is a critically important piece of the problem-solving process. A Colby professor and two of his recent students believe they have circumvented some of the challenges by finding a better way to develop statistical models, and they have published their work in a peer-reviewed journal.
Assistant Professor of Statistics Jerzy Wieczorek, Cole Guerin ’21, and Thomas McMahon ’21 created algorithms that could have far-reaching positive implications in many areas. Those could be as diverse as helping governments and non-governmental organizations to do a better job surveying and predicting poverty or improving how self-driving cars learn to drive with the creation of more accurate machine learning and artificial intelligence algorithms.
“This helps us understand if we’ve got data that were sampled in a very particular way and we account for it, we would hope that our models are going to be a little bit more cautious,” the professor said. “Our guess about how well this model will generalize will be a more honest assessment.”
Data collection 101
To better explain some of the challenges inherent in data sampling, Wieczorek shared the example that motivated their research in the first place. It was when he learned about a nonprofit organization that sought to use a simple questionnaire to gauge whether a family in a developing country was likely to be under or over the poverty line.
“Each country has its own definition of what counts as being in poverty,” he said. “To actually get a full, correct answer, you’d have to go through a person’s household finances and ask them about all the details of their income and spending.”
Doing this kind of comprehensive interview would take about an hour, something that is clearly impossible to do for every person in a given country. But a shorter questionnaire with a few key questions, including whether a family owns a television, could provide a good estimate of whether a family is likely to be under or over the poverty line.
The answers would give a sense of how many people in a country or region live in poverty, and that in turn could help the organization determine how to run its development programs there, Wieczorek said.
A detail that matters a lot, though, is who is asked to respond to the questionnaire. Was it a simple random sample of the population? Or a stratified sample, which aims to make sure there is a fair mix of people? Or something called a cluster sample, where some villages were chosen at random to have every resident interviewed? Each has its benefits, and reasons why it would be used, but each might return results differently.
Similar issues arise beyond household surveys, Wieczorek said, including in the growing field of research on self-driving cars.
“When developing self-driving car algorithms, researchers will oversample data from rare but important conditions such as rainy days,” he said. “This oversampling leads to the same challenges around the statistical interplay between data collection and data analysis.”
Checking the results
Statisticians and researchers know this and typically do something called cross-validation, in which they check the quality of the data model against a set of data that has been set aside.
“There’s a lot of nuances in how to do this right,” Wieczorek said. “What Cole, Thomas, and I did was work out how to account for the fact that you might be doing stratified or cluster sampling and how to mimic the strata or the clusters in the data when you’re splitting out the holdout sample here from the real sample.”
Guerin, who double majored in mathematical sciences and studio art, said he was glad to have the opportunity to work on the research project with Wieczorek and McMahon. The work was interesting and challenging, he said, and came about after he and McMahon had taken a class with Wieczorek that focused on doing a large-scale survey of Waterville residents for the local fire department.
“It was very gratifying to work through a long-term project like that with a team,” Guerin, now a production data analyst for a Boston-based biotechnical company, said. “By that point, I had taken many classes with Jerzy and gotten to know him very well, so Thomas and I approached him to see if there were any other cool projects he was interested in doing with students, and he had this one on his mind.”
Their research, “K-fold Cross-Validation for Complex Sample Surveys,” was published in Stat, an innovative electronic journal for the rapid publication of novel and topical research results. Wieczorek presented their work at the 2021 Symposium on Data Science and Statistics, and the trio also wrote an associated open-source software package that can be shared with other researchers and data analysts.
Getting it published was “pretty cool,” Guerin said. “It felt really rewarding after having worked very hard on it as a group.”
So far, the software has gotten about 200 downloads a month, or around 3,000 downloads since its initial release, the professor said.
“We know people are doing this kind of work, and we want it to be done right,” he said.
Although training self-driving car models is very different from doing door-to-door surveys, both can benefit from using better statistical models. And better statistical models could result in better, safer lives for everyone, he said.