Humanities Database Enhanced by Artificial Intelligence

A cross-disciplinary team creates an online platform for analyzing Chinese magazines

Photo of Associate Professor of East Asian Studies Hong Zhang and Assistant Professor of Computer Science Ying Li
Associate Professor of East Asian Studies Hong Zhang (front) and Assistant Professor of Computer Science Ying Li
Share
By Kardelen Koldas ’15Photography by Gregory Rec
December 13, 2021

Preparing for her Politics of Satire and Humor course, Associate Professor of East Asian Studies Hong Zhang stumbled upon a website with issues of the Chinese satirical magazine Modern Sketch (时代漫画). But the material covered only a brief period in her course, and comics from other eras were scattered.

So in 2018, she collected these primary sources into one place, the Colby China Studies online database, which she’s since steadily enriched and expanded. 

Now, she’s revamping it with artificial intelligence (AI) to create a new one-of-a-kind online platform with support from Colby’s recently launched Davis Institute for Artificial Intelligence facilitating cross-disciplinary research. In collaboration with Assistant Professor of Computer Science Ying Li and student researchers, Zhang is building a public resource for scholars and students to more easily access and analyze Chinese magazines and other primary sources. 

Besides its unique and expansive content, what will set this platform apart is robust AI-powered searching capabilities, and later on, other features like translation. 

This new digital platform, still in development, will make available hundreds of issues of major state magazines published mostly from 1949 to the present. “These [magazines] are actually pretty representative if we want to study China,” said Zhang. They also complement one another for examining the country’s culture and politics in different eras. 

Included is Nationality Pictorial (民族画报), the only state-run magazine on ethnic minorities that has previously not been digitized beyond its cover. “This material has never been studied as a primary source,” said Zhang, whose research is branching out into this topic. “I just feel that in our world today, we need to deal with that part of China.” 

The landing page of the new database, currently in development. Once launched, the AI-powered site will house hundreds of issues of major Chinese magazines for students and scholars to easily search and use in their research.

Converting each issue from a set of scanned images to a searchable PDF, Li and her students—currently Changling Li ’22, Siyuan Peng ’24, and Yaxuan Ren ’24—use optical character recognition (OCR), a field of research in artificial intelligence that can identify images of typed, handwritten, or printed text into machine-encoded text. But unlike a regular look-up, OCR can scan through everything appearing in the magazines—titles, captions, text on images. This digital platform will not only allow for searches within and across issues by using OCR, but for text analysis and data visualization in the near future.

“We’re trying to build this to be a one-stop for text analysis,” said Ying Li. And yet, “it’s more difficult to train the machine for Chinese.” 

Challenges for OCR to accurately detect Chinese characters include the fact that Chinese doesn’t break up words with spaces, unlike Western languages. Another difficulty comes from different ways of writing, as Chinese texts can go from left to right horizontally or from right to left vertically. There are also traditional and simplified characters in various fonts.

Existing models can determine if a text is horizontal or vertical and in simplified or traditional Chinese, but they’re not always accurate and need refining. So Ying Li and her students are addressing these challenges with the potential goal of training the machine to detect and segment the Chinese texts regardless of the ways of writing as well as cleaning data, scanning sources, and writing all the code for the work—all of which they’ll make available for others.  

Moreover, these models can’t read handwritten and ornamental characters in these magazines. “To identify all those characters, we had to create our own database and train those modules,” explained Li. 

One of those students working on the project is Changling Li, a computer science and astrophysics double major. “Research related to full-stack website development requires you to know every single process from the front- and back-end,” he said. Additionally, he has been learning more about his home country of China, too. This work ​​gives him an opportunity to explore his interest in machine learning and, he believes, will also make him stand out in the job market.

For Ying Li, this has not only been a chance to apply her scholarly skills but also have students join her at the intersection of humanities, AI, machine learning, and system development. 

“This is a really good opportunity to have hands-on experience for students,” she said, “on a cross-disciplinary project.”


Read an earlier story about Zhang’s first iteration of this database.


Sign up to read the latest each week.


Highlights