Skip to main content
Ed. Magazine

History Rewritten

Assistant professor helps Boston archeologists uncover new way to digitize archives
Boston artifact

A data guy walks into a kitchen.

It’s his kitchen, and Nadia Kline, his partner, is sitting there, tediously labeling photographs online. She’s an archeologist for the City of Boston and she and her team are creating a digital database of the city’s historical archives that researchers and students can use.

The process is frustratingly slow: She’s been at it for more than five hours. And it’s like this just about every day.

“I thought, I can help with that,” says Sebastian Munoz-Najar Galvez, the Ed School’s new assistant professor of data science and education. He created a script — an optical character recognition, or OCR script — that converts an image into text. This is exactly what Kline and the archeology department needed. At the time, they were printing out a label for each artifact that includes an archival code and descriptive information such as where the piece was found (beneath Paul Revere’s house, inside a privy on Endicott Street) and what it’s made of. They photographed each artifact with the label. The challenge was that the camera was creating a default file name for each file, such as DSC0001 or DSC0002. Manually, they had to open the picture, look at the label, then change the camera’s labeling to the artifact’s catalog number. Doing this for tens of thousands of photos in the city’s massive collections was taking forever.

The script Munoz-Najar Galvez wrote finds the archival label in the photo, isolates the catalog number, and then saves the image with the catalog number as the file name.

“The procedure was so effective that they started changing the typography on the labels so that the automatic procedure works better,” he says. He also helped the team learn basic coding. “Now they have ownership,” he says. “They’re running with it.”

Automating the process has freed up the staff to do more interesting things, such as using laser scanning and photogrammetry to create multiple 3D models of some of the artifacts, such as a Boston Common cowbell and a whizzer (like a colonial-era fidget spinner) found at Town Dock near Faneuil Hall, that the public can download for free and print with their own 3D printer.

Munoz-Najar Galvez’s text processing work goes beyond helping the City of Boston. He’s teaching students at the Ed School how to access and make sense of the massive amount of data they will encounter as they move into jobs in school districts and at education departments.

“I’m here at Harvard to teach data science to education researchers,” he says, which includes developing classes in network analysis and text analysis. With students, using text analysis, he looks at documents like school improvement plans and charter school applications — common but complex education documents full of detailed information. “I’m helping students find evidence of schools innovating or different districts coordinating with one another. Within this massive archive of documentation are actually data to find out what’s going on in schools.”

But again, the problem is that doing this manually — like for the archeologists in Boston — is too much. “No single human could spend their time fully reading all of these documents,” he says. “For example, we’re currently looking at school improvement plans in Florida, specifically documents where schools outline their plans for a whole year: budgets, strategic goals, description of their challenges. I’m like wow, you document all of this in such careful details, but who’s going to read every plan?” Using text analysis, he says, you can look at the data and find patterns.

“What we’re looking at in Florida is the difference between school improvement plans of very successful schools and schools that need improvement,” he says. “For example, we look at the budget. What are the things successful schools are putting their money toward? You could do this by interviewing people, and there are experts who know what to look for — I’m not trying to replace them — but that would take a long time, so we’re helping to make it more efficient. What I’m putting on the table is, we can do this at scale.”

Asked if people are surprised by what can be done with these scripts, he says, “It’s been really fun to have 'a-ha' moments with students and my colleagues.” He says he had his own moment of surprise when he realized the technology could also help the archeologists.

“The real ‘a-ha’ for me was when I identified that problem,” he says. “I realized there’s this process — the archeologists in Boston spend five hours a day doing this tedious task — that I can actually help out with.”

Munoz-Najar Galvez is quick to point out that what he’s doing, whether it’s for a school district digging into a budget document or an archeologist digging into photos of clay pots, isn’t unique.

“All of the tools I use are available in the world — this OCR technology is public and open to everyone,” he says. “In history, there’s a well-established tradition of using OCR to search through old archives. There’s an imaging unit at Harvard Library, for example, that does this with old books.”

But with everything he does, there’s still the human element.

“It’s not difficult to automate a tedious task or run analysis at scale,” Munoz-Najar Galvez says. “What is challenging is learning how to identify worthy puzzles as part of an authentic partnership with experts from other domains.”

Ed. Magazine

The magazine of the Harvard Graduate School of Education

Related Articles