Institute News

UCLA IPAM Students Present Results of Information Retrieval Project










(L-R: Bin Bi, Ilan Morgenstern, Lingxin Zhou, Qiaoyu Yang, Sam Gustman, Mateo Wirth, Mills Chang)

After two months working with the USC Shoah Foundation, the 2014 Research in Industrial Projects (RIPS) team made great strides in finding a way to link an outside archive to video segments in IWitness. The team presented their findings to USC Shoah Foundation staff on Wednesday.

Each summer, undergraduate students from around the world convene at the UCLA Institute for Pure and Applied Mathematics (IPAM) for the two-month RIPS program. They are placed in small groups, and each group must find a solution to an applied mathematics problem within the sponsor organization they are assigned. Past sponsors have included Intel, HRL and the Los Angeles Police Department. At the end, they present their findings to the organization. This is the fifth year the USC Shoah Foundation has participated in RIPS.

Bin Bi, a PhD candidate in computer science at UCLA, mentored the four students assigned to the USC Shoah Foundation: Ilan Morgenstern Kaplan (Instituto Tecnologico Autonomo de Mexico), Mateo Wirth (Princeton), Qiaoyu Yang (Reed College), and Lingxin Zhou (Rutgers University).

USC Shoah Foundation technology staff asked the RIPS group this year to experiment with ways to link videos in IWitness, USC Shoah Foundation’s interactive educational website, to other archives – for the purposes of their project, the group worked with Wikipedia. In other words, each video segment would link automatically to a relevant Wikipedia page based on the segment’s keywords, providing another avenue of research and learning for students using IWitness.

This is an example of “information retrieval,” or what search engines like Google do – but using the testimonies’ keywords as the search queries.

In their presentation, the students first outlined the challenges of their assignment. The keywords in each testimony segment don’t include context, so each keyword (such as “concentration camp”) could have many different interpretations and meanings. Also some keywords are very vague, such as “photograph.” This made it difficult to retrieve truly relevant results from Wikipedia.

In addition, they only wanted to retrieve one result per segment – not a list of options for the user to choose from like a Google search.

Next the team explained the different models they tested to see which would return the highest number of relevant Wikipedia pages for a random sample of 200 testimony segments. They scored each model on its precision – how relevant the retrieved Wikipedia pages were to the segments – and its recall, or the proportion of relevant pages retrieved. The team explained that it was important to balance the two, so they could get the highest number of relevant results.

Some models worked better than others. For example, one model returned the Smokin’ Aces movie Wikipedia page for a segment in which the survivor discussed his immigration to New York City – not exactly relevant. But, using another model, the Wikipedia page for Vilnius, Lithuania, was retrieved for a segment in which the survivor discussed her background there. This would be considered an accurate retrieval.

Ultimately, the team’s best-performing model had 88 percent precision and 47 percent recall. They said keywords of specific people, events and places were the easiest to match in Wikipedia, but it was more difficult to find Wikipedia pages that could accurately convey abstract, personal keywords like “family life” or “loved ones’ deaths.” They suggested that an archive like Google Books, as opposed to Wikipedia, may be a better match for these more complicated keywords. Google Books could suggest books and novels that deal with broader themes like “loved ones’ deaths” that can’t adequately be conveyed in a Wikipedia page.

USC Shoah Foundation staff were impressed by the presentation and how well the students explained such complex ideas. Technology staff will consider the group’s research as it experiments further with integrating outside archives into the Visual History Archive.