RIPS Math Students Create New and Improved Algorithm for Searching the Visual History Archive

Fri, 08/21/2015 - 5:00pm

The four undergraduates working with USC Shoah Foundation during the summer Research in Industrial Projects for Students (RIPS) program at UCLA presented their new method for achieving more relevant search results in the Visual History Archive to staff on Wednesday.

Adam Foster, Georg Maierhofer, Megan Shearer and Hangjian Li have been working for the past eight weeks to create an algorithm for the Visual History Archive that will allow users to receive more relevant testimony clips when they search for keywords in the archive.

In the group’s presentation, they explained how they adapted an existing formula, Latent Dirichlet Allocation (LDA), which gathers keywords into “topics” and returns search results based on the probability that certain topics contain certain related keywords. When they first applied LDA to their data, it wasn’t as effective as they wanted. So, they incorporated the Visual History Archive’s indexing hierarchy system, in which keywords are organized from general to specific within subject areas, into LDA and created a new algorithm, mLDA. 

They found that mLDA did improve search relevancy in many cases, especially for very vague, general keywords, when they compared it to the existing algorithm. For example, while a search for “war criminals” in the existing archive returned testimony clips that had been simply tagged with the keyword “war criminals,” mLDA actually returned testimonies tagged with the names of war criminals, such as Rudolf Hoess, and testimonies about specific war crimes trials. This is far more likely to be relevant to someone searching the archive for war criminals, they reasoned.

While mLDA wasn’t as effective when they tested it with very specific keywords, like a person’s name, the group said it would be relatively simple to correct this problem within the algorithm.

The group also created a visualization of the keywords in the Visual History Archive. Colored nodes represent different topics, and when a user clicks on a node it branches out into related keywords. This might be useful for someone who wants ideas for keywords that are related to their current search but doesn’t quite know what to search for, or wants to see how different keywords are related to each other.

The staff of USC Shoah Foundation was very impressed by the group’s presentation and findings. Sam Gustman, chief technology officer, said their system is sophisticated and it is a smarter version of the keyword system that is based on how the testimonies got indexed.

As the Visual History Archive continues to be updated and developed, Gustman said technology staff will explore the possibility of implementing the mLDA algorithm, along with other projects by past RIPS groups.