Translating the Archive: The Czech Language Project
By Daryn Eller
One of the unique features of USC Shoah Foundation’s Visual History Archive is a controlled vocabulary of over 62,000 keywords related to genocide concepts and experience. That not only makes virtually every minute of the archive’s 52,000 testimonies searchable, but gives searchers a very good chance of finding what they’re looking for. The only limitation of the archive’s searchability is that it’s indexed in English—but now USC Shoah Foundation Information Technology Services (ITS) is working to make that limitation a thing of the past.
In partnership with Charles University in Prague, ITS has developed software that enables Czech users to search the archive in their own language. This dual-language function will present VHA users with a simple, easy-to-navigate interface, but behind that user-friendly front will live intricately designed data structures that connect the English keywords with their Czech translations.
“We created a tool that’s enabled the Charles University group to begin filling data tables with translations not only of keywords, but of synonyms, too,” says USC Shoah Foundation lead web developer Michael Russell. Along with the 62,000-plus keywords in the Visual History Archive's controlled vocabulary are approximately 230,000 synonyms, a collection of terms that considerably increases the chance that searchers will get a hit. The Czech language project, initiated in 2012, is now off and running. “As of this week, 3,128 keywords and 12,947 synonyms have been translated,” says Russell.
The tool being used by the Charles University group is part of a suite of applications ITS began developing in 2008. These applications allow researchers located anywhere in the world to collaborate with staff on the USC campus. For instance, using tools from this same suite of applications, staff at Kigali Genocide Memorial (KGM) in Rwanda did not need to travel to USC in 2013 in order to catalogue and index testimonies for the Visual History Archive’s Rwandan Tutsi Genocide collection – they could work from KGM. Similarly, the group working on the Czech project is using the software onsite at Charles University in Prague.
If all goes as planned, Czech users will be able to plug their own language into the search box by the end of 2015. The exact design of the interface is yet to be determined. Right now, ITS is focusing on the infrastructure required to make dual-language searching a straightforward process; a complimentary interface design will come in the second phase of development.
Ultimately, the goal is to expand access to the archive through multiple languages. “We want to build out our keyword manager so that it works in multiple locations and languages at the same time,” says Sam Gustman, USC Shoah Foundation chief technology officer. “The Czechs are the first to test it out.” Of the approximately 110,000 hours of testimony in the VHA, about 1,000 of those hours are in Czech.
How many languages will the archive eventually be able to accommodate? “We’re only limited by our imagination,” says Russell.
(Charles University project group in photo above, from left: Petra Hoffmanova, project coordinator and language researcher; Jakub Mlynar, language researcher; Martin Smok, USC Shoah Foundation senior international training consultant)
Like this article? Get our e-newsletter.
Be the first to learn about new articles and personal stories like the one you've just read.