TrebleCLEF - Research

Evaluation, Best Practices and Collaboration for Multilingual Information Access

About

Research

Summer School

Best Practices

Research

The MLIA area is very much a multidisciplinary area involving the following fields: information retrieval, natural language processing, machine translation and summarization, speech processing, and human-computer interaction. While research was initially mainly confined to working with textual data, over the years CLEF has successfully expanded the coverage to other media, such as working with audio and images, in order to stimulate research into the development of multilingual multimedia retrieval systems.
Treble-CLEF will continue to stimulate research through a set of coordinated actions. In particular, attention will be given to the following key areas:

User-oriented Studies

User studies in the MLIA area have received scarce attention from the scientific community, partly due to their cost (much higher than running batch experiments) and partly due to the difficulties of establishing evaluation methodologies which are both realistic (performed in real-world scenarios) and scientifically well-grounded (performed under laboratory-controlled conditions) TrebleCLEF will address the needs of (at least) three types of users with strong interests: a) multilingual system developers; b) business companies with a potential interest in MLIA system software (the potential market for system developers); c) end users with information needs that transcend language barriers
An important part of the user studies will be query log analysis. The logs of operational search engines will be analysed to study users’ patterns of search. Such analyses allow lab-based testing of the systems of large search engines involving the interactions of millions of users: a scale of user evaluation inconceivable to previous user study research. Logs for a particular search engine can be analysed, patterns of use determined, then changes can be made to the engine and patterns of use re-examined to determine the impact of the change. In addition, large sets of logs can be split into training and testing sets. User models can be built from examination of user interaction in the training set and the models can be used to predict how users will search in the test set.

Test Collection Creation

It is generally assumed by many researchers that constructing test collections demands great effort and can only be afforded by rich organisations or through extensive collaboration with large numbers of researchers. Current evaluation campaigns reinforce this belief. However, such attitudes ignore the flood of research currently being conducted on new measures and new methodologies that allow building test collections more efficiently along with new measures that work well with the new test collections. TrebleCLEF aims at identifying and collating the latest research in methods for forming test collections quickly and efficiently and at identifying new evaluation methodologies and metrics specifically designed and tuned for use in a multilingual context.

Grid-Experiments

Individual researchers or small groups do not usually have the possibility of running large-scale and systematic experiments over a large set of experimental collections and resources in order to improve the comprehension of MLIA systems and gain an exhaustive picture of their behaviour with respect to languages. TrebleCLEF will address this lack of information by promoting and coordinating a series of systematic “grid-experiments” which will re-use and exploit the valuable resources and experimental collections made available by CLEF in order to gain more insights about the effectiveness of the various weighting schemes and retrieval techniques with respect to the languages and to disseminate this knowledge to the relevant application communities.

Language Resources for MLIA

TrebleCLEF will support the development of high priority language resources for Multilingual Information Access in a systematic, standards-driven, collaborative learning context. Priority requirements will be assessed through consultations with language industry and communication players, and a protocol and roadmap will be established for developing a set of language resources for all technologies related to MLIA.