Crosslingual information retrieval

https://img.shields.io/github/stars/J4K08L4N63N84HN/crosslingual-information-retrieval?style=social:alt:GitHubRepostars https://img.shields.io/github/repo-size/J4K08L4N63N84HN/crosslingual-information-retrieval?style=social:alt:GitHubreposize https://img.shields.io/github/stars/J4K08L4N63N84HN/crosslingual-information-retrieval?style=social:alt:GitHubRepostars

Cross-Lingual Information Retrieval is the task of getting information in a different language than the original query. Our goal is to implement a lightweight system, unsupervised and supervised, to recognize the translation of a sentence in a large collection of documents in a different language. Testing different cross-lingual word embedding- and text-based features with wide-ranging parameter combinations, our best model, the MLPClassifier, achieved a Mean Average Precision of 0.8459 on our English-German test collection. Our lightweight system also demonstrates zero-shot performance in other languages, such as Italian and Polish. We compare our results to the SOTA, but resource-hungry transformer model XLM-R.

Table of Contents

Description

We make all our code availabe that were used for this project. It contains the data preprocessing, inducing cross-lingual word embeddings, training and evaluating all models. You can find the code for each part in the following table:

All Experiments done were written in Jupyter Notebooks, which can be found in this Folder

Furthermore, we make all models available Drive. All raw and preprocessed data can be downloaded in the following Drive.

Our results are summarized in the following table:

https://github.com/J4K08L4N63N84HN/crosslingual-information-retrieval/blob/main/reports/figures/final_results.png

How to Install

To use this code you have to follow these steps:

  1. Start by cloning this Git repository:

$  git clone https://github.com/J4K08L4N63N84HN/crosslingual-information-retrieval.git
$  cd crosslingual-information-retrieval
  1. Continue by creating a new conda environment (Python 3.8):

$  conda create -n animate_logos python=3.8
$  conda activate animate_logos
  1. Install the dependencies:

$ pip install -r requirements.txt

For a detailed documentation you can refere to here or create your own sphinx documentation with

Credits

The project started in March 2021 as a Information Retrieval project at the University of Mannheim. The project team consists of:

License

This repository is licenced under the MIT License. If you have any enquiries concerning the use of our code, do not hesitate to contact us.

Indices and tables