Localizing GitHub Repositories with LLMs

timothychung · Aug 15, 2024

Localizing GitHub Repositories with LLMs

Many established software teams work across multiple locations, basing their work on GitHub repositories. Such teams consist of people across different languages and cultures. It is common practice for large repositories to undergo localization procedures, which involves translating documentation files in such a way that preserve the context of work.

This article discusses an application developed by a group of Students studying at Imperial College London who have been working in collaboration with Microsoft for translating Markdown files in GitHub repositories using Large Language Models (LLMs) from OpenAI. The application uses a Django backend, React frontend, and communicates with GitHub through its API as a GitHub App.

The article includes links to GitHub repositories containing the proof-of-concept app, a public notebook for testing translations, and a test repository with results. Note that the Django app is currently not being maintained and was designed as proof of concept application.

The Project Task
An innovative solution for translating Markdown files in GitHub repositories using Large Language Models (LLMs) from OpenAI. The application, should feature a web backend and frontend, seamlessly communicate with GitHub through its API as a GitHub App. Translate text, code blocks, and even images within your Markdown files while preserving their original structure and syntax.

The Transalation of Code and Comments
For example, if you had the following Python code in English:

Code:

# test the Python print function 
print(“Hello World”)

It would look like this in French:

Code:

# tester la fonction print en Python 
print("Bonjour le monde")

While this is a simple example, there exist small nuances that require us to pay attention to the context of the example. For example, the Python function ‘print’ cannot be translated into another language. The hash (#), that denotes the comment also must be left unmodified even during the translation. Paying attention to small details is not something conventional machine translation methods (used in tools such as Google Translate) are developed for.

Given the nature of this work, localization is not a trivial task. At Microsoft, there are localization teams to produce culturally and professionally accurate versions of documentation, reports, and guides for codebases. Tasks like these can take up to several months to finish.

Fortunately, this task is something large language models (LLMs) excel at compared to machine translation, as they process tokens of text in a context window. LLMs use attention mechanisms to weigh the relevance of each token relative to others in the sequence, allowing them to capture nuanced meanings and contextual dependencies effectively.

Our team at Imperial College London worked with Microsoft to realize the potential of LLMs in the localization process, including noting any interesting behaviors and useful takeaways of what is currently possible with existing technologies provided by Azure and OpenAI.

Introducing a proof-of-concept GitHub App

As part of the project, our team built a simple GitHub app based off Django and React, with the former framework utilizing Python to quickly build and iterate an app that would look for a configuration file on an installed GitHub repository, and then Markdown files on the repo.

The app opts for a simple configuration:

We use English as the universal language (although theoretically, we can have this set to any other language). Any translations made by the app are stored in a separate branch that follows the main branch. Every time main is updated by the user, the app pushes to the translation branch and makes a PR to main.

A .yml configuration file lists the directory of the documentation folder and the desired languages to be translated into.

Any README.md files in the repository are automatically translated– the translated READMEs are placed in a folder of translations adjacent to the original README.md file in English.
All markdown files in the documentation directory are automatically translated. The documentation directory also contains translation folders by language. Each translation folder mirrors the structure of the documentation directory in English.

Code:

/project_root 
    /docs 
        /installation.md 
        /screenshot.png 
        /api 
            /services.md 
        /usage.md 
        /fr 
            /installation.md 
            /usage.md 
            /api 
                /services.md 
        /es 
            /installation.md 
            /usage.md 
            /api 
                /services.md 
        /translated_images 
            /screenshot<hash>.fr.png 
            /screenshot<hash>.es.png 
    /src 
        /readme.md 
        /readme_img.png 
        /translations 
            /readme.fr.md 
            /readme.es.md 
        /translated_images 
            /readme_img<hash>.fr.png 
            /readme_img<hash>.es.png

Properly Translating a Markdown File

There are three aspects to translating a markdown file:

Getting the text translations correct
Getting the code block translations precise
Translating the images in the Markdown file.

(1) Text Translation

This was a relatively straightforward task requiring minimal prompting – our tests with GPT models were generally very accurate (even with RTL languages like Arabic). At the time of testing, GPT-4o and GPT-4 were accurate most of the time for high-resource languages and performed moderately on low-resource languages. GPT-3.5 performed adequately for common high-resource languages but struggled with low-resource languages.

This article covers a deeper definition of low-resource languages, which tend to be less commonly spoken in the world. Naturally, there is a lack of training data in such low-resource languages used to train language models, which leads to poorer performance in tasks that involve low-resource languages compared to completing tasks in high-resource languages. This behavior is observed in many leading LLMs today, such as from Claude, Llama, Mistral, and GPT. There exist initiatives such as Cohere for AI that try to solve the low-resource language problem with dedicated multilingual models such as Aya, but they are out of the scope of this article.

(2) Code Block Translation

To our surprise, translating code blocks in Markdown came naturally to most LLMs. Since a lot of training data for GPTs consist of a lot of code, the GPT models perform very well on translating comments while preserving the structure of code syntax in code blocks.

(3) Image Translation

Markdown files can contain images– in most cases these images are stored locally within the repository. All the images in the repository get translated and their translated copies are stored within the repo, and then after the Markdown files are translated, regex is used to find all image links and replace them with their translated variants.

Current state of the art diffusion cannot be reliably prompted to generate translated versions of images due to their probabilistic nature – they are good at generating new images from scratch but cannot imitate the structure of previous images.

Generating new images would also be computationally expensive, so we developed a novel way of using a mix of OCR services and GPT models to translate images:

Use Azure’s Image Analysis API to detect text lines and their bounding boxes
Feed the detected text lines into GPT to be translated
Replace each text line with translated variant

This is not a perfect solution, but text replacement works. It was also agreed with our supervisor that overlaying text would be more legally feasible because there was no generation of new images of similar likeness, which carries copyright implications.

More detailed information about the app

Django backend, React frontend

Backend

Backend use Django structure and SQLite database. It communicates with GitHub though api by acting as a GitHub App, subscribed to certain events (installation of this GitHub App, new push to Repo), and push changes to repositories. It also accesses OpenAI services and other LLMs through Azure.

Translation Process

1. Installation

Users start by installing GitHub app on repositories or organization selected, those repositories will be recorded in SQLite database, and displayed on webpage once you logged in. A robot account will be invited to those repositories.

2. Initial Translation

Once you activate translation on webpage, a new branch "co-op-translator" will be created, then all document files/images will be read and translated by backend server, then uploaded to the new branch. Pull request will then be created from "co-op-translator" to main branch. Note that all changes to repositories will be performed by the robot account, as it’s easier to manage.

3. Tracking Stage

Repositories will be further tracked. GitHub app is subscribed to events (main branch update, delete of pull requests), modified or added document in new commit will be translated and updated on the “co-op-translator" branch, old pull request will be replaced by a new one.

Edge Cases and Interesting Behaviour

The GPT model would hallucinate translated short 1-liner markdown files: INPUT

# Example Hello World in Markdown

OUTPUT (translated into Spanish)

Code:

# Ejemplo de "Hello World" en Markdown 
 
Este es un archivo `README.md` simple para demostrar el formato de Markdown en GitHub. 
 
## Características 
 
Este documento incluye: 
- Etiquetas de **encabezado** (`#`) 
- **Bloque de código** para resaltado de sintaxis 
- Formato en **negrita**

We tried to prompt the model with the prompt:

“NEVER ADD ANY EXTRA CONTENT OUTSIDE THE TRANSLATION. TRANSLATE ONLY WHAT IS GIVEN TO YOU. MAINTAIN MARKDOWN FORMAT”

This worked, but the model would fail to translate longer files properly. So, we have a dynamic prompt– for files with less than 2 lines, we use the prompt for shorter files, and for files with more than 2 lines, we use the original prompt.

Issues in Image Context: Some context is not fully captured due to splits when reading left to right across different columns, mixing up the text.

Note that there is also a small grammatical error as 'tech company' has been split up, so the context has lost track of the lines. This is because 'tech company' is reversed in French to be 'société tech' or 'company tech'.

Example Image, two cards of text:

Extracted Text that the LLM reads in order ( -> French translation)
Section 1: -> Section 1:
Section 2: -> Section 2:
My favourite tech -> Ma technologie préférée"
My favourite fruit -> Mon fruit préféré:
company: Apple -> : société: Apple
: Apple -> : Apple? Pomme?
Expected Output

Actual Output (using line to line replacement)

Note that the translation of "Apple" could be confused between either the company name in French (also called Apple), or the fruit (pomme). Note that there is also a small grammatical error as 'tech company' has been split up, so the context has lost track of the lines. This is because 'tech company' is reversed in French to be 'société tech' or 'company tech'.

Examples

Take this example, of a Microsoft Guide for Generative AI in English. This is the app’s translated French version. Note that translations do not work with externally linked images.

Summary
Here are some key points from the article:
1. Text Translation: LLMs were found to be quite accurate for translating text in Markdown files, especially when dealing with high-resource languages. However, performance was less consistent for low-resource languages due to the lack of training data available for these languages.
2. Code Block Translation: LLMs were surprisingly good at translating code blocks while preserving their structure within Markdown files.
3. Image Translation: The authors developed a method using OCR services and GPT models to translate text within images, though this is not a perfect solution and carries legal implications due to copyright concerns.
4. Edge Cases and Interesting Behaviour: While the LLMs performed well in most cases, they sometimes added extra content or failed to fully capture context when dealing with longer files. This could result in grammatical errors and other issues.
5. Future Work: The authors plan to explore using LLMs for code improvements and static code analysis within GitHub repositories.

Code

We have attached our work in the GitHub repositories below, including a public notebook for testing translations, a test repository with results, and the Django proof-of-concept app. Do note that the Django app is not being maintained, and that there is currently active work in making the Python notebook code more modular and useful for others to try out LLMs to localize repositories via translations.

Source Code: GitHub Repos

Continue reading...

Localizing GitHub Repositories with LLMs

timothychung