Problem/Motivation

Investigate, discuss, and decide the optimal methodology for executing translations via an LLM to ensure overall consistency, including the translation of specific terminology.

Proposed resolution

The following approaches could be applied to maintain translation consistency:
- Utilizing glossaries (bilingual mapping tables).
- Fine-tuning models using existing translated strings.

The specific methodology remains undecided as of the creation of this issue. We intend to discuss the details within this thread and will update the issue summary once a consensus is reached.

Comments

dokumori created an issue. See original summary.

dokumori’s picture

Issue summary: View changes

Note to self: Issue descriptions and comments need to be written in English :D

dokumori’s picture

The introduction of a RAG layer appears to be more suitable than fine-tuning for the translation of the core strings.

https://www.drupal.org/project/translation_llm/issues/3564256#comment-16...

dokumori’s picture

StatusFileSize
new31.73 KB

Uploading a sample glossary file to be used with a PoC translation tool, which I will share soon.

dokumori’s picture

StatusFileSize
new66.14 KB

...and the .po file containing untranslated strings extracted from Drupal 11.0.6.

dokumori’s picture

After I looked into RAG, I was convinced it was a better approach than fine-tuning an LLM to achieve high quality translations. So I’ve built a PoC and it’s ready to be taken for a spin. 
RAG LLM Translator: https://github.com/dokumori/rag-llm-translator/

It works with any AI provider that supports OpenAI API specifications. Also, amazee.ai is generously providing their LLM resources for this project, so if you are a maintainer of this project, please DM me and I will share the API key.

My plan was to build something simple, but it was my first attempt to try out RAG and I wanted to ensure the system functioned as expected, so added various inspection tools and tests. I also had lots of fun building it with help from an LLM so kept adding features. Maybe I’ve over-engineered it a bit but hey ;)

Below is a brief comparison of the two approaches: fine-tuning and RAG:

Fine-tuning an LLM
While fine-tuning an LLM appears to be a valid approach to improve the quality of translation, the main problem with the approach is that the configuration is not portable. When we want to move to a new model, we need to start over. 

Providing translation context with RAG
The RAG-based approach, on the other hand, is model-agnostic and works with most modern LLMs. The payload sent to the LLM includes not only the prompt, but also the glossary and translation memory as the context for the LLM to produce translations that are consistent with the existing translations.

Just to be clear: while I've built a tool to support a RAG-based approach, it shouldn’t dictate the roadmap of this project. Let’s have an open discussion and decide on the project's direction together. Please don’t hesitate to share your thoughts.

dokumori’s picture

Here are the test results that display the effectiveness of the system:

I've ingested the existing core translations (11.0.6) as the translation memory, and glossary.csv as the glossary into RAG. The translation process was run for the untranslated strings found in the core (11.0.6) against each of the LLMs once without RAG, and again with RAG. While each model makes different 'mistakes', the impact of RAG is consistent.

The terms shown in the charts appeared within translated sentences, and were not processed as standalone vocabulary.

Considerations and observations:
Partly because the existing translations are not perfectly consistent, it's trickier to evaluate the effect of translation memory. I should probably also test the effect of translation memory and the glossary separately. But the overall impact of RAG seems promising.

If you have any comments on the results or suggestions on improving the testing method, please share!

Haiku-3.5

Rate of consistency w/o RAG : 37.5%

Rate of consistency w/ RAG: 87.5%

Term (untranslated) Glossary w/o RAG w/ RAG Consistent with Glossary (w/ RAG only)
Publish 掲載 公開 掲載 1
Unpublish 非掲載 非公開 非公開 0
Migrate 移行 マイグレーション 移行 1
Machine-readable name システム内部名称 機械可読名 システム内部名称 1
term ターム 用語 ターム 1
logging ログ記録 ログ記録 ログ記録 1
Username ユーザー名 ユーザー名 ユーザー名 1
State 状態 状態 状態 1

Claude Sonnet 4

Rate of consistency w/o RAG : 37.5%

Rate of consistency w/ RAG: 87.5%

Term (untranslated) Glossary w/o RAG w/ RAG Consistent with Glossary (w/ RAG only)
Publish 掲載 公開 公開 0
Unpublish 非掲載 非公開 非掲載 1
Migrate 移行 マイグレーション 移行 1
Machine-readable name システム内部名称 機械可読名 システム内部名称 1
term ターム 用語 ターム 1
logging ログ記録 ログ記録 ログ記録 1
Username ユーザー名 ユーザー名 ユーザー名 1
State 状態 状態 状態 1

Claude Opus 4

Rate of consistency w/o RAG : 37.5%

Rate of consistency w/ RAG: 87.5%

Term (untranslated) Glossary w/o RAG w/ RAG Consistent with Glossary (w/ RAG only)
Publish 掲載 公開 掲載 1
Unpublish 非掲載 非公開 非掲載 1
Migrate 移行 マイグレーション 移行 1
Machine-readable name システム内部名称 マシン読み取り可能 システム内部名称 1
term ターム 用語 ターム 1
logging ログ記録 ログ記録 ロギング 0
Username ユーザー名 ユーザー名 ユーザー名 1
State 状態 状態 状態 1

Mistral

Rate of consistency w/o RAG : 0%

Rate of consistency w/ RAG: 87.5%

Term (untranslated) Glossary w/o RAG w/ RAG Consistent with Glossary (w/ RAG only)
Publish 掲載 公開 掲載 1
Unpublish 非掲載 非公開 非掲載 1
Migrate 移行 migrate 移行 1
Machine-readable name システム内部名称 マシンが読み取れる システム内部名称 1
term ターム 用語 ターム 1
logging ログ記録 ロギング ロギング 0
Username ユーザー名 ユーザーネーム ユーザー名 1
State 状態 ステータス 状態 1

Note: I've excluded `claude-3-5-sonnet` and `deepseek-r1-v1` that are included in the PoC for the following reasons:
- three other Claude models have been tested so it would be an evaluation of the model's capability, rather than the impact of RAG
- `deepseek-r1-v1` was apparently sub-optimal for translation, and it simply didn't translate well, if at all, and will be removed from the PoC in the future

dokumori’s picture

The previous evaluation method wasn’t statistically valid in many ways, including the the sample size. To conduct better evaluations, I’ve added a feature that uses LLM-as-a-judge to compare RAG-based and non-RAG based translations (included in the release version 1.2.3). This produces quantitative evaluations of the RAG-based translation that are also statistically sound.

I’m sharing two sets of reports generated by the feature. These consistently show strong indications that the RAG-based translations align more closely with existing translations (as they should), and also tend to have higher fluency and accuracy rates.

Here's an example of the report:

=========================================
🏆 EVALUATION RESULTS SUMMARY
Completed: 2026-03-29 19:22
=========================================
JUDGE MODEL ⚖️ : kimi-k2.5 (kimi-k2.5)
Total Evaluated: 296
Wins (With RAG): 103
Wins (Without RAG): 39
Ties: 154
-----------------------------------------
Files Compared:
  - With RAG: en-ja_mistral-large-2402-v1_rag_2026-03-28_19-12-16.po (in with_rag/)
  - Without RAG: en-ja_mistral-large-2402-v1_norag_2026-03-28_19-31-43.po (in without_rag/)
-----------------------------------------
Comparative Metrics (vs Non-RAG):
  - Win Ratio: 2.64x (RAG is 2.6x more likely to win)
  - Relative Win Rate: 72.5% (Preference in decided cases)
  - Win Lead: +164.1% (More 'Best' translations produced)
  - Contextual Error Reduction: 44.8% (Closing the gap to perfection)
  - Sub-optimal Rate Reduction: +58.1% (Reduction in scores < 4.0)
  - Net Improvement (Delta): +21.6% (Total win-rate difference)
  - Score Improvement: +10.9% (Average context score boost)
-----------------------------------------
Average Context Adherence Score (Max 5):
  - With RAG: 4.46
  - Without RAG: 4.02
Average Accuracy & Fluency Score (Max 5):
  - With RAG: 4.46
  - Without RAG: 4.28
=========================================

Here are two sets of samples. I'm uploading all relevant files for the record, but I think you only need to look at the summary reports (.txt) to get the gist of it.

#1:

#2:

Please refer to the documentation for the detailed explanation of the evaluation method (https://github.com/dokumori/rag-llm-translator/blob/main/docs/5_translat...).