Establish a strategy for LLM-assisted core string translation with glossary and terminology adherence [#3564267]

Problem/Motivation

Investigate, discuss, and decide the optimal methodology for executing translations via an LLM to ensure overall consistency, including the translation of specific terminology.

Proposed resolution

The following approaches could be applied to maintain translation consistency:
- Utilizing glossaries (bilingual mapping tables).
- Fine-tuning models using existing translated strings.

The specific methodology remains undecided as of the creation of this issue. We intend to discuss the details within this thread and will update the issue summary once a consensus is reached.

Comment	File	Size	Author
#8	evaluation-result-2026-03-29_21-34_mistral-large-2402-v1.txt	1.3 KB	dokumori
#8	evaluation-result-2026-03-29_20-32_gemini-2.5-pro.txt	1.29 KB	dokumori
#8	evaluation_report_2026-03-29_21-34_mistral-large-2402-v1.csv	231.24 KB	dokumori
#8	evaluation_report_2026-03-29_20-32_gemini-2.5-pro.csv	244.49 KB	dokumori
#8	en-ja-claude-sonnet-4-norag-20260309.po	143.3 KB	dokumori
#8	en-ja-claude-sonnet-4-rag-20260323.po	142.71 KB	dokumori
#8	evaluation-result-2026-03-29_19-22_kimi-k2.5.txt	1.32 KB	dokumori
#8	evaluation-result-2026-03-29_14-59_claude-3-5-haiku.txt	1.33 KB	dokumori
#8	evaluation_report_2026-03-29_19-22_kimi-k2.5.csv	282.91 KB	dokumori
#8	evaluation_report_2026-03-29_14-59_claude-3-5-haiku.csv	312.22 KB	dokumori
#8	en-ja_mistral-large-2402-v1_rag_2026-03-28_19-12-16.po	149.46 KB	dokumori
#8	en-ja_mistral-large-2402-v1_norag_2026-03-28_19-31-43.po	150.15 KB	dokumori
#5	en-ja.po	66.14 KB	dokumori
#4	glossary.csv	31.73 KB	dokumori

Comments

Comment #1

20 December 2025 at 14:39

dokumori created an issue. See original summary.

Comment #2

dokumori commented 20 December 2025 at 15:28

Issue summary:

View changes

Note to self: Issue descriptions and comments need to be written in English :D

Comment #3

dokumori commented 2 January 2026 at 15:26

The introduction of a RAG layer appears to be more suitable than fine-tuning for the translation of the core strings.

https://www.drupal.org/project/translation_llm/issues/3564256#comment-16...

Comment #4

dokumori commented 22 January 2026 at 21:00

Status	File	Size
new	glossary.csv	31.73 KB

Uploading a sample glossary file to be used with a PoC translation tool, which I will share soon.

Comment #5

dokumori commented 22 January 2026 at 21:15

Status	File	Size
new	en-ja.po	66.14 KB

...and the .po file containing untranslated strings extracted from Drupal 11.0.6.

Comment #6

dokumori commented 25 January 2026 at 18:12

After I looked into RAG, I was convinced it was a better approach than fine-tuning an LLM to achieve high quality translations. So I’ve built a PoC and it’s ready to be taken for a spin.
RAG LLM Translator: https://github.com/dokumori/rag-llm-translator/

It works with any AI provider that supports OpenAI API specifications. Also, amazee.ai is generously providing their LLM resources for this project, so if you are a maintainer of this project, please DM me and I will share the API key.

My plan was to build something simple, but it was my first attempt to try out RAG and I wanted to ensure the system functioned as expected, so added various inspection tools and tests. I also had lots of fun building it with help from an LLM so kept adding features. Maybe I’ve over-engineered it a bit but hey ;)

Below is a brief comparison of the two approaches: fine-tuning and RAG:

Fine-tuning an LLM
While fine-tuning an LLM appears to be a valid approach to improve the quality of translation, the main problem with the approach is that the configuration is not portable. When we want to move to a new model, we need to start over.

Providing translation context with RAG
The RAG-based approach, on the other hand, is model-agnostic and works with most modern LLMs. The payload sent to the LLM includes not only the prompt, but also the glossary and translation memory as the context for the LLM to produce translations that are consistent with the existing translations.

—

Just to be clear: while I've built a tool to support a RAG-based approach, it shouldn’t dictate the roadmap of this project. Let’s have an open discussion and decide on the project's direction together. Please don’t hesitate to share your thoughts.

Comment #7

dokumori commented 15 March 2026 at 22:53

Here are the test results that display the effectiveness of the system:

I've ingested the existing core translations (11.0.6) as the translation memory, and glossary.csv as the glossary into RAG. The translation process was run for the untranslated strings found in the core (11.0.6) against each of the LLMs once without RAG, and again with RAG. While each model makes different 'mistakes', the impact of RAG is consistent.

The terms shown in the charts appeared within translated sentences, and were not processed as standalone vocabulary.

Considerations and observations:
Partly because the existing translations are not perfectly consistent, it's trickier to evaluate the effect of translation memory. I should probably also test the effect of translation memory and the glossary separately. But the overall impact of RAG seems promising.

If you have any comments on the results or suggestions on improving the testing method, please share!

Haiku-3.5

Rate of consistency w/o RAG : 37.5%

Rate of consistency w/ RAG: 87.5%

Term (untranslated)	Glossary	w/o RAG	w/ RAG	Consistent with Glossary (w/ RAG only)
Publish	掲載	公開	掲載	1
Unpublish	非掲載	非公開	非公開	0
Migrate	移行	マイグレーション	移行	1
Machine-readable name	システム内部名称	機械可読名	システム内部名称	1
term	ターム	用語	ターム	1
logging	ログ記録	ログ記録	ログ記録	1
Username	ユーザー名	ユーザー名	ユーザー名	1
State	状態	状態	状態	1

Claude Sonnet 4

Rate of consistency w/o RAG : 37.5%

Rate of consistency w/ RAG: 87.5%

Term (untranslated)	Glossary	w/o RAG	w/ RAG	Consistent with Glossary (w/ RAG only)
Publish	掲載	公開	公開	0
Unpublish	非掲載	非公開	非掲載	1
Migrate	移行	マイグレーション	移行	1
Machine-readable name	システム内部名称	機械可読名	システム内部名称	1
term	ターム	用語	ターム	1
logging	ログ記録	ログ記録	ログ記録	1
Username	ユーザー名	ユーザー名	ユーザー名	1
State	状態	状態	状態	1

Claude Opus 4

Rate of consistency w/o RAG : 37.5%

Rate of consistency w/ RAG: 87.5%

Term (untranslated)	Glossary	w/o RAG	w/ RAG	Consistent with Glossary (w/ RAG only)
Publish	掲載	公開	掲載	1
Unpublish	非掲載	非公開	非掲載	1
Migrate	移行	マイグレーション	移行	1
Machine-readable name	システム内部名称	マシン読み取り可能	システム内部名称	1
term	ターム	用語	ターム	1
logging	ログ記録	ログ記録	ロギング	0
Username	ユーザー名	ユーザー名	ユーザー名	1
State	状態	状態	状態	1

Mistral

Rate of consistency w/o RAG : 0%

Rate of consistency w/ RAG: 87.5%

Term (untranslated)	Glossary	w/o RAG	w/ RAG	Consistent with Glossary (w/ RAG only)
Publish	掲載	公開	掲載	1
Unpublish	非掲載	非公開	非掲載	1
Migrate	移行	migrate	移行	1
Machine-readable name	システム内部名称	マシンが読み取れる	システム内部名称	1
term	ターム	用語	ターム	1
logging	ログ記録	ロギング	ロギング	0
Username	ユーザー名	ユーザーネーム	ユーザー名	1
State	状態	ステータス	状態	1

Note: I've excluded `claude-3-5-sonnet` and `deepseek-r1-v1` that are included in the PoC for the following reasons:
- three other Claude models have been tested so it would be an evaluation of the model's capability, rather than the impact of RAG
- `deepseek-r1-v1` was apparently sub-optimal for translation, and it simply didn't translate well, if at all, and will be removed from the PoC in the future

Comment #8

dokumori commented 29 March 2026 at 22:15

Status	File	Size
new	en-ja_mistral-large-2402-v1_norag_2026-03-28_19-31-43.po	150.15 KB
new	en-ja_mistral-large-2402-v1_rag_2026-03-28_19-12-16.po	149.46 KB
new	evaluation_report_2026-03-29_14-59_claude-3-5-haiku.csv	312.22 KB
new	evaluation_report_2026-03-29_19-22_kimi-k2.5.csv	282.91 KB
new	evaluation-result-2026-03-29_14-59_claude-3-5-haiku.txt	1.33 KB
new	evaluation-result-2026-03-29_19-22_kimi-k2.5.txt	1.32 KB
new	en-ja-claude-sonnet-4-rag-20260323.po	142.71 KB
new	en-ja-claude-sonnet-4-norag-20260309.po	143.3 KB
new	evaluation_report_2026-03-29_20-32_gemini-2.5-pro.csv	244.49 KB
new	evaluation_report_2026-03-29_21-34_mistral-large-2402-v1.csv	231.24 KB
new	evaluation-result-2026-03-29_20-32_gemini-2.5-pro.txt	1.29 KB
new	evaluation-result-2026-03-29_21-34_mistral-large-2402-v1.txt	1.3 KB

The previous evaluation method wasn’t statistically valid in many ways, including the the sample size. To conduct better evaluations, I’ve added a feature that uses LLM-as-a-judge to compare RAG-based and non-RAG based translations (included in the release version 1.2.3). This produces quantitative evaluations of the RAG-based translation that are also statistically sound.

I’m sharing two sets of reports generated by the feature. These consistently show strong indications that the RAG-based translations align more closely with existing translations (as they should), and also tend to have higher fluency and accuracy rates.

Here's an example of the report:

=========================================
🏆 EVALUATION RESULTS SUMMARY
Completed: 2026-03-29 19:22
=========================================
JUDGE MODEL ⚖️ : kimi-k2.5 (kimi-k2.5)
Total Evaluated: 296
Wins (With RAG): 103
Wins (Without RAG): 39
Ties: 154
-----------------------------------------
Files Compared:
  - With RAG: en-ja_mistral-large-2402-v1_rag_2026-03-28_19-12-16.po (in with_rag/)
  - Without RAG: en-ja_mistral-large-2402-v1_norag_2026-03-28_19-31-43.po (in without_rag/)
-----------------------------------------
Comparative Metrics (vs Non-RAG):
  - Win Ratio: 2.64x (RAG is 2.6x more likely to win)
  - Relative Win Rate: 72.5% (Preference in decided cases)
  - Win Lead: +164.1% (More 'Best' translations produced)
  - Contextual Error Reduction: 44.8% (Closing the gap to perfection)
  - Sub-optimal Rate Reduction: +58.1% (Reduction in scores < 4.0)
  - Net Improvement (Delta): +21.6% (Total win-rate difference)
  - Score Improvement: +10.9% (Average context score boost)
-----------------------------------------
Average Context Adherence Score (Max 5):
  - With RAG: 4.46
  - Without RAG: 4.02
Average Accuracy & Fluency Score (Max 5):
  - With RAG: 4.46
  - Without RAG: 4.28
=========================================

Here are two sets of samples. I'm uploading all relevant files for the record, but I think you only need to look at the summary reports (.txt) to get the gist of it.

#1:

two .po files translated by Mistral Large - one with RAG, one without
Two sets of reports from the evaluation carried out by:

Claude Haiku 3.5 (.txt and .csv)
Kimi k2.5 (.txt and .csv)

#2:

two .po files translated by Claude Sonnet 4 - one with RAG, one without
Two sets of reports from the evaluation carried out by:

Gemini Pro 2.5 (.txt and .csv)
Mistral Large (.txt and .csv)

Please refer to the documentation for the detailed explanation of the evaluation method (https://github.com/dokumori/rag-llm-translator/blob/main/docs/5_translat...).

Establish a strategy for LLM-assisted core string translation with glossary and terminology adherence

Problem/Motivation

Proposed resolution

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Haiku-3.5

Claude Sonnet 4

Claude Opus 4

Mistral

Comment #8

Parent issue

Referenced by

News items

Our community

Documentation

Drupal code base

Governance of community