Problem/Motivation

Source strings can contain HTML, locale placeholders that shouldn't be touched by the translator. Later on possibly exposed in the UI so that users can mark non-translatable parts. Especially the last will also require to support specific positions, so that we don't ignore/escape too much.

Proposed resolution

Extend the data item structure with an #escape key that looks like this:

$item['#escape'] = array(
  // Escape the @bar that starts at position 5.
  array('string' => '@bar', 'position' => 5),
  // Escape all @foo
  array('string' => '@foo'),
);

Then translators can check for that and translate them accordingly. This is not trivial, as positions can affect each other and so on, so this should be abstracted/generalized as much as possible. I expect something like

$final_text = $this->escapeText($data_item), For highest flexibility, I think there should be a $this->getEscapedString($string), which has a default implementation that looks like this: return $this->escapeStart . $string . $this->escapeEnd), so simple escaping patterns can just define those properties.

Remaining tasks

Write the code, update documentation, add a lot of tests.

Define unescape. Can we just look for the pattern and remove it? have to care about shifted positions there would be insanely complicated.

User interface changes

None for now.

API changes

Sources can define to be escaped strings, Translators are supposed to care about them.

Gengo: #1676774: Escape HTML from source before sending to be translated
Google: #2064823: Escaping

All other translators will need issues too.

CommentFileSizeAuthor
#2 escaping-2064871-2.patch9.77 KBberdir

Comments

berdir’s picture

Title: Allow sources to define strings that should be escaped. » Allow sources to define strings that should be escaped
berdir’s picture

Status: Active » Needs review
StatusFileSize
new9.77 KB

First implementation.

The #escape definition was changed to always require the position and it is therefore used as the key. This makes actually implementing it fairly easy, which is done here in the test translator for testing purposes. The test translator doesn't actually use it, as he doesn't need to.

The escape definitions is implemented for the locale source, as far as possible, as discussed, see inline comments.

Escaping HTML should IMHO be moved to a separate issue, that will not be trivial and it doesn't need to block getting the API in so that translators can start using it.

miro_dietiker’s picture

In addition to the limitations, side effects and problems discussed...

There is a second thought that is similarily crazy:
Instead of pseudo metadata, we could define that sources need to deliver proper ITS.
http://www.w3.org/TR/its20/
See also some approach:
https://drupal.org/project/its

This leads to two problems:
- A source might doesn't like to care about ITS and thus the default would be to hint "it's just text!" - don't try to interprete it as ITS tagged content.
- A source needs quite some complexity to parse source and convert into ITS content (and back)
- A translator that doesn't support anything (like current translators) would need to get ITS stripped payload
- A translator that supports escaping (or a subset of ITS) needs to instanciate an ITS parser and interprete the events

While it is nice to follow a clean standard, it's just crazy complex.
We might provide something like this as V2 with a fallback to placeholder positions like we are doing currently.

blueminds’s picture

Status: Needs review » Needs work

Just some very minor things:

+++ b/plugin/tmgmt.plugin.translator.incundefined
@@ -16,6 +16,20 @@ abstract class TMGMTDefaultTranslatorPluginController extends TMGMTPluginBase im
+   * Characters that indicates the beginning of an escaped string.
+   *
+   * @var string
+   */

indicates -> indicate

+++ b/plugin/tmgmt.plugin.translator.incundefined
@@ -16,6 +16,20 @@ abstract class TMGMTDefaultTranslatorPluginController extends TMGMTPluginBase im
+  /**
+   * Characters that indicates the end of an escaped string.
+   *
+   * @var string

indicates -> indicate

+++ b/sources/locale/tmgmt_locale.plugin.incundefined
@@ -145,12 +145,23 @@ class TMGMTLocaleSourcePluginController extends TMGMTDefaultSourcePluginControll
+      if (preg_match_all('/([@!%][a-zA-Z0-9_-]+)/', $text, $matches, PREG_OFFSET_CAPTURE)) {

The last dash does not need to be escaped?

berdir’s picture

Status: Needs work » Fixed

Thanks, fixed the indicate thing, escaping is not necessary as discussed. Committed and pushed.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.