Problem/Motivation

During migrations, many times I found the need to manipulate in some way fields with html inside, e.g. the node body field.

Some manipulation examples are:

  • Change the src attribute on images to point to a new place where new images are stored.
  • Change a subset of link href's to new places, e.g. one node type is now a media bundle.
  • Changing the way images are handled to use embed module <drupal-entity> html entities instead of <img>.
  • Transformations to cleanup codebase, like #2957933: Clean Up HTML & improve accessibility during migration, but that would mean to move this issue into core instead.

Proposed resolution

Two ideas come to mind to solve this problem: (a) use regular expressions, and (b) use DOM* classes to manipulate.
Based on a past experience during #88183: Relative URLs in feeds should be converted to absolute ones, I think core is leading more towards (b).

For either case, but especially for (b) writing custom migrate process plugins sounds like the better alternative.
Then, any more specific logic can be added via new migrate process plugins.

The main idea here is then to create a dom migrate process plugin that allows:

  • to convert a html string into a DOMDocument
  • then other dom aware plugins can later manipulate that object,
  • finally, convert a DOMDocument into a html string, for other normal process.

One example migrate process plugin using this is #2958285: Allow replacing based on a xpath expression.

Remaining tasks

User interface changes

N.A.

API changes

Not really an API change, but process plugins seem to assume that strings are passed along.
This new process plugin instead produces a DOMDocument on import, but a normal string on export.

Data model changes

N.A.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

marvil07 created an issue. See original summary.

marvil07’s picture

Assigned: marvil07 » Unassigned
Status: Active » Needs review
FileSize
6.12 KB

Here a dom migrate process plugin.

Also, thanks to benjifisher for the original idea, helping to make this idea more concrete, and for some suggestions.

marvil07’s picture

Issue summary: View changes
dww’s picture

Nice! I've got a migration doing all sorts of stuff to a body field. It currently does a callback process to do all the magic directly in PHP in my custom migration module. It's conceivable it could be made to work with this approach, instead. Given my background, I tend to reach for the callback process plugin at the drop of a hat. ;) I'm much more comfortable doing something directly in PHP than trying to chain a bunch of other processes together via the yml glue. But I definitely see value in this.

However, I'm curious how this will work in practice. Namely, how will other process plugins that expect a DOM object as input, not a string, advertise/declare themselves as such? And, for the chaining to really work, we'd need a way for a given DOM-aware plugin to know if it's the last step in the chain and should spit back HTML, or if it's going to keep being used and should pass on the modified DOM.

A brief skim of the patch doesn't reveal an obvious answer. Apologies if I'm missing something.

Thanks!
-Derek

marvil07’s picture

Hey dww!

Namely, how will other process plugins that expect a DOM object as input, not a string, advertise/declare themselves as such? And, for the chaining to really work, we'd need a way for a given DOM-aware plugin to know if it's the last step in the chain and should spit back HTML, or if it's going to keep being used and should pass on the modified DOM.

How to expose it in the context of other plugins is something we may want to think about.
Maybe declaring a new interface for process plugins to implement could be a way to detect that in code, and then later try to intervene in process plugins to raise an exception if the last process plugin is not dom.

For now my approach has been a lot simpler: assume the migration yml author to actually know that dom import is required at the start, and dom export is required at the end.

A full example from the docblock in the patch on the related issue #2958285: Allow replacing based on a xpath expression:

process:
  'body/value':
    -
      plugin: dom
      method: import
      source: 'body/0/value'
    -
      plugin: dom_str_replace
      mode: attribute
      expression: '//a'
      attribute_options:
        name: href
      search: 'foo'
      replace: 'bar'
    -
      plugin: dom_str_replace
      mode: attribute
      expression: '//a'
      attribute_options:
        name: href
      regex: true
      search: '/foo/'
      replace: 'bar'
    -
      plugin: dom
      method: export
dww’s picture

Hi marvil07! It's been a while... ;)

Cool, thanks for clarifying. Makes sense. That'd work for me, but I presume this will need more fancy exception throwing in the general case before it could be officially adopted. Otherwise, I predict fairly interesting and useless error messages (if any) if someone got the details wrong. ;)

mikeryan’s picture

Just an FYI, my goto for HTML parsing has been https://packagist.org/packages/querypath/querypath, it's especially good if you're dealing with old-school HTML (no </p>, etc.).

mikeryan’s picture

A simplified example...


namespace Drupal\example_migrate\Plugin\migrate\process;

use Drupal\migrate\MigrateExecutableInterface;
use Drupal\migrate\ProcessPluginBase;
use Drupal\migrate\Row;

/**
 * Extract transcript HTML from a video node body.
 *
 * @MigrateProcessPlugin(
 *   id = "extract_transcript"
 * )
 */
class ExtractTranscript extends ProcessPluginBase {

  /**
   * {@inheritdoc}
   */
  public function transform($value, MigrateExecutableInterface $migrate_executable, Row $row, $destination_property) {
    // Extract the transcript, and remove the disclaimer.
    $qp_options = [
      'convert_to_encoding' => 'utf-8',
    ];
    $transcript = \QueryPath::withHTML($value, 'div.fpBody', $qp_options);
    $transcript->remove('p.legal');
    $transcript = $transcript->innerHTML();

    return $transcript;
  }

}
marvil07’s picture

@mikeryan: yes, there are some HTML manipulation libraries that can be used. IMHO I would say we could start simple with php dom extension, and later we can decide which library to use, since this will become a new dependency.

Also, adding a parent ticket.

marvil07’s picture

FileSize
7.9 KB

Initial test coverage and fixing a small bug.

marvil07’s picture

FileSize
2.74 KB

Related interdiff.

benjifisher’s picture

Issue summary: View changes

@dww, thanks for your comments!

Namely, how will other process plugins that expect a DOM object as input, not a string, advertise/declare themselves as such?

@marvil07 already answered this. I just want to point out that process plugins generally have no type hinting for the value they are given. It is typically a string or an array: often a nested array of strings. It is already the responsibility of the migration to chain process plugins together so that each provides the input expected from the next.

Otherwise, I predict fairly interesting and useless error messages (if any) if someone got the details wrong.

The import method already has

+    if (!is_string($value)) {
+      throw new MigrateException('Cannot import a non-string value.');
+    }

and the export method has

+    if (!$value instanceof \DOMDocument) {
+      $value_description = (gettype($value) == 'object') ? get_class($value) : gettype($value);
+      throw new MigrateException(sprintf('Cannot export a "%s".', $value_description));
+    }

Can you give a more specific suggestion on how to improve these? Should we do something other than throw MigrateException?

benjifisher’s picture

If you need to do custom processing with the DOMDocument class, then you can write a custom process plugin instead of a custom callback, and let this plugin handle converting from string to object and back.

Our plan is to have some configurable process plugins that do common transformations, like updating href and src attributes during migrations. @marvil07 has already added two other issues as children of the parent issue #2958642: DOM manipulation on process plugins, based on the code we are using on our current project.

joey-santiago’s picture

Having a similar problem and wrote a class extending ProcessPluginBase for this.

I am also using regex in order to "translate" some xml elements into others as i couldn't find a smart way of doing it with DOM classes in php.

Also, the problem i am having is: i have an array of URLs i use to do the migration from... how can i access the "current" URL the transform method is running at?

thanks,

heddn’s picture

Status: Needs review » Needs work
  1. +++ b/src/Plugin/migrate/process/Dom.php
    @@ -0,0 +1,166 @@
    +  public function __construct(array $configuration, $plugin_id, $plugin_definition) {
    +    parent::__construct($configuration, $plugin_id, $plugin_definition);
    

    Let's throw a noisy exception if we don't pass a valid value for method or import or export.

  2. +++ b/src/Plugin/migrate/process/Dom.php
    @@ -0,0 +1,166 @@
    +    // @todo It may be useful to have these parameters as configuration options.
    

    Then let's do this. An optional params option is a great idea.

  3. +++ b/src/Plugin/migrate/process/Dom.php
    @@ -0,0 +1,166 @@
    +    if ($this->nonRoot) {
    

    We should also make this non root markup as an optional argument passed to the constructor.

  4. +++ b/src/Plugin/migrate/process/Dom.php
    @@ -0,0 +1,166 @@
    +      set_error_handler(function(int $errno , string $errstr) use ($migrate_executable) {
    

    I don't think type hinting string and int is safe when using with php 5.6.

  5. +++ b/tests/src/Unit/process/DomTest.php
    @@ -0,0 +1,65 @@
    +class DomTest extends MigrateProcessTestCase {
    

    Let's beef up the test case to include the edge cases I mention in my review previously.

benjifisher’s picture

Status: Needs work » Needs review
Issue tags: +#DrupalCampNJ2019
FileSize
7.32 KB
11 KB

@heddn, thanks for the detailed suggestions.

  • 15.1 Done. I followed the example of the core callback plugin. Should we add a follow-up issue to do something similar in the skip_on_value plugin?
  • 15.2 Done. I called the optional parameters "version" and "encoding", matching the documentation of the DOMDocument constructor. I also copied the descriptions from there. If that is not clear enough, then I could prefix these parameters with "domdocument_" or "xml_". What do you think?
  • 15.3 Done. The default markup (now the html_template parameter) is complicated enough that I decided to add a defaultValues() method and simplify the handling of these defaults. I also added a second example in the documentation block, showing all the defaults.
  • 15.4 I removed the type hints.
  • 15.5 I added test coverage for validating the "method" parameter (15.1). Thanks to @tim.plunkett for his advice at #DrupalCampNJ. Did you have additional tests in mind?

Status: Needs review » Needs work

The last submitted patch, 16: 2958281-16.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

benjifisher’s picture

Status: Needs work » Needs review
FileSize
1.65 KB
10.86 KB

This patch should fix the coding standards. I am also attaching an interdiff.

Status: Needs review » Needs work

The last submitted patch, 18: 2958281-18.patch, failed testing. View results

benjifisher’s picture

@dww:

In #6, you requested better error handling. I asked for more information in #12, but you did not reply. Maybe you are satisfied with the explanations @marvil07 and I gave, or with the validation I added in #16?

@joey-santiago:

Your question seems off topic for this issue. Maybe I am missing the connection. The transform() method (or the import() and export() methods in this issue) receive a $row parameter. Maybe the information you need is already there. If not, maybe you could add it in hook_migrate_prepare_row().

benjifisher’s picture

Status: Needs work » Needs review
FileSize
2.46 KB
10.93 KB

This should fix the test failures. (Make sure not to rely on optional parameters!) This change also gives migrations the option of specifying a custom html_template that (like the default) uses a token for the encoding.

Status: Needs review » Needs work

The last submitted patch, 21: 2958281-21.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

benjifisher’s picture

FileSize
925 bytes
10.99 KB

I disagree with CodeSniffer about indentation. At the expense of adding a few lines, I can make a compromise that should satisfy us both.

benjifisher’s picture

Status: Needs work » Needs review
heddn’s picture

Status: Needs review » Needs work
  1. +++ b/tests/src/Unit/process/DomTest.php
    @@ -0,0 +1,84 @@
    +    $document = (new Dom($configuration, 'dom', []))
    ...
    +    $document = (new Dom($configuration, 'dom', []))
    

    I don't think we need to assign to variable $document.

  2. +++ b/tests/src/Unit/process/DomTest.php
    @@ -0,0 +1,84 @@
    +    $this->setExpectedException(MigrateException::class);
    ...
    +    $this->setExpectedException(MigrateException::class);
    

    Nit: Can we test the message for the exception?

benjifisher’s picture

Status: Needs work » Needs review
FileSize
1.92 KB
11.02 KB

@heddn:

Good catches. I also removed a $value = for the same reason as #25.1.

heddn’s picture

I'm doing a review of the doxygen now more thoroughly. Let's see if we can make the use of this process plugin a little easier.

  1. +++ b/src/Plugin/migrate/process/Dom.php
    @@ -0,0 +1,232 @@
    + * - warnings_to_idmap: (optional) When parsing HTML, libxml may trigger
    + *   warnings. If this option is set to true, it will log them as migration
    + *   messages. Otherwise, it will not handle it in a special way. Only used if
    + *   method is 'import'. Defaults to true.
    

    Let's call this log_messages.

  2. +++ b/src/Plugin/migrate/process/Dom.php
    @@ -0,0 +1,232 @@
    + * - version: (optional) The version number of the document as part of the XML
    + *   declaration. Only used if method is 'import'. Defaults to '1.0'.
    + * - encoding: (optional) The encoding of the document as part of the XML
    + *   declaration. Only used if method is 'import'. Defaults to 'UTF-8'.
    

    Is this also only important when non_root is false? Can we combine this with html_template and remove some config options?

  3. +++ b/src/Plugin/migrate/process/Dom.php
    @@ -0,0 +1,232 @@
    + * - html_template: (optional) If non_root is true, then a complete HTML
    + *   document with the token '!value' to be replaced by the input text. You can
    + *   also use the token '!encoding'. Only used if method is 'import'. The
    + *   default has a DOCTYPE line, an <html> element, a <head> element, and a
    + *   <body> element containing the '!value' token.
    

    This is a complicated concept and I would like if we could make this easier to understand. Are these replacement values to make parsing the dom easier? Is there a more intuitive name for the config option than html_template? Like string_replacements, or something else?

benjifisher’s picture

More later, but for now:

In #15.3, you asked for the markup to be configurable. I realized that this was complicated, so I included the second example (specifying all optional parameters as their default values):

 *       html_template: |
 *         <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 *         <html xmlns="http://www.w3.org/1999/xhtml">
 *         <head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head>
 *         <body>!value</body>
 *         </html>
benjifisher’s picture

#27.1: Done. I made a similar change to the name of the corresponding class property.

If I were writing this from scratch, then I might not bother with the two class properties. I have kept them from the original patch.

#27.2: No, the version and encoding are also passed directly to the DOMDocument constructor:

    $document = new \DOMDocument($this->configuration['version'], $this->configuration['encoding']);

#27.3: I would rather remove this optional parameter than rename it. It is a wrapper for the supplied HTML fragment, so I do not think that string_replacements is a descriptive name.

Besides what I mentioned in #28, note that the import() and export() methods are basically wrappers for Html::load() and Html::serialize(), at least in the usual case where non_root is true (default). The export() method explicitly calls Html::serialize(). The only reason import() re-implements Html::load() instead of calling it is that we want to provide an option to log messages. (@marvil07 gets all the credit for that idea. We find it very helpful in debugging migrations.)

Maybe I should have pushed back on the suggestions from #15.2 and #15.3. We are providing options that may not be very useful and certainly are not provided by Html::load().

heddn’s picture

Status: Needs review » Needs work
+++ b/src/Plugin/migrate/process/Dom.php
@@ -0,0 +1,234 @@
+ * - html_template: (optional) If non_root is true, then a complete HTML

I like you reasoning for removing this. Let's do that. It is complicated enough and if someone does need to adjust it, then can always extend this process plugin and make their own changes. Let's move some of these comments down into code where we are processing the html template and remove this config option.

Thanks for your patience.

marvil07’s picture

Status: Needs work » Needs review
FileSize
10.5 KB
3.92 KB

@heddn, thanks for the reviews!
@benjifisher, thanks for all your iterations here!

I have only made a couple of changes to the code:

  • non_root parameter is used in both import and export methods, so I moved it up on documentation.
  • Move partial to full html building to a separate method. In this way, we both remove the parameter that may be a bit confusing, and we make it a bit easier to override the process plugin on a child class isolating the non-root partial to full html conversion in one new method.
heddn’s picture

Status: Needs review » Reviewed & tested by the community

Looks good now.

  • heddn committed abe81dd on 8.x-4.x authored by marvil07
    Issue #2958281 by benjifisher, marvil07, heddn: Allow manipulating html...
heddn’s picture

Status: Reviewed & tested by the community » Fixed

Thanks for all your contributions here.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.