Allow manipulating html from process plugins [#2958281]

Problem/Motivation

During migrations, many times I found the need to manipulate in some way fields with html inside, e.g. the node body field.

Some manipulation examples are:

Change the src attribute on images to point to a new place where new images are stored.
Change a subset of link href's to new places, e.g. one node type is now a media bundle.
Changing the way images are handled to use embed module <drupal-entity> html entities instead of <img>.
Transformations to cleanup codebase, like #2957933: Clean Up HTML & improve accessibility during migration, but that would mean to move this issue into core instead.

Proposed resolution

Two ideas come to mind to solve this problem: (a) use regular expressions, and (b) use DOM* classes to manipulate.
Based on a past experience during #88183: Relative URLs in feeds should be converted to absolute ones, I think core is leading more towards (b).

For either case, but especially for (b) writing custom migrate process plugins sounds like the better alternative.
Then, any more specific logic can be added via new migrate process plugins.

The main idea here is then to create a dom migrate process plugin that allows:

to convert a html string into a DOMDocument
then other dom aware plugins can later manipulate that object,
finally, convert a DOMDocument into a html string, for other normal process.

One example migrate process plugin using this is #2958285: Allow replacing based on a xpath expression.

Remaining tasks

User interface changes

N.A.

API changes

Not really an API change, but process plugins seem to assume that strings are passed along.
This new process plugin instead produces a DOMDocument on import, but a normal string on export.

Data model changes

N.A.

Comment	File	Size	Author
#31	interdiff.txt	3.92 KB	marvil07
#31	2958281-31.patch	10.5 KB	marvil07
#29	2958281-29.patch	10.93 KB	benjifisher
#29	interdiff-2958281-26-29.txt	3.78 KB	benjifisher
#26	2958281-26.patch	11.02 KB	benjifisher
#26	interdiff-2958281-23-26.txt	1.92 KB	benjifisher
#23	2958281-23.patch	10.99 KB	benjifisher
#23	interdiff-2958281-21-23.txt	925 bytes	benjifisher
#21	2958281-21.patch	10.93 KB	benjifisher
#21	interdiff-2958281-18-21.txt	2.46 KB	benjifisher
#18	2958281-18.patch	10.86 KB	benjifisher
#18	interdiff-2958281-16-18.txt	1.65 KB	benjifisher
#16	2958281-16.patch	11 KB	benjifisher
#16	interdiff-2958281-10-16.txt	7.32 KB	benjifisher
#11	interdiff.txt	2.74 KB	marvil07
#10	2958281-10.patch	7.9 KB	marvil07
#2	0001-Add-a-dom-migrate-process-plugin-that-handles-string.patch	6.12 KB	marvil07

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

3 April 2018 at 22:17

marvil07 created an issue. See original summary.

Comment #2

marvil07 CreditAttribution: marvil07 at Isovera for Pegasystems commented 3 April 2018 at 22:22

Assigned:	marvil07	» Unassigned
Status:	Active	» Needs review

File	Size
0001-Add-a-dom-migrate-process-plugin-that-handles-string.patch	6.12 KB

Here a dom migrate process plugin.

Also, thanks to benjifisher for the original idea, helping to make this idea more concrete, and for some suggestions.

Comment #3

marvil07 CreditAttribution: marvil07 at Isovera for Pegasystems commented 3 April 2018 at 22:31

Issue summary:

View changes

Comment #4

dww

we/he/they

CreditAttribution: dww as a volunteer commented 4 April 2018 at 15:29

Nice! I've got a migration doing all sorts of stuff to a body field. It currently does a callback process to do all the magic directly in PHP in my custom migration module. It's conceivable it could be made to work with this approach, instead. Given my background, I tend to reach for the callback process plugin at the drop of a hat. ;) I'm much more comfortable doing something directly in PHP than trying to chain a bunch of other processes together via the yml glue. But I definitely see value in this.

However, I'm curious how this will work in practice. Namely, how will other process plugins that expect a DOM object as input, not a string, advertise/declare themselves as such? And, for the chaining to really work, we'd need a way for a given DOM-aware plugin to know if it's the last step in the chain and should spit back HTML, or if it's going to keep being used and should pass on the modified DOM.

A brief skim of the patch doesn't reveal an obvious answer. Apologies if I'm missing something.

Thanks!
-Derek

Comment #5

marvil07 CreditAttribution: marvil07 at Isovera for Pegasystems commented 4 April 2018 at 15:50

Hey dww!

Namely, how will other process plugins that expect a DOM object as input, not a string, advertise/declare themselves as such? And, for the chaining to really work, we'd need a way for a given DOM-aware plugin to know if it's the last step in the chain and should spit back HTML, or if it's going to keep being used and should pass on the modified DOM.

How to expose it in the context of other plugins is something we may want to think about.
Maybe declaring a new interface for process plugins to implement could be a way to detect that in code, and then later try to intervene in process plugins to raise an exception if the last process plugin is not dom.

For now my approach has been a lot simpler: assume the migration yml author to actually know that dom import is required at the start, and dom export is required at the end.

A full example from the docblock in the patch on the related issue #2958285: Allow replacing based on a xpath expression:

process:
  'body/value':
    -
      plugin: dom
      method: import
      source: 'body/0/value'
    -
      plugin: dom_str_replace
      mode: attribute
      expression: '//a'
      attribute_options:
        name: href
      search: 'foo'
      replace: 'bar'
    -
      plugin: dom_str_replace
      mode: attribute
      expression: '//a'
      attribute_options:
        name: href
      regex: true
      search: '/foo/'
      replace: 'bar'
    -
      plugin: dom
      method: export

Comment #6

dww

we/he/they

CreditAttribution: dww as a volunteer commented 4 April 2018 at 16:14

Hi marvil07! It's been a while... ;)

Cool, thanks for clarifying. Makes sense. That'd work for me, but I presume this will need more fancy exception throwing in the general case before it could be officially adopted. Otherwise, I predict fairly interesting and useless error messages (if any) if someone got the details wrong. ;)

Comment #7

mikeryan

he/him

English

Murphysboro, IL, USA

CreditAttribution: mikeryan as a volunteer and at Virtuoso Performance commented 4 April 2018 at 16:44

Just an FYI, my goto for HTML parsing has been https://packagist.org/packages/querypath/querypath, it's especially good if you're dealing with old-school HTML (no </p>, etc.).

Comment #8

mikeryan

he/him

English

Murphysboro, IL, USA

CreditAttribution: mikeryan as a volunteer and at Virtuoso Performance commented 4 April 2018 at 16:50

A simplified example...


namespace Drupal\example_migrate\Plugin\migrate\process;

use Drupal\migrate\MigrateExecutableInterface;
use Drupal\migrate\ProcessPluginBase;
use Drupal\migrate\Row;

/**
 * Extract transcript HTML from a video node body.
 *
 * @MigrateProcessPlugin(
 *   id = "extract_transcript"
 * )
 */
class ExtractTranscript extends ProcessPluginBase {

  /**
   * {@inheritdoc}
   */
  public function transform($value, MigrateExecutableInterface $migrate_executable, Row $row, $destination_property) {
    // Extract the transcript, and remove the disclaimer.
    $qp_options = [
      'convert_to_encoding' => 'utf-8',
    ];
    $transcript = \QueryPath::withHTML($value, 'div.fpBody', $qp_options);
    $transcript->remove('p.legal');
    $transcript = $transcript->innerHTML();

    return $transcript;
  }

}

Comment #9

marvil07 CreditAttribution: marvil07 at Isovera for Pegasystems commented 4 April 2018 at 23:39

Parent issue:

» #2958642: DOM manipulation on process plugins

@mikeryan: yes, there are some HTML manipulation libraries that can be used. IMHO I would say we could start simple with php dom extension, and later we can decide which library to use, since this will become a new dependency.

Also, adding a parent ticket.

Comment #10

marvil07 CreditAttribution: marvil07 at Isovera for Pegasystems commented 5 April 2018 at 05:38

File	Size
2958281-10.patch	7.9 KB

1 file was hidden/shown/deleted

File	Size
0001-Add-a-dom-migrate-process-plugin-that-handles-string.patch	6.12 KB

Initial test coverage and fixing a small bug.

Comment #11

marvil07 CreditAttribution: marvil07 at Isovera for Pegasystems commented 5 April 2018 at 05:39

File	Size
interdiff.txt	2.74 KB

Related interdiff.

Comment #12

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher at Isovera commented 5 April 2018 at 15:59

Issue summary:

View changes

@dww, thanks for your comments!

Namely, how will other process plugins that expect a DOM object as input, not a string, advertise/declare themselves as such?

@marvil07 already answered this. I just want to point out that process plugins generally have no type hinting for the value they are given. It is typically a string or an array: often a nested array of strings. It is already the responsibility of the migration to chain process plugins together so that each provides the input expected from the next.

Otherwise, I predict fairly interesting and useless error messages (if any) if someone got the details wrong.

The import method already has

+    if (!is_string($value)) {
+      throw new MigrateException('Cannot import a non-string value.');
+    }

and the export method has

+    if (!$value instanceof \DOMDocument) {
+      $value_description = (gettype($value) == 'object') ? get_class($value) : gettype($value);
+      throw new MigrateException(sprintf('Cannot export a "%s".', $value_description));
+    }

Can you give a more specific suggestion on how to improve these? Should we do something other than throw MigrateException?

Comment #13

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher at Isovera for Pegasystems commented 5 April 2018 at 18:28

If you need to do custom processing with the DOMDocument class, then you can write a custom process plugin instead of a custom callback, and let this plugin handle converting from string to object and back.

Our plan is to have some configurable process plugins that do common transformations, like updating href and src attributes during migrations. @marvil07 has already added two other issues as children of the parent issue #2958642: DOM manipulation on process plugins, based on the code we are using on our current project.

Comment #14

joey-santiago CreditAttribution: joey-santiago commented 27 November 2018 at 10:08

Having a similar problem and wrote a class extending ProcessPluginBase for this.

I am also using regex in order to "translate" some xml elements into others as i couldn't find a smart way of doing it with DOM classes in php.

Also, the problem i am having is: i have an array of URLs i use to do the migration from... how can i access the "current" URL the transform method is running at?

thanks,

Comment #15

heddn

English

Nicaragua

CreditAttribution: heddn at MTech, LLC, Acro Commerce commented 11 December 2018 at 17:29

Status:

Needs review

» Needs work

+++ b/src/Plugin/migrate/process/Dom.php
@@ -0,0 +1,166 @@
+  public function __construct(array $configuration, $plugin_id, $plugin_definition) {
+    parent::__construct($configuration, $plugin_id, $plugin_definition);

Let's throw a noisy exception if we don't pass a valid value for method or import or export.

+++ b/src/Plugin/migrate/process/Dom.php
@@ -0,0 +1,166 @@
+    // @todo It may be useful to have these parameters as configuration options.

Then let's do this. An optional params option is a great idea.

```
+++ b/src/Plugin/migrate/process/Dom.php
@@ -0,0 +1,166 @@
+    if ($this->nonRoot) {
```
We should also make this non root markup as an optional argument passed to the constructor.

+++ b/src/Plugin/migrate/process/Dom.php
@@ -0,0 +1,166 @@
+      set_error_handler(function(int $errno , string $errstr) use ($migrate_executable) {

I don't think type hinting string and int is safe when using with php 5.6.

+++ b/tests/src/Unit/process/DomTest.php
@@ -0,0 +1,65 @@
+class DomTest extends MigrateProcessTestCase {

Let's beef up the test case to include the edge cases I mention in my review previously.

Comment #16

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher at Isovera for Pegasystems commented 3 February 2019 at 18:29

Status:	Needs work	» Needs review
Issue tags:		+#DrupalCampNJ2019

File	Size
interdiff-2958281-10-16.txt	7.32 KB
2958281-16.patch	11 KB

@heddn, thanks for the detailed suggestions.

15.1 Done. I followed the example of the core callback plugin. Should we add a follow-up issue to do something similar in the skip_on_value plugin?
15.2 Done. I called the optional parameters "version" and "encoding", matching the documentation of the DOMDocument constructor. I also copied the descriptions from there. If that is not clear enough, then I could prefix these parameters with "domdocument_" or "xml_". What do you think?
15.3 Done. The default markup (now the html_template parameter) is complicated enough that I decided to add a defaultValues() method and simplify the handling of these defaults. I also added a second example in the documentation block, showing all the defaults.
15.4 I removed the type hints.
15.5 I added test coverage for validating the "method" parameter (15.1). Thanks to @tim.plunkett for his advice at #DrupalCampNJ. Did you have additional tests in mind?

Comment #17

3 February 2019 at 18:32

Status:

Needs review

» Needs work

The last submitted patch, 16: 2958281-16.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

Comment #18

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher at Isovera for Pegasystems commented 3 February 2019 at 18:58

Status:

Needs work

» Needs review

File	Size
interdiff-2958281-16-18.txt	1.65 KB
2958281-18.patch	10.86 KB

This patch should fix the coding standards. I am also attaching an interdiff.

Comment #19

3 February 2019 at 19:04

Status:

Needs review

» Needs work

The last submitted patch, 18: 2958281-18.patch, failed testing. View results

Comment #20

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher at Isovera for Pegasystems commented 3 February 2019 at 19:11

@dww:

In #6, you requested better error handling. I asked for more information in #12, but you did not reply. Maybe you are satisfied with the explanations @marvil07 and I gave, or with the validation I added in #16?

@joey-santiago:

Your question seems off topic for this issue. Maybe I am missing the connection. The transform() method (or the import() and export() methods in this issue) receive a $row parameter. Maybe the information you need is already there. If not, maybe you could add it in hook_migrate_prepare_row().

Comment #21

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher at Isovera for Pegasystems commented 3 February 2019 at 19:38

Status:

Needs work

» Needs review

File	Size
interdiff-2958281-18-21.txt	2.46 KB
2958281-21.patch	10.93 KB

This should fix the test failures. (Make sure not to rely on optional parameters!) This change also gives migrations the option of specifying a custom html_template that (like the default) uses a token for the encoding.

Comment #22

3 February 2019 at 19:48

Status:

Needs review

» Needs work

The last submitted patch, 21: 2958281-21.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

Comment #23

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher at Isovera for Pegasystems commented 3 February 2019 at 20:10

File	Size
interdiff-2958281-21-23.txt	925 bytes
2958281-23.patch	10.99 KB

I disagree with CodeSniffer about indentation. At the expense of adding a few lines, I can make a compromise that should satisfy us both.

Comment #24

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher at Isovera for Pegasystems commented 3 February 2019 at 20:21

Status:

Needs work

» Needs review

Comment #25

heddn

English

Nicaragua

CreditAttribution: heddn at MTech, LLC, Acro Commerce commented 4 February 2019 at 14:42

Status:

Needs review

» Needs work

+++ b/tests/src/Unit/process/DomTest.php
@@ -0,0 +1,84 @@
+    $document = (new Dom($configuration, 'dom', []))
...
+    $document = (new Dom($configuration, 'dom', []))

I don't think we need to assign to variable $document.

+++ b/tests/src/Unit/process/DomTest.php
@@ -0,0 +1,84 @@
+    $this->setExpectedException(MigrateException::class);
...
+    $this->setExpectedException(MigrateException::class);

Nit: Can we test the message for the exception?

Comment #26

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher at Isovera for Pegasystems commented 5 February 2019 at 19:33

Status:

Needs work

» Needs review

File	Size
interdiff-2958281-23-26.txt	1.92 KB
2958281-26.patch	11.02 KB

@heddn:

Good catches. I also removed a $value = for the same reason as #25.1.

Comment #27

heddn

English

Nicaragua

CreditAttribution: heddn at MTech, LLC commented 5 February 2019 at 20:44

I'm doing a review of the doxygen now more thoroughly. Let's see if we can make the use of this process plugin a little easier.

+++ b/src/Plugin/migrate/process/Dom.php
@@ -0,0 +1,232 @@
+ * - warnings_to_idmap: (optional) When parsing HTML, libxml may trigger
+ *   warnings. If this option is set to true, it will log them as migration
+ *   messages. Otherwise, it will not handle it in a special way. Only used if
+ *   method is 'import'. Defaults to true.

Let's call this log_messages.

+++ b/src/Plugin/migrate/process/Dom.php
@@ -0,0 +1,232 @@
+ * - version: (optional) The version number of the document as part of the XML
+ *   declaration. Only used if method is 'import'. Defaults to '1.0'.
+ * - encoding: (optional) The encoding of the document as part of the XML
+ *   declaration. Only used if method is 'import'. Defaults to 'UTF-8'.

Is this also only important when non_root is false? Can we combine this with html_template and remove some config options?

+++ b/src/Plugin/migrate/process/Dom.php
@@ -0,0 +1,232 @@
+ * - html_template: (optional) If non_root is true, then a complete HTML
+ *   document with the token '!value' to be replaced by the input text. You can
+ *   also use the token '!encoding'. Only used if method is 'import'. The
+ *   default has a DOCTYPE line, an <html> element, a <head> element, and a
+ *   <body> element containing the '!value' token.

This is a complicated concept and I would like if we could make this easier to understand. Are these replacement values to make parsing the dom easier? Is there a more intuitive name for the config option than html_template? Like string_replacements, or something else?

Comment #28

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher at Isovera for Pegasystems commented 5 February 2019 at 22:39

More later, but for now:

In #15.3, you asked for the markup to be configurable. I realized that this was complicated, so I included the second example (specifying all optional parameters as their default values):

 *       html_template: |
 *         <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 *         <html xmlns="http://www.w3.org/1999/xhtml">
 *         <head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head>
 *         <body>!value</body>
 *         </html>

Comment #29

benjifisher

he/him or they/them

English

Boston area

CreditAttribution: benjifisher at Isovera for Pegasystems commented 6 February 2019 at 21:07

File	Size
interdiff-2958281-26-29.txt	3.78 KB
2958281-29.patch	10.93 KB

#27.1: Done. I made a similar change to the name of the corresponding class property.

If I were writing this from scratch, then I might not bother with the two class properties. I have kept them from the original patch.

#27.2: No, the version and encoding are also passed directly to the DOMDocument constructor:

    $document = new \DOMDocument($this->configuration['version'], $this->configuration['encoding']);

#27.3: I would rather remove this optional parameter than rename it. It is a wrapper for the supplied HTML fragment, so I do not think that string_replacements is a descriptive name.

Besides what I mentioned in #28, note that the import() and export() methods are basically wrappers for Html::load() and Html::serialize(), at least in the usual case where non_root is true (default). The export() method explicitly calls Html::serialize(). The only reason import() re-implements Html::load() instead of calling it is that we want to provide an option to log messages. (@marvil07 gets all the credit for that idea. We find it very helpful in debugging migrations.)

Maybe I should have pushed back on the suggestions from #15.2 and #15.3. We are providing options that may not be very useful and certainly are not provided by Html::load().

Comment #30

heddn

English

Nicaragua

CreditAttribution: heddn at MTech, LLC commented 14 February 2019 at 16:17

Status:

Needs review

» Needs work

+++ b/src/Plugin/migrate/process/Dom.php
@@ -0,0 +1,234 @@
+ * - html_template: (optional) If non_root is true, then a complete HTML

I like you reasoning for removing this. Let's do that. It is complicated enough and if someone does need to adjust it, then can always extend this process plugin and make their own changes. Let's move some of these comments down into code where we are processing the html template and remove this config option.

Thanks for your patience.

Comment #31

marvil07 CreditAttribution: marvil07 as a volunteer commented 15 February 2019 at 21:34

Status:

Needs work

» Needs review

File	Size
2958281-31.patch	10.5 KB
interdiff.txt	3.92 KB

14 files were hidden/shown/deleted

File	Size
2958281-10.patch	7.9 KB
interdiff.txt	2.74 KB
interdiff-2958281-10-16.txt	7.32 KB
2958281-16.patch	11 KB
interdiff-2958281-16-18.txt	1.65 KB
2958281-18.patch	10.86 KB
interdiff-2958281-18-21.txt	2.46 KB
2958281-21.patch	10.93 KB
interdiff-2958281-21-23.txt	925 bytes
2958281-23.patch	10.99 KB
interdiff-2958281-23-26.txt	1.92 KB
2958281-26.patch	11.02 KB
interdiff-2958281-26-29.txt	3.78 KB
2958281-29.patch	10.93 KB

@heddn, thanks for the reviews!
@benjifisher, thanks for all your iterations here!

I have only made a couple of changes to the code:

non_root parameter is used in both import and export methods, so I moved it up on documentation.
Move partial to full html building to a separate method. In this way, we both remove the parameter that may be a bit confusing, and we make it a bit easier to override the process plugin on a child class isolating the non-root partial to full html conversion in one new method.

Comment #32

heddn

English

Nicaragua

CreditAttribution: heddn at MTech, LLC commented 18 February 2019 at 14:33

Status:

Needs review

» Reviewed & tested by the community

Looks good now.

Comment #33

18 February 2019 at 14:34

heddn committed abe81dd on 8.x-4.x authored by marvil07

Issue #2958281 by benjifisher, marvil07, heddn: Allow manipulating html...

Comment #34

heddn

English

Nicaragua

CreditAttribution: heddn at MTech, LLC commented 18 February 2019 at 14:35

Status:

Reviewed & tested by the community

» Fixed

Thanks for all your contributions here.

Comment #35

4 March 2019 at 14:39

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Allow manipulating html from process plugins

Problem/Motivation

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Comments

Change records for this issue

Parent issue

Referenced by