Problem/Motivation
During migrations, many times I found the need to manipulate in some way fields with html inside, e.g. the node body field.
Some manipulation examples are:
- Change the src attribute on images to point to a new place where new images are stored.
- Change a subset of link href's to new places, e.g. one node type is now a media bundle.
- Changing the way images are handled to use embed module
<drupal-entity>
html entities instead of<img>
. - Transformations to cleanup codebase, like #2957933: Clean Up HTML & improve accessibility during migration, but that would mean to move this issue into core instead.
Proposed resolution
Two ideas come to mind to solve this problem: (a) use regular expressions, and (b) use DOM* classes to manipulate.
Based on a past experience during #88183: Relative URLs in feeds should be converted to absolute ones, I think core is leading more towards (b).
For either case, but especially for (b) writing custom migrate process plugins sounds like the better alternative.
Then, any more specific logic can be added via new migrate process plugins.
The main idea here is then to create a dom
migrate process plugin that allows:
- to convert a html string into a
DOMDocument
- then other dom aware plugins can later manipulate that object,
- finally, convert a
DOMDocument
into a html string, for other normal process.
One example migrate process plugin using this is #2958285: Allow replacing based on a xpath expression.
Remaining tasks
User interface changes
N.A.
API changes
Not really an API change, but process plugins seem to assume that strings are passed along.
This new process plugin instead produces a DOMDocument
on import, but a normal string on export.
Data model changes
N.A.
Comment | File | Size | Author |
---|---|---|---|
#31 | interdiff.txt | 3.92 KB | marvil07 |
#31 | 2958281-31.patch | 10.5 KB | marvil07 |
Comments
Comment #2
marvil07 CreditAttribution: marvil07 at Isovera for Pegasystems commentedHere a
dom
migrate process plugin.Also, thanks to benjifisher for the original idea, helping to make this idea more concrete, and for some suggestions.
Comment #3
marvil07 CreditAttribution: marvil07 at Isovera for Pegasystems commentedComment #4
dwwNice! I've got a migration doing all sorts of stuff to a body field. It currently does a callback process to do all the magic directly in PHP in my custom migration module. It's conceivable it could be made to work with this approach, instead. Given my background, I tend to reach for the callback process plugin at the drop of a hat. ;) I'm much more comfortable doing something directly in PHP than trying to chain a bunch of other processes together via the yml glue. But I definitely see value in this.
However, I'm curious how this will work in practice. Namely, how will other process plugins that expect a DOM object as input, not a string, advertise/declare themselves as such? And, for the chaining to really work, we'd need a way for a given DOM-aware plugin to know if it's the last step in the chain and should spit back HTML, or if it's going to keep being used and should pass on the modified DOM.
A brief skim of the patch doesn't reveal an obvious answer. Apologies if I'm missing something.
Thanks!
-Derek
Comment #5
marvil07 CreditAttribution: marvil07 at Isovera for Pegasystems commentedHey dww!
How to expose it in the context of other plugins is something we may want to think about.
Maybe declaring a new interface for process plugins to implement could be a way to detect that in code, and then later try to intervene in process plugins to raise an exception if the last process plugin is not dom.
For now my approach has been a lot simpler: assume the migration yml author to actually know that dom import is required at the start, and dom export is required at the end.
A full example from the docblock in the patch on the related issue #2958285: Allow replacing based on a xpath expression:
Comment #6
dwwHi marvil07! It's been a while... ;)
Cool, thanks for clarifying. Makes sense. That'd work for me, but I presume this will need more fancy exception throwing in the general case before it could be officially adopted. Otherwise, I predict fairly interesting and useless error messages (if any) if someone got the details wrong. ;)
Comment #7
mikeryanJust an FYI, my goto for HTML parsing has been https://packagist.org/packages/querypath/querypath, it's especially good if you're dealing with old-school HTML (no </p>, etc.).
Comment #8
mikeryanA simplified example...
Comment #9
marvil07 CreditAttribution: marvil07 at Isovera for Pegasystems commented@mikeryan: yes, there are some HTML manipulation libraries that can be used. IMHO I would say we could start simple with php dom extension, and later we can decide which library to use, since this will become a new dependency.
Also, adding a parent ticket.
Comment #10
marvil07 CreditAttribution: marvil07 at Isovera for Pegasystems commentedInitial test coverage and fixing a small bug.
Comment #11
marvil07 CreditAttribution: marvil07 at Isovera for Pegasystems commentedRelated interdiff.
Comment #12
benjifisher@dww, thanks for your comments!
@marvil07 already answered this. I just want to point out that process plugins generally have no type hinting for the value they are given. It is typically a string or an array: often a nested array of strings. It is already the responsibility of the migration to chain process plugins together so that each provides the input expected from the next.
The import method already has
and the export method has
Can you give a more specific suggestion on how to improve these? Should we do something other than throw MigrateException?
Comment #13
benjifisherIf you need to do custom processing with the DOMDocument class, then you can write a custom process plugin instead of a custom callback, and let this plugin handle converting from string to object and back.
Our plan is to have some configurable process plugins that do common transformations, like updating
href
andsrc
attributes during migrations. @marvil07 has already added two other issues as children of the parent issue #2958642: DOM manipulation on process plugins, based on the code we are using on our current project.Comment #14
joey-santiago CreditAttribution: joey-santiago commentedHaving a similar problem and wrote a class extending ProcessPluginBase for this.
I am also using regex in order to "translate" some xml elements into others as i couldn't find a smart way of doing it with DOM classes in php.
Also, the problem i am having is: i have an array of URLs i use to do the migration from... how can i access the "current" URL the transform method is running at?
thanks,
Comment #15
heddnLet's throw a noisy exception if we don't pass a valid value for method or import or export.
Then let's do this. An optional params option is a great idea.
We should also make this non root markup as an optional argument passed to the constructor.
I don't think type hinting string and int is safe when using with php 5.6.
Let's beef up the test case to include the edge cases I mention in my review previously.
Comment #16
benjifisher@heddn, thanks for the detailed suggestions.
callback
plugin. Should we add a follow-up issue to do something similar in theskip_on_value
plugin?html_template
parameter) is complicated enough that I decided to add adefaultValues()
method and simplify the handling of these defaults. I also added a second example in the documentation block, showing all the defaults.Comment #18
benjifisherThis patch should fix the coding standards. I am also attaching an interdiff.
Comment #20
benjifisher@dww:
In #6, you requested better error handling. I asked for more information in #12, but you did not reply. Maybe you are satisfied with the explanations @marvil07 and I gave, or with the validation I added in #16?
@joey-santiago:
Your question seems off topic for this issue. Maybe I am missing the connection. The
transform()
method (or theimport()
andexport()
methods in this issue) receive a$row
parameter. Maybe the information you need is already there. If not, maybe you could add it inhook_migrate_prepare_row()
.
Comment #21
benjifisherThis should fix the test failures. (Make sure not to rely on optional parameters!) This change also gives migrations the option of specifying a custom
html_template
that (like the default) uses a token for the encoding.Comment #23
benjifisherI disagree with CodeSniffer about indentation. At the expense of adding a few lines, I can make a compromise that should satisfy us both.
Comment #24
benjifisherComment #25
heddnI don't think we need to assign to variable $document.
Nit: Can we test the message for the exception?
Comment #26
benjifisher@heddn:
Good catches. I also removed a
$value =
for the same reason as #25.1.Comment #27
heddnI'm doing a review of the doxygen now more thoroughly. Let's see if we can make the use of this process plugin a little easier.
Let's call this log_messages.
Is this also only important when non_root is false? Can we combine this with html_template and remove some config options?
This is a complicated concept and I would like if we could make this easier to understand. Are these replacement values to make parsing the dom easier? Is there a more intuitive name for the config option than html_template? Like string_replacements, or something else?
Comment #28
benjifisherMore later, but for now:
In #15.3, you asked for the markup to be configurable. I realized that this was complicated, so I included the second example (specifying all optional parameters as their default values):
Comment #29
benjifisher#27.1: Done. I made a similar change to the name of the corresponding class property.
If I were writing this from scratch, then I might not bother with the two class properties. I have kept them from the original patch.
#27.2: No, the version and encoding are also passed directly to the DOMDocument constructor:
#27.3: I would rather remove this optional parameter than rename it. It is a wrapper for the supplied HTML fragment, so I do not think that
string_replacements
is a descriptive name.Besides what I mentioned in #28, note that the
import()
andexport()
methods are basically wrappers forHtml::load()
andHtml::serialize()
, at least in the usual case wherenon_root
is true (default). Theexport()
method explicitly callsHtml::serialize()
. The only reasonimport()
re-implementsHtml::load()
instead of calling it is that we want to provide an option to log messages. (@marvil07 gets all the credit for that idea. We find it very helpful in debugging migrations.)Maybe I should have pushed back on the suggestions from #15.2 and #15.3. We are providing options that may not be very useful and certainly are not provided by
Html::load()
.Comment #30
heddnI like you reasoning for removing this. Let's do that. It is complicated enough and if someone does need to adjust it, then can always extend this process plugin and make their own changes. Let's move some of these comments down into code where we are processing the html template and remove this config option.
Thanks for your patience.
Comment #31
marvil07 CreditAttribution: marvil07 as a volunteer commented@heddn, thanks for the reviews!
@benjifisher, thanks for all your iterations here!
I have only made a couple of changes to the code:
Comment #32
heddnLooks good now.
Comment #34
heddnThanks for all your contributions here.