We are using this in Demo Framework with the Scenario module, and its great! However, we often have to install quite a few modules and large amounts of content that risk timing out. It would be great if we could "throttle" this using Batch API.
Is it possible to integrate Batch API to optionally allow us to import content?
@rlnorthcutt
________________________________________________________________________________________________________________________________
We have a requirement to provide content deployments across environments. We looked into other options, like content_deploy, workspaces (core), but we determined default_content_deploy was the best option to use with minimal changes needed to support our needs.
There are a few issues with the current DCD implementation with this in mind:
- It doesn't use batch API, so it limits the amount of data that can be exported/imported
- It works well for entity reference fields, but not for exporting entity references in text fields (linkit, entity_embed, media, etc...)
- It doesn't give fine-grained control over which type of entities to export as dependencies.
- Error with exporting entities that do not support UUIDs
With these issues solved, DCD would be perfectly suited to support content deployment setups.
@smulvih2
| Comment | File | Size | Author |
|---|---|---|---|
| #48 | 3102222_rewrite_3.patch | 71.9 KB | mkalkbrenner |
| #47 | 3102222_rewrite_2.patch | 71.91 KB | mkalkbrenner |
| #46 | 3102222_rewrite.patch | 71.91 KB | mkalkbrenner |
| #41 | dcd-batch-3102222-41.patch | 62.44 KB | smulvih2 |
Issue fork default_content_deploy-3102222
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
ivnishI think it is possible
Comment #3
mkalkbrennerThis task is a bit more difficult. If you import an entire content export we need to iterate over the all files up to three times to retain all entity relations based on UUIDs when the entities get a new ID during import. This needs to be taken into account in the batches and we need to transport the in memory ID mapping array. from batch to batch.
Comment #4
mkalkbrennerComment #5
smulvih2I need batch for export/import as well. I have a site with 2k nodes, and I am not able to export through the UI or using Drush, I either get timeouts or memory issue. I still have 2 large migrations for this site and will have >10k nodes by the time it's ready for hand-off.
I wrote a simple drush command that batches export, given a
content_typeandbundle. I am able to use this to export all 2k nodes - https://gist.github.com/smulvih2/c3a406ecca47bf344fd2b36804c7d927Batching import is not as straightforward. My workaround for now is to use
drush dcdiand add the following to settings.php:With this I am able to import all 2k nodes. It would be nice to have progress updates during the import process so you can gauge time to completion.
Comment #6
smulvih2I had some time on Friday to look into this, was able to get import working with batch through drush and UI. Still needs some more testing and to get export working with batch as well. Posting my patch here to capture my progress, will continue working on this when I have spare time.
Comment #7
smulvih2Just tested #6 on a migration I am running to see how the new batch process would handle a real-life scenario. It seems to work as expected.
Here is the contents of my export:
I am calling the new importBatch() method programmatically on a custom form submit handler, like this:
This gives me the batch progress bar and correctly shows 180 items being processed. After the import, I get all 90 nodes with translations. The term reference fields all work as expected. Even the links to other imported nodes within the body field work as expected.
Here are the patches I have in my project:
Comment #8
smulvih2After testing the batch import for DCD a bit more, I think we need to add a batch process for the prepareForImport() method as well. With a lot of JSON files in the export directory, decoding all of these files can cause timeouts.Edit: This might have been an issue with my docker container. Not running into this issue today after restarting all containers.Comment #9
smulvih2With patch #6 I was getting warnings in the dblog, see below. New patch fixes this and now batch import works without any dblog errors/warnings.
Comment #10
mkalkbrennerUsing batches would be good improvement. But we need to review the patch carefully.
The current process runs three times across the content to deal with things like path aliases. So the context that needs to be shared between the batches is important.
Comment #11
smulvih2Agreed, although so far it seems to be working well for things like entity references, links to other nodes, etc. Will make sure to test this with path aliases like you suggest. Also need to implement batch for export since currently I am using a custom drush command to get past the export limitation.
Comment #12
smulvih2Got batch export working for all three modes (default, reference, all). Adding patch here to capture changes, but will push changes to a PR to make it easier to review.
Remaining tasks:
exportBatchDefaultandexportBatchWithReferences, could extract these into new method(s).Test export/import with complex data(complete)Comment #13
smulvih2As I suspected in #8, the prepareForImport() method will cause timeouts on large data sets due to the amount of processing that occurs per JSON file. The new patch moves prepareForImport() into it's own batch process, that then passes the data to the existing batch process that imports the entities. Tested this on a large data set and works well.
Comment #14
mkalkbrennerThanks for the patch, I'll review it ...
Comment #15
smulvih2Removed a line used for testing.
Comment #17
smulvih2Created a PR to make it easier to review the changes, inline with #15.
Comment #18
smulvih2One issue with my last patch in passing data from first batch process prepareForImport() to importBatch(). Will need to figure out a different solution for this before this is ready for review.
Comment #19
smulvih2Ok, got the import() process working with batch, it's solid now. Updated patch attached, will update PR shortly after and add comments.
Comment #20
smulvih2Comment #21
smulvih2This patch needs a re-roll after recent changes pushed to 2.0.x-dev branch. Will work on this over the next few days.
Comment #22
smulvih2Re-rolled patch to apply against latest 2.0.x-dev branch.
Comment #23
mkalkbrennerComment #24
mkalkbrennerI wanted to test the patch with a large amount of content locally, but the patch doesn't apply anymore.
Comment #25
smulvih2Re-rolled patch against latest 2.0.x-dev branch. Tested all 3 export modes through the UI as well as Drush. Also tested import through both UI and Drush. Seems to work well. Made a slight adjustment to the
--text_dependenciesDrush flag so it takes the UI config value if not specified in the Drush command.Comment #26
smulvih2Updated MR to align with changes in latest patch (#25) - https://git.drupalcode.org/project/default_content_deploy/-/merge_requests/7/diffs
Comment #27
mkalkbrennerComment #28
smulvih2After updating my containers to PHP8.3 (from PHP8.1) I am getting this error when importing content using this patch:
Adding this to the top of the Importer class fixes the deprecation error:
New patch tested on PHP8.3 and fixes the issue.
Comment #29
smulvih2I'm running a large export of > 16k nodes, with references, for a total of > 118k entities. Reviewing patch #28 I found a redundant call to load the entity for a second time in the exportBatchDefault() method. New patch attached removes this second call and hopefully will speed things up a bit.
Comment #30
mkalkbrennerI just can repeat what I commented on the MR.
The current patch removed an essential feature:
All entities with entity references will be imported two times to ensure that all entity references are present and valid. Path aliases will be imported last to have a chance to rewrite them to the new ids of newly created entities.
This strategy is essential to correct references via IDs (not everything works with UUIDs yet).
Especially path aliases are special and break using the proposed patch.
The patch only works for the content in total.
But exporting from A and importing into B breaks as he ID collisions in references aren't corrected anymore.
Comment #31
smulvih2@mkalkbrenner, yes I can reproduce the issue with path_alias entities. I was not running into this before as path_aliases are not exported with reference since they have a relationship from
path_alias -> nodeinstead ofnode -> path_alias. Now that I specifically export path_alias entities as well, I can reproduce. I will make sure to account for this in the batch import in subsequent patches, but for now I am looking at how to make the export/import process scalable.I was able to successfully export all 118,000 entities with the batch export, but I am having issues with the import at this scale. The issue is how I pass the $file to each batch operation, which is the contents of the JSON file in question. The issue with passing the file to the batch operation is that it then writes the file contents to the
queuetable in the database. Of the 118k entities, 50k are serialized files like images, pdf files, etc... Writing the actual file contents to the database exploded the database size and would timeout before even starting the batch operation, or run out of disk space.I will need to rewrite the importer class for this to work, and instead of passing the file contents to the batch operations, I will just pass a pointer to the file in the filesystem. Then each batch operation can load the file from the pointer. The I will combine the decodeFile() and importEntity() batch operations into one method, reducing the amount of batches by 50% (2 per JSON file). If I ensure path_alias entities are imported last, then I can probably get them working by just swapping the entity_ids.
See my comments from slack below for records (2024-04-16):
Comment #32
mkalkbrennerA quick explanation about the old algorithm to fix references via ID:
First run:
Import new and updated entities, but skip path aliases. Store old ID and new ID of newly created entities in a mapping array. Store a list of newly created entities that reference other entities via IDs in another array (NEW).
Second run:
Run updates on all entities stored in array (NEW) and correct the reference IDs according to the mapping array. Still skip the path aliases.
Third run:
Import path aliases. In case of new aliases, adjust the referenced entity IDs according to the mapping array.
I think that this algorithm could be kept. The first batch has to create a second batch of newly created entities that reference other entities via IDs and skip path aliases. For path aliases it has to create the third batch.
Comment #33
smulvih2@mkalkbrenner thanks for the info! Will refer to this when fixing for path_aliases.
So I found the major source of my database exploding on import. DCD does a filesystem scan and stores a pointer to each file in the Importer object. This object is then added to each batch item in the database. This means each batch item in the database would have all 118k pointers. I was able to use KeyValueStorage to store this data outside of the object, and now the queue entries look reasonable for each batch item:
With this new patch, I combined both batch operation callbacks into one callback, so we have 50% less batch operations with this method.
Rough numbers, if I have 118,000 entities to import, then all 118,000 entities would be referenced in each of the 118,000 batch operations. Since we had 2 operations per batch that would be x2. So 118k x 118k x2 = 27.87 billion entries. This is one such entry:
If we say this string is 92 bytes, then this would be approx. 2.3 TB of storage required for this.
Now we only have 118,000 batch operations (not times two), and the total data that is stored per operation in the DB is about 1234 bytes. So 1234 bytes times 118,000 is approx. 145MB.
So the import of 118k entities would increase the DB size by about 145MB instead of 2+ TB. This should also significantly reduce the time it takes to write the batch operations to the database when batch_set() is called (before the progress bar is shown).
Uploading patch here and will test the import against the 118k entities and report back.
Comment #34
mkalkbrennerIf we (optionally) use igbinary to serialize this array, I expect that we'll save 80% of this memory.
Comment #35
smulvih2Update: With the latest patch #33 I was able to trigger an import through the UI of 118k entities and within 10 seconds it started to process the entities where before it would take too long and timeout. I am now importing the entities using nohup and drush. Definitely need to think of a few ways to optimize the export/import process, will take a look at igbinary for this!
Comment #36
smulvih2Coming back to the export, I originally exported 118k entities using the "All content" option which worked perfectly. Also tested "Content of the entity type" option and it worked well. i am now testing "Content with references" and running into an problem. When exporting with references, the
queuedatabase table would get populated as expected, then during the first batch operation it would spin until timed out. The issue ends up being getEntityReferencesRecursive(). It looks at references in the body field, and spiders out to include hundreds of nodes.This is where the patch in #3357503 comes in handy. I can exclude nodes from reference calculations. For example, I want to export all nodes of type
page, with references. If I include nodes as part of the reference calculation, the first batch operation could be massive and timeout. With nodes excluded, I would get all pages still, but they would be spread out across all batch operations and not all included in one.I have updated the patch to include #3357503, and also apply this filter directly to getEntityProcessedTextDependencies() to avoid loading entities that are later filtered out anyways.
Also need #3435979 included to make sure media items on French translations are included in the export.
Comment #37
smulvih2Ok another major improvement to the "export with references". I ran into a node that has 208 entity references (lots of media and files). This was causing memory issues, it was hitting 800MB and then would die. This is because of the current approach to exporting with references.
The current approach is to get all referenced entities recursively into an array, then loop over that array and serialize into another array, then loop over the second array and write the JSON files. With entities that have large amounts of references this starts to kill the memory.
My new approach is to get all referenced entities recursively into an array, then loop over that array to serialize and write to the file system at the same time. This means there is no second large array storing all entities and their serialized content. I tested this against the node with 804 references and it works great, memory usage doesn't go over 50MB. This node even has a PDF file of 28MB being serialized with better_normalizers enabled.
The patch attached provides this update to the exporter class. It removes a method, and simplifies the code, which is always nice :)
Comment #38
smulvih2I've made some major improvements on the importer class. Using a dataset of 2k nodes (12k+ entities) it was initially taking me 80min to import the full 12K entities and 60min to import them when all entities have already been imported (skipped). These numbers were even larger once I hooked up the methods for internal link replacement
updateInternalLinks()andupdateTargetRevisionId().I did a full review of the importer class to fix all performance related issue. Please find below a list of improvements:
queuetable for each batch operation. I originally passed data like processed UUIDs and entity_id => UUID mappings through the class itself, which was causingdelay in starting the batch process as well as made the database grow significantly. Then this information was being passed using KeyValueStorage on each batch operation. The final solution now uses$contextto pass this data to each subsequent batch operation, significantly speeding up the import process.updateInternalLinks()method. This option loops over each JSON file to look forurifield names and does astr_starts_withto see ifinternal:orentity:exist in the JSON file. This is not needed with newer sites since links now use UUIDs and entities can be embedded with<drupal-entity>elements. Disabling this option sped up the import process.$this->serializer->decode($this->exporter->getSerializedContent($entity), 'hal_json');. It was first being called inimportEntity()to determine of there is a diff, then if there was a diff thepreAddToImport()method was called and performed the comparison again. Now with this new patch this comparison is only done once, which significantly improved performance.I also made changes to support path_alias entities. To accommodation this without needing to loop over the entities multiple times, I ensure path_alias entities are processed last so that the corresponding entities being referenced in the path field already exist and their entity_ids are available in
$context.I did a complete review of the importer class and fixed things like doc comments, inline comments, and general coding practices.
Now an import of all 12k+ entities takes 28min and an import where it skips all entities takes 4min.
Comment #39
smulvih2I added output for the exporter class, similar to the importer class. It output how many entities of each type were exported. When I did this, I found a performance issue with export with references. I had 229 taxonomy_terms in total, but the count at the end of the export process was showing thousands. To fix this, I added the already processed entities to the batch
$context, so I could check if it has already been exported and skip. This prevented thousands of writes to already written files. Also added batch output for export so when running in Drush you get an idea of what is happening, same as the importer. Also removed account switcher from import/export since this is not needed with Drush. Updated patch attached.Comment #40
smulvih2I created the related issue for file_entity. Batch export would fail if a file entity is referencing an image/file that was removed from the file system. Was seeing this on a full site export, with old test data. Suggest updating file_entity to latest 2.0-rc6 to avoid this issue.
Comment #41
smulvih2Added error handling when export with references is used and the selected options produce no results.
Comment #42
mkalkbrennerI spent a lot of time on an in depth review. Some features like "export changes since" are broken. I started to rewrite the patch to fix these issues.
Comment #43
eiriksmWith the amount of code this touches, I feel it would first make sense to have in place the most basic test coverage of these forms, which can then be extended when reviewing and implementing the refactoring here.
This is why I opened #3458861: Create tests for import and export forms which is now NR and has these very basic tests at least. A place to start?
Comment #44
smulvih2@kalkbrenner that is one feature I didn't test (export changes since), so glad you are doing an in-depth review of this! I do have this patch on my project and it is working great for moving 10k+ entities from environment to environment, so for the main use-case of just exporting/importing it is working great!
@eiriksm thanks for adding these tests!!!
Comment #45
mkalkbrennerUnfortunately I found different issues and can't agree that it is working great. At least not with the content I could use for testing. I had to restructure the code and after two days of work, the export seems to work now with the different modes using drush.
But it is still much slower than without batches.
But don't get me wrong, the existing patch is a great starting point.
I will now take care of the import next week. And maybe we could talk about performance later as soon as the content itself is correct.
Comment #46
mkalkbrennerHere's a first draft of my rewrite.
Comment #47
mkalkbrennerOK, a small mistake in the import. Here's a fix.
Comment #48
mkalkbrenneroh, I uploaded the previous patch again, now ...
Comment #52
mkalkbrenner