Follow-up to #2623012: Implement interfaces and base classes for URL-based sources

The stand-alone JSON plugin had some support for following links from one request to the next where a feed is paginated (remnants still exist in comments). Get this working in general in the fetcher plugins.

Remaining tasks

  • Get tests passing.
  • Add documentation.
CommentFileSizeAuthor
#137 migrate_plus-support_paging-2640516-137.diff33.48 KBnadavoid
#130 migrate_plus-support-paging-2640516-130.patch17.2 KBjacobbell84
#125 migrate_plus-support-paging-2640516-125.patch17.15 KBjacobbell84
#124 interdiff-123-124.txt1021 byteskekkis
#124 migrate_plus-support-paging-2640516-124.patch16.97 KBkekkis
#123 migrate_plus-support-paging-2640516-123.patch16.8 KBchuyenlv
#121 interdiff_119_121.txt3.86 KBjcandan
#121 migrate_plus-support-paging-2640516-121.patch16.81 KBjcandan
#119 migrate_plus-pager_selector_error-2640516-119.patch15.6 KBkekkis
#119 interdiff-118-119.txt1.01 KBkekkis
#118 migrate_plus-pager_selector_error-2640516-118.patch15.39 KBslayne40
#114 migrate_plus-pager_selector_error-2640516.patch15.33 KBquadbyte
#112 migrate_plus-pager_selector_error-2640516-112.patch15.3 KBquadbyte
#109 migrate_plus-pager_selector_error-2640516-108---altered-on-line-202.patch15.08 KBauth
#107 migrate_plus-pager_selector_error-2640516-107.patch15.1 KBmonymirza
#102 migrate_plus-pager_selector_error-2640516-102.patch19.35 KBrob230
#100 interdiff-2640516-98-100.txt1.78 KBrob230
#100 migrate_plus-pager_selector_error-2640516-100.patch0 bytesrob230
#98 migrate_plus-pager_selector_error-2640516-98.patch18.08 KBsleopold
#96 migrate_plus-pager_selector_error-2640516-96.patch17.59 KBKevinVanRansbeeck
#95 migrate_plus-pager_selector_error-2640516.patch912 bytespeter törnstrand
#94 support-paging-through-multiple-requests-2640516-94.patch19.18 KBhmdnawaz
#93 migrate_plus-support_paging-2640516-92.patch19.49 KBscott_euser
#93 interdiff-2640516-90-92.txt329 bytesscott_euser
#91 interdiff-2640516-88-90.txt438 bytesscott_euser
#91 migrate_plus-support_paging-2640516-90.patch19.49 KBscott_euser
#90 interdiff-2640516-88-90.txt438 bytesscott_euser
#88 interdiff-2640516-86-88.txt505 bytesscott_euser
#88 migrate_plus-support_paging-2640516-88.patch19.35 KBscott_euser
#88 2021-09-20_13-38.png6.9 KBscott_euser
#86 reroll_diff_83-86.txt2.77 KBronaldmulero
#86 migrate_plus-support_paging-2640516-86.patch19.36 KBronaldmulero
#83 reroll_diff_81-83.txt5.7 KBronaldmulero
#83 migrate_plus-support_paging-2640516-83.patch18.85 KBronaldmulero
#81 interdiff-81.txt607 bytesszloredan
#81 migrate_plus-support_paging-2640516-81.patch18.79 KBszloredan
#79 interdiff-78.txt694 bytessegi
#79 migrate_plus-support_paging-2640516-78.patch18.64 KBsegi
#76 migrate_plus-support_paging-2640516-76.patch18.53 KBruslan piskarov
#72 migrate_plus-support_paging-2640516-72.patch18.15 KBrobin.ingelbrecht
#70 migrate_plus-support_paging-2640516-70.patch18.15 KBprudloff
#69 interdiff-2640516.68.69.txt2.06 KBpixlkat
#69 migrate_plus-support_paging-2640516-69.patch17.76 KBpixlkat
#68 migrate_plus-support_paging-2640516-68.patch17.2 KBduaelfr
#68 interdiff-2640516.65.68.txt1.58 KBduaelfr
#65 migrate_plus-support_paging-2640516-65.patch16.87 KBduaelfr
#65 interdiff-2640516.63.65.txt2.17 KBduaelfr
#62 interdiff_60-62.txt782 bytesweekbeforenext
#60 migrate_plus-support_paging-2640516-60.patch15.91 KBm.lebedev
#60 interdiff_48-60.txt9.98 KBm.lebedev
#48 migrate_plus-support_paging-2640516-48.patch16.39 KBsdstyles
#48 interdiff_45_48.txt672 bytessdstyles
#45 migrate_plus-support_paging-2640516-45.patch17.32 KBjcandan
#45 interdiff_41_45.txt551 bytesjcandan
#41 interdiff_39_41.txt1.84 KBjcandan
#41 migrate_plus-support_paging-2640516-41.patch17.33 KBjcandan
#39 migrate_plus-support_paging-2640516-39.png386.28 KBjcandan
#39 interdiff_37_39.txt13.64 KBjcandan
#39 migrate_plus-support_paging-2640516-39.patch17.28 KBjcandan
#35 migrate_plus-support_paging-2640516-35.patch13.19 KBm.lebedev
#33 migrate_plus-support_paging-2640516-33.patch12.96 KBm.lebedev
#27 migrate_plus-support_paging-2640516-27.patch12.96 KBberenddeboer
#26 migrate_plus-support_paging-2640516-26.patch13.29 KBberenddeboer
#25 migrate_plus-test-case.patch4.1 KBberenddeboer
#23 migrate_plus-support_paging-2640516-23.patch9 KBmortona2k
#19 migrate_plus-support_paging-2640516-19.patch9.38 KBberenddeboer
#16 interdiff-2640516-12-16.txt752 bytesbadrange
#16 migrate_plus-support_paging-2640516-16.patch8.36 KBbadrange
#12 interdiff-2640516.txt8.44 KBdrclaw
#12 migrate_plus-support_paging-2640516-12.patch8.34 KBdrclaw
#5 support_paging-2640516-5.patch8.08 KBGrayside
#37 migrate_plus-support_paging-2640516-37.patch13.19 KBm.lebedev
#63 migrate_plus-support_paging-2640516-63.patch16.84 KBweekbeforenext
#63 interdiff_37-63.txt13.52 KBweekbeforenext
#62 migrate_plus-support_paging-2640516-62.patch16.44 KBweekbeforenext
#63 interdiff_62-63.txt12.04 KBweekbeforenext
Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

mikeryan created an issue. See original summary.

mikeryan’s picture

Component: Source plugins » Plugins
Grayside’s picture

Assigned: Unassigned » Grayside
Grayside’s picture

Here's my rough game plan going into this. Note that I'm attempting generic approach but I am not going to work on XML/SOAP.

Configuration

  • I think paging should be opt-in
  • A selector to find the "next" link in the document body seems necessary. Raw JSON might use next, HAL would be _links/next/href.

Links Header

DataParserPluginBase::nextSource() should call $this->getNextLinks() and prepend the resulting URLs to $this->urls. getNextLinks() will be abstract.

Links Header: JSON

  • Json::openSourceUrls() can extract link headers from the data fetcher. Should data fetchers in general have a getStreamMetadata() method or should it check for the Http plugin specifically?
  • Paging being so intrinsic to fetching, it makes sense to move more of the Link header parsing into the data fetcher. No need for HTTP structure to leak into JSON parsing.

Links Header: XML (Blocked)

XMLReader is not currently using the HTTP data fetcher at all. Link Header support is dependent on switching to it. (E.g., download the XML doc via fetcher then open from temporary file location).

Links Header: SOAP (Blocked)

See XML.

Next Links

Next Links: JSON

  • Json::openSourceUrls calls getSourceData() to retrieve the data for import, but that might exclude data for the paging process.
  • Building on work done in #2608610-10: Add support for list/item pattern, we could inject a "pager selector" into getSourceData to extract the next link. Should we use static caching pattern to avoid repeatedly decoding?

Next Links: XML

TBD, but appears like it can be approximately similar to JSON in that we could parse out the next link based on a selector.

Next Links: SOAP

TBD, but appears like it can be approximately similar to JSON in that we could parse out the next link based on a selector.

Grayside’s picture

Assigned: Grayside » Unassigned
Status: Active » Needs review
StatusFileSize
new8.08 KB

Here we go. This patch will conflict with #2608610: Add support for list/item pattern as both needed some of the same changes to Json::getSourceData().

mikeryan’s picture

Version: 8.x-1.x-dev » 8.x-2.x-dev
mikeryan’s picture

Thanks for this patch - I'm going to need to think on this bit, just want to let you know I'm not ignoring it as I catch up on reviews...

Grayside’s picture

I suspect the current implementation of this or #2608610: Add support for list/item pattern is disrupting rollbacks. Got this second-hand so I do not have a lot of details, but likely related to item counts.

mikeryan’s picture

Status: Needs review » Needs work
+++ b/src/DataFetcherPluginInterface.php
@@ -46,4 +46,16 @@ interface DataFetcherPluginInterface {
+  public function getNextLinksFromMetadata($url);

Haven't gotten around to a thorough review and testing, but I'd like to see the interface be more general - getNextLinks(). The implementation could, based on configuration, derive the links from headers, selectors, or query strings (e.g., ?page=1)

Grayside’s picture

Sure. I named it getNextLinksFromMetadata() because to my mind, getNextLinks() sounded like specifically the Link HTTP header and metadata is more generic.

Would like to wait on more feedback before diving back in myself.

There is also this spurious warning in the logs under some configuration. Haven't traced which aspect of my usage is triggering it, as I have 2-3 variations of the pager configuration in place.

Notice: Undefined index: pager_selector in Drupal\migrate_plus\Plugin\migrate_plus\data_parser\Json->getNextLinks() (line 234 of /var/www/build/html/modules/contrib/migrate_plus/src/Plugin/migrate_plus/data_parser/Json.php).
hugovk’s picture

This would be a very useful addition. What's the current status? Thanks!

drclaw’s picture

Here's a re-rolled patch for 8.x-4.x from #5 with a few changes and a few additions:

Changes to original patch:

  • I named the next url getter methods getNextUrls which IMO is more technically correct. Also it matches the language we use throughout the fetcher/parser plugins.
  • It looked like original patch was making a second request to the active url to get pager data off the response. I don't think that Guzzle caches responses (at least not without a plugin maybe?) so I added some code to store the active url's source data so that we don't make extraneous requests.
  • Pager config is now nested under the 'pager' key instead of being sibling to it. This is to support different paging types. The paging type in the original patch I've named "urls" since it's pulling urls off of the response.

Example config for the "urls" paging type:

source:
  plugin: url
  data_fetcher_plugin: http
  data_parser_plugin: json
  pager:
    type: urls
    selector: "pager/next"

And the payload might look something like this

{
  "data": [
    {
      ... etc ...
    }
  ],
  "pager": {
    "next": "https://api.example.com/api/v1/collection?page=3",
    "prev": "https://api.example.com/api/v1/collection?page=1"
  }
}

Additions:

Added support for cursor-based paging

Some APIs (like the one I'm using) doesn't return full urls for the next and previous pages. Instead you get a cursor value that you need to insert into the url yourself. The twitter api docs explains this technique pretty well and has some examples https://dev.twitter.com/overview/api/cursoring.

Here's an example of the pager config for this:

source:
  plugin: url
  data_fetcher_plugin: http
  data_parser_plugin: json
  pager:
    type: cursor
    selector: "cursor/next"
    # The key will be used as the parameter name in the url
    key: cursor

And the payload might look like

{
  "data": [
    {
      ... etc ...
    }
  ],
  "cursor": {
    "next": 4,
    "prev": "3"
  }
}
drclaw’s picture

Version: 8.x-2.x-dev » 8.x-4.x-dev
Status: Needs work » Needs review

Ack, metadata :)

badrange’s picture

Happy to see that this patch has progressed! I just tried it out in our project and it seemed to do it's job - running for a long time - until it suddenly stopped with this error message:

Migration failed with source plugin exception: Error message: cURL error 3: <url> malformed (see http://curl.haxx.se/libcurl/c/libcurl-errors.html) at .  [error]
badrange’s picture

And it is not surprising that it fails; the last page in our datasource does return meta/next: null when there are no more items. I guess it would make sense for the pager to stop if the data source returns "null" or "false"?

https://api.hel.fi/linkedevents/v1/event/?end=2017-09-12&include=locatio...

badrange’s picture

A simple check to see if the url is empty might do the trick.

heddn’s picture

Status: Needs review » Needs work
Issue tags: +Needs tests

We could use some tests.

drclaw’s picture

@badrange I think you're right, a simple check for data should do it!

@heddn Yeah tests would be great. I'll try and get to those in the next couple weeks.

berenddeboer’s picture

Status: Needs work » Needs review
StatusFileSize
new9.38 KB

This patch really should be committed guys. Works great. Have extended it to handle the case where paging is done by page numbers. I.e. no previous/next, just ?page=1 kind of style.

Status: Needs review » Needs work

The last submitted patch, 19: migrate_plus-support_paging-2640516-19.patch, failed testing. View results

svendecabooter’s picture

@berenddeboer your patch does not apply, because it is not diffed from the module root directory, but rather from your Drupal installation root.

heddn’s picture

Issue tags: +Needs reroll

I agree this is pretty important. But with the ease of adding a unit or kernel test these days, I'm going to ask we do that. So leaving at NW. Plus there's the need for a re-roll per #21.

mortona2k’s picture

Rerolled the patch.

berenddeboer’s picture

Saying you need a test is easier said then done, definitely doesn't appear to be easy. I.e. it seems you may need an HTTP source for json.

I made a first attempt at such a test using a file uri, see attached file, and the error I get is:

../vendor/bin/phpunit --filter=MigrateHttpJsonCursoringTest
PHPUnit 4.8.36 by Sebastian Bergmann and contributors.

Testing 
F

Time: 9.09 seconds, Memory: 230.00MB

There was 1 failure:

1) Drupal\Tests\migrate_plus\Kernel\MigrateHttpJsonCursoringTest::testTableDestination
Migration failed with source plugin exception: The URI 'file:////home/berend/src/snl/nzad/modules/contrib/migrate_plus/tests/src/Kernel/cursoring.json' is invalid. You must use a valid URI scheme. Use base: for a path, e.g., to a Drupal file that needs the base path. Do not use this for internal paths controlled by Drupal.
Failed asserting that false is true.

/home/berend/src/snl/nzad/core/tests/Drupal/KernelTests/AssertLegacyTrait.php:23
/home/berend/src/snl/nzad/core/modules/migrate/tests/src/Kernel/MigrateTestBase.php:195
/home/berend/src/snl/nzad/core/modules/migrate/src/MigrateExecutable.php:192
/home/berend/src/snl/nzad/modules/contrib/migrate_plus/tests/src/Kernel/MigrateHttpJsonCursoringTest.php:103

FAILURES!
Tests: 1, Assertions: 1, Failures: 1.

Remaining deprecation notices (1)

Drupal\taxonomy\Tests\TaxonomyTestTrait is deprecated in Drupal 8.4.0 and will be removed before Drupal 9.0.0. Instead, use \Drupal\Tests\taxonomy\Functional\TaxonomyTestTrait: 1x
    1x in KernelTestSuite::suite from Drupal\Tests\TestSuites

How on earth do I fix that???

berenddeboer’s picture

StatusFileSize
new4.1 KB

Oops, here the files.

berenddeboer’s picture

StatusFileSize
new13.29 KB

OK, let's do this better. Here a full patch including the test. Test will fail. Have added two comments testing this against a local webserver, and it passes then.

berenddeboer’s picture

And another attempt to get a patch that can be applied.

berenddeboer’s picture

dweidner’s picture

What is holding this feature back? Looking for something similar to process a large HAL+JSON endpoint. I would like to help, even though I have no experience in contributing to drupal projects.

ressa’s picture

Status: Needs work » Needs review

I could also use this feature, maybe the patch just needs a review? I am changing status to Needs Review.

heddn’s picture

Status: Needs review » Needs work
Issue tags: -Needs tests, -Needs reroll

We now have some tests, but those are failing. And I don't see a lot of docs in this patch on how to configure a json migration with paging. Let's beef things up and get the failing test to pass.

ressa’s picture

I wonder if the patch will also add an option of paging through multiple files? This which works fine currently:

urls:
  - 'public://migrate_files/result-1-100.json'
  - 'public://migrate_files/result-101-200.json'
  - 'public://migrate_files/result-201-300.json'
  - 'public://migrate_files/result-301-400.json'
...

(I got the tip from Mike Ryan in From: Using Migrate API with a multi-page / paginated source.)

But if you have hundreds of files, something like this would be better:

urls: 'public://migrate_files/result-($1)-($2).json'
url-offsets: [1,100,4]

Meaning: Start from 1, get 100 records, repeat four times.

m.lebedev’s picture

Status: Needs work » Needs review
StatusFileSize
new12.96 KB

For test.

Status: Needs review » Needs work

The last submitted patch, 33: migrate_plus-support_paging-2640516-33.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

m.lebedev’s picture

Status: Needs work » Needs review
StatusFileSize
new13.19 KB

Rerolled; changed a path of a test file.

Status: Needs review » Needs work

The last submitted patch, 35: migrate_plus-support_paging-2640516-35.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

m.lebedev’s picture

Status: Needs work » Needs review
StatusFileSize
new13.19 KB

last try. I take an example from data_parser/JsonTest.php

Besides, I have an other problem
Notice: Undefined index: next in Drupal\migrate_plus\Plugin\migrate_plus\data_parser\Json->getSourceData()

    foreach ($selectors as $selector) {
      if (!empty($selector)) {
        $return = $return[$selector]; // <-- Notice.
      }
    }

I get a data from a resource, which may don't have a next page on last page.

Status: Needs review » Needs work

The last submitted patch, 37: migrate_plus-support_paging-2640516-37.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

jcandan’s picture

Fixed the failing test, changed names to match test performed, and added appropriate doc-blocks.

Added a new pager type: paginator. This pager type allows one to deal with a paginated data where the next urls cannot be derived from the content. The example configuration below demonstrates how the page_key value would iteratively increase appropriately based on the configured paginator_type (e.g. check api.example.com/v1/?offset=100&limit50).

source:
  . . .
  urls: https://api.example.com/v1?format=json
  pager:
    type: paginator
    paginator_type: starting_item  # page_number [default]
    default_num_items: 50  # [required] 
    page_key: offset  # page [default]
    size_key: limit  # pagesize [default]

Options:

  • paginator_type Use "page_number" when the pager uses page numbers to determine the item to start at, use "starting_item" when the pager uses the item number to start at.
  • default_num_itemsThe first call is made without size specified; one must therefore define the default size. There is a @todo noted to address this.
  • page_key The url parameter key that should be used for incrementing pages
  • size_key The url parameter key that specifies the number of items per page

As I wrote this description, I realized that unless we do add the ability to specify the size on the first run, the default_num_size is unnecessary. I may address this in a follow-up patch.

Need help with TESTS
Though I got the tests passing, I do not feel that the test, as supplied by previous patches, tests the available pager options adequately. I think it would be best to inject a mock http client into the parser and iterate over mock endpoints, as this issue addresses. I made a few attempts, but could not figure out how to accomplish this.

This pager type addition was inspired by External Entities, and met my needs perfectly.

External Entities Pager Settings form

jcandan’s picture

Additionally, I removed this bit of code which allows one to configure the paginator size type. This was also inspired by the External Entities UI (see image above in #39). However, I had a hard time understanding its use-case. If someone feels it necessary to add to the patch, feel free.

  // Use "num_items_per_page" when the pager uses this parameter
  // to determine the amount of items on each page, use "ending_item"
  // when the pager uses this parameter to determine the number of the
  // last item on the page.
  // @todo Handle ending_item pagination size type
  // Set default paginator size type.
  $paginator_size_type_options = ['num_items_per_page', 'ending_item'];
  $paginator_size_type = $paginator_size_type_options[0];
  // Check configured size type.
  if (!empty($this->configuration['pager']['size_type'])) {
    if (!in_array($this->configuration['pager']['size_type'], $paginator_size_type_options)) {
      // Not set to one of the two available options.
      throw new MigrateException(
        'Pager "size_type" must be configured as either "num_items_per_page" or "ending_item" ("num_items_per_page" is default).'
      );
    }
    $paginator_size_type = $this->configuration['pager']['size_type'];
  }
  if ($paginator_size_type === 'ending_item') {
    $next_end = $next_start + $next_end;
  }
jcandan’s picture

Re-rolled patch #39. Fixed coding standards in patched files.

achap’s picture

Hi @jcandan. Thanks for the patch. Is it possible to elaborate on this config key? Maybe an example. It's not really clear to me. "paginator_type Use "page_number" when the pager uses page numbers to determine the item to start at, use "starting_item" when the pager uses the item number to start at."

Otherwise, it is working.

jcandan’s picture

@achap, see the url parameters in the example given in #39. There, the parameter keys are limit and offset. Now, think in terms of how that is paginated: in this case, the offset tells the API endpoint how many items to skip, which is the same as listing the starting item (e.g. offset=0 would show items 0-10, offset=11 would show items 11-20, etc.). In another example, the parameter keys might be p for page, and ps for page size. The pagination increment is representative of pages of data (p=1, p=2, etc.), in which case paginator_type would be set to page. Examples of both of these are https://librivox.org/api/info and http://rijksmuseum.github.io/ respectively. Hope this helps clarify.

achap’s picture

@jcandan yes makes perfect sense. Thank you!

jcandan’s picture

StatusFileSize
new551 bytes
new17.32 KB

Re-rolled patch #41 (a re-roll of #39). Fixed one last missed coding standards item.

jcandan’s picture

jcandan’s picture

sdstyles’s picture

StatusFileSize
new672 bytes
new16.39 KB

Use data fetcher plugin to check if URL is valid, this will work also with URLs which require authorization.

sdstyles’s picture

Status: Needs work » Needs review
audriusb’s picture

I am having an issue with latest patches. They do not return correct data.
see this call:
$data = $this->getSourceData($url, $this->configuration['pager']['selector']);
it has 2 paramaters but the function is
protected function getSourceData($url)
so the second parameter is completely ignored.

the latest working patch with 2nd parameter is #37

nevermind, missed the #39 with instructions. Do initial call require to be without page limit? what if it returns whole set then? I am going back to #37 for now

m.lebedev’s picture

Import does not work starting from #39

source:
pager:
type: urls
selector: "next"

In function getNextUrls:

$data = $this->getSourceData($url, $this->configuration['pager']['selector']);

but function takes one arg
protected function getSourceData($url)

How to set up for the "http://devel.loc/api/v1/request?page=85"?
{
"self": "http://devel.loc/api/v1/request?page=84",
"first": "http://devel.loc/api/v1/request?page=0",
"last": "http://devel.loc/api/v1/request?page=87",
"prev": "http://devel.loc/api/v1/request?page=83",
"next": "http://devel.loc/api/v1/request?page=85",
"list": [
What type of pager to use?

audriusb’s picture

you need to use type: paginatorinstead of type: urls you currently use to work with latest patch. There are additional params needed to be set in #39.

m.lebedev’s picture

ok..

1)
source:
pager:
type: paginator
paginator_type: page_number
default_num_items: 100
page_key: page
size_key: limit

When a parser has try to get next link after last link I get an error:
"Error message: Client error: `GET http://devel.loc/api/v1/request?page=88&limit=100` resulted in a `404 Not Found`"

if this:

// Service may return 404 for last page, ensure next_urls are valid.
foreach ($next_urls as $key => $next_url) {
  $response = $this->getDataFetcherPlugin()->getResponse($next_url);

  if ($response->getStatusCode() !== 200) {
    unset($next_urls[$key]);
  }
}

change on this:

// Service may return 404 for last page, ensure next_urls are valid.
foreach ($next_urls as $key => $next_url) {
  try {
    $response = $this->getDataFetcherPlugin()->getResponse($next_url);

    if ($response->getStatusCode() !== 200) {
      unset($next_urls[$key]);
    }
  }
  catch (\Exception $e) {
    unset($next_urls[$key]);
  }
}

then it will work.

2) what about other types of pager? They don't work now

cameron prince’s picture

I've tested the patch against a SharePoint API and it seems to work very well for importing paginated responses. The one thing that bugs me is that it seems to break drush ms. When I debug the command, I see that it's paging through the API just like the import, trying to get the total number of rows to display.

Do you think listing the paginated migrations with maybe "N/A" or "---" for their values in the status report would be a good solution?

weynhamz’s picture

Just a heads up here, specifically for JSON:API, this issue propose to create a jsonapi data_parser plugin for JSON:API responses, in which also added the support for links/next, will there be any conflict against the work here?

heddn’s picture

Status: Needs review » Needs work

Re #54: there's a skip_count: true option for all source plugins.

  1. +++ b/src/DataFetcherPluginBase.php
    @@ -29,4 +29,11 @@ abstract class DataFetcherPluginBase extends PluginBase implements DataFetcherPl
    +  public function getNextUrls($url) {
    
    +++ b/src/DataFetcherPluginInterface.php
    @@ -47,4 +47,17 @@ interface DataFetcherPluginInterface {
    diff --git a/src/DataParserPluginBase.php b/src/DataParserPluginBase.php
    

    I think we need more docs on how to use this feature in the doxygen.

  2. +++ b/src/DataFetcherPluginInterface.php
    @@ -47,4 +47,17 @@ interface DataFetcherPluginInterface {
    +  public function getNextUrls($url);
    

    I think we need more docs on how to use this feature in the doxygen.

  3. +++ b/tests/src/Kernel/MigrateHttpJsonCursoringTest.php
    @@ -0,0 +1,139 @@
    +  public function testHttpJsonCursoring() {
    

    I don't see where we are using the pager. As the test data only has a single page.

cameron prince’s picture

Thanks @heddn! skip_count: true took care of the issue with the status report.

cameron prince’s picture

I'm seeing a few bugs with this now that we're actually testing things and comparing counts. For instance, if the row limit is set to 50, we're seeing missing records. If we set the row limit down to 10, we get all results.

Also, when adding debugging to the Http class, I see that the same URL is queried multiple times. It seems to go forward two steps, then back one, such as: startrow=10, 20, 10, 20, 30, 20, 30, 40, etc.

cameron prince’s picture

Another thing we've found with this patch is that when a migration is configured with a pager, it will no longer work if you change the data_fetcher_plugin to file. In our case, we were providing a local .json file as the source for a test. It wasn't obvious that the failure was related, but removing the pager resolved the issue.

m.lebedev’s picture

I created a patch based on the #53 comment

hudri’s picture

The last working patch for me is #37. I believe this is because I need to use the pager: type: urls together with item_selector config key.

Given a common data structure like

{
  data: { ...content payload... }
  links: {
    prev: 'remote api provides full uri to prev data',
    next: 'remote api provides full uri to next data'
  }
}

and a migration config like

source:
  plugin: url
  data_fetcher_plugin: http
  data_parser_plugin: json
  urls: '...'
  item_selector: data
  skip_count: true
  pager:
    type: urls
    selector: 'links/next'

then this patch fails on migrate:import with URI must be a string or UriInterface because file src/Plugin/migrate_plus/data_parser/Json.php

  protected function getNextUrls($url) {
    $next_urls = [];

    if (!empty($this->configuration['pager']['type'])) {
       // next-link can never be found, parameter $item_selector from #37 is mandatory
      $data = $this->getSourceData($url);

getSourceData() can't return the next url because it is limited to payload within item_selector

weekbeforenext’s picture

StatusFileSize
new16.44 KB
new782 bytes

The patch from #60 worked for me, but my API endpoint returns a warning message instead of NULL when there are no more results.

This is a patch that avoids the error:

In Json.php line 124:
                                             
  Passed variable is not an array or object 

My configuration for pagination looks like this:

  pager:
    type: paginator
    default_num_items: 20
    paginator_type: starting_item
    page_key: startNum
    size_key: numResults
weekbeforenext’s picture

Status: Needs work » Needs review
StatusFileSize
new16.84 KB
new12.04 KB
new13.52 KB

I had a need for the 'urls' pager type, so then I experienced the errors you all speak of.

I created a new patch where both the 'urls' and 'paginator' pager types work. I've also included interdiffs between patch #37 and patch #62.

weekbeforenext’s picture

duaelfr’s picture

I found out that there was an issue if the current page number were "0".
Here is the fix.

heddn’s picture

I wonder if this piggy-backed on the work in #3040427: Allow callback for Url source, and single item Json plugin if we would be better off? Build a default callback for drupal's json that is callable via a callback.

duaelfr’s picture

@heddn Thanks for your feedback. I believe both features could work together.
In my current case, I need to build 20K+ URLs so the other issue would be perfect but each of these URLs can be paginated and I cannot know if they are before reading them (the page count is in the answer).

duaelfr’s picture

Version: 8.x-4.x-dev » 8.x-5.x-dev
StatusFileSize
new1.58 KB
new17.2 KB

Bumped version to 5.x and added a way to select the max page from the source in "page" mode.
I feel like we should find a cleaner way to handle these use cases. We might need a new plugin type, don't you think?

pixlkat’s picture

In reference to the comment in #58 -- while debugging this, I discovered that if you set the `default_num_items` config variable to anything greater than the default number of rows returned by the API, it will skip the number you specified, so it is possible to skip the difference between your requested size and the default size since the *initial* url is not set with the pager values. I got around this by setting this in my initial URL.

The API we are using pagination with returns the number of rows in the response as part of the JSON. I've updated the patch in 68 to use the `selector` configuration option to return that value and use it to decide if I need to add another URL to the array. I realize this may be somewhat of a snowflake, but it eliminated the need to request the URL to decide if it should be added. Our API will return a 200 whether or not there are any results or even an error in the query.

I would also support the idea of using a callback to enable pager support rather than trying to create a patch that is all things to all people.

prudloff’s picture

The patch in #69 was very useful to us!

However, I found two remaining issues :

  • When using a pager on an API endpoint that implements rel="next" in Link headers, the pager behave strangely, because Json::getNextUrls() and Http::getNextUrls() will both want to add the same pages.
  • When using an item_selector that returns some empty items, it will cause the pager to go to the next page even if the current page is not over. The attached patch fixed this issue for us.
robin.ingelbrecht’s picture

Status: Needs review » Needs work

I found another use case where this patch crashes. I have multiple URL's setup and each of them has multiple pages.
The following line in DataParserPluginBase::addNextUrls() causes the array keys to be unsequential, which results in a NULL url in DataParserPluginBase::nextSource():

      $this->urls = array_unique($this->urls);

Adding the following fixes the problem:

      $this->urls = array_unique($this->urls);
      $this->urls = array_values($this->urls);
robin.ingelbrecht’s picture

Status: Needs work » Needs review
StatusFileSize
new18.15 KB
axel80’s picture

Hi, thanks for the very useful patch.
I made some test on patch #72 with YouTube Data API v3, which use a cursor for pagination.
Cursors available are stored in the following JSON field:

  • "nextPageToken": for next available page
  • "prevPageToken": for the previous available page

The cursor are NOT returned if the corresponding pages is not available (i.e. last API call doesn't have "nextPageToken" in the response)

I made a test running the following configuration

pager:
    type: cursor
    selector: nextPageToken
    key: pageToken

If I run drush command, the import is succesfully executed with the Info message [info] Undefined index: nextPageToken Json.php:83

$ drush migrate:import youtube_video_gam_json --verbose
 [info] Undefined index: nextPageToken Json.php:83

  16/164 [==>-------------------------]   9% 7 secs
  32/164 [=====>----------------------]  19% 10 secs
  48/164 [========>-------------------]  29% 13 secs
  64/164 [==========>-----------------]  39% 17 secs
  80/164 [=============>--------------]  48% 19 secs
  96/164 [================>-----------]  58% 23 secs
 112/164 [===================>--------]  68% 26 secs
 128/164 [=====================>------]  78% 29 secs
 144/164 [========================>---]  87% 32 secs [info] Undefined index: nextPageToken Json.php:83
 [notice] Processed 154 items (154 created, 0 updated, 0 failed, 0 ignored) - done with 'youtube_video_gam_json'

The items not imported are duplicated url that are skipped (unique constraint is not satisfied)

The same test executed from UI is failing after the import of the first 2 items. This is what I get on the page

An AJAX HTTP error occurred.
HTTP Result Code: 200
Debugging information follows.
Path: /en/batch?id=14&op=do_nojs&op=do
StatusText: OK
ResponseText: 
Notice:  Undefined index: nextPageToken in /Users/alessandro.senatore/sites/app4gym/web/modules/contrib/migrate_plus/src/Plugin/migrate_plus/data_parser/Json.php on line 83
{"status":true,"percentage":"1","message":"Migrating \u003Cem class=\u0022placeholder\u0022\u003EYuotube Playlist GAM - JSON\u003C\/em\u003E","label":"Importing \u003Cem class=\u0022placeholder\u0022\u003EYuotube Playlist GAM - JSON\u003C\/em\u003E (1%)."}
azussman’s picture

Is anyone else having issues with paginator. I am trying to map a drupal 7 content type to a drupal 8 content type. I am using Views Datasource to help create a JSON endpoint on my drupal 7 CMS. I've created the endpoint and have successfully mapped the fields into my drupal 8 site. This works great with my test data. But I am looking at a massive migration 300K+ of data, because of that I want to utilize the power of pagination. The JSON endpoint provides this snippet to use:

pager: {
     pages: 2,
     page: 0,
     count: 4,
     limit: 2
}

I tried using this data to map onto the paginator as such:

  pager:
    type: paginator
    paginator_type: starting_item
    default_num_items: 'pager/count'
    page_key: 'pager/page'
    size_key: 'pager/limit'

This returns a warning `A non-numeric value encountered Json.php:255` and actually does return page 1, but it does not fetch page 2. If I try and hardcode the values into all those fields. The migration just stops working all together and just sits idle, returning no error message or feedback of what could be hanging up the request.

Any help or information around this would be greatly appreciated.

ruslan piskarov’s picture

Also the following code not always work:

if ($response->getStatusCode() !== 200) {
  unset($next_urls[$key]);
}

in my case I get the following result with status 200:

{
    "item_type": "asset",
    "total_count": 253,
    "offset": 5200,
    "limit": 100,
    "items": []
}

Need to check if "item_selector" in my case "items" is empty also make unset($next_urls[$key]);

ruslan piskarov’s picture

I have updated #72 patch for the case if an API returns 200 status code, but with empty 'item_selector'.
For some reason, I can't create an interdiff on my local.
My updated the following:

              if ($response->getStatusCode() !== 200) {
                unset($next_urls[$key]);
              }
              else {
                // In some cases, API returns 200 status code,
                // but with empty 'item_selector'.
                $sourceData = $this->getSourceData($next_url, $this->itemSelector);
                if (is_null($sourceData) || empty($sourceData)) {
                  unset($next_urls[$key]);
                }
              }

segi made their first commit to this issue’s fork.

segi’s picture

StatusFileSize
new18.64 KB
new694 bytes

I have fined a bug when I used patch #76.

Notice: Undefined index: next in Drupal\migrate_plus\Plugin\migrate_plus\data_parser\Json->getSourceData() (line 83 of .../modules/contrib/migrate_plus/src/Plugin/migrate_plus/data_parser/Json.php)

I fixed in a new patch, plus I pulled the last changes from original repo to MR.

JeremyFrench’s picture

I ran into an issue using this patch with a target of a D7 restws site. Trying to pull back terms in a vocabulary ended up with an encoding error on the URL.

https://www.example.com/taxonomy_term?vocabulary=5&amp;page=1

In the page, this was being returned in the json as a Unicode code (which may be the issue and nothing to do with this patch).

,"next":"https:\/\/www.example.com\/taxonomy_term?vocabulary=5\u0026page=1",

The pager config is like this

 pager:
    type: urls
    selector: 'next'

Trying to call the page with &amp in the URL gives a 404.

szloredan’s picture

StatusFileSize
new18.79 KB
new607 bytes

I have a case where i need the sourceData parsed from URLs so a getter will be very good to not run the fetch again for same URL. I added a patch for this.

jcandan’s picture

RE: #74, @azussman:

The paginator pager type is meant to specify the parameters passed in the URL, not the item selector path to read pagination metadata. The example given in #39 shows how URL parameters might be used in pagination, e.g. ?offset=100&limit=50 puts you at page three. And the pager configuration example works to iterate through that pagination without the next link supplied as seen in these other pager types.

ronaldmulero’s picture

Re-roll of #81 against the latest 8.x-5.x-dev branch.

Status: Needs review » Needs work

The last submitted patch, 83: migrate_plus-support_paging-2640516-83.patch, failed testing. View results

roam2345’s picture

@ronaldmulero thank you kindly for that, we are working with this issue actively over this week to consume an endpoint that requires this. Looking through the patch I don't see any documentation on the required YAML definition to utilize the patch. Looking through and trying some of the options put forward in this thread im still battling to get it to function.

Could that be added to the readme? Or an example be added some where in this module.

Thanks.
Lathan

ronaldmulero’s picture

@lathan See #39 for @jcandan's example of how to use the patch.

Rerolled the patch to fix that failing test:
see reroll_diff_83-86.txt

Only 2 changes since #83:

  1. @@ -217,9 +217,6 @@ of Json.php: That else block replaces the entire source array with NULL, which can't be right.
  2. @@ -14,7 +14,7 @@ of JsonTest.php: property must be declared protected.
ronaldmulero’s picture

Status: Needs work » Needs review
scott_euser’s picture

Thanks everyone for the work on this - very helpful!

+++ b/src/Plugin/migrate_plus/data_parser/Json.php
@@ -22,43 +25,76 @@ class Json extends DataParserPluginBase implements ContainerFactoryPluginInterfa
     // Otherwise, we're using xpath-like selectors.
-    $selectors = explode('/', trim($this->itemSelector, '/'));
+    $selectors = explode('/', trim($item_selector, '/'));
+    $return = $this->sourceData;
     foreach ($selectors as $selector) {
-      if (!empty($selector) || $selector === '0') {
-        $source_data = $source_data[$selector];
+      if (!empty($selector) && !empty($return[$selector])) {
+        $return = $return[$selector];
       }
     }
-    return $source_data;
+    return $return;

When testing this with a D7 Restful Web Services site https://www.drupal.org/project/restws I noticed that the first page does not have a 'prev' and the last page does not have a 'next'. With the keys missing the entire JSON is returned instead for the pagination. The entire JSON is of course not a URL so the migration will error with "URI must be a string or UriInterface"

Screenshot of error message - URI must be a string or UriInterface

Updated patch to ensure that getSourceData by selector will only return data for that selector and if not found, then returns NULL instead.

Status: Needs review » Needs work

The last submitted patch, 88: migrate_plus-support_paging-2640516-88.patch, failed testing. View results

scott_euser’s picture

Status: Needs work » Needs review
StatusFileSize
new438 bytes

Updated patch to not assume that a selector has been provided.

scott_euser’s picture

Status: Needs review » Needs work

The last submitted patch, 91: migrate_plus-support_paging-2640516-90.patch, failed testing. View results

scott_euser’s picture

Status: Needs work » Needs review
StatusFileSize
new329 bytes
new19.49 KB

Sorry for the spam, obviously need a bit more coffee this morning!

hmdnawaz’s picture

I have version 5.1 installed but the patch in #93 is not applying cleanly.

So I have rerolled the patch for that version.

My migration configurations are:

pager:
    type: urls
    selector: '@odata.nextLink'
peter törnstrand’s picture

If the pager selector contains a path with more then one component the pager URL's returned are wrong.

For example, with a pager selector like _links/next/href the returned value contains all data under _links.

KevinVanRansbeeck’s picture

Re-roll against latest 8.x-5.x branch

Status: Needs review » Needs work

The last submitted patch, 96: migrate_plus-pager_selector_error-2640516-96.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

sleopold’s picture

Previous patch resulted in an `ArgumentCountError`

apmsooner’s picture

Patch in #98 works pretty well for me except I get an error on the migrations overview page:

Notice: Undefined index: next in Drupal\migrate_plus\Plugin\migrate_plus\data_parser\Json->getSourceData() (line 82 of /var/www/web/modules/composer/migrate_plus/src/Plugin/migrate_plus/data_parser/Json.php)

rob230’s picture

Patch #98 has a few issues arising from parameters being used instead of a class variable - this was not replaced in all areas.

We also have the issue that when there are not multiple pages, the JSON:API doesn't output a links/next section, so I've added a check for this which means it won't try to load a next URL if the xpath for it isn't found. Pagination has to end so eventually there won't be a next URL. This may also fix your undefined index error apmsooner.

rob230’s picture

Status: Needs work » Needs review
rob230’s picture

Not sure why the patch didn't upload before, hopefully this time it works.

Status: Needs review » Needs work

The last submitted patch, 102: migrate_plus-pager_selector_error-2640516-102.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

samerali’s picture

Thanks for the patch @rob230, any chance we can get this applied to 6.x ? i tried applying it and it would not apply.

apmsooner’s picture

patch in #102 didn't work for me applied against 5.x version. It returned no results. #98 is still working for me aside from the visual errors.

apmsooner’s picture

FYI, if anyone is trying to migrate from wordpress rest apis, i finally worked through a solution for that with example. The pager type is simply "page" and the defaults work.

plugin: url
  data_fetcher_plugin: http
  data_parser_plugin: json
  urls: 'https://www.{thewordpressdomain}.com/wp-json/wp/v2/media?media_type=image&_fields=id,data,guid,modified,slug,title,caption,alt_text,media_type,mime_type,media_details,source_url&per_page=100'
  pager:
    type: page
  item_selector: /

You may also need to set authentication if the rest apis are restricted on wordpress site.

monymirza’s picture

Version: 8.x-5.x-dev » 6.0.x-dev
Status: Needs work » Needs review
StatusFileSize
new15.1 KB

Re-roll against latest 6.0.x dev branch

Status: Needs review » Needs work
auth’s picture

Attaching a modification to the patch in #107 where we use is_numeric instead of is_int to determine if the selector should be used to select depth instead of as an xpath-like selector.

ricovandevin’s picture

Status: Needs work » Needs review

Triggering tests.

ricovandevin’s picture

Status: Needs review » Needs work
quadbyte’s picture

Status: Needs work » Needs review
StatusFileSize
new15.3 KB

Reapplied #109 to latest 6.0.x-dev and improved code for the "paginator" pager type to support non scalar selector.

bramdriesen’s picture

Status: Needs review » Needs work

Patch does not apply

quadbyte’s picture

Fixed patch prefix.

bramdriesen’s picture

Status: Needs work » Needs review

Setting to needs review to trigger tests :)

Status: Needs review » Needs work

The last submitted patch, 114: migrate_plus-pager_selector_error-2640516.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

heddn’s picture

Issue tags: +Needs tests

This really needs some tests. Any chance we can add some?

slayne40’s picture

For pager type urls, verify if the selector return an url.

kekkis’s picture

I believe the patches starting from #112 contain an error when using the 'paginator' type and when the $selector_data is scalar. The comparison operator between $num_items and $selector_data has shifted from '==' to '!=' for an unknown reason. How this manifests is that only the first page ever gets processed.

In this new patch I try to explain the thinking in the comment a bit better, plus of course fix the operator.

Leaving as Needs work since there still are no tests.

sassafrass’s picture

I am testing using the latest patch. In my yaml, I am using the paginator type because the urls provided in the JSON endpoints are not valid for my use case.

  urls:
    - http://services.baltimorecountymd.gov/api/hub/pets/pets?status=Adoptable
    - http://services.baltimorecountymd.gov/api/hub/pets/pets?status=Lost
  pager:
    type: paginator
    selector: /metaData/
    paginator_type: page_number
    default_num_items: 10
    page_key: page
    size_key: recordsPerPage

The paginated urls are generated as expected. However, it never stops generating urls. Urls being generated are valid but have no records. For example see: https://services.baltimorecountymd.gov/api/hub/pets/pets?status=Adoptabl....

My particular use case has non-scalar $selector_data but the JSON is not in format expected by this segment of code, which always evaluates to true.

     else {
            // If we have an array of rows
            if (count($selector_data) > 0) {
              $next_urls[] = Url::fromUri($path['path'], [
                'query' => $path['query'],
                'fragment' => $path['fragment'],
              ])->toString();
            }
          }
        }
jcandan’s picture

StatusFileSize
new16.81 KB
new3.86 KB

Re-roll #119 for latest 6.0.x-dev.

chuyenlv’s picture

Re-roll #121 for 6.0.x, Drupal 10 and PHP 8.1

kekkis’s picture

StatusFileSize
new16.97 KB
new1021 bytes

The return type of \Drupal\migrate_plus\Plugin\migrate_plus\data_parser\Json::getSourceData cannot be set to array reliably since it might return NULL (per its own code) or any scalar (from json_decode). This is my suggestion how to fix it, even though using mixed might do as well.

jacobbell84’s picture

Status: Needs work » Needs review
StatusFileSize
new17.15 KB

Comment 100 introduced a logic change to the getSourceData function in the Json data parser plugin that I believe wasn't intended.

     // Otherwise, we're using xpath-like selectors.
     $selectors = explode('/', trim($item_selector, '/'));
     $return = $this->sourceData;
     foreach ($selectors as $selector) {
+      if (!isset($return[$selector])) {
+        return NULL;
+      }
       if (!empty($selector) || $selector === '0') {
         $return = $return[$selector];
       }

With that change, it's no longer possible to use the Json data parser without setting an item selector. This is because the 'explode' method will return an array with one empty element if an empty item selector is passed to it, which causes the new check to fail 100% of the time and returns null. I've changed that logic to what's below, which I believe solves that problem and keeps the original intended functionality.

    // Otherwise, we're using xpath-like selectors.
    $return = $this->sourceData;
    if (!empty($item_selector)) {
      $selectors = explode('/', trim($item_selector, '/'));
      foreach ($selectors as $selector) {
        if (!isset($return[$selector])) {
          return NULL;
        }
        $return = $return[$selector];
      }
    } elseif (sizeOf($return) === 0) {
      return NULL;
    }

Status: Needs review » Needs work

The last submitted patch, 125: migrate_plus-support-paging-2640516-125.patch, failed testing. View results
- codesniffer_fixes.patch Interdiff of automated coding standards fixes only.

nadavoid’s picture

Is there a way to use this patch for the paging scenario I have? The things in particular that I'm not sure how to solve:

  1. The token for the next page is unpredictable; it needs to be read from the current page
  2. The identifier for the next page is only a token, not a full URL

My JSON source has this structure:

{
  "response": {
    "docs": ["one", "two", "three"],
    "count": "329",
    "nextPageToken": "qwk9"
}

The URL of the next page is the same as the current page, with the change of a URL param: &pageToken=qwk9

I'm thinking that I'll need to update the patch or extend it using custom code, but wanted to check here first. If updating the patch is there an existing pager type that would be best to model this on? These are the pager types that I see recognized in the patch so far:

  • urls
  • cursor
  • page
  • paginator

UPDATE
The solution is cursor with this configuration:

  pager:
    type: cursor
    key: pageToken
    selector: response/nextPageToken
nadavoid’s picture

Issue summary: View changes

Updating issue summary.

nadavoid’s picture

Commenting with what I understand to be the purpose of some of the pager options for the JSON plugin.

selector: The path to the JSON item identifying the next page.

type

  • urls: selector points to one or more URLs.
  • cursor: selector is just a portion of the next page URL. Set `key` to the URL parameter key that should be passed.
  • page: selector is a current page number, and expected to increment: 1, 2, 3, etc.
  • paginator: not sure

key: Parameter key in the URL, will be given a value that identifies the page to request. If key is set to page, then the resulting generated URL will contain ?page=[value-from-current-json]

More parameters to document:

size_key
page_key
selector_max
default_num_items
paginator_type

jacobbell84’s picture

Fixing some incorrect variable names (A couple spots were still using $this->itemSelector instead of $item_selector) and fixing another edge case where the item_selector is blank.

guiu.rocafort.ferrer made their first commit to this issue’s fork.

guiu.rocafort.ferrer’s picture

I updated the issue fork with the latest 6.0.x branch, so it is mergeable now. I also made a few changes so the Gitlab CI is available and runs de tests, and also fixed some of the tests.

Some of the changes i made, which might be arguable, are:

  • When the item_selector does not exist, the getSourceData returns now an empty array instead of null. This way, the plugin acts as there is no rows to be imported and does nothing.
  • When the item_selector is NULL or it is not defined, it is set to '' by default. In that case i believe the getSourceData method should return the whole sourceData contents.
  • One of the tests failed because the $item_selector parameter for getSourceData was set to type string, but it was passing an integer value. Inside the function, it checked if the parameter is_numeric, to apply the backwards compatibility for depth selection. I changed the method definition to accept also integer values.

My plan is to next write some tests for the pager functionality, and go from there to fix potential issues and make sure it works ok, before merging into the main branch.

guiu.rocafort.ferrer’s picture

The last commit adds test cases for the pager types urls, cursor, and page. The tests for paginator are still missing thought.

I also expanded a little bit the documentation on the pager types here: https://www.drupal.org/docs/8/api/migrate-api/migrate-source-plugins/mig...

guiu.rocafort.ferrer’s picture

Status: Needs work » Needs review
Issue tags: -Needs tests

I am changing the status to "Needs review", since i wrote tests for all the pager types, and also updated the documentation to reflect the new functionality.

Documentation: https://www.drupal.org/docs/8/api/migrate-api/migrate-source-plugins/mig...

nadavoid’s picture

The issue fork has merge conflicts against the main branch, with the recent update of migrate_plus to 6.0.2. The latest patches also no longer apply.

@guiu.rocafort.ferrer Could you merge the latest into your issue fork and resolve conflicts? I'll be happy to test after that, since I'm actively using this, and will continue to for quite a while.

It looks like the conflict is only in .gitlab-ci.yml

<<<<<<< HEAD
# variables:
#   SKIP_ESLINT: '1'
=======
variables:
  _TARGET_CORE: "$CORE_STABLE"
  _SHOW_ENVIRONMENT_VARIABLES: '1'
  OPT_IN_TEST_MAX_PHP: '1'
>>>>>>> 6.0.x
guiu.rocafort.ferrer’s picture

Hi @nadavoid, the merge request should now be good to go.

nadavoid’s picture

Status: Needs review » Reviewed & tested by the community
StatusFileSize
new33.48 KB

Thanks for the quick update @guiu.rocafort.ferrer! I really appreciate it. I've successfully applied the patch from the MR on 6.0.2 and confirmed that the cursor pagination still works in my installation.

The code looks good good to me, and the tests seem comprehensive. So I'm marking this RTBC. Others who have been deeper in this issue are of course also welcome to review. I'm strongly in favor of merging soon, and if adjustments are needed or bugs are found, they can be handled in smaller followup issues.

I'm also uploading the current version of the patch so that people have something stable to use in their builds today.

guiu.rocafort.ferrer’s picture

Issue summary: View changes
longwave’s picture

I'm using this with the paginator pager type and while it works well, it is repeating HTTP requests - and my source is relatively slow to respond, so this is slowing down my migrations.

If I add logging to Http::getResponse() I see that requests are made up to three times for the same URL:

get https://[redacted]?size=500... done
get https://[redacted]?size=500&offset=500... done
get https://[redacted]?size=500... done
get https://[redacted]?size=500&offset=500... done
get https://[redacted]?size=500&offset=1000... done
get https://[redacted]?size=500&offset=500... done
get https://[redacted]?size=500&offset=1000... done
get https://[redacted]?size=500&offset=1500... done
get https://[redacted]?size=500&offset=1000... done

The extra calls come from two places. Firstly:

    return array_merge(parent::getNextUrls($url), $next_urls);

where parent::getNextUrls($url) calls the Http fetcher to look for Link headers; this probably isn't necessary if a pager is configured?

Secondly:

          // Service may return 404 for last page, ensure next_urls are valid.
          foreach ($next_urls as $key => $next_url) {
            try {
              $response = $this->getDataFetcherPlugin()->getResponse($next_url);

This isn't necessary for my source so it would be good to configure this as well.

I hate to say this at this stage but I wonder if the different pagers need to be their own type of plugin? That would mean they could be more easily extended for different cases. As an example my source also returns a count of total rows in each response which should mean I can automatically calculate the final page number without having to look ahead.

guiu.rocafort.ferrer’s picture

Hi @longwave, I understand your concerns about the performance issues, and the redundant http requests, i do believe too that this is not optimal, and there is room for improve.

This issue have been opened since 2015, and it has been difficult to push it forward and make it to merge with the development branch, so i am worried that addressing those issues would delay even more. I personally have a few sites that make use of the patch for a while now, and i believe some other people might be in the same situation.

So i believe it would be better to merge the issue into development, and then create a follow-up issue to improve the performance situation.
What do the other think about this ?

longwave’s picture

That decision is up to the Migrate Plus maintainers, I think it would be fine to defer to followups given this fixes some immediate issues and both removing the additional requests and refactoring to use plugins could be done separately.

googletorp’s picture

I've used this in the past and compared to the alternative (which is to fetch everything in one go) this is a major improvement. This is also an opt-in feature, so while making a lot of requests can cause problems, it's something the developers using this can determine if it's a problem or not. From what I understand if you're not using the feature we won't have any performance change.

Maybe it's better to get it in, solve a lot of people's problems and get real life feedback on which performance issues would make sense to address.

I think this is RTBC.

heddn’s picture

This is likely one of the longest running requests in the migrate plus queue. Let's land it and make incremental improvements on what we have here.

  • heddn committed b79a5735 on 6.0.x authored by jcandan
    Issue #2640516 by jcandan, m.lebedev, berenddeboer, scott_euser, segi,...
heddn’s picture

Status: Reviewed & tested by the community » Fixed

Thank you everyone for sticking with this issue.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

heddn’s picture

ressa’s picture

Yes, thanks everyone for landing this feature with great tenacity. This is one of the things which makes so Drupal so great -- while an issue may take some time, many will eventually be resolved, and make Drupal even greater.

init90’s picture

I've created an issue to address one of the points from #139 - #3466499: Omit an extra request when the pager is used