So Drupal 7 used only numeric IDs to store entities but Drupal 8 is changing and adapting to the philosophy of none numeric IDs. We are building an application and using entities that use URIs as IDs.

The issue came when I tried to integrate with SOLR (I don't know about the generic framework but I tried to solve the issue within it so I'm reporting it here). The indexing and loading of the entities failed with errors regarding the ID.
Looking a bit more into it, we found 2 places within the code that are not enough ID agnostic and actually work when the character ':' and the character '/' are not part of the ID.

So I propose that we should change the way the ID splitting occures in the code. All we need to check is that when splitting by '/', the strpos should definetly be used instead of strrpos and also we should replace explode function in Utilities.php with a preg_split().

Credit also goes to sandervd

Comments

dimilias created an issue. See original summary.

dimilias’s picture

Status: Active » Needs review
StatusFileSize
new1.32 KB

Status: Needs review » Needs work

The last submitted patch, 2: support_for_non_numeric-2671228-2.patch, failed testing.

The last submitted patch, 2: support_for_non_numeric-2671228-2.patch, failed testing.

dimilias’s picture

StatusFileSize
new1.29 KB

Sorry for the previous patch, Filezilla messed the format...

dimilias’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch, 5: support_for_non_numeric-2671228-2.patch, failed testing.

The last submitted patch, 5: support_for_non_numeric-2671228-2.patch, failed testing.

pfrenssen’s picture

I ran the test locally, it doesn't output why it fails but this is the reason:

PHPUnit_Framework_Error_Notice: Undefined offset: 1 in src/Plugin/search_api/datasource/ContentEntity.php on line 372

This happens when an item is indexed with the language code "l0".

dimilias’s picture

Status: Needs work » Needs review
StatusFileSize
new1.3 KB

Ok so the previous one was failing because I did not include the upper case and numbers for the preg_split.
I though that language codes are just small case and dashes like "fr" or "en-us". Am I thinking it wrong? Various tests are using generators including both upper case and numbers as well.

pfrenssen’s picture

Status: Needs review » Needs work
Issue tags: +Needs tests

@dimilias looking at ConfigurableLanguageTest::testName() it appears that alphanumeric characters are accepted. Dashes are not tested there, and it only tests 2 characters.

Now I did some more digging, and I found ConfigurableLanguage::createFromLangcode() which points to a long list of supported language codes in LanguageManager::getStandardLanguageList(). They are all consisting of lowercase characters and dashes. Most are 2 characters but some are longer (e.g. 'xx-lolspeak'). None of them contain numbers, but since ConfigurableLanguageTest::testName() actively tests that numbers are supported I would leave them in.

I would like to see a specific test added for this issue. This regex stuff is always a bit tricky, so it would be good to have a failing test that proves the problem exists and is conclusively fixed.

Thanks for working on this!

dimilias’s picture

StatusFileSize
new4.28 KB

I've started on a test and it is not working for some reason (my xdebug is failing to communicate with me)..
Can you test it??

pfrenssen’s picture

  1. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,146 @@
    +use Drupal\Component\Render\FormattableMarkup;
    +use Drupal\entity_test\Entity\EntityTestMul;
    +use Drupal\KernelTests\KernelTestBase;
    +use Drupal\language\Entity\ConfigurableLanguage;
    +use Drupal\search_api\Entity\Index;
    +use Drupal\search_api\Entity\Server;
    

    Some of these use statements are not used in the current scope.

  2. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,146 @@
    +    /** @var \Drupal\entity_test\Entity\EntityTestMul $entity_1 */
    +    $entity_1 = EntityTestMul::create(array(
    

    It uses EntityTestMul here of which the schema is not available, it should be EntityTestStringId instead.

drunken monkey’s picture

Title: Support for non Numeric IDs » Support for arbitrary strings as entity IDs

Thanks for reporting this problem and providing a patch!
The title is a bit misleading, though – string entity IDs are already well supported, it's just that it seems we have problems with specific special characters in there. But you're right, we definitely didn't think of that use case! (Actually, I thought content entities are still required to have integer IDs. But apparently I was mistaken.)

  1. +++ b/src/Plugin/search_api/datasource/ContentEntity.php
    @@ -369,7 +369,7 @@ class ContentEntity extends DatasourcePluginBase {
    -      list($entity_id, $langcode) = explode(':', $item_id, 2);
    +      list($entity_id, $langcode) = preg_split('/:(?=[a-zA-Z0-9-]+$)/', $item_id);
    

    There will always be a language code at the end, so why not just use strrpos()?

  2. +++ b/src/Utility.php
    @@ -367,7 +367,8 @@ class Utility {
       public static function splitPropertyPath($property_path, $separate_last = TRUE, $separator = ':') {
    -    $function = $separate_last ? 'strrpos' : 'strpos';
    +    $is_default_separator = $separator != IndexInterface::DATASOURCE_ID_SEPARATOR;
    +    $function = $separate_last && $is_default_separator ? 'strrpos' : 'strpos';
    

    The correct fix for this problem is not to change the splitPropertyPath() method, but not to use it (or to use it in the proper way) in splitCombinedId(). It probably was a bad idea to begin with – it's just that the code was so similar and I apparently was too aggressive in trying to avoid code duplication.

In any case, a test case demonstrating this problem would really be helpful!

dimilias’s picture

Sorry for taking so long. I will try to write the test soon but I'm quite busy at the moment. :/

dimilias’s picture

StatusFileSize
new6.66 KB

Ok, I have built the test. sorry for taking so long. This is also the first test I create so sorry if I am missing some logic.

There are two test methods in the test. The first one tests plain string Ids and the second one is testing uri ids.
The limits of the ID length are also described in the test regarding the db backend.

dimilias’s picture

There will always be a language code at the end, so why not just use strrpos()?

Initially that was also my thought but I think (don't remember quite well) that there were various cases and the checks would need to be more and then again maybe there where use cases that I wouldn't think of. Like initially I thought that the only characters supported in a language code was letters. Then my tests were failing and it was because language codes also get '-' and numbers. So I went with this safe solution instead.

dimilias’s picture

StatusFileSize
new5.02 KB

Here is a better version of the previous test.

pfrenssen’s picture

  1. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,198 @@
    +/**
    + * @file
    + * Contains \Drupal\Tests\search_api\Kernel\EntityStringIdTest.
    + */
    

    A new coding standards rule was accepted yesterday - these @file docblocks should be removed for namespaced classes. Finally :)

    See #2304909: Relax requirement for @file when using OO Class or Interface per file.

  2. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,198 @@
    +  /**
    +   * Tests Uris as Ids.
    +   *
    +   * We are testing a uri because it is not matched by the regex character set
    +   * represented by \w because they contain the characters ':' and '/'
    +   * which are used to split the string saved in the index
    +   * and should not affect it.
    +   */
    +  public function testUriStringId() {
    

    This test is very similar to the first one. You can use a @dataProvider to provide a range of IDs to test. That way you can reuse the same test code to test a whole bunch of different IDs.

Edit: This was a review for the previous version of the patch :) Ilias can implement this faster than I can explain it :)

dimilias’s picture

Status: Needs work » Needs review
dimilias’s picture

StatusFileSize
new5.02 KB

@pfrenssen Sorry! I just fixed this really fast. New patch available.

The last submitted patch, 12: 2671228_12.patch, failed testing.

The last submitted patch, 12: 2671228_12.patch, failed testing.

The last submitted patch, 16: 2671228_16_test.patch, failed testing.

The last submitted patch, 16: 2671228_16_test.patch, failed testing.

The last submitted patch, 18: 2671228_18_test.patch, failed testing.

The last submitted patch, 18: 2671228_18_test.patch, failed testing.

Status: Needs review » Needs work

The last submitted patch, 21: 2671228_test.patch, failed testing.

The last submitted patch, 21: 2671228_test.patch, failed testing.

dimilias’s picture

StatusFileSize
new4.93 KB

I'm too clumsy.. Coding standards met.

dimilias’s picture

StatusFileSize
new6.23 KB

Patch with test also submitted now.

borisson_’s picture

back to NR, so the testbot can have a look at the last patch.

borisson_’s picture

Status: Needs work » Needs review

Actually doing #32

pfrenssen’s picture

Strange that #30 is not tested?

pfrenssen’s picture

Assigned: Unassigned » pfrenssen

Ok anyway the patch in #21 contains only the test, this one fails correctly to prove that it can correctly detect the bug. The one in #30 which is not tested only fixes some coding standards. The patch in #31 provides both the patch + the test and this one is green. So the test and the fix are proven to be functional. That's great!

Assigning to me for code review.

pfrenssen’s picture

Status: Needs review » Needs work
Issue tags: -Needs tests

This patch looks really great! I just found some minor code style and comment cleanups to make it perfect.

The two remarks of @drunkenmonkey were not addressed but you answered them. I also think that strrpos() would be a viable
alternative, but the regex also is fine for me.

I agree with the second remark of @drunkenmonkey: splitPropertyPath() should not have been used in the first place in splitCombinedId(). In this patch we now had to implement a workaround because splitPropertyPath() was not used for its intended purpose. The workaround now actually can cause it to return wrong results if the $separator is a different value.

I propose to remove the workaround from splitPropertyPath() and replace splitCombinedId() with the following:

  public static function splitCombinedId($combined_id) {
    return explode(IndexInterface::DATASOURCE_ID_SEPARATOR, $combined_id, 2);
  }

This should satisfy the tests, and we don't need to pollute splitPropertyPath().

Here's the rest of the review:

  1. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,160 @@
    +/**
    + * Tests entity indexing that are using string IDs.
    + *
    

    Tests indexing entities that are using string IDs.

  2. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,160 @@
    +  /**
    +   * Modules to enable for this test.
    +   *
    +   * @var string[]
    +   */
    +  public static $modules = array(
    

    This is overriding a property of KernelTestBase, so this docblock should be:

    /**
     * {@inheritdoc}
     */
    
  3. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,160 @@
    +    // Enable translation for the entity_test module.
    +    \Drupal::state()->set('entity_test_string_id.translation', FALSE);
    

    Are you sure this is correct? This says that it is "enabling" translation, but it sets the value to FALSE, which seems to be disabling translation? Am I seeing this incorrectly?

  4. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,160 @@
    +    // Create a test server.
    +    $this->server = Server::create(array(
    +      'name' => 'Test Server',
    +      'id' => 'test_server',
    +      'status' => 1,
    +      'backend' => 'search_api_test_backend',
    +    ));
    +    $this->server->save();
    

    This is really minor, but in new code written for D8 we typically use the new short array syntax [] instead of array().

    If you decide to fix this, have a look for other places in the test where this syntax is used.

    Edit: @borisson_ points out that Search API uses the classic array syntax by default.

  5. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,160 @@
    +  /**
    +   * Tests Uris as Ids.
    +   *
    

    Tests URIs as IDs.

  6. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,160 @@
    +  public function testUriStringId($entity_id) {
    

    This test is really great! I would never be able to tell this is your first test. Great job!

  7. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,160 @@
    +    $entity_1 = EntityTestStringId::create(array(
    

    There is only 1 entity in this test, so you can just name the variable $entity.

  8. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,160 @@
    +    // Test indexing the new entities. One should fail so only one should be indexed.
    

    This documentation is left over from an earlier version of the patch I think. The "One should fail" part is not relevant any more it seems.

  9. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,160 @@
    +  /**
    +   * Provides string ids to test.
    +   *
    

    Provides string IDs to test.

  10. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,160 @@
    +   * @return array An array of arrays which contain a list of parameters to be
    +   *   passed to the appropriate tests.
    

    Here the description should be on a second line. I know, the coding standards are complicated ;)

  11. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,160 @@
    +    return[
    

    Leave a space between 'return' and '['.

pfrenssen’s picture

Assigned: pfrenssen » Unassigned
borisson_’s picture

#36.4 - we always use the long notation in the entire Search API module, so let's keep it to the long notation everywhere in this patch.

  1. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,160 @@
    + * For this test we are using:
    + *
    + * entity:entity_test_string_id:<string_id>:und
    + * entity = 6 characters.
    + * entity_test_string_id = 21 characters.
    + * und = 3 characters.
    + * 3 x ':' = 3 characters.
    + * 50 - 6 - 21 - 3 - 3 = 17 characters left for the ID.
    

    While this is great information, I don't think we need it in the docblock here (because it'll go out of date when this tests gets updated)

  2. +++ b/tests/src/Kernel/EntityStringIdTest.php
    @@ -0,0 +1,160 @@
    +  function entityStringIdList(){
    

    Should be public function, there also needs to be a space after the brackets: () {

I agree with @pfrenssen that this looks great!

dimilias’s picture

StatusFileSize
new6.21 KB

Implemented latest required changes.

@borisson what is NR?
Also, do I have to select anything specific for my patch to go through testing?

borisson_’s picture

Status: Needs work » Needs review

The issue status should be on "Needs Review".

Status: Needs review » Needs work

The last submitted patch, 39: 2671228.patch, failed testing.

The last submitted patch, 39: 2671228.patch, failed testing.

pfrenssen’s picture

+++ b/src/Utility.php
@@ -365,7 +365,8 @@ class Utility {
-    $function = $separate_last ? 'strrpos' : 'strpos';
+    $is_default_separator = $separator != IndexInterface::DATASOURCE_ID_SEPARATOR;
+    $function = $separate_last && $is_default_separator ? 'strrpos' : 'strpos';

This is not needed any more since we are no longer calling splitPropertyPath() from splitCombinedId().

We should take a look at the failures too.

pfrenssen’s picture

I'm thinking that it would be better if we simplify splitPropertyPath() so that it always gives back all three parts, instead of doing all this magic with returning the first or the last part. That would make the logic more straightforward, and we don't need the additional arguments any more.

If it returns an array with three items and you need only the language then you can do list(,,$language) = splitPropertyPath($path). This is very similar to how the entity info was returned by Entity API in D7.

pfrenssen’s picture

Status: Needs work » Needs review
StatusFileSize
new4.43 KB
new5.64 KB
new1.08 KB

Hopefully this solves the failures. Also addressed my remark from #43. Going to have a look if it is feasible to simplify splitPropertyPath(), but that's probably better handled in a separate ticket. We now also don't have a need for the $separator argument any more. I did not change this.

The last submitted patch, 45: 2671228-45-test-only.patch, failed testing.

The last submitted patch, 45: 2671228-45-test-only.patch, failed testing.

dimilias’s picture

Status: Needs review » Needs work

Hmm nice one.! And I thought I was missing something trivial when I tried to fix this myself.. I did not notice about the null case. :/

borisson_’s picture

Status: Needs work » Reviewed & tested by the community
StatusFileSize
new1.32 KB
new5.61 KB

We can open a followup to simplify splitPropertyPath .

drunken monkey’s picture

StatusFileSize
new3.66 KB
new5.99 KB

Wow, great work in this issue, thanks a lot everyone!

Before the switch to use splitPropertyPath(), splitCombinedId() used the following code:

$pos = strpos($combined_id, IndexInterface::DATASOURCE_ID_SEPARATOR);
if ($pos === FALSE) {
  return array(NULL, $combined_id);
}
return array(substr($combined_id, 0, $pos), substr($combined_id, $pos + 1));

However, the performance benefit is surely negligible, and I guess your new version is slightly better to read, so we can just stick with that.

I don't understand the argument for preg_split(), so I changed the code to use strrpos(), as suggested.
I shortly considered also putting in a $pos === FALSE check as a safeguard, but I actually think throwing

Please see the attached patch for this and some other small nit-picks – if that's still OK with you, I can commit it.
In any case, thanks again!

drunken monkey’s picture

Status: Reviewed & tested by the community » Needs review
mkalkbrenner’s picture

pfrenssen’s picture

+++ b/src/Plugin/search_api/datasource/ContentEntity.php
@@ -368,7 +368,14 @@ public function loadMultiple(array $ids) {
+      // This can only happen if someone passes an invalid ID, since we always
+      // include a language code. Still, no harm in guarding against bad input.
+      if ($pos === FALSE) {
+        continue;
+      }

If this is a side effect of unexpected input, wouldn't it be better to throw an \InvalidArgumentException?

drunken monkey’s picture

Status: Needs review » Fixed

If this is a side effect of unexpected input, wouldn't it be better to throw an \InvalidArgumentException?

No, I don't think so. If you try to load a node with a string as a NID, you will also just get back NULL (or an empty array, if multi-loading). Drupal is generally generous regarding error reporting when loading entities, and it seems natural to mimick that behavior here, I'd say.

So, since this is blocking other issues, if this is the only complaint, I now just committed.
If you still feel we should throw an exception, please either re-open the issue or create a new one!

@ mkalkbrenner: How exactly is that issue related? (Thanks to the link, I noticed it was out-dated, though.)

mkalkbrenner’s picture

@ mkalkbrenner: How exactly is that issue related? (Thanks to the link, I noticed it was out-dated, though.)

Documentation ;-)

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.