Supporting multiple item types per index [#2230931]

Comment #1

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 15 April 2014 at 09:48

Issue tags:

+sprint

Log in or register to post comments

Comment #2

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 15 April 2014 at 14:28

See the bottom of #2232253-29: Make the indexing batch process work for my thoughts on how tracking could/should work with multiple types per index.
It therefore probably makes sense to have the tracking stable (again) before implementing this change. I'll discuss this shortly in today's meeting – either someone else implements the now more or less agreed-upon tracking solution, or I'll do it myself, before starting on this issue.

Log in or register to post comments

Comment #3

amateescu CreditAttribution: amateescu commented 16 April 2014 at 10:02

Marking as postponed on #2232253: Make the indexing batch process work for better issue tracking.

Log in or register to post comments

Comment #4

amateescu CreditAttribution: amateescu commented 16 April 2014 at 10:03

Status:

Active

» Postponed

Err, I meant this.

Log in or register to post comments

Comment #5

amateescu CreditAttribution: amateescu commented 16 April 2014 at 13:20

Priority:	Normal	» Critical
Status:	Postponed	» Active

#2232253: Make the indexing batch process work is in!

Log in or register to post comments

Comment #6

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 16 April 2014 at 15:40

Assigned:

Unassigned

» drunken monkey

OK, then let's see how this could be implemented …

Datasources unchanged
Everywhere else where items are handled (tracker, field-extracted items for indexing, search results), item ID + item type need to be present (item type called "datasource" for now, for consistency – I'd prefer renaming all of it to "item type" later, but that's for then to discuss)
New "datasource" column in our tracking implementation; tracking methods already seem to take the datasource as parameter, but I'd actually like to make that just the datasource ID – tracking should not be dependent on the datasource in any way, we just need the identifier
"Fields" tab split into sections for each datasource (can/should/must be refined later); internally, I think we should split Index::options['fields'] into sub-arrays for each type
To determine: Should it be possible to change the datasources of an existing index? If we want to enable people later to only have a single index per site, then probably yes. (Also, what about the tracker, can admins change that?)
Rewrite the processors to also look at the type of items (where necessary)

Log in or register to post comments

Comment #7

Berdir

German

Switzerland

CreditAttribution: Berdir commented 16 April 2014 at 16:32

Just thinking out lout, what about switching to passing around an object? IndexItem, SearchItem, something like that...

Then you can more or less avoid the concatenated string vs. multiple arguments question and while it's going to be a larger initial change (that should possibly happen as a separate issue, without introducing an further changes) but it would be much easier to add language to it I think.

The biggest problem could be that AFAIK right now, we often pass a list of indexable data arrays around, we can't use an IndexItem object as array key, but switching to type + id would require that to be nested anyway. Maybe an extended object, like IndexItemData?

Could look like this:

interface IndexItemInterface {
  public function getId()/id()/getItemId();
  public function getDataSourceId();
  public function getIndexId();

  // This would be the concatenated string.
  public function getUniqueIdentifier()
}

And when you add language to the mix, you can add a new getLangcode() and extend the unique identifier. Not sure if the unique identifier should be unique within the index or globally, but I assume within the index is enough. Names are just suggestions of course.

And for the data, something like this?

interface IndexItemDataInterface extends IndexItemInterface {
  getAllData();
  setAllData($data);

  getData($field_name);
  setData($field_name, $data);
}

Or you could only have a single interface and define that the data methods only return something in certain cases, but two interfaces might be a nice separation for that?

We unfortunately can't type hint on an array of specific objects but at least PHPStorm understands @param IndexItemDataInterface[] $items just fine.

Log in or register to post comments

Comment #8

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 16 April 2014 at 17:19

Yes, that's definitely something I wanted to do. Nested arrays are just sooo D7. ;)
I see we don't have an issue for this yet, but yes, basically I completely agree with you there. Could you maybe create an issue with basically just a copy of your above comment (and reference this one)? I do agree, also, that it should be a separate issue. This change is large enough as it is, for now I'll just add another magic property (after #item).

Also, again, I have to apologize to Frédéric here, who already implemented this partly in his initial port.

Btw, I've now dabbled a bit with this and am gonna commit what I have to a new branch soon. If you want, feel free to take a look – though it's of course far, far from finished. If someone wants, you can also work on this until tomorrow morning, I'll just review any new commits when starting again (and ask in IRC whether someone is currently working on it).
I'll also add a temporary "notes" file to that branch, where I keep track of things I shouldn't forget for this change.

Log in or register to post comments

Comment #9

freblasty CreditAttribution: freblasty commented 16 April 2014 at 22:11

@Berdir: Did you create the issue or can I go ahead and create it?

Log in or register to post comments

Comment #10

Berdir

German

Switzerland

CreditAttribution: Berdir commented 17 April 2014 at 07:16

I did not :)

Log in or register to post comments

Comment #11

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 17 April 2014 at 09:10

The branch is now also pushed to the sandbox. Yesterday, I failed "a little" and pushed it to the normal Search API project instead. Luckily, deleting branches is supported by the Drupal.org repos, otherwise we'd now have several hundred useless commits in the history there (let alone a new branch).

Continuing work on this now, as soon as I've looked at the other new comments in this issue queue.

Log in or register to post comments

Comment #12

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 17 April 2014 at 11:29

Title:

Supporting multiple entity types per index

» Supporting multiple item types per index

After a little more thought on the matter, I think for the fields a prefixed solution would be better, since it would hide this complexity for systems that don't care about it – e.g., the service classes, or processors.
E.g., the node author's name would then be entity:node|uid:name.

One thing that makes this ugly currently is the possibility that datasource plugin IDs have colons in them, so we can't use a colon to separate the datasource from the rest of the field identifier. But it seems the colon method for derivatives is a core standard, right?
Then, what would be a good separator here? A pipe (|), as suggested above?

Log in or register to post comments

Comment #13

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 17 April 2014 at 11:49

Unfortunately, I'm done again already for today, busy day. :-/ I'll hopefully have a lot more time tomorrow (weekend probably not so much).
I committed again my current state, feel free to take a look or continue work (e.g., the tasks in the "notes" file, especially adapting the Fields tab). Tomorrow I'll probably finish the index and go on to the tracker – datasource can stay untouched, I think.

Log in or register to post comments

Comment #14

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 18 April 2014 at 07:54

E.g., the node author's name would then be entity:node|uid:name.

This sounds good to me, gives us some flexibility for the UI also as you said. Hoping to get some more feedback

Log in or register to post comments

Comment #15

amateescu CreditAttribution: amateescu commented 18 April 2014 at 08:00

One thing that makes this ugly currently is the possibility that datasource plugin IDs have colons in them, so we can't use a colon to separate the datasource from the rest of the field identifier.

Is that possibility a historical one (e.g. coming from 7.x)? And even if it is, can't we change it for the 8.x branch?

But it seems the colon method for derivatives is a core standard, right?

Yep, it is.

Is there something else preventing us from using a period instead of a pipe?

Log in or register to post comments

Comment #16

aspilicious CreditAttribution: aspilicious commented 18 April 2014 at 08:04

  /**
   * Decodes derivative id and plugin id from a string.
   *
   * @param string $plugin_id
   *   Plugin identifier that may point to a derivative plugin.
   *
   * @return array
   *   An array with the base plugin id as the first index and the derivative id
   *   as the second. If there is no derivative id it will be null.
   */
  protected function decodePluginId($plugin_id) {
    // Try and split the passed plugin definition into a plugin and a
    // derivative id. We don't need to check for !== FALSE because a leading
    // colon would break the derivative system and doesn't makes sense.
    if (strpos($plugin_id, ':')) {
      return explode(':', $plugin_id, 2);
    }

    return array($plugin_id, NULL);
  }

According to this function, it only gets split on the first colon...
We should test this.

And some other copy paste snippet from irc

aspilicious_home: Nick_vh: maybe it *thinks* it's a derivative if there is any colon in it
[10:03am] aspilicious_home: this all dependes on the usecase
[10:03am] aspilicious_home: if it's always tied to a derivative we are safe
[10:03am] aspilicious_home: if a colon gets used in a single plugin
[10:03am] aspilicious_home: we are screwed
[10:03am] aspilicious_home: *I think*

Log in or register to post comments

Comment #17

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 18 April 2014 at 10:30

Is that possibility a historical one (e.g. coming from 7.x)? And even if it is, can't we change it for the 8.x branch?

No, on the contrary. I meant that colons are now present in some datasource IDs because they are derivatives.
We didn't have that problem in D7.

Is there something else preventing us from using a period instead of a pipe?

No, of course not. Anything not valid in a datasource ID is OK here. Would you think a period makes more sense?

@ Bram: Thanks for the additional information!

However, it seems to me I have caused some misunderstandings here. (Or maybe not and I'm just misunderstanding.) I'll try to rehash the problem here:

We want both the datasource ID and the "raw" field identifier (as used in D7) in the new field identifiers.
We use colons as separators for the raw field identifiers (e.g., uid:name).
Although "raw" datasource IDs (e.g., entity, as specified in the annotation) should never contain a colon (we don't want to allow that), due to the possibility of having derivatives (that use colons), complete datasource IDs (like entity:node) can contain colons.
We therefore can't use a colon as the separator between datasource ID and raw field identifier.
One possibility would be to make datasource IDs not use colons – e.g., entity.node instead. That's why I asked if derivatives only work with colons.
As it seems they do, we have to use some other separator – which can be any string that will never be part of a datasource ID. I suggested the pipe, just on a whim, but a period would be just as good. Or a dash (or two).

So, a "normal" (non-derivative) plugin ID containing a colon is definitely not the issue here, we don't want to allow that.
I hope I could clear this up.

Continuing work on the branch now.

Log in or register to post comments

Comment #18

amateescu CreditAttribution: amateescu commented 18 April 2014 at 11:24

Thanks, that is indeed much more clear and I do think a period makes sense in this case :)

Log in or register to post comments

Comment #19

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 18 April 2014 at 15:24

Other votes for pipe, period or other? (I have now extracted the separator into a constant, thus making it trivial to change.)

Furthermore, I have given more thought to the "combined vs. pure item ID" question (i.e., "Should the datasource ID be included in the item IDs, or not?") and am now unsure again. (Reminder: We previously settled on keeping it the pure ID from the datasource, and passing the datasource separately along with the items/IDs.)

A lot of functionality simply shouldn't/doesn't care about the separate parts, it just needs unique IDs. Having those centrally defined makes things a lot easier. Primarily the service classes really shouldn't care at all from which datasource an item comes. They get the extracted fields and should index them.
With "pure" item IDs, most of them would have to construct a combined ID anyways (the DB backend could circumvent this, but that would make deletes uglier), but the Search API couldn't use that. Instead, methods like deleteItems() would need to receive the item IDs plus the datasource they belong to, leading to (conceptually) completely unnecessary changes in the server plugin interface.

On the other hand, if we include the datasource in the item ID, the datasource would need to remove that prefix when loading items. Also, the tracker might want to store the datasource ID separately.

I would therefore suggest that we generally use combined item IDs (maybe just using the same separator as for field names, and renaming it from FIELD_ID_SEPARATOR to DATASOURCE_ID_SEPARATOR, or something), but introduce wrapper methods loadItem() and loadMultipleItems() in the index class which automatically take care of removing this prefix, so the datasource plugins don't have to deal with them. We should then just properly document this, and also come up with clear terminology to refer to the different types of item IDs to avoid confusion (same for the machine names).

The tracker could internally of course do what it wants – extract the datasource ID and store it separately, or don't care about it and use LIKE in queries that make it necessary. (Our implementation should of course still have a separate column, I do think that makes sense.)
However, we could also decide to let the tracker also use the "pure" IDs of the datasource, and continue to, e.g., return remaining item IDs grouped by datasource. When we have two kinds of IDs, ideally only the index would ever have to interact with its datasources and its tracker, so that the complexity of translating between the IDs would be entirely encapsulated in the index class. (Therefore, we'd need to create proxy methods on the index for all generally useful datasource methods, like getItemUrl() and viewItem().)

Any comments?

Log in or register to post comments

Comment #20

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 18 April 2014 at 19:22

#2244787: Ensure index state invariants always hold contains a new side track of this, as I came across Index::postSave() which didn't really do what it should.

Also, I'm finished for today and, most likely, the week. I sadly probably won't have much time the next days. If someone could, e.g., start the whole UI (fields, index create/edit), that would be great!

Log in or register to post comments

Comment #21

amateescu CreditAttribution: amateescu commented 19 April 2014 at 10:00

Assigned:

drunken monkey

» amateescu

The suggestion from #19 makes sense to me, with the same disclaimer that I'm not yet familiar with the entire codebase :/

That said, I'm going to work on the UI bits today.

Log in or register to post comments

Comment #22

amateescu CreditAttribution: amateescu commented 19 April 2014 at 14:02

Assigned:

amateescu

» Unassigned

Started some work on the Index form, not so much progress but I'll get back to it later in the weekend.

Log in or register to post comments

Comment #23

amateescu CreditAttribution: amateescu commented 22 April 2014 at 07:53

Assigned:

Unassigned

» amateescu

I finished the index form yesterday, going to work on fields and filters today :)

Log in or register to post comments

Comment #24

amateescu CreditAttribution: amateescu commented 22 April 2014 at 14:44

The fields page is done, started some work on processors but those need more thought.

Log in or register to post comments

Comment #25

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 22 April 2014 at 17:26

Sorry I didn't manage to look at this earlier. Your code so far looks mostly very good, though! Just two notes:

In my opinion, it's not necessary to set processors per datasource. Most have a list of fields to work on anyways, for the others we could introduce a "Datasources" option, where it makes sense.
Why is there validation code for datasource changes if they are forbidden anyways?

In any case, we can discuss more in half an hour!

Log in or register to post comments

Comment #26

amateescu CreditAttribution: amateescu commented 22 April 2014 at 18:49

Assigned:

amateescu

» Unassigned

Thanks for reviewing :) As discussed today, we shouldn't update the filters page to set processors per datasource, so we just need to revert this commit: http://drupalcode.org/sandbox/daeron/2091893.git/commit/63e0ebda1d46133d...

About the validation code for datasources, that's a good question. It was there before and I just made it work.. :)

Log in or register to post comments

Comment #27

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 22 April 2014 at 19:41

OK, reverted that commit. Now working on rewriting the processors themselves – shouldn't be much, though.

However, I've already hit a snag: I'd imagined Index::getPropertyDefinitions() would now receive the properties for only one datasource at a time. Thus, we'd also only pass the properties of a single datasource at a time to the processors for altering.
What should a processor now do, though, if they are adding a new field for all datasources? For now, they'll need to add it to the properties array for all datasources separately. However, since we currently don't allow datasource-independent fields at all, there wouldn't be an easy way around this anyways, I guess.

Thinking about it, though, having datasource-independent fields is important for several different use cases, so we will probably have to re-add them before merging this into the master branch. Or maybe afterwards, to get this in as soon as possible? Creating a new issue, for now: #2247451: Allow datasource-independent fields.

Also, side note @ Andrei: Why did you comment out the cache set in Index::getFields()?

Log in or register to post comments

Comment #28

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 22 April 2014 at 19:48

Assigned:

Unassigned

» drunken monkey

Also²: Do the Fields forms already work for you? For me, they just throw fatals when submitting (either with "Save changes" or "Update" for the "Add related fields" subform).
If this isn't done, then that's something you could work at in parallel with me, if you want. As long as you only touch the fields form class, there shouldn't be any conflicts (otherwise, we'd have to coordinate).

Log in or register to post comments

Comment #29

amateescu CreditAttribution: amateescu commented 22 April 2014 at 20:04

Why did you comment out the cache set in Index::getFields()?

Ugh, I committed that? Sorry :( I was debugging why the processors alter step from Index::getPropertyDefinitions() was making getFields() return an empty array and it was easier than clearing the cache all the time.

About the fields page, yep, adding regular and additional fields on a index with two datasources was working for me. I'll test again tomorrow morning and work only in that form if there's something to fix.

Log in or register to post comments

Comment #30

amateescu CreditAttribution: amateescu commented 23 April 2014 at 08:28

I managed to get that fatal error after testing with different entity types and committed a fix for it.

Log in or register to post comments

Comment #31

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 23 April 2014 at 10:36

Ugh, I committed that? Sorry :( I was debugging why the processors alter step from Index::getPropertyDefinitions() was making getFields() return an empty array and it was easier than clearing the cache all the time.

Ah, OK. It's no problem, I was just wondering. (Personally, I always add "dpm" to lines I temporarily comment out, so I can quickly check for any forgotten debugging code by greping for that.)

Log in or register to post comments

Comment #32

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 23 April 2014 at 18:37

Assigned:

drunken monkey

» Unassigned

OK, this is now working quite well already, as far as I can see. I also wouldn't know of any areas that are still incompatible with the change.
However, I think there are still some fails in the tests which I couldn't quite figure out. Also, I've run out of time for this week, so it would be great if someone else could finalize this. Otherwise, we can try on Monday, then we should really have enough time I'd say. ;)

(Btw, I figured out some of my local fails. They were due to the wrong usage of paths in the tests, which lead to incorrect links on my installation (which isn't in the root directory of the domain). Such errors also appear at some places in our normal code, I think, at least until recently. Maybe we can fix that, too, in Zürich.)

Log in or register to post comments

Comment #33

amateescu CreditAttribution: amateescu commented 24 April 2014 at 07:23

Assigned:

Unassigned

» amateescu

Fixed SearchApiLocalTasksTest last night and I'm working on fixing the others now.

Log in or register to post comments

Comment #34

amateescu CreditAttribution: amateescu commented 24 April 2014 at 09:40

The last fails I'm hitting are actually caused by a core bug in config entity query: #2248567: Config entity query doesn't work when matching an array property

Log in or register to post comments

Comment #35

amateescu CreditAttribution: amateescu commented 24 April 2014 at 14:16

Assigned:

amateescu

» Unassigned

Aaaaand we have a working branch!

Log in or register to post comments

Comment #36

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 24 April 2014 at 17:01

Excellent, thanks a lot! Looks great, too.
I only had one minimal code clarity fix, but other than that I think we can merge this now. Does anyone still want to have a look, or has some other objection?