For more information see discussion "#2126979: Overhaul the "Multi-Index Searches" module"

Comments

Nick_vh’s picture

Issue tags: +sprint
drunken monkey’s picture

See the bottom of #2232253-29: Make the indexing batch process work for my thoughts on how tracking could/should work with multiple types per index.
It therefore probably makes sense to have the tracking stable (again) before implementing this change. I'll discuss this shortly in today's meeting – either someone else implements the now more or less agreed-upon tracking solution, or I'll do it myself, before starting on this issue.

amateescu’s picture

Marking as postponed on #2232253: Make the indexing batch process work for better issue tracking.

amateescu’s picture

Status: Active » Postponed

Err, I meant this.

amateescu’s picture

Priority: Normal » Critical
Status: Postponed » Active
drunken monkey’s picture

Assigned: Unassigned » drunken monkey

OK, then let's see how this could be implemented …

  • Datasources unchanged
  • Everywhere else where items are handled (tracker, field-extracted items for indexing, search results), item ID + item type need to be present (item type called "datasource" for now, for consistency – I'd prefer renaming all of it to "item type" later, but that's for then to discuss)
  • New "datasource" column in our tracking implementation; tracking methods already seem to take the datasource as parameter, but I'd actually like to make that just the datasource ID – tracking should not be dependent on the datasource in any way, we just need the identifier
  • "Fields" tab split into sections for each datasource (can/should/must be refined later); internally, I think we should split Index::options['fields'] into sub-arrays for each type
  • To determine: Should it be possible to change the datasources of an existing index? If we want to enable people later to only have a single index per site, then probably yes. (Also, what about the tracker, can admins change that?)
  • Rewrite the processors to also look at the type of items (where necessary)
Berdir’s picture

Just thinking out lout, what about switching to passing around an object? IndexItem, SearchItem, something like that...

Then you can more or less avoid the concatenated string vs. multiple arguments question and while it's going to be a larger initial change (that should possibly happen as a separate issue, without introducing an further changes) but it would be much easier to add language to it I think.

The biggest problem could be that AFAIK right now, we often pass a list of indexable data arrays around, we can't use an IndexItem object as array key, but switching to type + id would require that to be nested anyway. Maybe an extended object, like IndexItemData?

Could look like this:

interface IndexItemInterface {
  public function getId()/id()/getItemId();
  public function getDataSourceId();
  public function getIndexId();

  // This would be the concatenated string.
  public function getUniqueIdentifier()
}

And when you add language to the mix, you can add a new getLangcode() and extend the unique identifier. Not sure if the unique identifier should be unique within the index or globally, but I assume within the index is enough. Names are just suggestions of course.

And for the data, something like this?

interface IndexItemDataInterface extends IndexItemInterface {
  getAllData();
  setAllData($data);

  getData($field_name);
  setData($field_name, $data);
}

Or you could only have a single interface and define that the data methods only return something in certain cases, but two interfaces might be a nice separation for that?

We unfortunately can't type hint on an array of specific objects but at least PHPStorm understands @param IndexItemDataInterface[] $items just fine.

drunken monkey’s picture

Yes, that's definitely something I wanted to do. Nested arrays are just sooo D7. ;)
I see we don't have an issue for this yet, but yes, basically I completely agree with you there. Could you maybe create an issue with basically just a copy of your above comment (and reference this one)? I do agree, also, that it should be a separate issue. This change is large enough as it is, for now I'll just add another magic property (after #item).

Also, again, I have to apologize to Frédéric here, who already implemented this partly in his initial port.

Btw, I've now dabbled a bit with this and am gonna commit what I have to a new branch soon. If you want, feel free to take a look – though it's of course far, far from finished. If someone wants, you can also work on this until tomorrow morning, I'll just review any new commits when starting again (and ask in IRC whether someone is currently working on it).
I'll also add a temporary "notes" file to that branch, where I keep track of things I shouldn't forget for this change.

freblasty’s picture

@Berdir: Did you create the issue or can I go ahead and create it?

Berdir’s picture

I did not :)

drunken monkey’s picture

The branch is now also pushed to the sandbox. Yesterday, I failed "a little" and pushed it to the normal Search API project instead. Luckily, deleting branches is supported by the Drupal.org repos, otherwise we'd now have several hundred useless commits in the history there (let alone a new branch).

Continuing work on this now, as soon as I've looked at the other new comments in this issue queue.

drunken monkey’s picture

Title: Supporting multiple entity types per index » Supporting multiple item types per index

After a little more thought on the matter, I think for the fields a prefixed solution would be better, since it would hide this complexity for systems that don't care about it – e.g., the service classes, or processors.
E.g., the node author's name would then be entity:node|uid:name.

One thing that makes this ugly currently is the possibility that datasource plugin IDs have colons in them, so we can't use a colon to separate the datasource from the rest of the field identifier. But it seems the colon method for derivatives is a core standard, right?
Then, what would be a good separator here? A pipe (|), as suggested above?

drunken monkey’s picture

Unfortunately, I'm done again already for today, busy day. :-/ I'll hopefully have a lot more time tomorrow (weekend probably not so much).
I committed again my current state, feel free to take a look or continue work (e.g., the tasks in the "notes" file, especially adapting the Fields tab). Tomorrow I'll probably finish the index and go on to the tracker – datasource can stay untouched, I think.

Nick_vh’s picture

E.g., the node author's name would then be entity:node|uid:name.

This sounds good to me, gives us some flexibility for the UI also as you said. Hoping to get some more feedback

amateescu’s picture

One thing that makes this ugly currently is the possibility that datasource plugin IDs have colons in them, so we can't use a colon to separate the datasource from the rest of the field identifier.

Is that possibility a historical one (e.g. coming from 7.x)? And even if it is, can't we change it for the 8.x branch?

But it seems the colon method for derivatives is a core standard, right?

Yep, it is.

Is there something else preventing us from using a period instead of a pipe?

aspilicious’s picture

  /**
   * Decodes derivative id and plugin id from a string.
   *
   * @param string $plugin_id
   *   Plugin identifier that may point to a derivative plugin.
   *
   * @return array
   *   An array with the base plugin id as the first index and the derivative id
   *   as the second. If there is no derivative id it will be null.
   */
  protected function decodePluginId($plugin_id) {
    // Try and split the passed plugin definition into a plugin and a
    // derivative id. We don't need to check for !== FALSE because a leading
    // colon would break the derivative system and doesn't makes sense.
    if (strpos($plugin_id, ':')) {
      return explode(':', $plugin_id, 2);
    }

    return array($plugin_id, NULL);
  }

According to this function, it only gets split on the first colon...
We should test this.

And some other copy paste snippet from irc

aspilicious_home: Nick_vh: maybe it *thinks* it's a derivative if there is any colon in it
[10:03am] aspilicious_home: this all dependes on the usecase
[10:03am] aspilicious_home: if it's always tied to a derivative we are safe
[10:03am] aspilicious_home: if a colon gets used in a single plugin
[10:03am] aspilicious_home: we are screwed
[10:03am] aspilicious_home: *I think*

drunken monkey’s picture

Is that possibility a historical one (e.g. coming from 7.x)? And even if it is, can't we change it for the 8.x branch?

No, on the contrary. I meant that colons are now present in some datasource IDs because they are derivatives.
We didn't have that problem in D7.

Is there something else preventing us from using a period instead of a pipe?

No, of course not. Anything not valid in a datasource ID is OK here. Would you think a period makes more sense?

@ Bram: Thanks for the additional information!

However, it seems to me I have caused some misunderstandings here. (Or maybe not and I'm just misunderstanding.) I'll try to rehash the problem here:

  • We want both the datasource ID and the "raw" field identifier (as used in D7) in the new field identifiers.
  • We use colons as separators for the raw field identifiers (e.g., uid:name).
  • Although "raw" datasource IDs (e.g., entity, as specified in the annotation) should never contain a colon (we don't want to allow that), due to the possibility of having derivatives (that use colons), complete datasource IDs (like entity:node) can contain colons.
  • We therefore can't use a colon as the separator between datasource ID and raw field identifier.
  • One possibility would be to make datasource IDs not use colons – e.g., entity.node instead. That's why I asked if derivatives only work with colons.
  • As it seems they do, we have to use some other separator – which can be any string that will never be part of a datasource ID. I suggested the pipe, just on a whim, but a period would be just as good. Or a dash (or two).

So, a "normal" (non-derivative) plugin ID containing a colon is definitely not the issue here, we don't want to allow that.
I hope I could clear this up.

Continuing work on the branch now.

amateescu’s picture

Thanks, that is indeed much more clear and I do think a period makes sense in this case :)

drunken monkey’s picture

Other votes for pipe, period or other? (I have now extracted the separator into a constant, thus making it trivial to change.)

Furthermore, I have given more thought to the "combined vs. pure item ID" question (i.e., "Should the datasource ID be included in the item IDs, or not?") and am now unsure again. (Reminder: We previously settled on keeping it the pure ID from the datasource, and passing the datasource separately along with the items/IDs.)

A lot of functionality simply shouldn't/doesn't care about the separate parts, it just needs unique IDs. Having those centrally defined makes things a lot easier. Primarily the service classes really shouldn't care at all from which datasource an item comes. They get the extracted fields and should index them.
With "pure" item IDs, most of them would have to construct a combined ID anyways (the DB backend could circumvent this, but that would make deletes uglier), but the Search API couldn't use that. Instead, methods like deleteItems() would need to receive the item IDs plus the datasource they belong to, leading to (conceptually) completely unnecessary changes in the server plugin interface.

On the other hand, if we include the datasource in the item ID, the datasource would need to remove that prefix when loading items. Also, the tracker might want to store the datasource ID separately.

I would therefore suggest that we generally use combined item IDs (maybe just using the same separator as for field names, and renaming it from FIELD_ID_SEPARATOR to DATASOURCE_ID_SEPARATOR, or something), but introduce wrapper methods loadItem() and loadMultipleItems() in the index class which automatically take care of removing this prefix, so the datasource plugins don't have to deal with them. We should then just properly document this, and also come up with clear terminology to refer to the different types of item IDs to avoid confusion (same for the machine names).

The tracker could internally of course do what it wants – extract the datasource ID and store it separately, or don't care about it and use LIKE in queries that make it necessary. (Our implementation should of course still have a separate column, I do think that makes sense.)
However, we could also decide to let the tracker also use the "pure" IDs of the datasource, and continue to, e.g., return remaining item IDs grouped by datasource. When we have two kinds of IDs, ideally only the index would ever have to interact with its datasources and its tracker, so that the complexity of translating between the IDs would be entirely encapsulated in the index class. (Therefore, we'd need to create proxy methods on the index for all generally useful datasource methods, like getItemUrl() and viewItem().)

Any comments?

drunken monkey’s picture

#2244787: Ensure index state invariants always hold contains a new side track of this, as I came across Index::postSave() which didn't really do what it should.

Also, I'm finished for today and, most likely, the week. I sadly probably won't have much time the next days. If someone could, e.g., start the whole UI (fields, index create/edit), that would be great!

amateescu’s picture

Assigned: drunken monkey » amateescu

The suggestion from #19 makes sense to me, with the same disclaimer that I'm not yet familiar with the entire codebase :/

That said, I'm going to work on the UI bits today.

amateescu’s picture

Assigned: amateescu » Unassigned

Started some work on the Index form, not so much progress but I'll get back to it later in the weekend.

amateescu’s picture

Assigned: Unassigned » amateescu

I finished the index form yesterday, going to work on fields and filters today :)

amateescu’s picture

The fields page is done, started some work on processors but those need more thought.

drunken monkey’s picture

Sorry I didn't manage to look at this earlier. Your code so far looks mostly very good, though! Just two notes:

  • In my opinion, it's not necessary to set processors per datasource. Most have a list of fields to work on anyways, for the others we could introduce a "Datasources" option, where it makes sense.
  • Why is there validation code for datasource changes if they are forbidden anyways?

In any case, we can discuss more in half an hour!

amateescu’s picture

Assigned: amateescu » Unassigned

Thanks for reviewing :) As discussed today, we shouldn't update the filters page to set processors per datasource, so we just need to revert this commit: http://drupalcode.org/sandbox/daeron/2091893.git/commit/63e0ebda1d46133d...

About the validation code for datasources, that's a good question. It was there before and I just made it work.. :)

drunken monkey’s picture

OK, reverted that commit. Now working on rewriting the processors themselves – shouldn't be much, though.

However, I've already hit a snag: I'd imagined Index::getPropertyDefinitions() would now receive the properties for only one datasource at a time. Thus, we'd also only pass the properties of a single datasource at a time to the processors for altering.
What should a processor now do, though, if they are adding a new field for all datasources? For now, they'll need to add it to the properties array for all datasources separately. However, since we currently don't allow datasource-independent fields at all, there wouldn't be an easy way around this anyways, I guess.

Thinking about it, though, having datasource-independent fields is important for several different use cases, so we will probably have to re-add them before merging this into the master branch. Or maybe afterwards, to get this in as soon as possible? Creating a new issue, for now: #2247451: Allow datasource-independent fields.


Also, side note @ Andrei: Why did you comment out the cache set in Index::getFields()?

drunken monkey’s picture

Assigned: Unassigned » drunken monkey

Also²: Do the Fields forms already work for you? For me, they just throw fatals when submitting (either with "Save changes" or "Update" for the "Add related fields" subform).
If this isn't done, then that's something you could work at in parallel with me, if you want. As long as you only touch the fields form class, there shouldn't be any conflicts (otherwise, we'd have to coordinate).

amateescu’s picture

Why did you comment out the cache set in Index::getFields()?

Ugh, I committed that? Sorry :( I was debugging why the processors alter step from Index::getPropertyDefinitions() was making getFields() return an empty array and it was easier than clearing the cache all the time.

About the fields page, yep, adding regular and additional fields on a index with two datasources was working for me. I'll test again tomorrow morning and work only in that form if there's something to fix.

amateescu’s picture

I managed to get that fatal error after testing with different entity types and committed a fix for it.

drunken monkey’s picture

Ugh, I committed that? Sorry :( I was debugging why the processors alter step from Index::getPropertyDefinitions() was making getFields() return an empty array and it was easier than clearing the cache all the time.

Ah, OK. It's no problem, I was just wondering. (Personally, I always add "dpm" to lines I temporarily comment out, so I can quickly check for any forgotten debugging code by greping for that.)

drunken monkey’s picture

Assigned: drunken monkey » Unassigned

OK, this is now working quite well already, as far as I can see. I also wouldn't know of any areas that are still incompatible with the change.
However, I think there are still some fails in the tests which I couldn't quite figure out. Also, I've run out of time for this week, so it would be great if someone else could finalize this. Otherwise, we can try on Monday, then we should really have enough time I'd say. ;)

(Btw, I figured out some of my local fails. They were due to the wrong usage of paths in the tests, which lead to incorrect links on my installation (which isn't in the root directory of the domain). Such errors also appear at some places in our normal code, I think, at least until recently. Maybe we can fix that, too, in Zürich.)

amateescu’s picture

Assigned: Unassigned » amateescu

Fixed SearchApiLocalTasksTest last night and I'm working on fixing the others now.

amateescu’s picture

The last fails I'm hitting are actually caused by a core bug in config entity query: #2248567: Config entity query doesn't work when matching an array property

amateescu’s picture

Assigned: amateescu » Unassigned

Aaaaand we have a working branch!

drunken monkey’s picture

Excellent, thanks a lot! Looks great, too.
I only had one minimal code clarity fix, but other than that I think we can merge this now. Does anyone still want to have a look, or has some other objection?

amateescu’s picture

I think it's safe to merge it, everyone can express any objection next week at the sprint :)

drunken monkey’s picture

Status: Active » Fixed

OK, did that. So, this is finally fixed, yay! ;)

amateescu’s picture

Issue tags: -sprint

Woot!

Nick_vh’s picture

Awesome work guys!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.