Use Case

Search API shall be used to provide a search in an external system.
The external system is accessed via SOAP.

Architecture

Stack of the application would look like this:

- Facet API
	- Search API
		- Search API integration of external system. (Like search_api_solr)
			- Client to handle the connection etc. to the external system. (Like SolrPhpClient)

The external system has special needs and has to be configurable (wsdl, credentials), thus a SearchApiService is created and registered with hook_search_api_service_info.
And because the external system is a non entity data source, a dedicated SearchApiDataSource is defined and registered with hook_search_api_item_type_info.

Issue - chicken egg problem

The SearchApiDataSource provides meta-data via getPropertyInfo() and other methods. But in this use case these meta-data are located in the external system. This means we need to connect to the external system to fetch these meta-data. Since the connection properties / methods are decoupled in the specialized SearchApiService and the index configuration, we need to access these data somehow. Unfortunately the SearchApiDataSource doesn't know on which index it's currently acting - but without this information it's impossible to figure out which service / configuration has to be used.

Suggested solution

The only way I see how this could be solved is to make the SearchApiDataSource's index aware.
This would change some of the current usage patterns:

  • Pass always a SearchApiIndex to the constructor of SearchApiDataSource's.
    Remove the type parameter - we can use $index->item_type instead.
  • Remove the index parameter on these methods (and change the constructs in which they are called):
    • startTracking()
    • stopTracking()
    • trackItemInsert()
    • trackItemChange()
    • trackItemQueued()
    • trackItemIndexed()
    • trackItemDelete()
    • getChangedItems()
    • getIndexStatus()

To verify I missed nothing in the above description I rewrote the whole code according to my suggestion. Attached patch was created using the facetapi branch of the dedicated sandbox.
After all the changes I'm able now to do something like this $this->index->server()->ping() in SearchApiDataSource::getPropertyInfo() :)

CommentFileSizeAuthor
search_api-chicken-egg-issue.patch34.32 KBdas-peter
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

drunken monkey’s picture

Huh. Wow, that's quite some patch …
I'm really reluctant to a) make such a huge API change and b) weaken the decoupling of data source and index. Also, the type is supposed to stay fixed throughout the Search API. It's weird when it can change (or at least all properties) when you change the server. From an architectural point of view, the data source just shouldn't depend on the index it is used by.

What if you just encoded the connection information (e.g., the server's machine name, or some other mechanism) in the item type name? That would be much cleaner anyways (as the different servers seem to represent different types after all).

das-peter’s picture

Thanks for your feedback.
I absolutely understand your perspective.
However let me tell you what my thoughts were before I decided to create this patch.

The index is the glue between the datasource (item type) and the server - and while I can't imagine a use case where the server has to use the index, the datasource makes excessive usage of the index.

Direct usage in the methods I've listed above - but also indirect usage over the method getMetadataWrapper which is used in several places in SearchApiAbstractDataSourceController.
And not to forget getMetadataWrapper() calls my chicken egg problem place getPropertyInfo().

If I try to look at this from a higher level it seems to me the current design is to drupal/entity centric.
As long as we use the entity datasource only, the type is enough information to do all the necessary stuff in the datasource (fetch entity metadata) - but only because we're acting in the framework that we base on. There the only needed "glue" is the type.
Of course, rely on just the type has some advantages in perspective of usage convenience in the framework (creating an array of indexes with the same type and pass them as parameter).

For me the external datasource use case unmasks the need for a relation instead a decoupling.
The datasource needs the index in any case - at the moment it's just called type because it's all that's necessary to cover the framework internal use cases.
Switching from type to index won't break the framework internal use case, simply because the necessary type information is embedded in the index.

Changing this "decoupling" to a real relation might look like loosing flexibility on code perspective but for me it seems more like enhancing the flexibility in terms of use cases.

drunken monkey’s picture

Title: Use case incl. patch for external systems » Make data source controllers index-dependent
Category: bug » feature

Sorry, but I still don't really see why there should be a dependency, or why you can't just encode the information in the type.
It's true that the abstract data source controller calls getMetadataWrapper() in several places, but a) that's only my suggestion for a default implementation, it's not mandatory, and b) what has that got to do with the index? It just calls its own method. If you use external information (from the server, as it seems, not from the index) in there, you are allowed to do so, but you should keep the information/key for that in the item type. That's what the type is there for (identifying the kind of data to index), after all, it wouldn't have any use otherwise.

das-peter’s picture

If I got you right you suggest to do something like this:

function external_data_search_api_item_type_info() {
  $types = array();
  $servers = search_api_server_load_multiple(FALSE, array('class' => 'search_api_external_data_service'));
  foreach ($servers as $server) {
    $types['extrenal_data:' . $server->id] = array(
      'name' => 'External data on ' . $server->name,
      'datasource controller' => 'SearchApiExternalDataDataSourceController',
    );
  }
  return $types;
}

And in the data source:

$server = search_api_server_load(end(explode(':', $this->type)));

I agree this is absolutely doable.
A nasty downside I see is that you've to explain the user he has to select the item type according to the server he wants to use - and once done there's no return. Atm. you can't switch the item type later on.
Besides that the advantage of being able to collect indexes with the same item type is nevertheless gone (even of course only for the external data integration), same applies to the datasource controller caching in search_api_get_datasource_controller().

In my eyes it's still better to have a dependency in the code as being dependent on the knowledge of a user.

drunken monkey’s picture

Hm, you are right about the UX there, hadn't thought about that.
However, my thinking was that such external data sources are almost always a customization for a certain site, with the site builders known, or even identical to the developers, and the search configuration maybe already stored in code.
If you need this for more untrained users, though, I agree it's a problem. However, you could rather easily overcome this problem with some slight modifications/altering to the index create and edit forms, so that the user is automatically directed to the right selections.
You certainly have a point, though. I'll have to contemplate this further.

By the way, I mentioned this issue in the project announcements – maybe someone else wants to chime in and convinve either of us. ;)

das-peter’s picture

Good idea to add this discussion to the announcements - feedback from other dev's with different use cases would be really helpful.
Do you know if there's someone who also worked with own data-sources?

Regarding your suggestion to alter the configuration form my contra is this:
Why should we prefer a solution which makes it more complex to extend the functionality instead of changing to a design that has better support for extending and doesn't seem to have other downsides (yeah, I know the decoupling - but as of my description in #2 I consider this argument as invalid ;D )

Hmm, I guess I don't have any argument left - now we need a negotiator :P

Akaoni’s picture

I'm not sure I'm quite across what you're trying to do here, but...
Isn't SearchApiDataSource aware of the Item Type which is, in turn, aware of SearchApiService?

I think I'm working on something similar which has an instance of SearchApiDataSource fetching all non-Drupal indexes from the search server. These indexes are then available to users for use as read-only Search API indexes.

My in-progress code for this is:

protected function getPropertyInfo() {
...
  $type_info = search_api_get_item_type_info($this->type);
  $server = search_api_server_load($type_info['server']);
  $database = $type_info['database'];

  $idol_request = 'http://' . $server->options['host'] . ':' . $server->options['port'] . '/?action=query&text=*&maxresults=1&databasematch=' . urlencode($database);
...
}

Note: Obviously, this also involves creating Item Types with hook_search_api_item_type_info().

Useful?

Edit: I just had a proper look into this again and the only reason Item Type is aware of SearchApiService is because I added two non-API values for server and database. Worth adding something similar as optionals to the API?

das-peter’s picture

@Akoni: Thank you very much for your participation. Glad to see someone else has similar needs :)

The approaches in #4 and #7 are quite similar.
At least they suffer from the same problem - they introduce special knowledge for developers and UI users as already described in #4.

Akaoni’s picture

By gum, very similar indeed!!
Teach me to only half read an issue before posting. Sorry. :/

The only thing mine adds is the ability for one server to have multiple types (external datasets) attached to it.

Will put some thought into the problem you described in #4.

drunken monkey’s picture

Thanks a lot for chiming in!

Edit: I just had a proper look into this again and the only reason Item Type is aware of SearchApiService is because I added two non-API values for server and database. Worth adding something similar as optionals to the API?

I think I already note in the hook documentation that people are free to add any other keys they want. Your example is, in my opinion, an excellent usage of that, encoding the server information that way.
It would also allow to later change the server the type is associated with (even though I'd consider that a very bad idea).
As a side note, I hope you're using the server's machine name for $type_info['server'], not its ID – this is a slight flaw in #4, which will make problems when used with Features.

@das-peter:

Why should we prefer a solution which makes it more complex to extend the functionality instead of changing to a design that has better support for extending and doesn't seem to have other downsides (yeah, I know the decoupling - but as of my description in #2 I consider this argument as invalid ;D )

It does have downsides:
- Worse performance for the „normal“ use case (where we can't pass in all indexes at once anymore).
- Bad architecture (which, in my opinion, still stands, and could well cause us some headaches later).

While I do want to support your use case, it's not the traditional one and violates the assumption/rule that indexes should be independent from their server (possibly except for defined service class features). This will cause chaos anyways if a user tries to change the index's server, so I guess you need to prevent this anyways. Adding a workaround for half of the problem to the framework itself just masks it up, and makes the framework more interdependent on the way.
Consider for example that someone would just want some additional information on a type, for any reason. With the proposed architectural change they'd have to create a mock-up index just to do that.

But of course, maybe there's a different solution alltogether. We're rooting for you, Akaoni! ;)

Akaoni’s picture

Most welcome!! ;)

I hope you're using the server's machine name for $type_info['server'], not its ID

Yep:

$types[$server->machine_name . '|' . $database_id] = array(
  'name' => $database_name,
  'server' => $server->machine_name,
  'database' => $database_id,
  'datasource controller' => 'SearchApiIDOLDataSourceController',
);
Akaoni’s picture

I've started a sandbox project called "Search non-Drupal indexes":
http://drupal.org/sandbox/Akaoni/1327976

This is an add-on where each server whose service has supportsFeature('search_api_non_drupal') == TRUE populates an item type for each non-Drupal index on that server. It does this through two new service functions getNonDrupalIndexes() and then getNonDrupalProperties($index_name).

This is all still pretty basic and UI hackish, but it does work.
I'm thinking the next step would be to create a secondary UI specifically for non-Drupal indexes. Eg:

  1. click Add non-Drupal index
  2. select server and index
  3. select fields to use
  4. etc...

My IDOL search service implements this add-on:
http://drupal.org/sandbox/Akaoni/1240206

Thoughts? Suggestions? Offers to co-maintain?

das-peter’s picture

Status: Needs review » Closed (won't fix)

I think this is stuck and I got over it ;) Hope no-one minds if I close with "won't fix".

das-peter’s picture

Issue summary: View changes

Fix markup