Hi Stephane,

I'm very interested in working on a plugin for the Virtuoso Open-Source triplestore. The ESIP Drupal Camp is in a few weeks and I'm hoping to do a code sprint there to work on this. I've been looking through the RDF Indexer code, and it looks like I need to write implementations for: SearchApiAbstractService following your pattern: RdfIndexerArc2StoreService.

Comments

scor’s picture

Hi Adam, this is great! Could you indicate what means Virtuoso supports for receiving RDF data via HTTP? SPARQL Update, or something else? If there are different ways of sending RDF data to Virtuoso, it would be good to also compared their pros/cons (in particular in terms of performance) so we can decide which one is the most appropriate for RDF Indexer to use.

ashepherd’s picture

Issue tags: +virtuoso

I believe the version I have locally supports SPARQL Update (http://docs.openlinksw.com/virtuoso/rdfsparql.html) and it may support other methods as well, so I will investigate and, if so, write multiple implementations to benchmarking. I imagine that seomwhere in the conifugration UI, I'll request to Drupal user for a Virtuoso username/password that has rights to the SPARQL_UPDATE Virtuoso user-group, but it looks like I can handle that with the SearchApiAbstractService methods. I'll keep this thread posted of progress, and feel free to get in touch anytime.

ashepherd’s picture

ashepherd’s picture

Category: feature » task

I've got a proof of concept working at: https://github.com/ashepherd/rdf_virtuoso

What's the best way to "collaborate". Should I create a Drupal.org sandbox for this and go down the path of creating a separate module? Soon, I'll get a chance to test it out with some data to see how well it performs.

ashepherd’s picture

Using devel generate, I created 1000 nodes which took about 20 seconds on my laptop. I then indexed those, verified they exist in Virtuoso, and that took about 20 seconds as well. I have a Drupal site on a larger server with 500,000 nodes, so I'm going to try it out there.

{UPDATE}
My dev server running a bunch of sites, and not optimized, updated the same Virtuoso instance (on the same server) with 583,282 nodes in 2hours and 12min averaging about 4,419 nodes per minute.

scor’s picture

These numbers look great. I need to look at your code further, but it looks to me that there is a lot of code duplication. The service class you implement should maybe extends the service class from rdf_indexer. I think I should abstract the current ARC2 service class into a rdf_indexer base class, that each backend service class (ARC2, virtuoso, OWLIM, etc.) would extend.

Also, if your code was made as a patch against rdf_indexer, you would just need to register your class in rdf_indexer_search_api_service_info() and not need to repeat the rest of the code. Looks like you managed to get all the code to deal with Virtuoso indexing, so this is great progress!

ashepherd’s picture

I really like your idea of a base class which which would remove a lot of my dupe code. I'm happy to write the base class and submit a patch including the service.virtuoso.inc file w/ an updated rdf_indexer_search_api_service_info()

ashepherd’s picture

Here is the patch with a base class and the virtuoso Service.

In testing, I've found that the entity module is causing problems with:

  1. book module via: https://drupal.org/node/1330086
  2. anonymous content when using devel generate via: https://drupal.org/node/1237014
ashepherd’s picture

Here is an updated version of the patch:

It fixes a few Virtuoso issues:

  1. queries might be longer than what an HTTP GET request might be able to handle, so I've converted this Service to use HTTP POST
  2. Virtuoso has a limit on the length of a query (default is 10,000 lines of code). This only effects INSERT queries since DELETE queries are small. I've made the module count the number of triples and make multiple insert queries if necessary
scor’s picture

Status: Active » Needs work

Thanks Adam for your continued work on this. I haven't reviewed the entire code in the patch, but it looks like this include LICENSE.txt and rdf_indexer-add_virtuoso_support-2029717-8.patch (which explain the tripling in size).

Also, it's a good idea to have a separate file for each backend (like it is the case for Virtuoso), but it would also make sense to have a separate class file for ARC2.

scor’s picture

Category: task » feature
ashepherd’s picture

Status: Needs work » Needs review
StatusFileSize
new10.46 KB

My apologies!

I separated out the ARC2 implementation into it's own file, and I modified the logic for when deleteItems is passed $ids = 'all'. Originally, is I saw 'all' I was clearing the entire graph. However, if another index is also writing to the same graph (let's say someone setup separate indexes for nodes, taxonomy terms, or other entity types), it was wiping out another indexes data. I left the clearGraph() line in the code, but commented it out in service.inc

ashepherd’s picture

I'm an idiot. Forgot to run 'git add' before creating the patch.

ashepherd’s picture

Issue summary: View changes
StatusFileSize
new28.09 KB

Hi Steph,

Here is the latest version of the Virtuoso Service. It patches branch 7.x-1.x, and adds a base service class which the ARC2 & Virtuoso service implement.

cheers, Adam

ashepherd’s picture

Latest version of patch fixes bug of unescaped forward slashes and adds support for xml:lang

avpaderno’s picture

Assigned: ashepherd » Unassigned
Issue tags: -virtuoso