Unindexed string field in schema.xml… [#370436]

As you probably very well know, Solr supports unindexed, stored fields. So we could store, say, the path to an image or perhaps the result of some heavy computation in Solr's index for it to be available on the search results without having to generate it in Drupal.

The current Solr schema shipped with this module does not have such amongst the many useful dynamic fields. I propose that we add a 'ssufield' or something like that. String storage should be good for almost anything that doesn't need to be indexed, since it can be converted back to int, deserialised, etc. when it comes back into PHP.

Although this field might not be used by the ApacheSolr module package itself, it will be very helpful to have a standard stored field to rely on when developing modules that interface with Solr.

Comment	File	Size	Author
#26	370436-26.patch	2.17 KB	nick_vh
#25	370436-25.patch	6.69 KB	nick_vh
#11	add_unindexed_dynamicfields-370436-11.patch	2.26 KB	craigmc
#10	add_unindexed_dynamicfields-370436-10.patch	1.02 KB	craigmc

Comments

Comment #1

pwolanin commented 7 February 2009 at 20:04

YEs, I can see the utility of this. For example, the _image module is storing an image path, which has no need for indexing.

Comment #2

pwolanin commented 8 February 2009 at 19:22

Let's combine with this issue: http://drupal.org/node/370707

Comment #3

pwolanin commented 9 February 2009 at 13:22

Status:

Active

» Closed (duplicate)

Comment #4

mikl

Møn

commented 18 February 2009 at 06:57

Title:	Standard schema.xml could have a standard unindexed string field…	» Unindexed string field in schema.xml…
Status:	Closed (duplicate)	» Active

Ummm, sorry for not coming back to this earlier, but it appears to me that #370707: compact field names, use analysers for "order only" fields in schema.xml was closed without actually resoving this issue :)

Comment #5

pwolanin commented 18 February 2009 at 13:14

Given that a string field gets really zero processing when it's indexed, it wasn't clear to be that there's actually a useful benefit to storing but not analyzing them.

Comment #6

mikl

Møn

commented 19 February 2009 at 09:42

Might that not introduce false positives, like if the stored data contains the word "lime" without that being relevant to (nor present in) in the visible node data would mean that my node would show up in places that it doesn't make sense?

You guys probably know a lot more about this stuff than I do, so I might be wrong here, but in my mind, indexing something you don't want to be searchable doesn't make a whole of sense. There might also be security implications if you were indexing data the user really wasn't supposed to be able to see, the user might be able to dig those out by futzing with the search query :)

Comment #7

pwolanin commented 19 February 2009 at 14:41

These string fields are not searched by default - you'd have to be adding them in custom code.

Comment #8

jpmckinney commented 22 July 2010 at 15:52

Status:

Active

» Fixed

It's very likely users will want to customize their Solr config files. I don't think it's the responsibility of this module to include all such customizations. As for the question of indexing image_path, this has been sufficiently well addressed.

Comment #9

5 August 2010 at 16:00

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Comment #10

craigmc commented 21 December 2011 at 23:50

Assigned:	Unassigned	» craigmc
Status:	Closed (fixed)	» Needs review

Status	File	Size
new	add_unindexed_dynamicfields-370436-10.patch	1.02 KB

I think this is worth including in the form of dynamic fields.

I've done a lot of work recently with converting over from views-based searches to SOLR-based, and instead of having to re-theme everything, you can just pre-bake the rendered view content for the given field into the SOLR index. No point using the processing to actually index the content, but worth storing in SOLR so you can just output it when you retrieve the node.

Sure you need to create a custom hook_apachesolr_update_index to take advantage of this, but it's so useful, I don't see the harm in including it directly in the schema.xml that comes with this module.

Patch attached below against 6.x-1.x-dev

Comment #11

craigmc commented 21 December 2011 at 23:55

Version:

6.x-1.x-dev

» 7.x-1.x-dev

Status	File	Size
new	add_unindexed_dynamicfields-370436-11.patch	2.26 KB

Added version of schema.xml changes for 7.x-1.x-dev as well

Comment #12

nick_vh

he/him

Ghent

commented 22 December 2011 at 00:48

There is already 1 unindexed field present in the schema :

   <!-- Binary fields can be populated using base64 encoded data. Useful e.g. for embedding
        a small image in a search result using the data URI scheme -->
   <dynamicField name="xs_*"  type="binary"  indexed="false" stored="true" multiValued="false"/>
   <dynamicField name="xm_*"  type="binary"  indexed="false" stored="true" multiValued="true"/>

If you really don't want to have your data analyzed you could use this field and use base64_encode and base64_decode if you get it back from the index.
The text fields do make kind of sense but why should we only support string and text fields, why not long and slong?

As jpmckinney said, nothing stops you from modifying the schema and I guess this case can also be handled with the xs_* field?

Comment #13

craigmc commented 22 December 2011 at 00:59

Nick--
Fair point, although I don't see the benefit of having to base64_decode something vs just reading it directly from the $result object.

As for long and slong, that would be easy enough to add. I figured Strings are the most likely to be useful as non-indexed fields, since number fields are so frequently used for filters and sorts.

Comment #14

pwolanin commented 22 December 2011 at 04:35

Status:

Needs review

» Needs work

Well there is really no point of using text for a non-indexed field, since what's the point of analyzing it?

Also, since you are never going to sort on a non-indexed field, there is no realy need to have a single-valued version (other than for mental clarity I guess).

Comment #15

craigmc commented 22 December 2011 at 06:08

pwolanin--
couple potential use-cases for non-indexed text:

Pre-rendered HTML
file path, or other non-searchable text
Random text, e.g. hash text for API lookup, etc.

Seems like placing text more specifically in SOLR and only indexing on items that will actually be searched on is more performant than the server overhead of SOLR running indexes against fields that will never be searched/filtered on.

Comment #16

nick_vh

he/him

Ghent

commented 23 December 2011 at 13:12

Just a comment, I think that the prefix for the field should not be noindex. Something more in the sense of ssn (string, single, not indexed)?

The pre-rendered HTML is actually a good use case because I've seen it happen before. It can be used to have a very basic multisite between Drupal 6 and 7 where the search result is saved in Solr as HTML and outputted as a result with links to the other site.

Comment #17

pwolanin commented 23 December 2011 at 13:33

The only reason to use text instead of string is if you are going to search the field.

Comment #18

nick_vh

he/him

Ghent

commented 28 December 2011 at 10:33

Status:

Needs work

» Closed (works as designed)

Marking as closed. If you really need this field there is nothing that stops you from editing it. If you could prove that the performance benefits are much higher for unindexed strings as text vs indexed strings/texts we'd love to see this!

Comment #19

craigmc commented 28 December 2011 at 18:31

Nick_vh -- why the change of heart between #16 and #18?

I'm not readily coming up with any benchmarks on amount of processing time to ingest an indexed vs. non-indexed field. Perhaps I can generate some tests one way or another.

I just can't imagine that adding 1-2 extra lines to the schema that SOLR uses vs. the cost of the tokenizing and other transforms that happen to indexed fields vs non-indexed fields would be a wash. When a field is indexed, SOLR treats it in very specific ways, and effectively optimizes delivery of search results along that chain of inquiry. If you know in advance you will never use that chain of inquiry for a given field, why not notify SOLR ahead of time?

This might not make a difference until you're in really high node count territory, 1M+ or whatever, but since we're trying to position Drupal for the enterprise, seems like useful discussion at least.

And yes, I know I can modify the schema.xml myself. The reason I'm trying to participate in this discussion is to argue for including this behavior in the contributed module so other people can use the same lessons learned to help their sites.

This doesn't affect any behavior from the module itself, and will only really apply to people who are already using hook_apachesolr_update_index. It seems like pretty low-hanging fruit to include this. I'm happy to help supplement the documentation to explain how/where this is useful.

Ultimately it's the will of the community, and since this issue doesn't have much action, maybe I'm just advocating for a really fringe use-case, but it still seems like a valid discussion to have.

Comment #20

nick_vh

he/him

Ghent

commented 28 December 2011 at 21:52

Status:

Closed (works as designed)

» Active

First of all I really admire your determination in getting this issue forward.

Nick_vh -- why the change of heart between #16 and #18?

I closed it because the lack of interest and because pwolanin also expressed his opinion against this. This is a 1% use case and the schema.xml is something that is rarely modified. Adding 4 fields does not cause trouble but it does not help the module either. There is no field known yet that is using these fields.
Another important point is that you have to remember solr is used as an indexing and searching engine. If you really need to fetch data or process data this should not happen in the solr level. It should not become a database. Not adding these fields somehow prevents this from happening and discourages the use of it.

I'm not readily coming up with any benchmarks on amount of processing time to ingest an indexed vs. non-indexed field. Perhaps I can generate some tests one way or another.

That would be great

I just can't imagine that adding 1-2 extra lines to the schema that SOLR uses vs. the cost of the tokenizing and other transforms that happen to indexed fields vs non-indexed fields would be a wash. When a field is indexed, SOLR treats it in very specific ways, and effectively optimizes delivery of search results along that chain of inquiry. If you know in advance you will never use that chain of inquiry for a given field, why not notify SOLR ahead of time?

This might not make a difference until you're in really high node count territory, 1M+ or whatever, but since we're trying to position Drupal for the enterprise, seems like useful discussion at least.

Same here, if you are running with 1M+ nodes you would try to minimize the data you sent to and receive from solr, not add extra weight to it that could cause slow down? Just a thought? But don't get me wrong, if it works for your use case then I'm very happy!

This doesn't affect any behavior from the module itself, and will only really apply to people who are already using hook_apachesolr_update_index. It seems like pretty low-hanging fruit to include this. I'm happy to help supplement the documentation to explain how/where this is useful.

I can't disagree on this, I'd love to see more documentation on how and why this is useful. This could potentially also convince pwolanin.

Ultimately it's the will of the community, and since this issue doesn't have much action, maybe I'm just advocating for a really fringe use-case, but it still seems like a valid discussion to have.

Then we'll discuss further ;-)

Comment #21

pwolanin commented 29 December 2011 at 14:50

re: #15, you clearly are not understanding string vs. text. There is no length limit to string, the difference is in processing.

Comment #22

craigmc commented 31 December 2011 at 08:13

Nick_vh -- to me this addresses some useful use cases. I think that people are using views too much to provide search capability, and views are a poor model for this. SOLR is a great replacement, and documenting this type of use case is something that will help others in the community to more quickly repurpose their existing codebase and logic for a more scalable and performant solution using SOLR.

I think the main argument I see for pre-rendered HTML being stored in SOLR is that it can basically act as a kind of cache layer. IF you're theming search results individually, e.g. using search-result.tpl.php and individual fields, then you have to return all of those fields from SOLR, load the theme layer, perform theming logic, etc. If you've pre-baked that content and are using SOLR to return it to you, you avoid the processing work, and can leverage SOLR not just for the search logic, but as a cache layer as well, to offload further processing from your Drupal site.

To me, having a high speed, non RDMS way of retrieving node data is hugely useful. It opens up the possibility of content analysis, retrieval, etc. If you keep the payload as trim as possible for indexed/processed content, and overload the document objects with fields as necessary, then you can support a large amount of business logic without ever having to hit the DB.

pwolanin-- maybe you can chime in on your opinion on this. Seems like in comment #1 you were in agreement with the person that started this thread. Am I misunderstanding here?

#21 pwolanin-- you misunderstood my comment. I was talking about stored non-indexed content vs indexed content. Yes, there are some vagaries of text vs. string fields in SOLR which may warrant further discussion, but I was talking about use cases for storing textual content in a non-indexed fashion. If you re-read my comment in that light, I think it might be a little more useful to the discussion.

Maybe it doesn't make sense to have stored non-index text fields and non-index string fields, and that one of those field types makes more sense one way or another... I'm fine to do more research on my side and figure it out, or pwolanin, please feel free to contribute your thoughts on this matter since you have a clearer grasp of how these two fields are handled differently in SOLR.

Comment #23

nick_vh

he/him

Ghent

commented 31 December 2011 at 11:12

For documentation purposes (found in the schema.xml)

The StrField type is not analyzed, but indexed/stored verbatim.
StrField and TextField support an optional compressThreshold which limits compression (if enabled in the derived fields) to values which exceed a certain size (in characters).
Note : Indexed=true for string means that the field is searchable, but only if it matches 100%. It does not take more space as far as I could understand.

<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of "wifi" or "wi fi" could match a document containing "Wi-Fi".
Synonyms and stopwords are customized by external files, and stemming is enabled. Duplicate tokens at the same position (which may result from Stemmed Synonyms or WordDelim parts) are removed.

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
... (lots of processing and filters)
</fieldType>

Maybe we should make some experiments and fill up a solr index with 10.000 nodes and their respecitve rendered html output. one in an indexed string field and the other in a non-indexed string field. If it appears that this does not make a difference I would keep the schema as it is. If the size of the index increases and/or we notice slow downs (using the apache AB test and with the default solrcofig.xml) we can decide to add those fields.

Sounds like a good idea?

Comment #24

pwolanin commented 31 December 2011 at 19:56

I *think* making the string field indexed will always increase the size of the index, since you must at least maintain the appropriate Lucene inverted index entry for it to look up each doc containing the matching string.

I don't really object to adding non-indexed string fields in our standard schema - when talking last to Nick I suggested with might use e.g. 'zs_*' and 'zm_*' rather than 3 or 4 letter prefixes.

Comment #25

nick_vh

he/him

Ghent

commented 2 January 2012 at 09:42

Status	File	Size
new	370436-25.patch	6.69 KB

So, I've added these fields in analogy to the previous patch (but without the text field) to the schema and molded this into a patch.
This has been quite some discussion so hopefully we can see good and useful use cases from this field in the future.

The patch is removing some white spaces in line endings apparently, but I can't see any problem with that so I'll leave it like this

Comment #26

nick_vh

he/him

Ghent

commented 6 January 2012 at 11:02

Status	File	Size
new	370436-26.patch	2.17 KB

Without the whitespace fixes

Comment #27

nick_vh

he/him

Ghent

commented 6 January 2012 at 11:04

Status:

Active

» Fixed

And committed. Changed the tag of the schema to beta14 so it will be ready for the next release

Comment #28

nick_vh

he/him

Ghent

commented 6 January 2012 at 21:43

Status:

Fixed

» Closed (fixed)

Unindexed string field in schema.xml…

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

Comment #23

Comment #24

Comment #25

Comment #26

Comment #27

Comment #28

News items

Our community

Documentation

Drupal code base

Governance of community