Just to get a first grip here's a diff from the two (3.x) schema.xml configs. Probably the schema should not be 100% equal but let's see what we can learn from each other.

Comments

nick_vh’s picture

StatusFileSize
new29.57 KB
nick_vh’s picture

Status: Active » Needs work
+++ b/solr-conf/schema-solr3x.xmlundefined
@@ -34,10 +34,12 @@
-    <fieldType name="string" class="solr.StrField" indexed="true" stored="true" sortMissingLast="true" omitNorms="true"/>

This is general for all the field types defined in the search api schema. It's not necessary to define index/stored if that already happens in a dynamic definition.

+++ b/solr-conf/schema-solr3x.xmlundefined
@@ -34,10 +34,12 @@
+    <fieldtype name="binary" class="solr.BinaryField"/>

This was added to allow base64 content. It makes the schema more complete and display suite is even using it

+++ b/solr-conf/schema-solr3x.xmlundefined
@@ -52,15 +54,11 @@
+    <!-- numeric field types that can be sorted, but are not optimized for range queries -->
+    <fieldType name="integer" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
+    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
+    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>

http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-sol...

"Finally, I tried out the Trie stuff by changing the query to use integer_tri. The results are the same as the sortable stuff, but, in my totally informal testing, the Trie stuff is a lot faster (others have done more formal testing, so I feel comfortable with my results.) Good news!"

+++ b/solr-conf/schema-solr3x.xmlundefined
@@ -138,9 +136,10 @@
+        <filter class="solr.LowerCaseFilterFactory"/>

Not sure when this was added, but it probably solved an existing problem :-)

+++ b/solr-conf/schema-solr3x.xmlundefined
@@ -186,21 +175,22 @@
-        <!-- <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> -->
-        <!--[[SnowballPorterFilterFactory]]-->

Why was this commented out? Seems like we want processing of the, by default, English language?

Have to run now, but I'll do more later

cpliakas’s picture

Thanks Nick. This will be a tricky one. The diff is much appreciated, and thanks for the explanations.

cpliakas’s picture

Component: Code » schema.xml

Adding new component flag.

drunken monkey’s picture

Status: Needs work » Active

Yes, that really is a tough one. I'd have said we should leave that for when we have the solrconfig.xml figured out, but I guess we might as well start now. However, we probably should split this as well?

In any case, I guess having common type definitions might be a good sub-ordinate target and also achievable.
For the actual field definitions, a quick look tells me there won't be much chance of those being unified. The approaches are just too different, as far as I can tell. Maybe unifying some fields, e.g., the required ones, might be possible, but I'm not sure how much of an advantage that would bring us.

+++ b/solr-conf/schema-solr3x.xmlundefined
@@ -52,15 +54,11 @@
+    <!-- numeric field types that can be sorted, but are not optimized for range queries -->
+    <fieldType name="integer" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
+    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
+    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>

Search API Solr also uses Trie fields, but we didn't rename the type. This way, people can still use the "normal" types, even though by default only Trie fields are used.

Not sure when this was added, but it probably solved an existing problem :-)

Well, you'll only very rarely want case-sensitive matching in fulltext searches, so I guess that makes sense. And since this field type isn't used by default, I guess adding it to the Search API side wouldn't hurt, either.

Why was this commented out? Seems like we want processing of the, by default, English language?

Why? Why should English be the default?
In the end, no matter what language we choose, most people will have to change it. Maybe some won't need any setting at all, and in any case that will (very probably) be much better than having stemming for a wrong language.
These are about the thoughts that lead me to that decision.

nick_vh’s picture

In regards of the english stemming. I think we should design for the 90% and those that want to modify it can do so afterwards. Having a sensible default only seems logical to me. On a sub-note, I am natively Dutch but I do expect things to work as fluent as possible in English and have a set of clear instructions somewhere that explains how to modify it to support another language.

The whole multilingual story is still an issue anyhow and is not easily solved. Maybe this module can provide a couple of languages or we should set up some site that can generate your schema, similar as we discussed in Drupalcon London.
I know that Typo3 has a repository with over 20+ schemas (see https://svn.typo3.org/TYPO3v4/Extensions/solr/trunk/resources/solr/typo3...) + some stopwords per language.

nick_vh’s picture

Status: Active » Needs work
+++ b/solr-conf/schema-solr3x.xmlundefined
@@ -383,6 +504,9 @@
+ <!-- field for the QueryParser to use when an explicit fieldname is absent -->
+ <defaultSearchField>content</defaultSearchField>
+
  <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
  <solrQueryParser defaultOperator="AND"/>
 

Adding a defaultSearchField is always useful when you want to debug solr directly in the admin interface.
We should probably discuss this since Search Api does not have such a field

+++ b/solr-conf/schema-solr3x.xmlundefined
@@ -208,19 +198,55 @@
+    <fieldType name="edge_n2_kw_text" class="solr.TextField" omitNorms="true" positionIncrementGap="100">

"Set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms." Auto-complete does not need boosting nor full text search so this can be omitted

We'll get back to this after we've done the solrconfig. Changes here are less critical as cpliakas mentioned

drunken monkey’s picture

In regards of the english stemming. I think we should design for the 90% and those that want to modify it can do so afterwards. Having a sensible default only seems logical to me. On a sub-note, I am natively Dutch but I do expect things to work as fluent as possible in English and have a set of clear instructions somewhere that explains how to modify it to support another language.

I highly doubt that 90% is anywhere near the right figure.

Adding a defaultSearchField is always useful when you want to debug solr directly in the admin interface.
We should probably discuss this since Search Api does not have such a field

Since there is no field like "content", which will always contain sensible data, in Search API Solr, we cannot easily do that. The best we could do is specify "id" or something like that, to at least avoid errors. Don't know if that makes much sense, though, as the error more clearly shows that you should explicitly specify a field.

nick_vh’s picture

I've been talking to a lot of people about this initiative and they all vote yes to have this. Now, I've also asked if a default English would be a problem or not and not a single one of them was against the idea of having a default English schema.

I really think and press on the importance of an English default schema and moving the multilanguage issue to another issue. This way we can enable the stopwords/protwords/... at once and allow an easy integration. A readme could be included that explains the possible configuration options for different languages

nick_vh’s picture

StatusFileSize
new5.98 KB

This diff combines both schema's without making a cut on functionality or backwards compatibility. The patch was diffed against the solr-3.x schema of apachesolr

Last thing I have to check is the geo functionality in search_api_location so it works with this schema. I do have some questions but will post them as a review.

Update geo front : I've asked the search_api_location maintainers to see if they could manage with the current fieldsets. from the looks of it, they are just adding unnecessary fields and I think they can use the dynamic fields that exist in this patch. #1647520: Schema.xml consolidation efforts

nick_vh’s picture

+++ b/solr-conf/schema-solr3x.xmlundefined
@@ -443,6 +466,12 @@
+   <dynamicField name="f_ss_*" type="string" multiValued="false" termVectors="true" />

Why are these prefixed with f_ss/f_sm*

If you want to say they are for facetting, it conflicts with the namespace philosophy that the namespace is a abbreviation of the field type (string, int/...).
Could we rename them somehow? Having a fss and f_ss is not so nice.

proposal : sts_* stm_* (string termvector single, string termvector multiple)

Also, the default key is still content, but it is not required to have it, so it doesn't harm search api at all.

cpliakas’s picture

This diff combines both schema's without making a cut on functionality or backwards compatibility.

Nick, that's awesome!

nick_vh’s picture

Status: Needs work » Needs review

Marking as needs review, because I need input from Tomas/drunkenmonkey

drunken monkey’s picture

Status: Needs review » Needs work

Great work!
It makes the schema pretty cluttered, especially in comparison with the Search API, but I guess if this really has no disadvantages regarding functionality, that's probably worth it. We'll just have to unify the type definitions now, up to a certain extent. Specifically, we seem to use a very different approach regarding strings. We might have to introduce two different dynamic fields for those …
Other than that, how compatible are the types used (or, to what extent are they equal)? If they differ in other aspects, maybe we'll have to generally introduce prefixes for our dynamic fields (i.e., all apachesolr fields start with "a_", all Search API ones with "s_").

However, if we have to change field names, or to what fields what data is indexed, this would necessitate re-indexing for users and also make updating more complicated (on code update, users would have to immediately replace the config files, restart Solr and re-index, or things would break).
So maybe forcefully (completely) unifying the schema, too, won't be worth it after all? Having uniform type definitions (and then maybe use different types for the same fields) would be a good first (and maybe only) step here.

I've been talking to a lot of people about this initiative and they all vote yes to have this. Now, I've also asked if a default English would be a problem or not and not a single one of them was against the idea of having a default English schema.

I really think and press on the importance of an English default schema and moving the multilanguage issue to another issue. This way we can enable the stopwords/protwords/... at once and allow an easy integration. A readme could be included that explains the possible configuration options for different languages

Hm, I guess it would be OK to do that …
However, at the very least we should make the "English" a variable, so people can easily change the language (and then just replace the stopwords, etc., files).

+++ b/solr-conf/schema-solr3x.xml
@@ -390,15 +407,19 @@
+	 <!-- This field is used to build the spellchecker index -->

Tab.

+++ b/solr-conf/schema-solr3x.xml
@@ -432,6 +453,8 @@
+   <!-- Search Api needs a t_* field for backwards compatibility -->

It's not really for backwards-compatibility if we still use it.
But by the way, is there any merit to distinguishing single-valued from multi-valued text fields? I couldn't really think of any, thus only t_*.

+++ b/solr-conf/schema-solr3x.xml
@@ -443,6 +466,12 @@
+   <!-- These fields are used for facetting on string fields, so case and
+        special characters are correctly displayed. -->

We should probably also note that these are only used by the Search API.

If you want to say they are for facetting, it conflicts with the namespace philosophy that the namespace is a abbreviation of the field type (string, int/...).
Could we rename them somehow? Having a fss and f_ss is not so nice.

So does, e.g., sort_*, and so do the three-letter prefixes Apachesolr uses when compared to the Search API.
We should maybe just make the distinction clearer, which fields are used by both modules and which only by one. This way, f_ss_* and fss_* would be more clearly separated and I wouldn't see such a problem with that.
(That is, provided we choose to unify the field definitions at all, that is.)

nick_vh’s picture

Status: Needs work » Needs review
StatusFileSize
new30.83 KB

After some discussion, drunkenmonkey agreed to use the conventions from apachesolr in the schema.xml. The attached patch is the full schema, as it can be used. This is not compatible for search_api_solr so there is some small work to be done there.

nick_vh’s picture

StatusFileSize
new30.83 KB

fixed whitespace issues

nick_vh’s picture

commiting this file just so it becomes easier to see diffs

drunken monkey’s picture

Status: Needs review » Needs work

There are two trailing spaces, but apart from that I can live with this. ;)

Amazing to see we really managed to do this!

nick_vh’s picture

Status: Needs work » Fixed

Just checked, there are no trailing spaces in that file anymore :)

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.