Using an edge ngram filter highlights the whole word instead of ngrams [#3347121]

Problem/Motivation

Edge ngrams are used to split up a word into chunks of characters and are useful for autocomplete etc. ElasticSearch/Opensearch have filters and tokenizers. The modules custom edge_ngram_analyzer is currently using an edge_ngram filter with a standard tokenizer. The standard tokenizir splits text up into words. When requesting highlights from opensearch (I'm using the bodybuilder.js library) the current setup returns a highlight of only entire words, even if the requested text is just a chunk of text. For example, if the request is "Marou" the highlighted excerpt that is returned is "Maroubra" when it should be "Maroubra". I've attached screenshots of the current and expected behavior (After I made some changes).

Steps to reproduce

N/A

Proposed resolution

Change EdgeNgram and Ngram to use tokenizers instead of filters. Right now I'm just working around this using the AlterSettingsEvent but it should be changed on the EdgeNgram plugin. Instead of using a filter we should use an edge_ngram tokenizer. I've attached screenshots of my current settings.

Remaining tasks

N/A

User interface changes

N/A

API changes

N/A

Data model changes

N/A

File	Size	Author
Screenshot 2023-03-10 at 3.12.27 pm.png	62.16 KB	achap
Screenshot 2023-03-10 at 3.12.16 pm.png	64.22 KB	achap
Screenshot 2023-03-10 at 3.00.02 pm.png	29.68 KB	achap
Screenshot 2023-03-10 at 2.58.34 pm.png	28.29 KB	achap

Issue fork search_api_opensearch-3347121

Show commands

Start within a Git clone of the project using the version control instructions.

Add & fetch this issue fork’s repository

Or, if you do not have SSH keys set up on git.drupalcode.org:

Add & fetch this issue fork’s repository

3347121-using-an-edge changes, plain diff MR !27
Check out this branch for the first time

Check out existing branch, if you already have it locally

About issue forks

Comments

Comment #1

10 March 2023 at 04:17

achap created an issue. See original summary.

Comment #2

kim.pepper

English

🏄‍♂️🇦🇺Sydney, Australia

commented 14 March 2023 at 01:07

Status:

Active

» Postponed (maintainer needs more info)

Seems like a reasonable change. Are you able to submit a PR?

Comment #3

achap

🇦🇺

commented 14 March 2023 at 04:30

Assigned:

Unassigned

» achap

Yeah I can do it when I get some free time.

Comment #4

kim.pepper

English

🏄‍♂️🇦🇺Sydney, Australia

commented 21 March 2023 at 00:36

Can you take a look at #3349179: Add a search_as_you_type data type to see if that is a better fit for your case?

Comment #5

achap

🇦🇺

commented 23 March 2023 at 04:11

Thanks for putting that together. From what I'm seeing it actually has the same issue as the original edge n-gram implementation, i.e. it's highlighting the entire word rather than the n-grams themselves. Not sure why that is based on the docs https://opensearch.org/docs/latest/search-plugins/searching-data/highlight/

Comment #6

23 March 2023 at 04:36

achap opened merge request !27

Comment #7

achap

🇦🇺

commented 23 March 2023 at 04:44

Status:

Postponed (maintainer needs more info)

» Needs review

Switching from filter to tokenizer is working for me with Edge N-gram filters. I guess the two plugins can co-exist?

Comment #8

kim.pepper

English

🏄‍♂️🇦🇺Sydney, Australia

commented 23 March 2023 at 22:12

Status:

Needs review

» Postponed (maintainer needs more info)

Yeah they can both exist.

I wonder if you can get the same results with search_as_you_type by just playing with the highlighter options? https://www.elastic.co/guide/en/elasticsearch/reference/current/highligh...

Comment #9

achap

🇦🇺

commented 24 March 2023 at 01:41

I have previously played around with those settings on the edge n-gram field before using a custom tokenizer and it didn't appear to do anything but I haven't had a chance to try it out yet for search_as_you_type. I imagine it's caused by the same issue, i.e. that search_as_you_type is probably using the standard tokenizer which splits tokens up on word boundaries rather than individual characters.

This SO question appears to solve it in the same way for the search_as_you_type implementation (implementing an edge n-gram tokenizer) https://stackoverflow.com/questions/59677406/how-do-i-get-elasticsearch-to-highlight-a-partial-word-from-a-search-as-you-type

From the https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html#analysis-tokenizers it says that a tokenizer is among other things responsible for:

Order or position of each term (used for phrase and word proximity queries)

Start and end character offsets of the original word which the term represents (used for highlighting search snippets).

If I analyze a title field that is using the custom edge ngram tokenizer I get the following token information for the sentence "This is a title":

{
  "tokens" : [
    {
      "token" : "t",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "th",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "thi",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "i",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "t",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "ti",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "tit",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "titl",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "title",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 11
    }
  ]
}

If I analyze a search_as_you_type field I get the following information:

{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "title",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

So if the offset information is used for highlighting that explains why only the edge_ngram_tokenizer is working as expected.

Comment #10

kim.pepper

English

🏄‍♂️🇦🇺Sydney, Australia

commented 24 March 2023 at 02:33

OK. Makes sense. Now we just need to decide whether highlighting whole words or tokens should be the default.

Comment #11

achap

🇦🇺

commented 5 July 2023 at 06:08

Sorry for not replying, got a bit side tracked :D I've been using this patch in production without issues for a while now. In terms of which one should be default I guess something to consider is index size and performance. Don't have any hard data to back this up but I guess tokenizing every character is a lot more expensive than every word. So maybe because of that and also preserving backwards compatibility it makes sense to keep filter as the default and add the tokenizer as a new plugin?

Comment #12

kim.pepper

English

🏄‍♂️🇦🇺Sydney, Australia

commented 5 July 2023 at 23:06

I'm inclined to push people towards the search_as_you_type approach rather than getting into specific tokenizers and analyzers etc. If people want to build their own custom solutions they can do that.

Comment #13

achap

🇦🇺

commented 6 July 2023 at 00:57

No worries I will move this patch into our own codebase :)

Comment #14

kim.pepper

English

🏄‍♂️🇦🇺Sydney, Australia

commented 6 July 2023 at 01:00

Status:

Postponed (maintainer needs more info)

» Closed (won't fix)

OK cool. I'll close this for now then.

Using an edge ngram filter highlights the whole word instead of ngrams

Problem/Motivation

Steps to reproduce

Proposed resolution

Remaining tasks

User interface changes

API changes

Data model changes

Issue fork search_api_opensearch-3347121

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

News items

Our community

Documentation

Drupal code base

Governance of community