I have a client that has a lot of older content that contains many base64 images in the body. The issue here is that this ends up with thousands of these warnings and indexing takes hours with drush - and cannot be done from the front end as it exhausts the memory.

gEIWnEigsEpWFJ9EJZAaRwsSR6FgSDUOioshUNJWGZTLxXDZewMUKORgBGyvi4CRcjISNFrNwEg5OxMHz2UQRjyQVkH1FZJWU4u9L0ynZQRqePZDvNAiCjYJgkyDYJHAahQ6DyBEocuhEDq3QESB0akVOrSRYK3UGiO1qH4dG4QhQOXRqm9bPqrFGWXMLklaWzfhuXs7q2RkrZ6YtK0peUpBYnhNXkRu.Database
search servers currently cannot index such words correctly – the word was therefore trimmed to the allowed length.

Is there anyway of ignoring base64 strings? Or ignoring content that contains them? It's not feasible to search the thousands of nodes containing these strings and replace them...I tried that and it turns out there are much more than I thought.

This has only been an issue since upgrading to the latest search_api module

Comments

drunken monkey’s picture

Status: Active » Fixed

You can use the "Tokenizer" processor for that. Just change the "Ignorable characters" setting from ['] to something like [']|\w{30,} (which would remove apostrophes and words with 30 or more characters from the index).

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.