I need to keep the content of non-HTML tags delimited by "<...>" [#2144715]

Thanks for this awesome module. When indexing an attached PDF (containing no HTML) with an OCR layer, extracted by TIKA, one messy graphic is interpreted to contain "<". About 7 pages later, another messier graphic is interpreted to contain ">". So they, and everything in between were omitted in the index.

Is there a way to include the content of "<...>" when it is NOT a known tag?
I have found other cases where only "<" appears and all that follows is omitted.

I'm not much of a programmer, so am looking for some help here. Thanks for any help/tips you can provide.

I found these three relevant functions, the second of which is called at the end of the first one:

/**
 * Extract HTML tag contents from $text and add to boost fields.
 *
 * @param ApacheSolrDocument $document
 * @param string $text
 *   must be stripped of control characters before hand.
 *
 */
function apachesolr_index_add_tags_to_document(ApacheSolrDocument $document, $text) {
  $tags_to_index = _apachesolr_tags_to_index();

  // Strip off all ignored tags.
  $allowed_tags = '<' . implode('><', array_keys($tags_to_index)) . '>';
  $text = strip_tags($text, $allowed_tags);

  preg_match_all('@<(' . implode('|', array_keys($tags_to_index)) . ')[^>]*>(.*)</\1>@Ui', $text, $matches);
  foreach ($matches[1] as $key => $tag) {
    $tag = drupal_strtolower($tag);
    // We don't want to index links auto-generated by the url filter.
    if ($tag != 'a' || !preg_match('@(?:http://|https://|ftp://|mailto:|smb://|afp://|file://|gopher://|news://|ssl://|sslv2://|sslv3://|tls://|tcp://|udp://|www\.)[a-zA-Z0-9]+@', $matches[2][$key])) {
      if (!isset($document->{$tags_to_index[$tag]})) {
        $document->{$tags_to_index[$tag]} = '';
      }
      $document->{$tags_to_index[$tag]} .= ' ' . apachesolr_clean_text($matches[2][$key]);
    }
  }
}

/**
 * Strip html tags and also control characters that cause Jetty/Solr to fail.
 */
function apachesolr_clean_text($text) {
  // Remove invisible content.
  $text = preg_replace('@<(applet|audio|canvas|command|embed|iframe|map|menu|noembed|noframes|noscript|script|style|svg|video)[^>]*>.*</\1>@siU', ' ', $text);
  // Add spaces before stripping tags to avoid running words together.
  $text = filter_xss(str_replace(array('<', '>'), array(' <', '> '), $text), array());
  // Decode entities and then make safe any < or > characters.
  $text = htmlspecialchars(html_entity_decode($text, ENT_QUOTES, 'UTF-8'), ENT_QUOTES, 'UTF-8');
  // Remove extra spaces.
  $text = preg_replace('/\s+/s', ' ', $text);
  // Remove white spaces around punctuation marks probably added
  // by the safety operations above. This is not a world wide perfect solution,
  // but a rough attempt for at least US and Western Europe.
  // Pc: Connector punctuation
  // Pd: Dash punctuation
  // Pe: Close punctuation
  // Pf: Final punctuation
  // Pi: Initial punctuation
  // Po: Other punctuation, including ¿?¡!,.:;
  // Ps: Open punctuation
  $text = preg_replace('/\s(\p{Pc}|\p{Pd}|\p{Pe}|\p{Pf}|!|\?|,|\.|:|;)/s', '$1', $text);
  $text = preg_replace('/(\p{Ps}|¿|¡)\s/s', '$1', $text);
  return $text;
}

function _apachesolr_tags_to_index() {
  $tags_to_index = variable_get('apachesolr_tags_to_index', array(
    'h1' => 'tags_h1',
    'h2' => 'tags_h2_h3',
    'h3' => 'tags_h2_h3',
    'h4' => 'tags_h4_h5_h6',
    'h5' => 'tags_h4_h5_h6',
    'h6' => 'tags_h4_h5_h6',
    'u' => 'tags_inline',
    'b' => 'tags_inline',
    'i' => 'tags_inline',
    'strong' => 'tags_inline',
    'em' => 'tags_inline',
    'a' => 'tags_a'
  ));
  return $tags_to_index;
}

Comments

Comment #1

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 26 November 2013 at 18:26

Interesting issue! I'm not sure if I can quickly help. Perhaps we should look in the standards and see how big a tag can be and if it's bigger than the allowed size -> just strip the arrows?

Perhaps the better option would be not to strip < and > when content is coming from Tika. I'd love to see some help here.

Comment #2

topdillon CreditAttribution: topdillon commented 26 November 2013 at 18:27

Issue summary:

View changes

Comment #3

topdillon CreditAttribution: topdillon commented 5 December 2013 at 17:52

Thanks Nick. I am interested in any solution or workaround.

Comment #4

topdillon CreditAttribution: topdillon commented 1 January 2014 at 20:32

This problem is still unsolved. Here's a point that might make it easy to find a solution, for someone other than I. Never will there be "<...>" tags in the files (mostly .pdf's, some .doc's) that SOLR needs to index. So can tags be stripped only from non-file objects that it encounters?

Comment #5

j0rd CreditAttribution: j0rd commented 19 March 2014 at 02:52

I have a similar problem, except I need to simply store HTML in solr. I'm using a lot of non-drupal straight from solr queries on the frontend of my website, and while I'm not querying this data, I do have the need to display it.

Comment #6

Zatox CreditAttribution: Zatox commented 4 October 2016 at 16:08

Anyone found a solution / workaround for this ?
I found the culprit; It's filter_xss inside apachesolr_clean_text :
https://api.drupal.org/api/drupal/includes%21common.inc/function/filter_...
If you look at the end of the function you'll see it has been done knowingly and on purpose :

  return preg_replace_callback('%
    (
    <(?=[^a-zA-Z!/])  # a lone <
    |                 # or
    <!--.*?-->        # a comment
    |                 # or
    <[^>]*(>|$)       # a string that starts with a <, up until the > or the end of the string
    |                 # or
    >                 # just a >
    )%x', '_filter_xss_split', $string);

So clearly this is the problem :
<[^>]*(>|$) # a string that starts with a <, up until the > or the end of the string
It does exactly what @topdillon describes.
At this time I have no idea how to cleanly fix this. filter_xss is a generic security function and I don't foresee it changing just for this small (as in number of people concerned) bug. We can't really bypass it in apachesolr_clean_text as this function is not just used for files text extraction but for all of Solr and due to the nature of Drupal, bypassing the security function in the clean text function would be waaaay to risky.
Now I guess we could add a flag to the clean function to tell it not to go through this filter_xss so that we can use that flag when calling it from the files extraction (function apachesolr_attachments_get_attachment_text in module apachesolr_attachements)
But even that wouldn't be a fix for everyone because you'd have to trust that the users that upload files aren't going to include malicious code in the PDFs
Anyone got ideas?