I use a link with a picture inside in one of my static pages. But this module do not parse it correct and shows the img-tag in the block. :(
Would be OK if it searches for alt or better title attribute or use the html code.
PS: Sorry for my bad english...

Comments

csc4’s picture

This is actually a 4.7x and 5.x issue

Are there any regex guru's out there who could offer some help? I'm seeing this a lot as I use the amazontools module and the links from the images are horrible:

<h2>Links from Article Text</h2><ul><li><a nicetitle="Matched text: &lt;a href=&quot;http://www.amazon.co.uk/gp/redirect.html%3FASIN=0743275284%26tag=googletag%26lcode=xm2%26cID=2025%26ccmID=165953%26location=/o/ASIN/0743275284%253FSubscriptionId=1XFK01HK9NZWGPENWGG2&quot; target=&quot;_blank&quot;&gt;&lt;img src=&quot;http://ec1.images-amazon.com/images/P/0743275284.01._SCTHUMBZZZ_.jpg&quot; height=&quot;75&quot; width=&quot;50&quot; alt=&quot;cover of The Writing on the Wall: Why We Must Embrace China as a Partner or Face It as an Enemy&quot; /&gt;&lt;/a&gt;" href="http://www.amazon.co.uk/gp/redirect.html%3FASIN=0743275284%26tag=googletag%26lcode=xm2%26cID=2025%26ccmID=165953%26location=/o/ASIN/0743275284%253FSubscriptionId=1XFK01HK9NZWGPENWGG2">&lt;img src="http://ec1.images-amazon.com/images/P/0743275284.01._SCTHUMBZZZ_.jpg" height="75" width="50" alt="cover of The Writing on the Wall: Why We Must Embrace China as a Partner or Face It as an Enemy" /&gt;</a></li><li><a nicetitle="Matched text: &lt;a href=&quot;http://www.amazon.co.uk/gp/redirect.html%3FASIN=0743275284%26tag=googletag%26lcode=xm2%26cID=2025%26ccmID=165953%26location=/o/ASIN/0743275284%253FSubscriptionId=1XFK01HK9NZWGPENWGG2&quot; target=&quot;_blank&quot;&gt;The Writing on the Wall: Why We Must Embrace China as a Partner or Face It as an Enemy&lt;br&gt;&lt;/a&gt;" href="http://www.amazon.co.uk/gp/redirect.html%3FASIN=0743275284%26tag=googletag%26lcode=xm2%26cID=2025%26ccmID=165953%26location=/o/ASIN/0743275284%253FSubscriptionId=1XFK01HK9NZWGPENWGG2">The Writing on the Wall: Why We Must Embrace China as a Partner or Face It as an Enemy&lt;br&gt;</a></li></ul></div>

The original source it is parsing looks like

<table class="class=" amazontools_related=""><tbody><tr><td><a href="http://www.amazon.co.uk/gp/redirect.html%3FASIN=0743275284%26tag=googletag%26lcode=xm2%26cID=2025%26ccmID=165953%26location=/o/ASIN/0743275284%253FSubscriptionId=1XFK01HK9NZWGPENWGG2" target="_blank"><img src="http://ec1.images-amazon.com/images/P/0743275284.01._SCTHUMBZZZ_.jpg" alt="cover of The Writing on the Wall: Why We Must Embrace China as a Partner or Face It as an Enemy" height="75" width="50"></a></td><td><a href="http://www.amazon.co.uk/gp/redirect.html%3FASIN=0743275284%26tag=googletag%26lcode=xm2%26cID=2025%26ccmID=165953%26location=/o/ASIN/0743275284%253FSubscriptionId=1XFK01HK9NZWGPENWGG2" target="_blank">The Writing on the Wall: Why We Must Embrace China as a Partner or Face It as an Enemy<br></a>author: Will Hutton<br>asin: 0743275284

I found http://drupal.org/node/53880#comment-101916 which suggested
$output = preg_replace('#<a href="/\?q=glossary[^"]+" title="[^"]+"><img src="/[^"]+" /></a>#', '', $output );

to strip out glossary image links but I'm not sure how to get this changed to strip the img tags from Related Links? I've tried some things myself but I just don't seem to be getting anywhere.

I believe the issues is around line 103

              if (in_array(RELATEDLINKS_PARSED, variable_get('relatedlinks_types', array(RELATEDLINKS_PARSED)))) {
                // Rather than parsing out only the URI + link text, an attempt is
                // made to retain any other attributes present.
                preg_match_all('#(<a [^>]+>[^<]+</a>)#', $node->body, $matches);
                if (count($matches[1])) {
                  $links = array();
                  // Check URIs for duplicates.
                  foreach ($matches[1] as $index => $link) {
                    preg_match('#href\s*=\s*["]*([^"\s>]*)#', $link, $match);
                    $link = rtrim($match[1], '/');
                    if (!in_array($link, $links)) {
                      $links[] = $link;
                    }
                    else {
                      // Unset duplicate.
                      unset($matches[1][$index]);
                    }
                  }
                  _relatedlinks_add_links($node->nid, $matches[1], RELATEDLINKS_PARSED);
                }
              }

I tried

                  foreach ($matches[1] as $index => $link) {
                    preg_match('#href\s*=\s*["]*([^"\s>]*)#', $link, $match);
                    $link = rtrim($match[1], '/');
                    $link = preg_replace('#<img src="/[^"]+" />#', '', $link);
                    if (!in_array($link, $links)) {
                      $links[] = $link;
                    }
                    else {
                      // Unset duplicate.
                      unset($matches[1][$index]);
                    }
                  }

but I don't seem to be getting anywhere.

Anyone out there good at regex?

csc4’s picture

Is there really noone out there who can help with this regex?

smscotten’s picture

Version: 5.x-1.0-beta » 6.x-1.0-alpha1
Component: Miscellaneous » Code

I took this from the 6.x code, which inherited this same problem. Any time there's a linked image the recommended links block contains HTML code for the image—very ugly behavior. My scorched-earth solution is simply to eliminate all links that contain HTML:

function _relatedlinks_check_links($url_matches, $title_matches) {
  $urls = array();
  $links = array();
  // Check URLs for duplicates.
  foreach ($url_matches as $index => $url) {
    $url = rtrim($url, '/ ');
    if (!in_array($url, $urls) &&
        preg_match('/</', $title_matches[$index]) == 0
        ) {
      $urls[] = $url;
      // The title is trimmed in _relatedlinks_add_links due to
      // inadequacies in the current regex.
      $links[] = array('url' => $url, 'title' => $title_matches[$index]);
    }
  }

  return $links;
}

which changes

    if (!in_array($url, $urls)) {

to

    if (!in_array($url, $urls) &&
        preg_match('/</', $title_matches[$index]) == 0
        ) {

It's a bit extreme—it will exclude links that contain bolded or italic text, but it's a working solution for me. Ideally I guess we'd want something that takes the alt attribute of the image tag and makes a text link out of an image, but for my purposes, this is enough.

Hope that helps.

Zen’s picture

Version: 6.x-1.0-alpha1 » 6.x-1.x-dev
Status: Active » Fixed

I've committed a patch which strips tags from the text. While ideally, we would be looking at the anchor tag or img tag's title or alt attributes, this should do for the time being.

-K

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

smscotten’s picture

Might this modification eventually find its way into the beta or release version?

smscotten’s picture

Version: 6.x-1.x-dev » 7.x-1.x-dev
Status: Closed (fixed) » Needs work

Zen, would you be so kind as to post the patch and/or other changes you made to fix this in 6.x-1.x-dev? I'd like to be able to change the behavior on a 7.x site.

Thanks!