Currently the module only fixes src/href attributes. There are a few more... It would be great if the module will be extended by the rest of the available attributes. I have found some more in linkchecker... hopefully nearly complete - you may like to take a look.

Examples (some of them depend on context):
1. pluginurl
2. pluginspage
3. data
4. codebase
5. value
6. poster

More funny regexes for this can be found in linkchecker module...

I would also suggest to use or at least take a look to the upcoming http://api.drupal.org/api/function/filter_dom_load/7. When I used this in linkchecker D7 - I was so positively impressed how easy and much more reliable it is without loosing any functionality and becoming a more reliable link extraction process for some tags with subtags like object/param or audio/source, etc. Maybe of interest for pathologic too - the only requirement seems to be PHP5 what may be a show stopper in D6 times...

Comments

Buddharide’s picture

Another attribute: ALT

For example:
Only local images are allowed.

hass’s picture

ALT is free text. Nothing we need to care about.

Garrett Albright’s picture

Except for 1 and 2, which I'm not too concerned about since I think they'll pretty much always be full off-site paths anyway, I've never heard of those attributes, or at least not why they'd have paths (like value). Could you give examples of how they'd be used?

One that I thought of that Pathologic should legitimately support is longdesc.

hass’s picture

@Garrett: All the attributes listed in #0 are used in flash, java, quicktime and other multimedia applets. Please see my linkchecker tests at http://drupalcode.org/viewvc/drupal/contributions/modules/linkchecker/te... to get an idea what I've found myself in near past. I hope the many examples in there gives a good picture...

"longdesc" is something I missed in linkchecker... added this to my to-do list :-)

Garrett Albright’s picture

Hmm, I see what you mean. "value" is not an attribute which is going to be easy to support, since it contains things that aren't paths more often than not…

However, that filter_dom_load() approach looks promising for the D7 branch. I'll keep it in mind.

hass’s picture

Absolutly - value is not that easy. Some others also not. filter_dom_load() is great... we hopefully get rid of nearly all the regexes :-)))

hass’s picture

Version: 6.x-2.x-dev » 7.x-2.x-dev

How about using filter_dom_load? I see the replace in 7.x-2.x is still using the regex and does not support all attributes that can have urls. I added some more new HTML5 tags to linkchecker, too.

Garrett Albright’s picture

The problem with filter_dom_load() is that it "fixes" HTML - rewrites HTML it thinks is faulty. In fact, that's what Drupal core's HTML fixing filter uses. It would be great if Pathologic could use a DOM object instead of regular expressions, but I don't want to do what Pathologic shouldn't be doing (in other words, what the core HTML fixer is doing) when the user doesn't intend it to. Kind of a bad situation, but I think what we have works well enough, and after all, even if I did have a DOM object available to me, how would I use it? I guess by traversing through every node and seeing if it has href, src, action or longdesc attributes - so more or less the same thing I'm doing now. I bet it would be slower, as well.

At any rate, hass, what other attributes does HTML5 add that might have URLs in them? I'm not aware of any, but then again, I haven't checked up on HTML5 in a while.

hass’s picture

I have never seen the filter_dom_load() destroying anything yet. Code that is broken cannot really get "more" broken. I don't think fixing invalid code is wrong. If you can implement below logic with regexes in a faster way - I'm just fine, but I failed to create reliable regexes myself and removed all of them to make it really *reliable* by using the DOM method. Speed is no problem as this is only one filter of many and this code will run only once and the filtered content is cached. If you run markdown of bbcode it will take much more longer.

I noted some attributes in the into of this case and later provided references to my linkchecker code. Here are updated links to linkchecker http://drupalcode.org/project/linkchecker.git/blob/refs/heads/7.x-1.x:/l... _linkchecker_extract_links() and _linkchecker_link_replace() that show, how this can be accomplished. In tests of http://drupalcode.org/project/linkchecker.git/blob/refs/heads/7.x-1.x:/l... you can see most of the examples, but they are currently missing HTML5 track link example.

As discussed earlier, some attributes are context sensitive to tags and their parent tags. It's not the current basic regex that will work here. With below code you are also able to make it conditional e.g. only replace src inside of video/audio elements. I removed my conditions for pathologic. Something that your regex cannot as it is able to destroy any "src" attributes in a bunch like the filter_dom_load() can fix something. :-)

In a summary list, here are all attributes again that can contain links:
1. pluginurl
2. pluginspage
3. data
4. codebase
5. value
6. poster

Code example:

// Load HTML code into DOM
$html_dom = filter_dom_load($text);

// Finds all hyperlinks in the content.
$links = $html_dom->getElementsByTagName('a');
foreach ($links as $link) {
  if (in_array($link->getAttribute('href'), $old_links)) {
	$link->setAttribute('href', $new_html_link);
  }
  // Replace link text, if same like the URL. If a link text contains
  // other child tags like <img> it will be skipped.
  if (in_array($link->nodeValue, $old_links)) {
	$link->nodeValue = $new_html_link;
  }
}

$links = $html_dom->getElementsByTagName('area');
foreach ($links as $link) {
  if (in_array($link->getAttribute('href'), $old_links)) {
	$link->setAttribute('href', $new_html_link);
  }
}

// Finds all audio links in the content.
$audios = $html_dom->getElementsByTagName('audio');
foreach ($audios as $audio) {
  if (in_array($audio->getAttribute('src'), $old_links)) {
	$audio->setAttribute('src', $new_html_link);
  }

  // Finds source tags with links in the audio tag.
  $sources = $audio->getElementsByTagName('source');
  foreach ($sources as $source) {
	if (in_array($source->getAttribute('src'), $old_links)) {
	  $source->setAttribute('src', $new_html_link);
	}
  }
  // Finds track tags with links in the audio tag.
  $tracks = $audio->getElementsByTagName('track');
  foreach ($tracks as $track) {
	if (in_array($track->getAttribute('src'), $old_links)) {
	  $track->setAttribute('src', $new_html_link);
	}
  }
}

// Finds embed tags with links in the content.
$embeds = $html_dom->getElementsByTagName('embed');
foreach ($embeds as $embed) {
  if (in_array($embed->getAttribute('src'), $old_links)) {
	$embed->setAttribute('src', $new_html_link);
  }
  if (in_array($embed->getAttribute('pluginurl'), $old_links)) {
	$embed->setAttribute('pluginurl', $new_html_link);
  }
  if (in_array($embed->getAttribute('pluginspage'), $old_links)) {
	$embed->setAttribute('pluginspage', $new_html_link);
  }
}

// Finds iframe tags with links in the content.
$iframes = $html_dom->getElementsByTagName('iframe');
foreach ($iframes as $iframe) {
  if (in_array($iframe->getAttribute('src'), $old_links)) {
	$iframe->setAttribute('src', $new_html_link);
  }
}

// Finds img tags with links in the content.
$imgs = $html_dom->getElementsByTagName('img');
foreach ($imgs as $img) {
  if (in_array($img->getAttribute('src'), $old_links)) {
	$img->setAttribute('src', $new_html_link);
  }
  if (in_array($img->getAttribute('longdesc'), $old_links)) {
	$img->setAttribute('longdesc', $new_html_link);
  }
}

// Finds object/param tags with links in the content.
$objects = $html_dom->getElementsByTagName('object');
foreach ($objects as $object) {
  if (in_array($object->getAttribute('data'), $old_links)) {
	$object->setAttribute('data', $new_html_link);
  }
  if (in_array($object->getAttribute('codebase'), $old_links)) {
	$object->setAttribute('codebase', $new_html_link);
  }

  // Finds param tags with links in the object tag.
  $params = $object->getElementsByTagName('param');
  foreach ($params as $param) {
	// @todo
	// - Try to replace links in unkown "flashvars" values
	//   (e.g. file=http://, data=http://).
	$names = array('archive', 'filename', 'href', 'movie', 'src', 'url');
	if ($param->hasAttribute('name') && in_array($param->getAttribute('name'), $names)) {
	  if (in_array($param->getAttribute('value'), $old_links)) {
		$param->setAttribute('value', $new_html_link);
	  }
	}

	$srcs = array('movie');
	if ($param->hasAttribute('src') && in_array($param->getAttribute('src'), $srcs)) {
	  if (in_array($param->getAttribute('value'), $old_links)) {
		$param->setAttribute('value', $new_html_link);
	  }
	}
  }
}

// Finds video tags with links in the content.
$videos = $html_dom->getElementsByTagName('video');
foreach ($videos as $video) {
  if (in_array($video->getAttribute('poster'), $old_links)) {
	$video->setAttribute('poster', $new_html_link);
  }
  if (in_array($video->getAttribute('src'), $old_links)) {
	$video->setAttribute('src', $new_html_link);
  }

  // Finds source tags with links in the video tag.
  $sources = $video->getElementsByTagName('source');
  foreach ($sources as $source) {
	if (in_array($source->getAttribute('src'), $old_links)) {
	  $source->setAttribute('src', $new_html_link);
	}
  }
  // Finds track tags with links in the audio tag.
  $tracks = $video->getElementsByTagName('track');
  foreach ($tracks as $track) {
	if (in_array($track->getAttribute('src'), $old_links)) {
	  $track->setAttribute('src', $new_html_link);
	}
  }
}

// Set the updated $text for the calling function.
$text = filter_dom_serialize($html_dom);
Garrett Albright’s picture

Code that is broken cannot really get "more" broken. I don't think fixing invalid code is wrong.

It's not Pathologic's job, though. Pathologic's job is to correct paths, not to fix HTML. But if we load the DOM, then we don't have a choice.

At any rate, it looks like that list you posted is the same one you posted in the OP nearly two and a half years ago… derp, sorry. The code was helpful, though, in giving some context; I would not have thought that "data" or "value" would be attributes we'd usually expect to see URLs in, but then again, I'm not really familiar with the object/param/embed region of HTML. It's food for thought, and maybe I'll change my mind at some point…

Garrett Albright’s picture

Version: 7.x-2.x-dev » 8.x-1.x-dev
Issue summary: View changes

Pathologic 8.x-2.x will use filter_dom_load() and navigate the DOM to find attributes to modify. I'm not entirely sold on greatly expanding the list of attributes to check, though. Maybe I'll go through the HTML 5 spec again and find attributes which will likely contain paths and use those as default, but have another list (possibly set via a variable?) which can be modified to add other attributes to check.

hass’s picture

Above list should be complete. Cannot remember that I changed anything in last 2 years in linkchecker about this.

mgifford’s picture

Issue tags: +longdesc
Anonymous’s picture

Any progress on this? It seems odd that the path of for instance the poster attribute isn't rewritten.

hass’s picture

l also have a case open in linkchecker for svg links. I think this need to be supported here, too.

Garrett Albright’s picture

Priority: Normal » Major

No progress on this yet, no. Sorry. I probably will not start on 8.x-2.x until after Drupal 8.0 is released.

miro_dietiker’s picture

We just started to do another round of review in Pathologic and i have seen this issue and looked into the replacement code.

I'm asking myself why we are not using the D8 core Masterminds\HTML5 parser to deal with the cases semantically... Yeah sure, much slower, but worth considering..

dww’s picture

FYI, #2993935: Support for srcset attribute is now in. So 8.x-1.0-alpha3 will at least support srcset.