Some time to time, after linkchecker test links from node, in body appears "& # 13;" (without spaces) on every line break.
How can i stop it?

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Dandily’s picture

Issue summary: View changes
Haessler’s picture

Permission restrictions deny you access to this broken link on every line. As I saw in the history this error was allready present in Drupal 6. Simply unprofessional.

Dandily’s picture

Are we talking about the same?

In my situation, if in node was a link (in body or in other field), Link checker resave it with body.

And body will change to this:

Some text.& #13;
& #13;
& #13;
Another text.& #13;

If link checker check this node next time, it will be:

Some text.& #13;& #13;
& #13;& #13;
& #13;& #13;
Another text.& #13;& #13;

And so on.

I think this happen when link checker have found -11000 error or other, because not every day he add & #13; in body.

Where need i look to found this error and fix it?

hass’s picture

Status: Active » Postponed (maintainer needs more info)

That is not caused by linkchecker I think. It is your content and filter settings. I guess you copy&pasted the content from somewhere on a apple Mac and thereby added the \r's to the body and now the core content filters are correcting this \r and these are disallowed and therefore replaced by an html entity. What core filters are doing is not in the hand of linkchecker. Check your core filter settings, please. Linkchecker is only a victim as it is the next user that saves the node.

If this may be a linkchecker issue than you need to write a repro case first. I have never seen this myself and have no idea how to repro. I think you may also see this issue with other manual edits. But maybe you need to switch between mac and windows.

http://stackoverflow.com/questions/5082253/what-is-html-entity-13

hass’s picture

Status: Postponed (maintainer needs more info) » Closed (cannot reproduce)
Dandily’s picture

Status: Closed (cannot reproduce) » Active

I'm still dont try to reproduce this from windows with clear drupal installation, but i have about 4 sites on drupal, and i write articles on them from my mac. This problem appear only on site with Linkchecker running and only in nodes with body text with links.

I made some experiments and now i can say, that's happened only if one of the link have response 301 - permanent moved.

I make a custom module page with

drupal_goto ('node/1', array(), 301);

In settings in linkchecker i have "Update permanently moved links : two times"

After i add new link to my custom page in some node body's text and run cron two times (checked link two times) - those symbols appears on every line in body text.

After looking into Linkchecker's code - i found, that those symbols appear in function _linkchecker_link_replace - about 1951 line in linkchecker.module. The variable $text received such changes in line near the end of function:

// Set the updated $text for the calling function.
$text = filter_dom_serialize($html_dom);

Adding after this some lines:

	if (preg_match ('#\&\#13;#is', $text)){
		watchdog('13LINK', 'After link changes - 13 found and replaced', array(), WATCHDOG_ALERT);
		$text = preg_replace ('#\&\#13;#is', '', $text);
	}

... help me to solve my problem, but it's not a best way...

hass’s picture

Can you check if it is also there in the beginning of this function?

Dandily’s picture

I check this before and after


    if (preg_match ('#\&\#13;#is', $text)){
        watchdog('13LINK', 'Before link changes - 13 found and replaced', array(), WATCHDOG_ALERT);
        $text = preg_replace ('#\&\#13;#is', '', $text);
    }

    // Set the updated $text for the calling function.
    $text = filter_dom_serialize($html_dom);

    if (preg_match ('#\&\#13;#is', $text)){
        watchdog('13LINK', 'After link changes - 13 found and replaced', array(), WATCHDOG_ALERT);
        $text = preg_replace ('#\&\#13;#is', '', $text);
    }

Worked only after, if i clear old "13" from node. With old "13" in body - worked two times.

hass’s picture

Sounds like a PHP configuration issue or bug in filter_dom_serialize() or filter_dom_load().

Can I get a file dump of such a before $text, please? So I can try if I get the same result on my box.

hass’s picture

Try this, please. Add it into linkchecker.module line 1984.

$debug = filter_dom_serialize(filter_dom_load($text));
if (preg_match ('#\&\#13;#is', $debug)){
  watchdog('13LINK', 'Corrupted mac line break found', array(), WATCHDOG_ALERT);
}
Dandily’s picture

In clean linkchecker.module i have on line 1984:

1982      // Finds all hyperlinks in the content.
1983      if (variable_get('linkchecker_extract_from_a', 1) == 1) {
1984        $links = $html_dom->getElementsByTagName('a');
1985        foreach ($links as $link) {

Need i place code between 1983 and 1984 or after 1984 ?

hass’s picture

No, earlier. Compare lines with latest dev.

Dandily’s picture

I have 7-1.2 version, but i look i dev version and place this code after:

      $old_links = array_unique($old_links);

And yes - it's work, i see this message in log/journal.

hass’s picture

DEV is stable. Don't feat to install. You can just go back to 7.x-1.2 after the tests.

Does the $text variable in line 1984 already contain the entities? If true, dump $node in line 619 (linkchecker.module) and see if it already contain the HTML entities. I gues it will.

What PHP version are you running?

hass’s picture

Status: Active » Closed (cannot reproduce)
vinmassaro’s picture

Status: Closed (cannot reproduce) » Active

This is an issue for us on nodes that have had links updated with Linkchecker 7.x-1.3. Please let me know any information I can provide to help debug this. Thanks.

hass’s picture

Status: Active » Postponed (maintainer needs more info)

See above. It is still unclear how to reproduce and the root cause need to be identified. If you may help here we may be able to solve the issue.

hass’s picture

Status: Postponed (maintainer needs more info) » Closed (cannot reproduce)

No feedback.

vinmassaro’s picture

If I get time to test, I will reopen. Thanks.

alt36’s picture

I think the problem does indeed come about due to the way filter_dom_serialize() works. Consider the following PHP snippet:

$text = "first line
second line";

print_r(filter_dom_serialize(filter_dom_load($text)));

If the newline in the middle of $text is a CRLF (i.e. 0x0d 0x0a , achieved by "set ff=dos" in vim) then the printed output contains 
 in the middle. If instead the newline is just a LF (i.e. 0x0a , achieved by "set ff=unix" in vim) then there's no 
 in the output.

Looking at the filter_dom_serialize() source, I think this arises in part from their use of saveXML() - continuing my example, if you then

$dom = new DOMDocument();
$dom = filter_dom_load($text);
print($dom->saveHTML());
print($dom->saveXML());

then the generated <body> node is the same in both cases if $text uses a LF, but in the case of CRLF, saveXML() includes the &#13; entity whereas saveHTML() does not.

alt36’s picture

Code to reproduce: create a node with the following fragment (which includes a link that I've checked is returning 301)

<?php

$node = new stdClass();

$node->title = "Test node";
$node->type = "page";

node_object_prepare($node);
$node->language = 'und';

$newline = "0d0a";
$newline = pack('H*', $newline);

$text = '<a href="http://www.phy.cam.ac.uk">Link</a>' . $newline . 'Second line';

$node->body['und'][0]['format'] = 'full_html';
$node->body['und'][0]['value'] = $text;

$node = node_submit($node);
node_save($node);
?>

Then run linkchecker; a &#13; entity will appear in the node body due to the way CRLFs are handled, as described in my previous comment.

vinmassaro’s picture

Status: Closed (cannot reproduce) » Active

Reopening because @alt36 provided some new info. I'm not sure how to work around it.

hass’s picture

That is an interesting analysis. Now we only need to figure out why this issue does not show up in core and add the same code... i suspect there is a regex somewhere that changes all windows/mac line feeds to unix line feeds before its getting saved.

alt36’s picture

Grepping for \r in the code for the core filter module, I find

https://api.drupal.org/api/drupal/modules!filter!filter.module/function/...

which uses a str_replace() to standardise line endings:

  $text = str_replace(array("\r\n", "\r"), "\n", $text);

check_markup() isn't called inside the filter module itself, but it is called from e.g. https://api.drupal.org/api/drupal/modules%21field%21modules%21text%21tex... so it would seem like a plausible thing to add to Linkchecker

hass’s picture

The same is in linkchecker module. See http://cgit.drupalcode.org/linkchecker/tree/linkchecker.module?h=7.x-1.x...

I only use core api to save nodes. So the code should just follow the same code flow like a node form save. Normally code in node forms is never altered. Filters run only on node view.

alt36’s picture

Thanks for the info @hass . The str_replace() call you link to in linkchecker.module happens too late in this instance though, because by that point $text already includes &#13; instead of \r, which happens due to the call to filter_dom_serialize() in _linkchecker_link_replace() . I've not fully traced all the function calls, but if I print some simple debug messages I can see that _linkchecker_link_replace() is being called before _linkchecker_check_markup() .

So perhaps one option would be to the same str_replace() immediately before the filter_dom_serialize() at http://cgit.drupalcode.org/linkchecker/tree/linkchecker.module?h=7.x-1.x... ?

oggsmith’s picture

I'm also seeing the same issue, saving content doesn't add in #13 linkchecker auto-fixing broken links does.

Skabbkladden’s picture

Version: 7.x-1.2 » 7.x-1.x-dev
FileSize
734 bytes

After some testing, it seems that doing the str_replace() call before building the DOM in _linkchecker_link_replace() (around line 2141) solves the problem. See patch.

hass’s picture

Still not clear to me where the entities are added and why. Doing random string replacements are not the right way to solve bugs.

Can you write a test that shows how it breaks, please?