I've been reading up on this for the better part of a week now but can't seem to get a straight answer from anywhere although I've seen several people asking very similar questions.

My confusion centers around the reports section of d7, specifically the 'page not found' when relating to images.

When I go to Reports > Recent log messages

I get this:

https://drive.google.com/file/d/0BwedWvh7eyHPMC1KZ01MZzIzNFE/view?usp=sh...

The page is filled with [page not found] but all for images that used to be on the drupal file system.

When a link is clicked I am taken to this page.

https://drive.google.com/open?id=0BwedWvh7eyHPY3BObHEtQUtLZHM

Finally, my [top page not found errors] is full of page not found and they are all links to images.

https://drive.google.com/open?id=0BwedWvh7eyHPTXp4X2VsODZDdEk

When the website was first made it used the internal d7 file system. At it's peak the website was getting 100's of thousands of views a week and it was running articles similar to buzzfeed listicles, so very image heavy.

The hosting ended up costing too much for me so I had to close it down for around a year. Since then I have switched to using amazon for all image hosting. So there are no images currently stored on the website.

I do not understand why there are still calls being made to these images. Is it due to factors out of my control? EG a lot of people have 'stolen' links to images we used and they are all over sites like pinterest etc

I also do not know what I am doing or how to debug the information using the above screen shots. I cannot find any clear documentation on it. I have read, find the link, follow the link, work it out from there. However, I know these images do not exist on the system any more and from what I can tell there are no calls being made to these images from within my website. For example:

There is a [page not found] link to this image: OLD IMAGE (this image doesn't exist): http:// www.newageman.co.uk/sites/default/files/realised.jpg

I know the above 404 image used to relate to the cover image of this page:

CURRENT PAGE: http://www.newageman.co.uk/22-images-things-i-bet-you-never-fucking-real...

However, it has been a year since the cover image was used for the old image. On the above CURRENT PAGE, there are no calls what so ever to the old image. It is, as far as I am aware, the only call to that image that should have ever existed..

What do I need to do to fix this?

Do I need to fix this? - there are so many of these 404 errors and most of them I have no idea what image relates to which article.

Could these be from old facebook scrape data? I have rescraped around 70% of the website so I can't see it being that.

Is it from external websites linking to images we used to have?

Could it be from old development sites I had online once apone a time and probably still exist?

Any help, ideas, thoughts or links to documentation would be a great help.

Comments

Jaypan’s picture

I do not understand why there are still calls being made to these images. Is it due to factors out of my control? EG a lot of people have 'stolen' links to images we used and they are all over sites like pinterest etc

Could be. Or maybe the search engines previously indexed them, and are checking to see if they are there or not. There is no referrer, which I believe means that the images are being directly accessed, rather than embedded on one of your pages.

There isn't really anything you can do about this, and you don't really need to either - if the image doesn't exist, it doesn't exist.

milfguy’s picture

Thank you for taking the time to read and reply Jaypan, you have reassured me considerably.

I have been very worried that it was a site issue with the pages causing the 404 to be logged every time a page has been visited, and thought these would need to be addressed if any considerable number of people visit again.

In the past two hours I have had 31 calls to 404 image errors logged. Which just doesn't seem right to me.

Since all these 404 calls are being made to the sites/default/files folder - which is now empty, do you think I could edit the htaccess to do something clever here?...

Jaypan’s picture

Since all these 404 calls are being made to the sites/default/files folder - which is now empty, do you think I could edit the htaccess to do something clever here?...

To what end?

milfguy’s picture

To stop the 404 logging errors and keep the log page clear so only useful items are logged. Since you have suggested I should or can do nothing I do not see how or why these [page not found] when relating to jpg's are logged at all by drupal, they do not seem to have any real use.

Not highlighting a source for the 404 (when a jpg or similar) doesn't make sense to me. Essentially, "someone tried to find this image but it wasn't there, we are not telling you from where they were looking or what page they were on." Why waste the processing power?

Although the processing power for each error must be small, I know from experience when an article goes viral that hundreds of thousands of hits make everything expensive, so I would like to remove these from using up any processing power what so ever.

In the past couple of hours the example I gave above for the image [/sites/files/realised.jpg], there have been 15 calls to that image alone. It just doesn't seem right to me.

Since that image hasn't been at that address for over a year - even if crawlers are looking for it, why would they target the same image multiple times a night?

By blocking access to the folder it might stop these errors being reported by drupal? I'm not sure, this is an area I have very little experience in.

I may install a stock version of drupal for a few days and check the log files and see if the errors are still being reported, that might be more useful.

Jaypan’s picture

Makes sense.

In that case, This should work (assuming you have nothing at all in the sites/default/files folder):

RewriteRule ^sites/default/files/ - [L,R=404]
milfguy’s picture

Thank you again Jaypan. I will try to give this a go over the next few days (unfortunately there are several files in the folder so I will need to sort those out first).

I am also going to try a few other things. I've read in several places that it is possible a module may be contributing to these 404's.

Also searching the IP's they all seem to be coming from Facebook, which suggests it might be something to do with articles still being shared - which suggests to me that articles are still being shared with the original scraped data (I have debugged these articles in facebook and nothing is standing out but possibly there is cached data somewhere. However, the article example I gave was made in 2014 so not sure if cached data going back that far can still be around)

Anyway, thank you very much for your help. I will report back once I have dug even deeper!

milfguy’s picture

EDIT: Well, I thought I had it, but I don't, still getting a large number of 404 errors. I've kept what I thought was a fix below, back to the drawing board!

I've found the source of the problem and a fix.

The cause of the 404 errors were facebook posts
Because the articles were posted to facebook some time ago the meta data associated with the facebook posts is out of date. Despite using facebooks debugging tool the actual facebook posts (as in the historic posts) were not updated but any future posts are updated.

As some of these articles are 'very' popular, they are still being shared 2 years later (which is surprising to me) but these posts are linked to the old meta data. I must assume there are facebook posts floating around being shared with no images associated with them, trying to make calls to the old images.

How to fix in drupal
First I went through all the meta data and made sure it was as it should be across all content types.

How to fix in facebook
Using the debugger (https://developers.facebook.com/tools/debug/), scrape new information on the pages that need it.

This only fixes any future shares.

To fix historic posts you need to locate the post that has an error.

Luckily for me all of our posts are initially posted to our facebook page, so for me everything is in one place. I just scroll down to a post with no image.

You need to find a link to that particular post. I did this by clicking the date at the top of that particular post (top of the facebook post).

Once you click the link you should now be on a page showing only that post.

Click the drop down arrow on the top right of that post, you will see a unique list of options, Click More Options > Refresh shared attachment.

This will refresh the old post with the new meta data.

You are done!

***

I haven't found a way to do this in a mass way and probably wont look into it, our facebook page can do with a clear up and its a good excuse for me to go through and I'll only have a couple of hundred to do.

If anyone has more to do and finds a more elegant way to do this please post it for future reference.

I cannot see a way of finding a historic post if it has been shared in another way, ie, someone else has posted one of your articles, there is no clear way to see where the original post exists.

It is possible that I have fixed something by messing about with my meta data but I don't feel this is what has correct this issue.

This write up is intended to help anyone else in the future. Thank you Jaypan for your help.

Max_Headroom’s picture

Replying to an old post. This started to annoy me a bit too much that I started looking for a solution. Thanks for your post. Yes, it's the same about Facebook.
But this site's turnaround and the way FB is used, to do all that is a bit too much, so I did this:
(If you don't mind losing all the warnings about images.).
In hook_cron:
 

$pattern = 'sites/all/files/'; //or sites/[MY SITE]/files/
  $num_deleted = db_delete('watchdog')
          ->condition('message', '%' . db_like($pattern) . '%', 'LIKE')
          ->execute();

Quentin Campbell