So here is a patch that does auto discovery. It needs a little work in the following areas

  1. Unicode regex
  2. little less naive way of handling the found <link> rss

Be nice to add in a tag support as well but holding off on that for now.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Scott Reynolds’s picture

FileSize
5.86 KB

hmm attached is the patch and forgot the mention of throwing an exception when it fails to find a feed.

Scott Reynolds’s picture

Status: Needs work » Needs review
FileSize
7.07 KB

ok got this patch to point where im happy with it. It uses proper unicode regex, most of it taken from SimplePie and it will throw an Exception when autoDiscover() fails to find any feeds in the html document.

I think there might be a potential to use these methods at form validation. Im going to spend some time exploring that so that it creates a more sensible user experience. For instance, if they fail to enter protocol in the form element.

alex_b’s picture

Status: Needs review » Needs work

Scott:

I hate to say it because I see all the work you've done and I'd LOVE to have auto discovery back (it worked in FeedAPI), but:

auto discover needs to go into libraries/http_fetcher.inc . I *really* want to keep non Feeds specific (integrational) functionality outside of the plugins and in libraries, this is functionality that could potentially even live outside of Drupal!

I'm on IRC if you have more questions.

Scott Reynolds’s picture

ahh i get it, there basically is this already although its a monster. hmm ok.

Scott Reynolds’s picture

the problem with that function as it stands currently is that it returns the data. That would mean the batch importer would need to know if it i should auto discover. Which looking at the existing code was not the intention. Hm....

alex_b’s picture

#4 It *is* a monster :-) I've pulled it over basically 1:1 from FeedAPI, FeedAPI used it from Aggregation (if I remember correctly), so it's oold. Would love to rewrite it if I had some time.

That would mean the batch importer would need to know if it i should auto discover.

I'm not sure whether I follow. Couldn't http_request auto detect a feed, download it when found and return its data?

Scott Reynolds’s picture

#6
The interface between the fetcher and the batch importer is "Here is the file name or the url" and the batch importer is the only one to deal with the raw data. And the fetcher is where you would want the setting for auto discovery. Therefore, the fetcher has to find the url not the data and hand it to the batch importer.

It is actually trivial, instead of returning the data, you return the url in the auto discover method.

So what I plan to todo.

Fix up that regex a bit with unicode and clean it WAY up. My patch here did essentially the same thing it just looks and feels cleaner cause its broken up.

Have HTTPFetcher call the function when its suppose to auto discover and hand the url off to the batch importer. Because HTTP responses are statically cached, there is no performance hit really.

Essentially, a "clean this up and turn on the feature".

alex_b’s picture

#7: I follow now. You'd like to essentially avoid doing auto discovery on every import, but rather you'd like to resolve the actual feed URL once and then keep using this actual feed URL rather than the original URL - correct?

Where do you plan to store the resolved URL? Would you replace the original URL with the resolved URL?

Scott Reynolds’s picture

FileSize
13.96 KB

#7: I follow now. You'd like to essentially avoid doing auto discovery on every import, but rather you'd like to resolve the actual feed URL once and then keep using this actual feed URL rather than the original URL - correct?

I decided against it. If the user types in something shouldn't that 'something' remain. Just like if i type

alert('xss')

in a node body that should still be stored.

So here is the massive patch. Sorry about that, I think it probably could be broken up a bit.

Summary of changes

  • Added a factory function to FeedsImportBatch. I didn't like the constructor that behaved differently depending on the parameters so I created a factory function. I do believe that it would be much cleaner to separate out the ImportBatch into two classes and the factory creates the proper class. But I resisted that urge.
  • Auto discovery is fixed up and added to the HTTPFetcher. The Fetcher can now be told to auto discover. @TODO: should this be the default behavior and how does that affect feeds_defaults?
  • The user supplied feed source is validated to check if the fetcher can find a feed there.

I think separating out the Batch class into two classes will provide a cleaner approach to this. The HTTPBatch class could then accept as a parameter $auto_discover. It could then defer auto discovery until getRaw() is called. That would prevent HTTP requests until they are actually required.

Alias, that thought just occurred to me and I need to eat :-D. And I think there is a reason why FeedsImportBatch is one master class instead of two separate ones. Care to enlighten me?

Scott Reynolds’s picture

FileSize
12.06 KB

Updated patch with bullet 1 removed. Seems off topic of this issue.

alex_b’s picture

And I think there is a reason why FeedsImportBatch is one master class instead of two separate ones. Care to enlighten me?

Only historical reasons. I agree that it should be broken into separate classes, just didn't have the time to do it. A @todo would be warranted.

Need more time to review the patch.

Scott Reynolds’s picture

Status: Needs work » Needs review

updating status.

Scott Reynolds’s picture

Scott Reynolds’s picture

FileSize
12.59 KB

Doh! made a minor (but major) mistake and didn't actually used the discovered feed.

In this patch thats fixed, but also I haven't ever like the use of extract(parse_url); so I put the result of parse_url into a variable. This is cleaner and doesn't through php notice's

Scott Reynolds’s picture

Status: Needs review » Needs work

Will need a reroll because of: #617054: PubSubHubbub support

alex_b’s picture

Status: Needs work » Needs review
FileSize
11.39 KB

Just rerolled the patch against HEAD. Not tested.

The patch uses now a checkbox instead of radios which is in line with the choice of the PuSH UI on the same page.

I've vastly simplified the part in FeedsHTTPFetcher and I'd like to discuss whether this covers your use case.

Essentially, the fetcher attempts to resolve the URL on validate if auto_detect_feed is true. What's different is that it modifies the URL once and forever. This keeps things much simpler for the rest of the stack (namely the fetch() method) as there doesn't be an awareness of URLs that need to be resolved before using them. Also edge cases like OPML import where existing URLs are determined by simply looking at the source field of the feeds_source table remain functional.

If the fetcher cannot resolve the URL it fails silently. This is in line with the existing behavior of Feeds where there is no warning or error issued if the feed that is added cannot be downloaded or is not of a specific format (the only error you get is that the title cannot be resolved). URL (input) validation is something that we should keep separate from this patch if possible.

What are your thoughts? I'm available to discuss this on IRC if you'd like to.

I did not review much of http_fetcher.inc so far.

Scott Reynolds’s picture

I resisted that urge (though man didn't realize how much it simplified things!) only because if the actual page provided changed their html output to put a different link tag in there. Sortof in the vain of output formatting.

Preserve what the user entered and filter it only on output. Same concept. (@see #9). Preserve the url they provided but filter and find the real feed.

Does that metaphor hold in this case? Honestly, Im on the fence.

Scott Reynolds’s picture

eehh and im working on an upgrade path from SimpleFeed to Feeds and SimpleFeed never changed the url based on its autodiscovery. So the data I enter in to feeds_source won't have the right url. It will have to autodiscover before inserting into that table.

alex_b’s picture

So I just thought more about this. I think we should only resolve on feed submission (specifically on sourceFormValidate()). It simplifies much and it's not too confusing for the user. Further, resolving shouldn't present a hard problem for the simplefeeds upgrade script. The feeds API should be available when upgrading, so you simply do a $url = http_request_get_common_syndication($url) for every migrated feed - right?

nvanhove’s picture

What I think a good solution would be is use this function on the node-add page. When entering a HTML field, it could detect this, get the first application/rss+xml element, return to the page with a message: "@url is not a valid link. Did you mean @discoveredlink"?

alex_b’s picture

"@url is not a valid link. Did you mean @discoveredlink"?

I'd rather not bother a user with this... If this is desired behavior in a site build, it can be added simply by adding an additional validation step to the node form - right?

nvanhove’s picture

Problem with that is that the autodiscovery really is an educated guess method. For example, my Twitter has the links:

<link rel="alternate" href="http://twitter.com/statuses/friends_timeline.rss" title="Your Twitter Timeline" type="application/rss+xml" /> 
<link rel="alternate" href="http://twitter.com/statuses/replies.rss" title="Your Twitter @nielsz Mentions" type="application/rss+xml" /> 

In this case, the first is the one we probably want, but you can never be sure. So an interface letting the user making the final choice could be a solution.

A other interface could be a ajax call wich displays:

did you mean:
1. Your Twitter Timeline [yes]
2. Your Twitter @nielsz Mentions? [yes]

alex_b’s picture

#22: Good point. Only reservation I'd have would be that I don't want to give users choices like "News as RSS" or "News as Atom". But I guess we could handle that.

I would argue that we should first land the "educated guess" approach that Scott started and then deal with an improved UI for auto discovery in a separate issue.

How does this sound?

The "improved UI" would mean that if the detection mechanism finds equal link entries it escalates them to the user. I'd like to avoid AHAH or AJAX, ideally we can run detection on form validation and then either present the same node form with a choice between the found links or a second form. This mechanism needs to work not only on node forms but on standalone forms, too.

Thoughts?

m3avrck’s picture

Yes, auto-discovery to take the first URL it finds would be a great step forward. Should it replace what the user entered? That seems to be ok since that would dovetail into a new interface (more on that in a sec). Otherwise, I would propose 2 url fields: user_entered and auto_discovered, but that complicates things.

On autodiscovery, a 2nd patch could then say: more than one feed found, which do you want. If no feed found, it can alert the user and they can't proceed. This would address both an improved experience and prevent bad feeds from being entered. Another patch/issue for this would be needed.

alex_b’s picture

Should it replace what the user entered?

Yes. (see #19)

Let's shoot for a two patch approach and go forward with Scott's patch first. Ball is in my court to take a closer look at #16 then.

Scott Reynolds’s picture

Excellent that plan will work. http_request_find_feeds() returns an array of feed urls. good enough abstract to achieve those goals.

I would love to work on a patch that takes the user entered AJAX and come back with the possible choices for the user to select. I don't see anything in this patch that would make that impossible or difficult.

alex_b’s picture

Status: Needs review » Needs work
FileSize
12.61 KB

I just tested #16 and found that it needed slight reordering in hook_nodeapi() to make the validation hook fire before we are looking for a title.

Then I tried this feed:

http://news.google.com/news/search?aq=f&pz=1&cf=all&ned=en_pk&hl=en&q=dr...

to test auto discovery. The feed URLs detected by the http_request library are not HTML decoded and hence don't work.

We should add some tests that verify HTML documents that we expect to work. The documents to test against should be included in feeds.

Scott Reynolds’s picture

Ahh good find. simple fix

$candidate[drupal_strtolower($attribute[1])] = drupal_strtolower(decode_entities($attribute[2]));
Scott Reynolds’s picture

FileSize
12.5 KB

Using the latest patch, and using the quick change i mentioned doesn't work. It finds the URL just fine but setting the value in sourceFormValidate doesn't work. Setting values like this in normal FAPI workflow doesn't work you need to use form_set_value().

I put a dpm() in the nodeapi:validate and in nodeapi:insert/update and I saw the source change to the discovered feed in the validate and change back to the original in the insert/update.

So I went looking for a way to change the one value in the source->config on sourceSave() and didn't find one.

It would be easy to add in the soureForm function a quick element_validate function on the form element and have that do its magik using form_set_value, but I resisted because I think the Feeds API needs to be able to handle this. But if you disagree, its a quick 30 minute fix.

Attached is the patch fixing the url decoding problem. It does not make this actually work though :(

alex_b’s picture

Uff. Did you try if you could use form_set_value() in configFormValidate() if you pass $form into configFormValidate() ? If we can do that, this would be ideal. This would essentially require refactoring all instances of configFormValidate() and sourceFormValidate() which wouldn't be too bad.

If we would also need to pass in the full $form_state array into configFormValidate() we'd have to refactor all instances of this method to expect the full $form_state and not the part which is the configurable's own. I'd rather go with a custom element_validate callback in this case.

Hint, look at feeds_config_form_validate() in FeedsConfigurable.inc to see how the validate methods are being invoked.

Scott Reynolds’s picture

Status: Needs work » Needs review
FileSize
14.28 KB

Well taking a step back, on nodeapi_insert, feeds creates a FeedsSource and then assigns it the config based node->feeds. This config needs to be modified by the Source's importer's plugins.

So it calls its importer's plugin's sourceSave, so why not pass in the $source_config into that method. So thats what this patch does and it works.

alex_b’s picture

#31: does title detection work when an HTML page is given?

A problem with this approach is that $source_config in saveSource() is piercing the current paradigm where we don't mix setting with saving methods. Hm... I need to do more thinking/experimenting with this issue, I really want to get this right.

I won't be in the coming week, I hope this patch works for you in the meantime, I'll get back to it in the week of the 15th.

alexfisher’s picture

Subscribing

Scott Reynolds’s picture

A problem with this approach is that $source_config in saveSource() is piercing the current paradigm where we don't mix setting with saving methods. Hm... I need to do more thinking/experimenting with this issue, I really want to get this right.

Well maybe we go back to the idea of simple form_set_value() via a '#element_validate' function on the element then. Its simple and if #22 is our goal this might actually make the most sense. When the form is submitted the value in that $form_state['values'] should be the exact value that is written to the database. On validate, this will form_set_error(), set the discovered_feeds into the $form_state for the form builder to find on the rebuild and set a select box instead of a textfield.

dead simple.

Scott Reynolds’s picture

FileSize
14.32 KB

rerolled to prevent fatal error. It includes the http_request.inc twice because it didn't use feeds_include_library, it used feeds_include instead. It now uses feeds_include_library

alex_b’s picture

We have a winner.

- Rerolled after no newline cleanup commit broke http_fetcher.inc patch - st00pid CSV :p
- HTTP auto discovery is executed on FeedsHTTPFetcher::sourceFormValidate() now. #740962-5 introduces a fix where changes to $values array in FPlugin::sourceFormValidate() are carried over to subsequent stages by a little workaround in feeds_nodeapi() - the standalone form should just work. Details on #740962.

From my point of view this is RTBC once #740962-5: FileFetcher Attached to Feed Node, Upload Field Not Saving File Path is in.

Scott: could you give this patch a spin and see whether it's working for you? You'll have to apply the patch on #740962-5.

Scott Reynolds’s picture

Do autodiscovery in the validate and the save function? Why is that necessary? Thats two http requests at a minimum

alex_b’s picture

Hm, I guess I shouldn't roll patches so late. The autodiscovery in the save function is completely unnecessary.

[edit: this patch still requires #740962-5: FileFetcher Attached to Feed Node, Upload Field Not Saving File Path to be in place.]

Will White’s picture

Status: Needs review » Fixed

Committed. This is an awesome feature.

http://drupal.org/cvs?commit=348266

alex_b’s picture

Thanks, Will for committing. Thank you Scott for your patience and persistence to see this through.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.