HTTPFetcher autodiscovery [#705872]

Comment	File	Size	Author
#38	705872-38_auto_discovery.patch	11.53 KB	alex_b

#36	705872-36_auto_discovery.patch	11.94 KB	alex_b

#35	feeds_705872-35.patch	14.32 KB	Scott Reynolds

#31	feeds_705872-31.patch	14.28 KB	Scott Reynolds

#29	feeds_705872-29.patch	12.5 KB	Scott Reynolds

#27	705872-27_autodiscovery.patch	12.61 KB	alex_b

#16	705872-16_autodiscovery.patch	11.39 KB	alex_b

#14	feeds_705872-14.patch	12.59 KB	Scott Reynolds

#13	feeds_705872-13.patch	12.45 KB	Scott Reynolds

#10	feeds_705872-10.patch	12.06 KB	Scott Reynolds

#9	feeds_705872-9.patch	13.96 KB	Scott Reynolds

#2	autodiscover_705872-2.patch	7.07 KB	Scott Reynolds

#1	feeds_autodiscover.patch	5.86 KB	Scott Reynolds

Comment #1

Scott Reynolds CreditAttribution: Scott Reynolds commented 5 February 2010 at 18:44

File	Size
feeds_autodiscover.patch	5.86 KB

hmm attached is the patch and forgot the mention of throwing an exception when it fails to find a feed.

Log in or register to post comments

Comment #2

Scott Reynolds CreditAttribution: Scott Reynolds commented 5 February 2010 at 19:51

Status:

Needs work

» Needs review

File	Size
autodiscover_705872-2.patch	7.07 KB

ok got this patch to point where im happy with it. It uses proper unicode regex, most of it taken from SimplePie and it will throw an Exception when autoDiscover() fails to find any feeds in the html document.

I think there might be a potential to use these methods at form validation. Im going to spend some time exploring that so that it creates a more sensible user experience. For instance, if they fail to enter protocol in the form element.

Log in or register to post comments

Comment #3

alex_b CreditAttribution: alex_b commented 5 February 2010 at 20:24

Status:

Needs review

» Needs work

Scott:

I hate to say it because I see all the work you've done and I'd LOVE to have auto discovery back (it worked in FeedAPI), but:

auto discover needs to go into libraries/http_fetcher.inc . I *really* want to keep non Feeds specific (integrational) functionality outside of the plugins and in libraries, this is functionality that could potentially even live outside of Drupal!

I'm on IRC if you have more questions.

Log in or register to post comments

Comment #4

Scott Reynolds CreditAttribution: Scott Reynolds commented 5 February 2010 at 22:56

ahh i get it, there basically is this already although its a monster. hmm ok.

Log in or register to post comments

Comment #5

Scott Reynolds CreditAttribution: Scott Reynolds commented 5 February 2010 at 23:16

the problem with that function as it stands currently is that it returns the data. That would mean the batch importer would need to know if it i should auto discover. Which looking at the existing code was not the intention. Hm....

Log in or register to post comments

Comment #6

alex_b CreditAttribution: alex_b commented 5 February 2010 at 23:27

#4 It *is* a monster :-) I've pulled it over basically 1:1 from FeedAPI, FeedAPI used it from Aggregation (if I remember correctly), so it's oold. Would love to rewrite it if I had some time.

That would mean the batch importer would need to know if it i should auto discover.

I'm not sure whether I follow. Couldn't http_request auto detect a feed, download it when found and return its data?

Log in or register to post comments

Comment #7

Scott Reynolds CreditAttribution: Scott Reynolds commented 6 February 2010 at 01:44

#6
The interface between the fetcher and the batch importer is "Here is the file name or the url" and the batch importer is the only one to deal with the raw data. And the fetcher is where you would want the setting for auto discovery. Therefore, the fetcher has to find the url not the data and hand it to the batch importer.

It is actually trivial, instead of returning the data, you return the url in the auto discover method.

So what I plan to todo.

Fix up that regex a bit with unicode and clean it WAY up. My patch here did essentially the same thing it just looks and feels cleaner cause its broken up.

Have HTTPFetcher call the function when its suppose to auto discover and hand the url off to the batch importer. Because HTTP responses are statically cached, there is no performance hit really.

Essentially, a "clean this up and turn on the feature".

Log in or register to post comments

Comment #8

alex_b CreditAttribution: alex_b commented 6 February 2010 at 17:16

#7: I follow now. You'd like to essentially avoid doing auto discovery on every import, but rather you'd like to resolve the actual feed URL once and then keep using this actual feed URL rather than the original URL - correct?

Where do you plan to store the resolved URL? Would you replace the original URL with the resolved URL?

Log in or register to post comments

Comment #9

Scott Reynolds CreditAttribution: Scott Reynolds commented 7 February 2010 at 02:14

File	Size
feeds_705872-9.patch	13.96 KB

#7: I follow now. You'd like to essentially avoid doing auto discovery on every import, but rather you'd like to resolve the actual feed URL once and then keep using this actual feed URL rather than the original URL - correct?

I decided against it. If the user types in something shouldn't that 'something' remain. Just like if i type

alert('xss')

in a node body that should still be stored.

So here is the massive patch. Sorry about that, I think it probably could be broken up a bit.

Summary of changes

Added a factory function to FeedsImportBatch. I didn't like the constructor that behaved differently depending on the parameters so I created a factory function. I do believe that it would be much cleaner to separate out the ImportBatch into two classes and the factory creates the proper class. But I resisted that urge.
Auto discovery is fixed up and added to the HTTPFetcher. The Fetcher can now be told to auto discover. @TODO: should this be the default behavior and how does that affect feeds_defaults?
The user supplied feed source is validated to check if the fetcher can find a feed there.

I think separating out the Batch class into two classes will provide a cleaner approach to this. The HTTPBatch class could then accept as a parameter $auto_discover. It could then defer auto discovery until getRaw() is called. That would prevent HTTP requests until they are actually required.

Alias, that thought just occurred to me and I need to eat :-D. And I think there is a reason why FeedsImportBatch is one master class instead of two separate ones. Care to enlighten me?

Log in or register to post comments

Comment #10

Scott Reynolds CreditAttribution: Scott Reynolds commented 8 February 2010 at 18:15

File	Size
feeds_705872-10.patch	12.06 KB

Updated patch with bullet 1 removed. Seems off topic of this issue.

Log in or register to post comments

Comment #11

alex_b CreditAttribution: alex_b commented 8 February 2010 at 18:26

And I think there is a reason why FeedsImportBatch is one master class instead of two separate ones. Care to enlighten me?

Only historical reasons. I agree that it should be broken into separate classes, just didn't have the time to do it. A @todo would be warranted.

Need more time to review the patch.

Log in or register to post comments

Comment #12

Scott Reynolds CreditAttribution: Scott Reynolds commented 8 February 2010 at 18:48

Status:

Needs work

» Needs review

updating status.

Log in or register to post comments

Comment #13

Scott Reynolds CreditAttribution: Scott Reynolds commented 20 February 2010 at 20:14

File	Size
feeds_705872-13.patch	12.45 KB

Reroll because of #708228: Break FeedsImportBatch into separate classes

Log in or register to post comments

Comment #14

Scott Reynolds CreditAttribution: Scott Reynolds commented 20 February 2010 at 20:27

File	Size
feeds_705872-14.patch	12.59 KB

Doh! made a minor (but major) mistake and didn't actually used the discovered feed.

In this patch thats fixed, but also I haven't ever like the use of extract(parse_url); so I put the result of parse_url into a variable. This is cleaner and doesn't through php notice's

Log in or register to post comments

Comment #15

Scott Reynolds CreditAttribution: Scott Reynolds commented 23 February 2010 at 17:17

Status:

Needs review

» Needs work

Will need a reroll because of: #617054: PubSubHubbub support

Log in or register to post comments

Comment #16

alex_b CreditAttribution: alex_b commented 23 February 2010 at 20:50

Status:

Needs work

» Needs review

File	Size
705872-16_autodiscovery.patch	11.39 KB

Just rerolled the patch against HEAD. Not tested.

The patch uses now a checkbox instead of radios which is in line with the choice of the PuSH UI on the same page.

I've vastly simplified the part in FeedsHTTPFetcher and I'd like to discuss whether this covers your use case.

Essentially, the fetcher attempts to resolve the URL on validate if auto_detect_feed is true. What's different is that it modifies the URL once and forever. This keeps things much simpler for the rest of the stack (namely the fetch() method) as there doesn't be an awareness of URLs that need to be resolved before using them. Also edge cases like OPML import where existing URLs are determined by simply looking at the source field of the feeds_source table remain functional.

If the fetcher cannot resolve the URL it fails silently. This is in line with the existing behavior of Feeds where there is no warning or error issued if the feed that is added cannot be downloaded or is not of a specific format (the only error you get is that the title cannot be resolved). URL (input) validation is something that we should keep separate from this patch if possible.

What are your thoughts? I'm available to discuss this on IRC if you'd like to.

I did not review much of http_fetcher.inc so far.

Log in or register to post comments

Comment #17

Scott Reynolds CreditAttribution: Scott Reynolds commented 23 February 2010 at 21:25

I resisted that urge (though man didn't realize how much it simplified things!) only because if the actual page provided changed their html output to put a different link tag in there. Sortof in the vain of output formatting.

Preserve what the user entered and filter it only on output. Same concept. (@see #9). Preserve the url they provided but filter and find the real feed.

Does that metaphor hold in this case? Honestly, Im on the fence.

Log in or register to post comments

Comment #18

Scott Reynolds CreditAttribution: Scott Reynolds commented 23 February 2010 at 21:28

eehh and im working on an upgrade path from SimpleFeed to Feeds and SimpleFeed never changed the url based on its autodiscovery. So the data I enter in to feeds_source won't have the right url. It will have to autodiscover before inserting into that table.

Log in or register to post comments

Comment #19

alex_b CreditAttribution: alex_b commented 23 February 2010 at 22:09

So I just thought more about this. I think we should only resolve on feed submission (specifically on sourceFormValidate()). It simplifies much and it's not too confusing for the user. Further, resolving shouldn't present a hard problem for the simplefeeds upgrade script. The feeds API should be available when upgrading, so you simply do a $url = http_request_get_common_syndication($url) for every migrated feed - right?

Log in or register to post comments

Comment #20

nvanhove CreditAttribution: nvanhove commented 25 February 2010 at 21:12

What I think a good solution would be is use this function on the node-add page. When entering a HTML field, it could detect this, get the first application/rss+xml element, return to the page with a message: "@url is not a valid link. Did you mean @discoveredlink"?

Log in or register to post comments

Comment #21

alex_b CreditAttribution: alex_b commented 25 February 2010 at 21:15

"@url is not a valid link. Did you mean @discoveredlink"?

I'd rather not bother a user with this... If this is desired behavior in a site build, it can be added simply by adding an additional validation step to the node form - right?

Log in or register to post comments

Comment #22

nvanhove CreditAttribution: nvanhove commented 26 February 2010 at 08:47

Problem with that is that the autodiscovery really is an educated guess method. For example, my Twitter has the links:

<link rel="alternate" href="http://twitter.com/statuses/friends_timeline.rss" title="Your Twitter Timeline" type="application/rss+xml" /> 
<link rel="alternate" href="http://twitter.com/statuses/replies.rss" title="Your Twitter @nielsz Mentions" type="application/rss+xml" />

In this case, the first is the one we probably want, but you can never be sure. So an interface letting the user making the final choice could be a solution.

A other interface could be a ajax call wich displays:

did you mean:
1. Your Twitter Timeline [yes]
2. Your Twitter @nielsz Mentions? [yes]

Log in or register to post comments

Comment #23

alex_b CreditAttribution: alex_b commented 26 February 2010 at 15:19

#22: Good point. Only reservation I'd have would be that I don't want to give users choices like "News as RSS" or "News as Atom". But I guess we could handle that.

I would argue that we should first land the "educated guess" approach that Scott started and then deal with an improved UI for auto discovery in a separate issue.

How does this sound?

The "improved UI" would mean that if the detection mechanism finds equal link entries it escalates them to the user. I'd like to avoid AHAH or AJAX, ideally we can run detection on form validation and then either present the same node form with a choice between the found links or a second form. This mechanism needs to work not only on node forms but on standalone forms, too.

Thoughts?

Log in or register to post comments

Comment #24

m3avrck CreditAttribution: m3avrck commented 27 February 2010 at 00:12

Yes, auto-discovery to take the first URL it finds would be a great step forward. Should it replace what the user entered? That seems to be ok since that would dovetail into a new interface (more on that in a sec). Otherwise, I would propose 2 url fields: user_entered and auto_discovered, but that complicates things.

On autodiscovery, a 2nd patch could then say: more than one feed found, which do you want. If no feed found, it can alert the user and they can't proceed. This would address both an improved experience and prevent bad feeds from being entered. Another patch/issue for this would be needed.

Log in or register to post comments

Comment #25

alex_b CreditAttribution: alex_b commented 27 February 2010 at 00:50

Should it replace what the user entered?

Yes. (see #19)

Let's shoot for a two patch approach and go forward with Scott's patch first. Ball is in my court to take a closer look at #16 then.

Log in or register to post comments

Comment #26

Scott Reynolds CreditAttribution: Scott Reynolds commented 27 February 2010 at 17:11

Excellent that plan will work. http_request_find_feeds() returns an array of feed urls. good enough abstract to achieve those goals.

I would love to work on a patch that takes the user entered AJAX and come back with the possible choices for the user to select. I don't see anything in this patch that would make that impossible or difficult.

Log in or register to post comments

Comment #27

alex_b CreditAttribution: alex_b commented 3 March 2010 at 16:23

Status:

Needs review

» Needs work

File	Size
705872-27_autodiscovery.patch	12.61 KB

I just tested #16 and found that it needed slight reordering in hook_nodeapi() to make the validation hook fire before we are looking for a title.

Then I tried this feed:

http://news.google.com/news/search?aq=f&pz=1&cf=all&ned=en_pk&hl=en&q=dr...

to test auto discovery. The feed URLs detected by the http_request library are not HTML decoded and hence don't work.

We should add some tests that verify HTML documents that we expect to work. The documents to test against should be included in feeds.

Log in or register to post comments

Comment #28

Scott Reynolds CreditAttribution: Scott Reynolds commented 3 March 2010 at 16:57

Ahh good find. simple fix

$candidate[drupal_strtolower($attribute[1])] = drupal_strtolower(decode_entities($attribute[2]));

Log in or register to post comments

Comment #29

Scott Reynolds CreditAttribution: Scott Reynolds commented 3 March 2010 at 17:36

File	Size
feeds_705872-29.patch	12.5 KB

Using the latest patch, and using the quick change i mentioned doesn't work. It finds the URL just fine but setting the value in sourceFormValidate doesn't work. Setting values like this in normal FAPI workflow doesn't work you need to use form_set_value().

I put a dpm() in the nodeapi:validate and in nodeapi:insert/update and I saw the source change to the discovered feed in the validate and change back to the original in the insert/update.

So I went looking for a way to change the one value in the source->config on sourceSave() and didn't find one.

It would be easy to add in the soureForm function a quick element_validate function on the form element and have that do its magik using form_set_value, but I resisted because I think the Feeds API needs to be able to handle this. But if you disagree, its a quick 30 minute fix.

Attached is the patch fixing the url decoding problem. It does not make this actually work though :(

Log in or register to post comments

Comment #30

alex_b CreditAttribution: alex_b commented 3 March 2010 at 17:52

Uff. Did you try if you could use form_set_value() in configFormValidate() if you pass $form into configFormValidate() ? If we can do that, this would be ideal. This would essentially require refactoring all instances of configFormValidate() and sourceFormValidate() which wouldn't be too bad.

If we would also need to pass in the full $form_state array into configFormValidate() we'd have to refactor all instances of this method to expect the full $form_state and not the part which is the configurable's own. I'd rather go with a custom element_validate callback in this case.

Hint, look at feeds_config_form_validate() in FeedsConfigurable.inc to see how the validate methods are being invoked.

Log in or register to post comments

Comment #31

Scott Reynolds CreditAttribution: Scott Reynolds commented 4 March 2010 at 17:09

Status:

Needs work

» Needs review

File	Size
feeds_705872-31.patch	14.28 KB

Well taking a step back, on nodeapi_insert, feeds creates a FeedsSource and then assigns it the config based node->feeds. This config needs to be modified by the Source's importer's plugins.

So it calls its importer's plugin's sourceSave, so why not pass in the $source_config into that method. So thats what this patch does and it works.

Log in or register to post comments

Comment #32

alex_b CreditAttribution: alex_b commented 8 March 2010 at 04:01

#31: does title detection work when an HTML page is given?

A problem with this approach is that $source_config in saveSource() is piercing the current paradigm where we don't mix setting with saving methods. Hm... I need to do more thinking/experimenting with this issue, I really want to get this right.

I won't be in the coming week, I hope this patch works for you in the meantime, I'll get back to it in the week of the 15th.

Log in or register to post comments

Comment #33

alexfisher CreditAttribution: alexfisher commented 12 March 2010 at 15:12

Subscribing

Log in or register to post comments

Comment #34

Scott Reynolds CreditAttribution: Scott Reynolds commented 17 March 2010 at 16:40

A problem with this approach is that $source_config in saveSource() is piercing the current paradigm where we don't mix setting with saving methods. Hm... I need to do more thinking/experimenting with this issue, I really want to get this right.

Well maybe we go back to the idea of simple form_set_value() via a '#element_validate' function on the element then. Its simple and if #22 is our goal this might actually make the most sense. When the form is submitted the value in that $form_state['values'] should be the exact value that is written to the database. On validate, this will form_set_error(), set the discovered_feeds into the $form_state for the form builder to find on the rebuild and set a select box instead of a textfield.

dead simple.

Log in or register to post comments

Comment #35

Scott Reynolds CreditAttribution: Scott Reynolds commented 17 March 2010 at 17:12

File	Size
feeds_705872-35.patch	14.32 KB

rerolled to prevent fatal error. It includes the http_request.inc twice because it didn't use feeds_include_library, it used feeds_include instead. It now uses feeds_include_library

Log in or register to post comments

Comment #36

alex_b CreditAttribution: alex_b commented 29 March 2010 at 03:54

File	Size
705872-36_auto_discovery.patch	11.94 KB

We have a winner.

- Rerolled after no newline cleanup commit broke http_fetcher.inc patch - st00pid CSV :p
- HTTP auto discovery is executed on FeedsHTTPFetcher::sourceFormValidate() now. #740962-5 introduces a fix where changes to $values array in FPlugin::sourceFormValidate() are carried over to subsequent stages by a little workaround in feeds_nodeapi() - the standalone form should just work. Details on #740962.

From my point of view this is RTBC once #740962-5: FileFetcher Attached to Feed Node, Upload Field Not Saving File Path is in.

Scott: could you give this patch a spin and see whether it's working for you? You'll have to apply the patch on #740962-5.