Embed.ly provider patterns containing * in hostname never match. [#887412]

Some of the Embed.ly schemes that come back from oembedembedly_default_oembedcore_provider() contain patterns (.+) after the period separator in the hostname. These are translated to * and used to select regex patterns in oembedcore_get_provider().

For example, for YouTube:

http://*youtube.com/watch*
http://*.youtube.com/v/*
http://youtu.be/*
http://*.youtube.com/user/*#*
http://*.youtube.com/*#*/*

which are converted to

*youtube.com
youtube.com
youtu.be

Likewise, there are other patterns that are being converted to unusable hosts:

*justin.tv
*amazon.*

I'm working on a patch.

Comment	File	Size	Author
#10	oembed-patterns-0.8-887412-10.patch	3.32 KB	kasperg
#1	oembed-patterns-887412-1.patch	3.5 KB	Anonymous (not verified)

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

Anonymous (not verified) CreditAttribution: Anonymous commented 19 August 2010 at 00:31

Status:

Active

» Needs review

File	Size
oembed-patterns-887412-1.patch	3.5 KB

This patch removes all the code that tries to get a hostname out of the scheme. Instead the providers array is indexed by the regular expression itself.

The undocumented hook_oembedprovider_alter() needs to be updated to expect a different kind of array.

Deriving a hostname from the scheme is brittle, and this is fundamental and critical functionality for the module.

Embed.ly does return domain and subdomain information about the services it supports, but I don't see a way for hook_default_oembedcore_provider() to handle this without changing around more code.

Comment #2

voxpelli CreditAttribution: voxpelli commented 25 August 2010 at 11:25

*youtube.com, *justin.tv, *amazon.* seems to be the case of Embed.ly not following the specification. It explicitly states that:

Within the domain portion, wildcards may only be used for subdomains.

Regarding removing the host-indexing - that would result in us needing to check all links against all regular expressions. That's a real performance hit - especially if we have very many regular expressions. If we have the Embed.ly module active we will have 100-300 regular expressions. Checking every link against that in eg. a blog entry is a lot more expensive than just extracting the domain and see if there are any expressions for that domain.

I'm not sure how to fix this - will think about it.

Comment #3

Anonymous (not verified) CreditAttribution: Anonymous commented 25 August 2010 at 11:40

I was not able to measure any performance difference of more than a millisecond when using the field formatter. The outcome is cached and retrieved later using the requested URL's MD5 hash.

I don't use the input filter, so if that's where the performance cost shows up, then the patch is no good. Your point about the spec is true. These are not valid schemas for oEmbed.

Perhaps there could be a way to trap all the invalid schemas so they can be checked for every domain? i.e. your URL matches a schema that is valid for the oEmbed schema, so you pay no cost. If you have 1,000 schemas that are not valid and your URL doesn't match any of the ones that are valid, you'll have to pay the cost of matching against every "trapped" invalid schema? Hmm... maybe not. This still means most URLs (which are not oEmbed) are going to be checked against the invalid schemas...

I think the only solution is to get embed.ly to stick to the standard.

Comment #4

screeley CreditAttribution: screeley commented 25 August 2010 at 13:35

We try to follow the oEmbed spec as much as possible, but it's 5 years old and has not been updated in a long time. There are little things like this that we do not follow to the letter because it's time and cost prohibitive.

* should be interrupted as .* not .+

subdomains:

*youtube.com/watch* matches both http://www.youtube.com/watch?v=weqwer and http://youtube.com/watch?v=123421. If we didn't use the * in the subdomain you would see the number of domain patterns skyrocket because we would have to have two patterns for every one we have now.

*.youtube.com/watch*
youtube.com/watch*

The same goes for amazon

amazon.com*
amazon.co.jp*
amazon.de*
amazon.co.uk*
etc...

If 100-300 domain regexes are taking too long then this seems like contributing to more of a problem then a solution.

The services api does return domain and subdomain information. See http://api.embed.ly/v1/api/services. Now the list of domains and subdomains are not exhaustive. It takes way too much time and effort to hunt down every permutation of a domain and a url. It does however seem like that would be the solution to your problem.

Please feel free to contact us through support.embed.ly. We would be more than happy to work with you guys to resolve this issue and make it work for you.

Sean

Comment #5

Anonymous (not verified) CreditAttribution: Anonymous commented 26 August 2010 at 03:20

It sounds like the reasonable way forward is to let hook_default_oembedcore_provider() return domain information. If the domain information doesn't exist in the provider objects, parse the schema strings for domains.

Comment #6

voxpelli CreditAttribution: voxpelli commented 20 September 2010 at 19:38

Status:

Needs review

» Needs work

I've responded to screeley and suggested that we try and move some of this discussion to the oEmbed mailing list.

Should we do a quick fix for this issue now or should we wait for the conclusion of that discussion?

Comment #7

Anonymous (not verified) CreditAttribution: Anonymous commented 25 January 2011 at 09:02

Status:

Needs work

» Needs review

https://github.com/bangpound/drupal-oembed/commit/5960d900bb3538458cc967...

This is a fix for some of the problems, but some will never be fixable.

Domains like amazon* don't have adequate information from embed.ly to calculate any valid hosts. The domain and subdomains returned by embed.ly are very incomplete.

The long term fix would be to rely on regular expressions throughout and not to try to extract the hostname or do more complicated key matching in the providers array.

Comment #8

voxpelli CreditAttribution: voxpelli commented 26 February 2011 at 19:54

Regarding moving completely to regular expressions - I benchmarked this a bit:

I generated 500 random providers - each with between 1 and 5 patterns - and 10k urls to test of which 1.3 k was urls from a provider. Running tests against the raw regular expressions resulted in over 14 000k regular expressions being executed to find the providers of the 1.3 k urls taking 45 seconds on my machine. Matching the domain names first decreased the number of regular expressions executed to 2.7k and only took around 0.17 second - over 250 times faster.

Due to the input filter in this module we need the kind of performance we get here.

I'm talking once again to the guys at Embed.ly now and looks like we can solve it by making the domain name static while allowing wildcards for both subdomains and tld:s.

Your patch looks like a good solution in the meanwhile though - will commit it the next time I get around to the code. (I'm sorry - I've very much to do right now)

Comment #9

Anonymous (not verified) CreditAttribution: Anonymous commented 28 May 2011 at 20:19

Status:

Needs review

» Needs work

I've committed a fix to 7.x and 6.x. As of today, it expands support for embedded media from these domains:

youtube.com
justin.tv
revision3.com
twitvid.com
revver.com
viddler.com
crackle.com
nfb.ca
meetup.com
imgur.com
dribble.com

These schemes from embed.ly are still unusable.

http://*amazon.*/gp/product/*
http://*amazon.*/*/dp/*
http://*amazon.*/dp/*
http://*amazon.*/o/ASIN/*
http://*amazon.*/gp/offer-listing/*
http://*amazon.*/*/ASIN/*
http://*amazon.*/gp/product/images/*
http://*amazon.*/gp/aw/d/*
http://*yfrog.*/*
http://picasaweb.google.com*/*/*#*
http://picasaweb.google.com*/lh/photo/*
http://picasaweb.google.com*/*/*

Comment #10

kasperg CreditAttribution: kasperg commented 21 July 2011 at 12:46

File	Size
oembed-patterns-0.8-887412-10.patch	3.32 KB

Here's the patch at #1 rerolled against 0.8 in case anyone else needs it.

Comment #11

28 October 2015 at 15:36

bangpound committed c6dc813 on 8.x-1.x

#887412 by bangpound: Identify domains from some broken embed.ly's...

Embed.ly provider patterns containing * in hostname never match.

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Thank you to these Drupal contributors

News items

Our community

Documentation

Drupal code base

Governance of community