Closed (fixed)
Project:
Collect
Version:
8.x-1.x-dev
Component:
Metadata Management
Priority:
Major
Category:
Feature request
Assigned:
Unassigned
Reporter:
Created:
6 Mar 2015 at 11:55 UTC
Updated:
26 Mar 2015 at 08:34 UTC
Jump to comment: Most recent, Most recent file
Comments
Comment #1
miro_dietiker:-)
Comment #2
mbovan commentedAdded new plugin for captured web pages and added controller that generates web page from html/json data so we can get preview of the captured page. Also provided web tests for fetch url schema plugin. The screenshot is displayed below.
Comment #3
miro_dietikerLooks pretty nice. Just some hints.
Unsure about csrf here. This makes the link temporary. We can start with that, but there might be an application where we need stable links to navigate through pages fetched.
Please stay consistent with the keys.
Missing test coverage: Follow the link and check contents. Make sure only the body markup is present and nothing additional.
Comment #4
mbovan commentedChanged key names, extended tests to compare body content stored in the collect container and page content of the generated page.
Comment #5
arla commentedShould be
truein lower-case and without quotes.The ternary expression is unnecessary here. The instanceof expression is enough, it returns bool.
AccessResult::allowedIf() can be used here.
A URL is an address. Does it make sense to use "url" to denote the content available at that address? Maybe better to use "web", "resource" or "http" to base the name off?
is*
(or "are no headers/header fields")
These do not seem to have a purpose?
Why are the values lists? If they always have only one element, is it perhaps nicer to "un-list-ify" it when saving in FetchUrlForm?
Hm, that's a nasty hack to get to the container ID :) Maybe we should rethink the structure here to avoid that... Like pass the whole container to build() or something. Not sure.
This method is supposed to output a brief version of the content, for output in listings like search results etc. It is only used when searching with Search API, which is unfortunately not tested or well documented yet. So I guess it's okay to not provide something useful for now (how to concentrate a web page to a few lines?), just return array().
Should use _ instead of - in form keys.
Comment #6
berdir1. No, it needs quotes and core uses uppercase everywhere, actually.
Discussed with Miro quickly. You don't need to worry about CSRF, you can remove this. That is needed when you have a link that *changes* something, like deleting something.
what you *do* need to worry about is XSS. you're returning random HTML from your own domain. If that contains javascript that does a post request back to the site, it could do all kinds of stuff, like creating new admin users :)
Imagine a scenario where someone fetches a container with a URL that contains such javascript and then for some reason tricks you, as an admin user, to visit that page.
To test it, fetch a patch that contains
alert('BAD')and then open the link, it will display a popup saying BAD. miro mentioned that Arild was doing something similar already for inmal? One approach is that you filter out all
tags before you return it. not sure if that's enough.Comment #7
mbovan commentedActually, it needs to be string as I get error: "Routing requirement for "_csrf_token" must be a string."
Fixed 2, 3, 5, 6, 10.
For 4 we can create follow-up to rename all the names of the classes/forms/methods.
They are returned as array from the getHeaders() method. Probably we can make foreach to make it nicer, but I am not sure if it always returns just one value.
For the similar issue @miro_dietiker suggested to do something like this, and asked for tests for buildTeaser(). I did the same here, but now changed it to return empty array(). :)
Also added container_revision checking as we have container revisions committed.
Comment #8
mbovan commentedAdded doc comment to FetchUrlForm for injected schema manager, removed csrf token from GenerateWebPageController. We faced a problem with #2449705: Capturing web page that doesn't have UTF-8 charset. Because of that also added 'User-Agent' to the request headers.
Comment #9
berdirI wouldn't add that header here, and instead document it in the other issue as a workaround that worked for google, but that's just google.
Comment #10
mbovan commentedRemoved User-Agent as we will work on the problem in another issue. Regarding filtering html content, tried to do XSS:filter but then we don't get page preview properly, so remove it for now.
Comment #11
miro_dietikerProviding some minor fixes. Hope still passes.
Changed a bit the headers / table representation.
Are headers with multiple values possible?
Comment #12
miro_dietikerPassed. Committed.
Unsure if the [0] things need a followup. It's strange to have the index present, although it is always single value.
Comment #14
arla commentedThere will be more than one element in the header values, if the header is set multiple times (
Foo: Aand thenFoo: B), which seems to be equivalent to setting it once and separating values by comma (Foo: A,B). Indeed, Guzzle itself does comma-concatenation in AbstractMessage::getHeader(). According to the Guzzle docs however this is "naive" and not correct in every case. However I think it would be the better thing to do in this case. Followup needed :)Additionally, the header values should be escaped to be output correctly, as shown by attached screenshots.
Full header:

Output interpretation in Chrome:

Rendered HTML:

Comment #15
arla commentedThere is also the flaw that the typed data support of this plugin is completely broken (getStaticPropertyDefinitions and getDataProperty). Not very strange because there are no general tests for that, nor is it easily accessible in the UI. Followup needed to fix it and provide tests.
Comment #16
miro_dietikerFollowup: So we should add a feature ASAP that clearly lists the properties in the UI and allow us to thest it.
Comment #17
arla commentedYes, good. Created #2450829: Property list on schema config form
Edit: But I'd rather have the schema properties tested with kernel tests using TypedDataProvider, than even more web tests! (Web-testing the list itself is fine.)
Edit 2: ... so I also created #2450853: Test and fix properties of all schemas