Problem/Motivation

Since recently, we can fetch URLs.
Still the output is just raw JSON

Proposed resolution

Offer a plugin that suggests itself for the namespace.

It should adapt the request and response header with a table display.
And the body could be offered with a link to display in a popup..
Be carefully with displaying it inline. See the inmail collect integration that displays multipart mail bodies with XSS protection...

Remaining tasks

User interface changes

API changes

Comments

miro_dietiker’s picture

Priority: Normal » Major

:-)

mbovan’s picture

Issue summary: View changes
Status: Active » Needs review
StatusFileSize
new9.22 KB
new122.42 KB

Added new plugin for captured web pages and added controller that generates web page from html/json data so we can get preview of the captured page. Also provided web tests for fetch url schema plugin. The screenshot is displayed below.

miro_dietiker’s picture

Status: Needs review » Needs work

Looks pretty nice. Just some hints.

  1. +++ b/collect.routing.yml
    @@ -105,3 +105,11 @@ collect.settings:
    +    _csrf_token: 'TRUE'
    
    +++ b/src/Plugin/collect/Schema/FetchUrlSchema.php
    @@ -0,0 +1,124 @@
    +    $url = \Drupal::url('collect.generate_page', ['collect_container' => $collect_container_id], ['absolute' => TRUE]);
    ...
    +      '#url' => Url::fromUri($url),
    

    Unsure about csrf here. This makes the link temporary. We can start with that, but there might be an application where we need stable links to navigate through pages fetched.

  2. +++ b/src/Plugin/collect/Schema/FetchUrlSchema.php
    @@ -0,0 +1,124 @@
    +    $output['request'] = array(
    ...
    +    foreach ($data['request-headers'] as $key => $value) {
    ...
    +    $output['response'] = array(
    ...
    +    foreach ($data['response-headers'] as $key => $value) {
    

    Please stay consistent with the keys.

  3. +++ b/src/Tests/CollectWebTest.php
    @@ -127,6 +127,19 @@ class CollectWebTest extends WebTestBase {
    +    $this->assertText(t('Show content in a new window'));
    

    Missing test coverage: Follow the link and check contents. Make sure only the body markup is present and nothing additional.

mbovan’s picture

Status: Needs work » Needs review
StatusFileSize
new9.82 KB
new3.51 KB

Changed key names, extended tests to compare body content stored in the collect container and page content of the generated page.

arla’s picture

Status: Needs review » Needs work
  1. +++ b/collect.routing.yml
    @@ -105,3 +105,11 @@ collect.settings:
    +    _csrf_token: 'TRUE'
    

    Should be true in lower-case and without quotes.

  2. +++ b/src/Controller/GenerateWebPageController.php
    @@ -0,0 +1,71 @@
    +    $has_fetch_url_plugin = $schema instanceof FetchUrlSchema ? TRUE : FALSE;
    

    The ternary expression is unnecessary here. The instanceof expression is enough, it returns bool.

  3. +++ b/src/Controller/GenerateWebPageController.php
    @@ -0,0 +1,71 @@
    +    return $access->andIf($has_fetch_url_plugin ? AccessResult::allowed() : AccessResult::forbidden());
    

    AccessResult::allowedIf() can be used here.

  4. +++ b/src/Plugin/collect/Schema/FetchUrlSchema.php
    @@ -0,0 +1,124 @@
    + * @Schema(
    + *   id = "collect_fetch_url",
    + *   label = @Translation("Collect Fetch URL"),
    + *   patterns = {
    + *     "http://schema.md-systems.ch/collect/0.0.1/url"
    + *   }
    + * )
    + */
    +class FetchUrlSchema extends SchemaBase {
    

    A URL is an address. Does it make sense to use "url" to denote the content available at that address? Maybe better to use "web", "resource" or "http" to base the name off?

  5. +++ b/src/Plugin/collect/Schema/FetchUrlSchema.php
    @@ -0,0 +1,124 @@
    +      '#empty' => $this->t('There are no data.'),
    ...
    +      '#empty' => $this->t('There are no data.'),
    

    is*
    (or "are no headers/header fields")

  6. +++ b/src/Plugin/collect/Schema/FetchUrlSchema.php
    @@ -0,0 +1,124 @@
    +      '#prefix' => '<div id="request-table-wrapper">',
    +      '#suffix' => '</div>',
    ...
    +      '#prefix' => '<div id="response-table-wrapper">',
    +      '#suffix' => '</div>',
    

    These do not seem to have a purpose?

  7. +++ b/src/Plugin/collect/Schema/FetchUrlSchema.php
    @@ -0,0 +1,124 @@
    +        ['#markup' => $value[0]],
    

    Why are the values lists? If they always have only one element, is it perhaps nicer to "un-list-ify" it when saving in FetchUrlForm?

  8. +++ b/src/Plugin/collect/Schema/FetchUrlSchema.php
    @@ -0,0 +1,124 @@
    +    $current_url = explode('/', \Drupal::requestStack()->getCurrentRequest()->getRequestUri());
    +    $collect_container_id = end($current_url);
    +    $url = \Drupal::url('collect.generate_page', ['collect_container' => $collect_container_id], ['absolute' => TRUE]);
    

    Hm, that's a nasty hack to get to the container ID :) Maybe we should rethink the structure here to avoid that... Like pass the whole container to build() or something. Not sure.

  9. +++ b/src/Plugin/collect/Schema/FetchUrlSchema.php
    @@ -0,0 +1,124 @@
    +  public function buildTeaser($data) {
    +    $output = array();
    +    $output['web_page'] = array(
    +      '#type' => 'item',
    +      '#title' => $this->t('Web page data'),
    +      '#markup' => [$this->t('Request headers'), $this->t('Response headers'), $this->t('Body')],
    +    );
    +
    +    return $output;
    +  }
    

    This method is supposed to output a brief version of the content, for output in listings like search results etc. It is only used when searching with Search API, which is unfortunately not tested or well documented yet. So I guess it's okay to not provide something useful for now (how to concentrate a web page to a few lines?), just return array().

  10. +++ b/src/Plugin/collect/Schema/FetchUrlSchema.php
    @@ -47,7 +47,7 @@ class FetchUrlSchema extends SchemaBase {
    +    $output['request-headers'] = array(
    
    @@ -56,13 +56,13 @@ class FetchUrlSchema extends SchemaBase {
    +      $output['request-headers'][$key] = array(
    ...
    +    $output['response-headers'] = array(
    
    @@ -71,7 +71,7 @@ class FetchUrlSchema extends SchemaBase {
    +      $output['response-headers'][$key] = array(
    

    Should use _ instead of - in form keys.

berdir’s picture

1. No, it needs quotes and core uses uppercase everywhere, actually.

Discussed with Miro quickly. You don't need to worry about CSRF, you can remove this. That is needed when you have a link that *changes* something, like deleting something.

what you *do* need to worry about is XSS. you're returning random HTML from your own domain. If that contains javascript that does a post request back to the site, it could do all kinds of stuff, like creating new admin users :)

Imagine a scenario where someone fetches a container with a URL that contains such javascript and then for some reason tricks you, as an admin user, to visit that page.

To test it, fetch a patch that contains

alert('BAD')

and then open the link, it will display a popup saying BAD. miro mentioned that Arild was doing something similar already for inmal? One approach is that you filter out all

tags before you return it. not sure if that's enough.
mbovan’s picture

Status: Needs work » Needs review
StatusFileSize
new11.43 KB
new5.29 KB

1. Should be true in lower-case and without quotes.

Actually, it needs to be string as I get error: "Routing requirement for "_csrf_token" must be a string."

Fixed 2, 3, 5, 6, 10.
For 4 we can create follow-up to rename all the names of the classes/forms/methods.

7. Why are the values lists? If they always have only one element, is it perhaps nicer to "un-list-ify" it when saving in FetchUrlForm?

They are returned as array from the getHeaders() method. Probably we can make foreach to make it nicer, but I am not sure if it always returns just one value.

9. This method is supposed to output a brief version of the content, for output in listings like search results etc. It is only used when searching with Search API, which is unfortunately not tested or well documented yet. So I guess it's okay to not provide something useful for now (how to concentrate a web page to a few lines?), just return array().

For the similar issue @miro_dietiker suggested to do something like this, and asked for tests for buildTeaser(). I did the same here, but now changed it to return empty array(). :)

Also added container_revision checking as we have container revisions committed.

mbovan’s picture

Added doc comment to FetchUrlForm for injected schema manager, removed csrf token from GenerateWebPageController. We faced a problem with #2449705: Capturing web page that doesn't have UTF-8 charset. Because of that also added 'User-Agent' to the request headers.

berdir’s picture

I wouldn't add that header here, and instead document it in the other issue as a workaround that worked for google, but that's just google.

mbovan’s picture

Removed User-Agent as we will work on the problem in another issue. Regarding filtering html content, tried to do XSS:filter but then we don't get page preview properly, so remove it for now.

miro_dietiker’s picture

StatusFileSize
new11.67 KB

Providing some minor fixes. Hope still passes.
Changed a bit the headers / table representation.

  1. +++ b/src/Plugin/collect/Schema/FetchUrlSchema.php
    @@ -0,0 +1,113 @@
    +    foreach ($data['response-headers'] as $key => $value) {
    +      $output['response_headers'][$key] = array(
    ...
    +        ['#markup' => $value[0]],
    

    Are headers with multiple values possible?

miro_dietiker’s picture

Status: Needs review » Fixed

Passed. Committed.
Unsure if the [0] things need a followup. It's strange to have the index present, although it is always single value.

  • miro_dietiker committed f3d8c0a on 8.x-1.x
    Issue #2447237 by mbovan, miro_dietiker: New Plugin: URL
    
arla’s picture

StatusFileSize
new13.97 KB
new17.34 KB
new6.47 KB

There will be more than one element in the header values, if the header is set multiple times (Foo: A and then Foo: B), which seems to be equivalent to setting it once and separating values by comma (Foo: A,B). Indeed, Guzzle itself does comma-concatenation in AbstractMessage::getHeader(). According to the Guzzle docs however this is "naive" and not correct in every case. However I think it would be the better thing to do in this case. Followup needed :)

Additionally, the header values should be escaped to be output correctly, as shown by attached screenshots.

Full header:

Output interpretation in Chrome:

Rendered HTML:

arla’s picture

There is also the flaw that the typed data support of this plugin is completely broken (getStaticPropertyDefinitions and getDataProperty). Not very strange because there are no general tests for that, nor is it easily accessible in the UI. Followup needed to fix it and provide tests.

miro_dietiker’s picture

Followup: So we should add a feature ASAP that clearly lists the properties in the UI and allow us to thest it.

arla’s picture

Yes, good. Created #2450829: Property list on schema config form

Edit: But I'd rather have the schema properties tested with kernel tests using TypedDataProvider, than even more web tests! (Web-testing the list itself is fine.)

Edit 2: ... so I also created #2450853: Test and fix properties of all schemas

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.