Problem/Motivation

Message bodies are often base64-encoded. The Content-Transfer-Encoding header will then contain "base64" (RFC 2045). Base64-encoded text needs to be decoded to be readable.

A concrete example where this is bad is that the bounce reason may be set to a base64-encoded string and presented to admin:

Proposed resolution

Check the Content-Transfer-Encoding header and conditionally base64_decode() the body.

Also for MIME multipart messages. So far there's no splitting the pseudo-headers from the body in each part, but we might introduce that here to facilitate this.

Remaining tasks

User interface changes

API changes

Comments

miro_dietiker’s picture

Version: » 8.x-1.x-dev

Yes, this is quite common for servers that want to make sure no data gets lost.
Internally many things are these days just UTF-8.
Since many servers are 7bit only, the only way to guarantee lossless transport is to do base64.
A server might decide to just always do base64 (not just if there are special chars) and then it doesn't need to do complex decisions about the proper message format when sending.

miro_dietiker’s picture

Component: Code » Mime
miro_dietiker’s picture

Thought about decoding during header decode issue review:
#2389327: Deal with header line length and inline encoding in MIME Parser

I think we should offer a getDecodedBody() method that checks for base64 encoding.
The parser would not waste resources by parsing into uniform representation. Also we have guaranteed byte exactness when going back into toString().
The method still should contain some documentation that it contains the file stream in case the entity was an encoded file... Such a file is not supposed to be output as string.

Also i do remember that in past mails have been sent around with local encodings. Still mail programs can switch encoding when reading to circumvent encoding issues.
As a result, i think we should always use the Unicide::convertToUtf8() with getBody(). Note that this should not be applied when parsing as it leads to character swapping with combination of later toString(). Also note that we only have a one way function into UTF8 and thus could not maintain easily byte exactness.

A getRawBody() might be offered to return unconverted characters (not in a safe UTF-8 space) for raw processing. (We have no usecase yet.)
In any case, we should be clear that all methods return guaranteed UTF-8 converted characters, independent from the mail character encoding domain. All methods that don't guarantee this need to have an explicit hint.

Most importantly, we need a collection of multipart mails generated with different mail clients, containing small attachments. There are many ways to handle attachments a bit differently.

arla’s picture

Status: Active » Needs review
StatusFileSize
new6.49 KB
new90.28 KB

That's better :)

Status: Needs review » Needs work

The last submitted patch, 4: decodebody-2381881-4.patch, failed testing.

miro_dietiker’s picture

Awesome progress.

  1. +++ b/src/MIME/Entity.php
    @@ -103,6 +107,38 @@ class Entity implements EntityInterface {
    +    $decoded = Encodings::decode($body, $this->getContentTransferEncoding());
    ...
    +      $converted = $decoded;
    ...
    +      $converted = Unicode::convertToUtf8($decoded, $charset);
    

    I would just update $body instead of using new variable names.

  2. +++ b/src/MIME/Entity.php
    @@ -103,6 +107,38 @@ class Entity implements EntityInterface {
    +    if (!isset($content_type['parameters']['charset'])) {
    +      return NULL;
    

    Strange, i thought the charset is optional with a default to us-ascii?

  3. +++ b/src/MIME/Entity.php
    @@ -103,6 +107,38 @@ class Entity implements EntityInterface {
    +    if (!Unicode::validateUtf8($converted)) {
    

    Now we validate again after conversion - which i suppose always returns valid stuff? Unsure about the speed of these methods.

  4. +++ b/src/MIME/EntityInterface.php
    @@ -73,8 +73,9 @@ interface EntityInterface {
    +   * decoded body if it is used as text.
    

    Not only used as text. Makes also sense when accessing an attachment content. I'd say for regular access to the payload (instead of the encoded MIME representation)

  5. +++ b/tests/src/Unit/MIME/MultipartEntityTest.php
    @@ -99,6 +99,16 @@ EOF;
    +    $this->assertEquals("日本国", static::getEncodedEntity()->getDecodedBody());
    
    @@ -199,4 +209,17 @@ This is the epilogue.  It is also to be ignored.';
    +      '=E6=97=A5=E6=9C=AC=E5=9B=BD'
    

    I'm a bit confused about this data split. I guess should be different with the common provider pattern if we add more examples ever.

arla’s picture

Status: Needs work » Needs review
StatusFileSize
new14.07 KB
new11.9 KB

Fixed fails and review points.

miro_dietiker’s picture

Status: Needs review » Reviewed & tested by the community

Looks perfect now.

As discussed, please create a followup about challenging the system with a crafted mail message that contains non-valid UTF-8 characters after base64 decoding. There's some risk that this can cause a fatal error or uncaught exception and we like to see this well covered. The processor cron should clearly never fail.

  • Arla committed 823ff07 on 8.x-1.x
    Issue #2381881 by Arla: Decode base64 message body
    
arla’s picture

Status: Reviewed & tested by the community » Fixed

Committed and pushed. Created followup:
#2408353: Add tests for decoding to invalid UTF-8

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.