Decode base64 message body [#2381881]

Problem/Motivation

Message bodies are often base64-encoded. The Content-Transfer-Encoding header will then contain "base64" (RFC 2045). Base64-encoded text needs to be decoded to be readable.

A concrete example where this is bad is that the bounce reason may be set to a base64-encoded string and presented to admin:

Proposed resolution

Check the Content-Transfer-Encoding header and conditionally base64_decode() the body.

Also for MIME multipart messages. So far there's no splitting the pseudo-headers from the body in each part, but we might introduce that here to facilitate this.

Remaining tasks

User interface changes

API changes

Comment	File	Size	Author
#7	decodebody-2381881-7.interdiff.txt	11.9 KB	arla
#7	decodebody-2381881-7.patch	14.07 KB	arla
#4	Decoded body.png	90.28 KB	arla
#4	decodebody-2381881-4.patch	6.49 KB	arla
	Base64 Bounce Reason.png	161.96 KB	arla

Comments

Comment #1

miro_dietiker

Switzerland

commented 28 November 2014 at 08:11

Version:

» 8.x-1.x-dev

Yes, this is quite common for servers that want to make sure no data gets lost.
Internally many things are these days just UTF-8.
Since many servers are 7bit only, the only way to guarantee lossless transport is to do base64.
A server might decide to just always do base64 (not just if there are special chars) and then it doesn't need to do complex decisions about the proper message format when sending.

Comment #2

miro_dietiker

Switzerland

commented 10 January 2015 at 09:00

Component:

Code

» Mime

Comment #3

miro_dietiker

Switzerland

commented 12 January 2015 at 22:45

Thought about decoding during header decode issue review:
#2389327: Deal with header line length and inline encoding in MIME Parser

I think we should offer a getDecodedBody() method that checks for base64 encoding.
The parser would not waste resources by parsing into uniform representation. Also we have guaranteed byte exactness when going back into toString().
The method still should contain some documentation that it contains the file stream in case the entity was an encoded file... Such a file is not supposed to be output as string.

Also i do remember that in past mails have been sent around with local encodings. Still mail programs can switch encoding when reading to circumvent encoding issues.
As a result, i think we should always use the Unicide::convertToUtf8() with getBody(). Note that this should not be applied when parsing as it leads to character swapping with combination of later toString(). Also note that we only have a one way function into UTF8 and thus could not maintain easily byte exactness.

A getRawBody() might be offered to return unconverted characters (not in a safe UTF-8 space) for raw processing. (We have no usecase yet.)
In any case, we should be clear that all methods return guaranteed UTF-8 converted characters, independent from the mail character encoding domain. All methods that don't guarantee this need to have an explicit hint.

Most importantly, we need a collection of multipart mails generated with different mail clients, containing small attachments. There are many ways to handle attachments a bit differently.

Comment #4

arla commented 14 January 2015 at 16:35

Status:

Active

» Needs review

Status	File	Size
new	decodebody-2381881-4.patch	6.49 KB
new	Decoded body.png	90.28 KB

That's better :)

Comment #5

14 January 2015 at 16:43

Status:

Needs review

» Needs work

The last submitted patch, 4: decodebody-2381881-4.patch, failed testing.

Comment #6

miro_dietiker

Switzerland

commented 14 January 2015 at 16:50

Awesome progress.

+++ b/src/MIME/Entity.php
@@ -103,6 +107,38 @@ class Entity implements EntityInterface {
+    $decoded = Encodings::decode($body, $this->getContentTransferEncoding());
...
+      $converted = $decoded;
...
+      $converted = Unicode::convertToUtf8($decoded, $charset);

I would just update $body instead of using new variable names.

+++ b/src/MIME/Entity.php
@@ -103,6 +107,38 @@ class Entity implements EntityInterface {
+    if (!isset($content_type['parameters']['charset'])) {
+      return NULL;

Strange, i thought the charset is optional with a default to us-ascii?

+++ b/src/MIME/Entity.php
@@ -103,6 +107,38 @@ class Entity implements EntityInterface {
+    if (!Unicode::validateUtf8($converted)) {

Now we validate again after conversion - which i suppose always returns valid stuff? Unsure about the speed of these methods.

```
+++ b/src/MIME/EntityInterface.php
@@ -73,8 +73,9 @@ interface EntityInterface {
+   * decoded body if it is used as text.
```
Not only used as text. Makes also sense when accessing an attachment content. I'd say for regular access to the payload (instead of the encoded MIME representation)

+++ b/tests/src/Unit/MIME/MultipartEntityTest.php
@@ -99,6 +99,16 @@ EOF;
+    $this->assertEquals("日本国", static::getEncodedEntity()->getDecodedBody());

@@ -199,4 +209,17 @@ This is the epilogue.  It is also to be ignored.';
+      '=E6=97=A5=E6=9C=AC=E5=9B=BD'

I'm a bit confused about this data split. I guess should be different with the common provider pattern if we add more examples ever.

Comment #7

arla commented 15 January 2015 at 08:09

Status:

Needs work

» Needs review

Status	File	Size
new	decodebody-2381881-7.patch	14.07 KB
new	decodebody-2381881-7.interdiff.txt	11.9 KB

Fixed fails and review points.

Comment #8

miro_dietiker

Switzerland

commented 16 January 2015 at 01:28

Status:

Needs review

» Reviewed & tested by the community

Looks perfect now.

As discussed, please create a followup about challenging the system with a crafted mail message that contains non-valid UTF-8 characters after base64 decoding. There's some risk that this can cause a fatal error or uncaught exception and we like to see this well covered. The processor cron should clearly never fail.

Comment #9

16 January 2015 at 07:54

Arla committed 823ff07 on 8.x-1.x

Issue #2381881 by Arla: Decode base64 message body

Comment #10

arla commented 16 January 2015 at 08:00

Status:

Reviewed & tested by the community

» Fixed

Committed and pushed. Created followup:
#2408353: Add tests for decoding to invalid UTF-8

Comment #11

30 January 2015 at 08:04

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Decode base64 message body

Problem/Motivation

Proposed resolution

Remaining tasks

User interface changes

API changes

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Referenced by

News items

Our community

Documentation

Drupal code base

Governance of community