Closed (fixed)
Project:
Inmail
Version:
8.x-1.x-dev
Component:
Mime
Priority:
Normal
Category:
Feature request
Assigned:
Unassigned
Issue tags:
Reporter:
Created:
25 Nov 2014 at 15:49 UTC
Updated:
30 Jan 2015 at 08:04 UTC
Jump to comment: Most recent, Most recent file

Comments
Comment #1
miro_dietikerYes, this is quite common for servers that want to make sure no data gets lost.
Internally many things are these days just UTF-8.
Since many servers are 7bit only, the only way to guarantee lossless transport is to do base64.
A server might decide to just always do base64 (not just if there are special chars) and then it doesn't need to do complex decisions about the proper message format when sending.
Comment #2
miro_dietikerComment #3
miro_dietikerThought about decoding during header decode issue review:
#2389327: Deal with header line length and inline encoding in MIME Parser
I think we should offer a getDecodedBody() method that checks for base64 encoding.
The parser would not waste resources by parsing into uniform representation. Also we have guaranteed byte exactness when going back into toString().
The method still should contain some documentation that it contains the file stream in case the entity was an encoded file... Such a file is not supposed to be output as string.
Also i do remember that in past mails have been sent around with local encodings. Still mail programs can switch encoding when reading to circumvent encoding issues.
As a result, i think we should always use the Unicide::convertToUtf8() with getBody(). Note that this should not be applied when parsing as it leads to character swapping with combination of later toString(). Also note that we only have a one way function into UTF8 and thus could not maintain easily byte exactness.
A getRawBody() might be offered to return unconverted characters (not in a safe UTF-8 space) for raw processing. (We have no usecase yet.)
In any case, we should be clear that all methods return guaranteed UTF-8 converted characters, independent from the mail character encoding domain. All methods that don't guarantee this need to have an explicit hint.
Most importantly, we need a collection of multipart mails generated with different mail clients, containing small attachments. There are many ways to handle attachments a bit differently.
Comment #4
arla commentedThat's better :)

Comment #6
miro_dietikerAwesome progress.
I would just update $body instead of using new variable names.
Strange, i thought the charset is optional with a default to us-ascii?
Now we validate again after conversion - which i suppose always returns valid stuff? Unsure about the speed of these methods.
Not only used as text. Makes also sense when accessing an attachment content. I'd say for regular access to the payload (instead of the encoded MIME representation)
I'm a bit confused about this data split. I guess should be different with the common provider pattern if we add more examples ever.
Comment #7
arla commentedFixed fails and review points.
Comment #8
miro_dietikerLooks perfect now.
As discussed, please create a followup about challenging the system with a crafted mail message that contains non-valid UTF-8 characters after base64 decoding. There's some risk that this can cause a fatal error or uncaught exception and we like to see this well covered. The processor cron should clearly never fail.
Comment #10
arla commentedCommitted and pushed. Created followup:
#2408353: Add tests for decoding to invalid UTF-8