I needed to control the format in which the downloaded CSV would be exported as.

Attached is a patch that gives control over choosing the encoding to the downloader.
The system can also assign the variable 'webform_export_encoding'.
The default is UTF16-LE because it is the encoding prior to this patch.

I had to recreate the patch for 7.x-4.x-dev because of some changes that happened in the recent commits.
So I am supplying two patches.
The first shows my original work for the 7.x-4.0-alpha9.
The second is a rediff against the dev branch.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

quicksketch’s picture

Could you describe your use-case for needing these options? What's the format you're actually needing from Webform? Is the default exporter not importing correctly into an application you're using?

The UTF16-LE encoding is essentially only there for the old Excel exporter, since a UTF16-LE TSV file was the only file Excel would import directly without needing to prompt for any import settings. Now that we have #1140026: Excel-native exporter, fixing new line and UTF-8 importing problems, I don't really feel the need to treat Excel special any more. If you're using Excel you'd use the Excel format. The TSV/CSV files can all just be UTF-8 now.

I definitely don't want to expose an option for encoding in the UI. That's an option that <1% of users are going to even know what it does. If you think encoding options are still necessary, we can make it a site-wide setting, but if UTF-8 is what you're after here, then let's just nix all the encoding transformations.

quicksketch’s picture

Status: Needs review » Needs work

Marking needs work per #1. Including an option for encoding is exposing an option to users that most won't use or understand. It seems like this would be more likely to cause problems than solve them. I'm still interested in hearing your use-case. If we just want to switch all delimited downloads to UTF-8 and remove all the encoding changes and the BOM indicator, I'm open to that approach.

thekevinday’s picture

Status: Needs work » Needs review
FileSize
3.08 KB

I've had a problem with the background_progress module's background_batch submodule preventing webform from downloading it as CSV (or anything...the batch job was not working).
When I finally got a work-around I decided to do some testing.

When I downloaded the test CSV file (not Excel CSV), I did a quick look at the contents with vim.
I saw vim putting extra symbols in front of each character and I thought, oh wide-char, it must be some form of utf16.

I opened it up in different applications and some seem to work properly, others seemed to just print it as if it were utf8 or ascii (causing extra junk) and one application just showed an empty page.

I know for a fact that users want to use their particular product and not some other, so if it doesn't support UTF16, for whatever reason, then they should be able to download it in the encoding they need.

I, personally, prefer to store things in UTF-8 and would want to simply export it as such.

Here is a new patch that just utilizes the settings.php 'webform_export_encoding' variable.
When the variable is undefined, then the original behavior of webform is used so that this should not cause any problems for existing sites.

quicksketch’s picture

I, personally, prefer to store things in UTF-8 and would want to simply export it as such.

Well like I said, the UTF-16 with BOM format was used for compatibility with Excel, which almost certainly trumps any text editor. It's legacy at this point.

I agree that UTF-8 is going to be the best default format now that we have the Excel native exporter. This patch looks pretty good overall, but how about we get rid of the special handling here:

+      else {
+        $row = mb_convert_encoding($row, 'UTF-16LE', 'UTF-8');
+      }

And just set $this->encoding = 'UTF-16LE' in the constructor for the legacy Excel exporter. We can keep the hidden variable if you find it necessary, but my preference would be to just keep everything UTF-8 except for the legacy Excel exporter.

quicksketch’s picture

Status: Needs review » Needs work

Also, I don't think this section is needed at all. You shouldn't be converting the encoding of the BOM:

     if (function_exists('mb_convert_encoding') && $this->delimiter == "\t") {
       $output = chr(255) . chr(254);
+
+      if ($this->encoding) {
+        $encoding_current = mb_detect_encoding($output);
+
+        if ($this->encoding != $encoding_current) {
+          $output = mb_convert_encoding($output, $this->encoding, $encoding_current);
+        }
+      }

Marking needs work per #4. I think the easiest implementation might be simply moving all the encoding changes to the legacy Excel exporter.

quicksketch’s picture

Status: Needs work » Needs review
FileSize
2.71 KB

Here's a set of changes that takes a completely different approach and removes all the UTF-16LE handling from the normal exporter. It's moved to the legacy Excel exporter only.

quicksketch’s picture

Title: Add support for controlling character encoding on export (such as UTF-8 or UTF16-LE) » Export TSV/CSV as UTF-8 instead of UTF-16LE
Status: Needs review » Fixed

I've committed this patch, making UTF-8 the de facto exporting format except if you're using the legacy Excel exporter. Excel is no longer a primary concern now that we export natively in that format.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

manuelBS’s picture

Issue summary: View changes

Is there also a patch for the 3.x branch? The same problem exists there.

DanChadwick’s picture

I would not anticipate much work being done to backport fixes to 7.x-3.x since 7.x-4.x is in release candidate. However, if a patch is made, reviewed, and marked RTBC, I'm happy to commit it.