With this sample CSV:

name,tag,body
foo,bar,foo"bar
foo2,bar2,foobar

It'll get incorrectly parsed as one entry, with the "body" of the one entry containing:

foo"bar
foo2,bar2,foobar

It looks like ParserCSV.inc is not correctly handling quotes with CSV. For example, it checks for escaped quotation marks, but does NOT check for this when it initially decides if it's entering a quoted field (instead just doing a straight out strpos($line, '"', $currentIndex)). It also doesn't care *where* the quotes are, though every CSV i've seen with quoted fields only has quotations right next to a delimiter, something like:

oo,bar2,"quoted field"

instead of:

foo,bar,not a "quoted field

I'm trying to figure out what the best fix for this, as it appears to be multiple use cases that need to be handled better, but this is my best guess, based on past experience with working with CSVs and also according to http://en.wikipedia.org/wiki/Comma-separated_values#Toward_standardization

  1. Quotes should only be checked for right after a delimiter ($nextDelimiterIndex + 1) to determine if the next field is a quoted field
  2. If it's a quoted field, check to make sure the next quotation mark is not escaped either in traditional C fashion or in RFC-proposed standardized CSV fashion (\" or "")
  3. Quoted fields should either be indicated by a ' or a "
  4. Quotes inside a field that is not quoted should be treated literally

Normally, the CSV parser will work fine, it's just when it comes to "bad" (but still valid) data that it starts to choke pretty severely.

I can start working on a patch, though I'm sure the community will have other things to say about my suggested parsing behavior.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Taxoman’s picture

Version: 7.x-2.0-alpha7 » 7.x-2.x-dev
MediaFormat’s picture

+1

When exporting data to csv, the id/key column is not double quoted as it is numeric.

When importing to feeds this throws:

SQLSTATE[22007]: Invalid datetime format: 1292 Truncated incorrect DOUBLE value
tterranigma’s picture

Importing

name,tag,body
foo,bar,foo"bar
foo2,bar2,foobar

with the latest Feeds 7.x version, thefe is only one body imported as

foobar
foo2,bar2,foobar

According to RFC 4180, foo,bar,foo"bar is not valid csv code.
First, since we want double quotes in the field, we should enclose the field in double quotes.
Second, the " should be double quoted (""). Ie

name,tag,body
foo,bar,"foo""bar"
foo2,bar2,foobar

I would prefer if Feeds returned an error in such a violation of the RFC, however as mentioned in the RFC due to the many different implementations prior to the RFC, one should be relaxed about what is considered valid for processing.

As mentioned in the original bug report, what Feeds does is consider any double quote as a start of a field. I am attaching a patch that tries to only consider opening and closing quotes. If an escaped quote is found inside the quoted field, it is 'unescaped'.

I believe that such a change should be committed regardless of whether we decide that Feeds should throw an error or not if there are violations of the RFC, since it is what the RFC mandates and it is the best way to properly support most edge cases.

The submitted patch still has an issue where if a (quoted) field spans to more than two lines, the lines after the second are ignored. This behavior exists on the current version of Feeds, as well.

I am also attaching some csv files help with debugging, but more thorough tests are needed.

Also, note that I created the patch on Windows and I have no idea what is happening with the end of files and carriage returns...

tterranigma’s picture

And I just found another issue with some patches.. Could have saved me some time.