With this sample CSV:
name,tag,body
foo,bar,foo"bar
foo2,bar2,foobar
It'll get incorrectly parsed as one entry, with the "body" of the one entry containing:
foo"bar
foo2,bar2,foobar
It looks like ParserCSV.inc is not correctly handling quotes with CSV. For example, it checks for escaped quotation marks, but does NOT check for this when it initially decides if it's entering a quoted field (instead just doing a straight out strpos($line, '"', $currentIndex)). It also doesn't care *where* the quotes are, though every CSV i've seen with quoted fields only has quotations right next to a delimiter, something like:
oo,bar2,"quoted field"
instead of:
foo,bar,not a "quoted field
I'm trying to figure out what the best fix for this, as it appears to be multiple use cases that need to be handled better, but this is my best guess, based on past experience with working with CSVs and also according to http://en.wikipedia.org/wiki/Comma-separated_values#Toward_standardization
- Quotes should only be checked for right after a delimiter ($nextDelimiterIndex + 1) to determine if the next field is a quoted field
- If it's a quoted field, check to make sure the next quotation mark is not escaped either in traditional C fashion or in RFC-proposed standardized CSV fashion (\" or "")
- Quoted fields should either be indicated by a ' or a "
- Quotes inside a field that is not quoted should be treated literally
Normally, the CSV parser will work fine, it's just when it comes to "bad" (but still valid) data that it starts to choke pretty severely.
I can start working on a patch, though I'm sure the community will have other things to say about my suggested parsing behavior.
Comment | File | Size | Author |
---|---|---|---|
#3 | feeds-better_support_for_rfc_4180-1839626-7.1.x.patch | 7.02 KB | tterranigma |
#3 | csvs.zip | 1.39 KB | tterranigma |
Comments
Comment #1
Taxoman CreditAttribution: Taxoman commentedComment #2
MediaFormat CreditAttribution: MediaFormat commented+1
When exporting data to csv, the id/key column is not double quoted as it is numeric.
When importing to feeds this throws:
Comment #3
tterranigma CreditAttribution: tterranigma commentedImporting
with the latest Feeds 7.x version, thefe is only one body imported as
According to RFC 4180,
foo,bar,foo"bar
is not valid csv code.First, since we want double quotes in the field, we should enclose the field in double quotes.
Second, the " should be double quoted (""). Ie
I would prefer if Feeds returned an error in such a violation of the RFC, however as mentioned in the RFC due to the many different implementations prior to the RFC, one should be relaxed about what is considered valid for processing.
As mentioned in the original bug report, what Feeds does is consider any double quote as a start of a field. I am attaching a patch that tries to only consider opening and closing quotes. If an escaped quote is found inside the quoted field, it is 'unescaped'.
I believe that such a change should be committed regardless of whether we decide that Feeds should throw an error or not if there are violations of the RFC, since it is what the RFC mandates and it is the best way to properly support most edge cases.
The submitted patch still has an issue where if a (quoted) field spans to more than two lines, the lines after the second are ignored. This behavior exists on the current version of Feeds, as well.
I am also attaching some csv files help with debugging, but more thorough tests are needed.
Also, note that I created the patch on Windows and I have no idea what is happening with the end of files and carriage returns...
Comment #4
tterranigma CreditAttribution: tterranigma commentedAnd I just found another issue with some patches.. Could have saved me some time.
Comment #5
MegaChriz CreditAttribution: MegaChriz as a volunteer commentedThis are all CSV related issues for Feeds that I found in the past:
#1446374: Be more forgiving about non-Unix line breaks in CSV files
#1189682: "There are no new nodes" message when import file has Mac EOL format [solved]
#1839626: Incorrect handling of quotes when parsing CSV
#1720724: Fix DQUOTES handling according to RFC 4180
#1429690: Optionaly turn off quotation in CSV parser
#2214603: CSV import stripping all double quotes (") not just text delimiters
#1014678: CSVParser can't parse TSV Files: double quotes still special
#2236589: Importing CSV with special characters gives errors
#1283512: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\x8Es, fi...' for column 'field_product_description_value' at row
#1141016: FeedsCSVParser issue on blanks terminated columns