The CVS Parser fails to properly parse files that have values that are not enclosed in quotes, but use quotes inside the value.
For example (Tab Separated Value File):
Photo Link MLS Number Photo Label Date Last Modified Photo Order Display as Portrait
http://example.com/getMLPhoto.asp?l=8C585D988E6294767FAF6D88ACB8B064846E825C89566D8C 894 Main View 4/27/2009 4:13:48 PM 0 False
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF6F8A70B2B9908C71815B89586C8D64 896 Tile Shower 4/27/2009 5:03:19 PM 20 False
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF6F8B70B2B9908C71815B89586C8D64 896 HUGE Closet 4/27/2009 5:03:20 PM 21 False
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF6F8C70B2B9908C71815B89586C8D64 896 Double Vanity 4/27/2009 5:03:21 PM 22 False
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF6F8D70B2B9908C71815B89586C8D64 896 Corian Counters 4/28/2009 10:00:10 AM 27 False
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF6F8E70B2B9908C71815B89586C8D64 896 20" Tile 4/28/2009 10:00:11 AM 28 False
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF7588ACB8B064846E825C89566D8C 896 Bedroom 1 4/27/2009 4:47:39 PM 8 False
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF7688ACB8B064846E825C89566D8C 896 Bedroom 2 4/27/2009 4:47:41 PM 9 False
http://example.com/getMLPhoto.asp?l=8C585D988E62947682AF6D88ACB8B064846E825C89566D8C 897 Main View 4/28/2009 12:30:19 AM 0 False
The record:
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF6F8E70B2B9908C71815B89586C8D64 896 20" Tile 4/28/2009 10:00:11 AM 28 False
Fails because of the double quotation mark inside the value.
This formatting is valid for a TSV which is supposed to be supported by the CSV Parser.
I am attaching a replacement parser that use's PHP 4.3's built-in fgetcsv function. Although it has only been binary safe since 4.3.5
The included patch does away with the separate parser class, and is significantly faster. The version numbers will be off in the diff because it comes from my own SVN repository.
Comment | File | Size | Author |
---|---|---|---|
#7 | feeds-tsv_parser-1014678-7.patch | 5.74 KB | kking |
#5 | TSV_Parser-1014678-5.patch | 792 bytes | BWPanda |
#3 | feeds-1014678-3.patch | 792 bytes | andrewlevine |
FeedsCSVParser.inc_.diff | 3.29 KB | JeromeHollon | |
sample.txt | 1.17 KB | JeromeHollon |
Comments
Comment #1
David Goode CreditAttribution: David Goode commentedThis seems like a cool idea. I'm not sure why Alex went with the custom CSV parser though--I would guess that he was aware of this fgetcsv option though. Either way, we should either fix the current CSV parser to not die on open quotes, or we should do this patch and replace it. Does the patch apply to the CVS head of 6.x-1.0? If not, could you reroll it? Thanks.
Comment #2
JeromeHollon CreditAttribution: JeromeHollon commentedIt applied to the head of 6.x-1.0 when I submitted it. It's been a while so I'm not sure. I thought Drupal emailed me when I got new responses, but evidently it doesn't.
Comment #3
andrewlevine CreditAttribution: andrewlevine commentedI am attaching a patch against 6.x-1.x-dev which is a less drastic change to the existing code, but still solves the problem of parsing TSV files incorrectly.
The root problem is that TSV doesn't support column enclosing quotes (see http://www.iana.org/assignments/media-types/text/tab-separated-values ), so the patch simply disables this feature of the parser in this special case of TSVs.
Also marked the priority down to major because this probably doesn't come up too often and corrected the title.
Comment #4
colanNeeds to be put into 7 first.
I've created a separate issue for fgetcsv(): #1369874: Don't roll own CSV parser; use PHP's native one
Marking #1368860: TSV import thwarted by odd number of double quotes in a field as a duplicate of this issue.
Comment #5
BWPanda CreditAttribution: BWPanda commentedI struggled with this issue all day until I finally found this bug!
On my site the issue manifested itself as seemingly missing data, some things were being imported correctly while others were being skipped.
(^ using keywords to help others find this issue, would have saved me hours of frustration...)
I've attached a patch that fixes this issue in the latest version 7 (essentially the same patch from #3 above).
Comment #6
kking CreditAttribution: kking commentedThe patch from #5 worked for me with the latest dev branch. Will work on creating a test for this.
Comment #7
kking CreditAttribution: kking commentedAttaching the patch from #5 with updated test.
Comment #8
kking CreditAttribution: kking commentedComment #9
hoZt CreditAttribution: hoZt commentedThe patch in #8 worked for me.
Thanks!
Comment #10
int_ua CreditAttribution: int_ua commentedconfirming TSV_Parser-1014678-5.patch from comment #5 fixes 7.x-2.0-alpha7
Comment #11
druderman CreditAttribution: druderman commentedI second this!
Comment #13
grahamCThis is a problem for comma-separated files too. A generic fix to only check for quotes after a delimiter might be better than special-casing tab delimited files.