The CVS Parser fails to properly parse files that have values that are not enclosed in quotes, but use quotes inside the value.

For example (Tab Separated Value File):

Photo Link	MLS Number	Photo Label	Date Last Modified	Photo Order	Display as Portrait	
http://example.com/getMLPhoto.asp?l=8C585D988E6294767FAF6D88ACB8B064846E825C89566D8C	894	Main View	4/27/2009 4:13:48 PM	0	False	
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF6F8A70B2B9908C71815B89586C8D64	896	Tile Shower	4/27/2009 5:03:19 PM	20	False	
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF6F8B70B2B9908C71815B89586C8D64	896	HUGE Closet	4/27/2009 5:03:20 PM	21	False	
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF6F8C70B2B9908C71815B89586C8D64	896	Double Vanity 	4/27/2009 5:03:21 PM	22	False	
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF6F8D70B2B9908C71815B89586C8D64	896	Corian Counters 	4/28/2009 10:00:10 AM	27	False	
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF6F8E70B2B9908C71815B89586C8D64	896	20" Tile 	4/28/2009 10:00:11 AM	28	False	
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF7588ACB8B064846E825C89566D8C	896	Bedroom 1	4/27/2009 4:47:39 PM	8	False	
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF7688ACB8B064846E825C89566D8C	896	Bedroom 2 	4/27/2009 4:47:41 PM	9	False	
http://example.com/getMLPhoto.asp?l=8C585D988E62947682AF6D88ACB8B064846E825C89566D8C	897	Main View	4/28/2009 12:30:19 AM	0	False	

The record:
http://example.com/getMLPhoto.asp?l=8C585D988E62947681AF6F8E70B2B9908C71815B89586C8D64 896 20" Tile 4/28/2009 10:00:11 AM 28 False
Fails because of the double quotation mark inside the value.

This formatting is valid for a TSV which is supposed to be supported by the CSV Parser.
I am attaching a replacement parser that use's PHP 4.3's built-in fgetcsv function. Although it has only been binary safe since 4.3.5

The included patch does away with the separate parser class, and is significantly faster. The version numbers will be off in the diff because it comes from my own SVN repository.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

David Goode’s picture

Status: Active » Needs review

This seems like a cool idea. I'm not sure why Alex went with the custom CSV parser though--I would guess that he was aware of this fgetcsv option though. Either way, we should either fix the current CSV parser to not die on open quotes, or we should do this patch and replace it. Does the patch apply to the CVS head of 6.x-1.0? If not, could you reroll it? Thanks.

JeromeHollon’s picture

It applied to the head of 6.x-1.0 when I submitted it. It's been a while so I'm not sure. I thought Drupal emailed me when I got new responses, but evidently it doesn't.

andrewlevine’s picture

Title: CVSParser failed to properly parse CSV Files » CSVParser fails to properly parse TSV Files
Version: 6.x-1.0-beta3 » 6.x-1.x-dev
Priority: Critical » Major
FileSize
792 bytes

I am attaching a patch against 6.x-1.x-dev which is a less drastic change to the existing code, but still solves the problem of parsing TSV files incorrectly.

The root problem is that TSV doesn't support column enclosing quotes (see http://www.iana.org/assignments/media-types/text/tab-separated-values ), so the patch simply disables this feature of the parser in this special case of TSVs.

Also marked the priority down to major because this probably doesn't come up too often and corrected the title.

colan’s picture

Title: CSVParser fails to properly parse TSV Files » CSVParser can't parse TSV Files: double quotes still special
Version: 6.x-1.x-dev » 7.x-2.x-dev
Assigned: JeromeHollon » Unassigned

Needs to be put into 7 first.

I've created a separate issue for fgetcsv(): #1369874: Don't roll own CSV parser; use PHP's native one

Marking #1368860: TSV import thwarted by odd number of double quotes in a field as a duplicate of this issue.

BWPanda’s picture

FileSize
792 bytes

I struggled with this issue all day until I finally found this bug!

On my site the issue manifested itself as seemingly missing data, some things were being imported correctly while others were being skipped.
(^ using keywords to help others find this issue, would have saved me hours of frustration...)

I've attached a patch that fixes this issue in the latest version 7 (essentially the same patch from #3 above).

kking’s picture

Status: Needs review » Needs work

The patch from #5 worked for me with the latest dev branch. Will work on creating a test for this.

kking’s picture

Attaching the patch from #5 with updated test.

kking’s picture

Status: Needs work » Needs review
hoZt’s picture

The patch in #8 worked for me.

Thanks!

int_ua’s picture

confirming TSV_Parser-1014678-5.patch from comment #5 fixes 7.x-2.0-alpha7

druderman’s picture

I second this!

Confirming TSV_Parser-1014678-5.patch from comment #5 fixes 7.x-2.0-alpha7

Status: Needs review » Needs work

The last submitted patch, feeds-tsv_parser-1014678-7.patch, failed testing.

grahamC’s picture

This is a problem for comma-separated files too. A generic fix to only check for quotes after a delimiter might be better than special-casing tab delimited files.