Cocoa for Scientists (Part XXVI): Parsing CSV Data
Author: Drew McCormack
Web Site: www.maccoremac.com, www.macanics.net
On quite a few occasions, MacResearch readers have posted questions asking how you parse CSV (comma-separated values) data in Cocoa. CSV is a simple standard that is used to represent tables; it is used in widely varying fields, from Science to Finance — basically anywhere a table needs to be stored in a text file.
I’ve recently added CSV import to my flash card application, Mental Case. Before I began, I thought it would be a trivial matter of searching for some Objective-C sample code or an open source library with Google. I found solutions in scripting languages like Python, but nothing Cocoa based. After an hour or two of searching, I realized that if I wanted a Cocoa-native solution, I was going to have to roll my own. In this short tutorial, I will show you what I came up with, and hopefully save you the trouble of doing it yourself.
Simple CSV
Parsing CSV can actually be quite simple, if you know the structure of the data beforehand, and you don’t have to deal with quoted strings. In fact, I addressed this in an earlier tutorial that stored spectra in CSV format.
- (BOOL)readFromURL:(NSURL *)absoluteURL ofType:(NSString *)typeName
error:(NSError **)outError
{
NSString *fileString = [NSString stringWithContentsOfURL:absoluteURL
encoding:NSUTF8StringEncoding error:outError];
if ( nil == fileString ) return NO;
NSScanner *scanner = [NSScanner scannerWithString:fileString];
[scanner setCharactersToBeSkipped:
[NSCharacterSet characterSetWithCharactersInString:@"\n, "]];
NSMutableArray *newPoints = [NSMutableArray array];
float energy, intensity;
while ( [scanner scanFloat:&energy] && [scanner scanFloat:&intensity] ) {
[newPoints addObject:
[NSMutableDictionary dictionaryWithObjectsAndKeys:
[NSNumber numberWithFloat:energy], @"energy",
[NSNumber numberWithFloat:intensity], @"intensity",
nil]];
}
[self setPoints:newPoints];
return YES;
}
The NSScanner class is what you use to do most of your string parsing in Cocoa. In the example above, it has been assumed that the CSV file is in a particular form, namely, that it has exactly two columns, each containing a decimal number. By telling the scanner to skip commas
[scanner setCharactersToBeSkipped:
[NSCharacterSet characterSetWithCharactersInString:@"\n, "]];
the parsing of each line is reduced to a single line
while ( [scanner scanFloat:&energy] && [scanner scanFloat:&intensity] ) {
The scanFloat: method will try to read a floating-point number, returning NO upon failure. So the while loop will continue until the format does not meet expectations.
General CSV
As you can see, parsing CSV data can be very easy, but it is not always the case. When you have to deal with general CSV data, things can get quite complicated, because you have to take account of the possibility that strings contain quotations, and can even extend over multiple lines. For example, the following is a valid line of CSV data, containing two columns:
"The quick, brown fox", "jumped over the ""lazy"",
dog"
In case you haven’t figured it out, the double quotation marks are treated as single quotations in the string, giving the two strings 'The quick, brown fox' and 'jumped over the "lazy"<new line>dog'.
Parsing this general form of CSV is considerably more difficult than the simple form, and it took me quite a while to come up with some clean code to do it. But I think I succeeded in the end. Here it is: (Update: I have changed this code to properly handle all newline varieties.)
@implementation NSString (ParsingExtensions)
-(NSArray *)csvRows {
NSMutableArray *rows = [NSMutableArray array];
// Get newline character set
NSMutableCharacterSet *newlineCharacterSet = (id)[NSMutableCharacterSet whitespaceAndNewlineCharacterSet];
[newlineCharacterSet formIntersectionWithCharacterSet:[[NSCharacterSet whitespaceCharacterSet] invertedSet]];
// Characters that are important to the parser
NSMutableCharacterSet *importantCharactersSet = (id)[NSMutableCharacterSet characterSetWithCharactersInString:@",\""];
[importantCharactersSet formUnionWithCharacterSet:newlineCharacterSet];
// Create scanner, and scan string
NSScanner *scanner = [NSScanner scannerWithString:self];
[scanner setCharactersToBeSkipped:nil];
while ( ![scanner isAtEnd] ) {
BOOL insideQuotes = NO;
BOOL finishedRow = NO;
NSMutableArray *columns = [NSMutableArray arrayWithCapacity:10];
NSMutableString *currentColumn = [NSMutableString string];
while ( !finishedRow ) {
NSString *tempString;
if ( [scanner scanUpToCharactersFromSet:importantCharactersSet intoString:&tempString] ) {
[currentColumn appendString:tempString];
}
if ( [scanner isAtEnd] ) {
if ( ![currentColumn isEqualToString:@""] ) [columns addObject:currentColumn];
finishedRow = YES;
}
else if ( [scanner scanCharactersFromSet:newlineCharacterSet intoString:&tempString] ) {
if ( insideQuotes ) {
// Add line break to column text
[currentColumn appendString:tempString];
}
else {
// End of row
if ( ![currentColumn isEqualToString:@""] ) [columns addObject:currentColumn];
finishedRow = YES;
}
}
else if ( [scanner scanString:@"\"" intoString:NULL] ) {
if ( insideQuotes && [scanner scanString:@"\"" intoString:NULL] ) {
// Replace double quotes with a single quote in the column string.
[currentColumn appendString:@"\""];
}
else {
// Start or end of a quoted string.
insideQuotes = !insideQuotes;
}
}
else if ( [scanner scanString:@"," intoString:NULL] ) {
if ( insideQuotes ) {
[currentColumn appendString:@","];
}
else {
// This is a column separating comma
[columns addObject:currentColumn];
currentColumn = [NSMutableString string];
[scanner scanCharactersFromSet:[NSCharacterSet whitespaceCharacterSet] intoString:NULL];
}
}
}
if ( [columns count] > 0 ) [rows addObject:columns];
}
return rows;
}
@end
(I’m releasing this code into the public domain, so use it as you please.)
This code is designed to be a category of NSString. The idea is that it will parse a string into rows and columns, under the assumption that it is in CSV format. The result is an array of arrays; entries in the containing array represent the rows, and those in the contained arrays represent columns in each row.
The code itself is fairly straightforward: It consists of a big while loop which continues until the whole string is parsed. An inner while loop looks through each row of CSV data, looking for significant landmarks, like an end of line, an opening or closing quotation mark, or a comma. By keeping track of opening and closing quotation marks, it is able to properly deal with commas and newlines embedded in quoted strings.
Conclusions
NSScanner is a useful class for parsing strings. It may not be quite as powerful as the regular expressions found in scripting languages like Perl and Python, but with just a few methods — eg, scanString:intoString: scanUpToCharactersFromSet:intoString:, scanFloat: — you can achieve an awful lot. If you need to do any basic string parsing in one of your Cocoa projects, give it a look.



Comments
Huge text files
Have you tried that with huge text files? NSString stringWithContentsOfURL reads the entire file does it not? For large files I think you'd have to memory map parts of the file in memory which would be more complicated.
Huge text files
True. For very big text files, you would have a problem.
I don't know if memory mapping would help much: if a large file is read in, it is effectively memory-mapped by the system, because it goes into virtual memory.
If you had enormous CSV files, you would probably need to do something more low-level, using NSFileHandle, or maybe the NSStream classes.
Drew
---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org
iwork numbers
could you send this to the good folks at apple who forgot to include this feature in numbers?
RE: Huge text files
It probably wouldn't matter if you're running on a server but if you're running on just a desktop then you'd probably consume all the physical ram and then it'd get paged out if needed.
File Encodings
I was also surprised when I needed to add CSV parsing to my app that there was nothing 'off the shelf' -- I eventually rolled my own (similar to what you did) but one thing that really bit me was encoding (My app has a lot of international users)
In your example you assume UTF8 encoding which is reasonable and the native encoding on the mac but the most popular CSV creator in the world (Excel) uses (I believe) UTF16 and there are other apps that put it in MacRoman (I never could figure out who was doing that)
Ultimately I spent probably more time writing the code to import the file properly regardless of it's encoding then I did the actual parsing.
-jm
Cocoadev
Hi Drew,
Nice code, just wondered whether you had seen this one given your introduction:
http://www.cocoadev.com/index.pl?ReadWriteCSVAndTSV
Cocoadev
Nice link. I don't think I saw that, although I definitely looked on CocoaDev.
I had a bit of a look around just now, but the CocoaDev server is having trouble, so I stopped.
There is sample code there, but like most of the other code I found, it doesn't treat all cases. In particular, a lot of the code you can find does not deal with new-lines in quoted strings, which are allowed (at least by Excel).
If there is a sample that treats all cases, please post the link so that people have a choice.
Drew
---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org
New lines
I've looked into the potential new-line problem when using non-UTF8 strings, and came across a few useful tips. First, there is a newLineCharacterSet factory method in Leopard, which could replace the @"\n\r" line in my code.
For pre-leopard, I found a public domain category that achieves the same thing: http://codebeach.org/code/show/1
I might work that into the code above if I get time.
Drew
---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org
Update for newlines
I have modified the code slightly to treat different varieties of newlines properly. I also tested it for a number of string types and line endings, including UTF8, UTF16, MacRoman, and Windows, and it seems to work in all cases.
Drew
---------------------------
Drew McCormack
http://www.maccoremac.com
http://www.macanics.net
http://www.macresearch.org
Big hammer
There is always the high overhead/short version
NSString *fileContent = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:NULL];
NSArray *lines = [fileContent componentsSeparatedByString:@"\n"];
// In 10.5 can use
// [fileContent componentsSeparatedByCharactersInSet:[NSCharacterSet newlineCharacterSet]];
// Then for each line, use
NSArray *entriesInLine = [[lines objectAtIndex:i] fileContent componentsSeparatedByString:@","];
// In case you have quotes on the outside, do
[[entriesInLine objectAtIndex:j] stringByTrimmingCharactersInSet:quotes];
// where
quotes = [NSCharacterSet characterSetWithCharactersInString:@"\""];
This method is fairly slow. But for moderate sizes (up to a megabyte or two) you really won't feel it that much. You are using fairly general purpose objects when you use a character set, and a lot of objects are added to the autorelease pool. The main issue is memory use, since you will have multiple copies of the data file + object overhead. To avoid that, create NSAutoreleasePool objects inside the loops, and put about 100 objects in it before you remove it and create another pool. Use "po [NSAutoreleasePool showPools]" in the debugger to monitor how many objects are in the pool.
David
newlines
David, one of the issue of your code is that you won't be able to have newline characters as part of the cell content of your table. For example, if column 1 is the title of a song, and column 2 are the lyrics, then you want the cells of column 2 to include newline characters.
Otherwise, it is true that the NSArray method 'componentsSeparatedByString:' is a very convenient method, worth remembering when performance is not an issue.
Great Series!
I'm new to mac, a mathematician retired from the aerospace industry. I decided a couple of weeks ago to use Cocoa for some projects. If I had found your series sooner, I could have saved an ink cartridge, a whole bunch of paper, and the price of a book.
I was just starting to struggle with a CSV import, export for my current app when I found your earlier article 17, and this one. Even though I've read a good deal of apple documentation, other tutorials, and a fair part of the aforementioned book, each of your articles answers a question I've had, or clarifies something I've only partially understood.
I hope to be able to make a contribution someday.
Thanks
Just want to say thanks for a great series of tutorials. I have studied and been put off by Obj-C for several years, but this set of lessons has finally brought it all together to make sense.
Kudos!