Using LINQ with CSV Files

In my last post, I wrote how you can convert a CSV to XML. I stated that the reason that you would do this is so that you can use LINQ on CSV files.

Using my CsvToXml class you can easily use LINQ on CSV. You can see the source for CsvToXml in my C# CommonLib library.

Let's assume your CSV file looks like the following: 'Product','Price','DateStocked' 'Pepsi','4.50','2010-05-04' 'Coke','3.00','2010-09-22' 'Cheetos','7.25','2009-01-13'

You can then convert like so:

var csv2xml = new CsvToXml("mycsvfile.csv", true);
csv2xml.TextQualifier = '\'';
csv2xml.Convert();

You can then do cool LINQ queries using the "Records" property of the CsvToXml class:

var records = from rec in csv2xml.Records
            where (decimal)rec.Element("Price") > 3.5m
            orderby (string)rec.Element("Product")
            select rec;

or use LINQ like...

var records = from rec in csv2xml.Records
            where Convert.ToDateTime((string)rec.Element("DateStocked")) > new DateTime(2010, 9, 1)
            orderby (string)rec.Element("Product")
            select rec;

Then to access the data of the returned "records" object you can do this:

var myrec = records.ElementAt(0);
var firstProduct = (string)myrec.Element("Product");

Cool, huh? But we can do better. It's annoying to cast your objects all of the time and to call Element("CSV_COLUMN_NAME"). Why not take advantage of .NET 4.0 and use its 'dynamic' data type? Admittedly, I got this idea from this article: Creating Wrappers with DynamicObject.

Now we can use LINQ like this:

var records = from rec in csv2xml.DynamicRecords
            where (decimal)rec.Price > 3.5m
            orderby (string)rec.Product
            select rec;

//OR

records = from rec in csv2xml.DynamicRecords
        where Convert.ToDateTime((string)rec.DateStocked) > new DateTime(2010, 9, 1)
        orderby (string)rec.Product
        select rec;

A bit better. Now you don't have to call "Element" method everytime, you still have to annoyingly cast. But, in each CSV column, we already know that the types should be consistent. Let's define our types up front:

var csv2xml = new CsvToXml("mycsvfile.csv", true);
csv2xml.TextQualifier = '\'';

csv2xml.ColumnTypes.Add("Product", typeof(string));
csv2xml.ColumnTypes.Add("Price", typeof(decimal));
csv2xml.ColumnTypes.Add("DateStocked", typeof(DateTime));

csv2xml.Convert();

Now we can use LINQ like this:

records = from rec in csv2xml.DynamicRecords
            where rec.Price > 3.5m
            orderby rec.Product
            select rec;

//OR
records = from rec in csv2xml.DynamicRecords
            where rec.DateStocked > new DateTime(2010, 9, 1)
            orderby rec.Product
            select rec;

This is truly clean code. All you need to do is define your types for each column and you're set. At this time, the major disadvantage is that you can't lazily evaluate your CSV. You must read it entirely into memory. This class could be modified for lazy evaluation and then you would be able to use LINQ on large CSV files.

If you made it this far, you should follow me on Twitter.

-JP

comments powered by Disqus