A reader that provides fast, non-cached, forward-only access to CSV data.
First and foremost, I thank everyone that is contributing in the discussion forum of this article: it amazes me that users of this library are helping each other in this manner.
So ... I never foresaw this project getting so popular. It was made with a single purpose in mind: performance and one can clearly see that it affected its design, maybe too much in retrospect. Five years have passed since the last update and many people are asking where they can contribute code or post issues. I no longer maintain this library, but Paul Hatcher has set up a GitHub repository and a NuGet package. I invite you to go there if you need to maintain an existing project that uses this library.
If however you are starting a new project or are willing to do so some light refactoring, I also wrote a new library that includes a CSV and fixed width reader/writer which are just as fast as this library, but are much more flexible and handle many more use cases. You can find the CSV reader source in the GitHub repository and download the NuGet package. I will fully maintain this new library and it is already used in many projects in production.
IntroductionOne would imagine that parsing CSV files is a straightforward and boring task. I was thinking that too, until I had to parse several CSV files of a couple GB each. After trying to use the OLEDB JET driver and various Regular Expressions, I still ran into serious performance problems. At this point, I decided I would try the custom class option. I scoured the net for existing code, but finding a correct, fast, and efficient CSV parser and reader is not so simple, whatever platform/language you fancy.
I say correct in the sense that many implementations merely use some splitting method like String.Split()
. This will, obviously, not handle field values with commas. Better implementations may care about escaped quotes, trimming spaces before and after fields, etc., but none I found were doing it all, and more importantly, in a fast and efficient manner.
And, this led to the CSV reader class I present in this article. Its design is based on the System.IO.StreamReader
class, and so is a non-cached, forward-only reader (similar to what is sometimes called a fire-hose cursor).
Benchmarking it against both OLEDB and regex methods, it performs about 15 times faster, and yet its memory usage is very low.
To give more down-to-earth numbers, with a 45 MB CSV file containing 145 fields and 50,000 records, the reader was processing about 30 MB/sec. So all in all, it took 1.5 seconds! The machine specs were P4 3.0 GHz, 1024 MB.
Supported FeaturesThis reader supports fields spanning multiple lines. The only restriction is that they must be quoted, otherwise it would not be possible to distinguish between malformed data and multi-line values.
Basic data-binding is possible via the System.Data.IDataReader
interface implemented by the reader.
You can specify custom values for these parameters:
If the CSV contains field headers, they can be used to access a specific field.
When the CSV data appears to be malformed, the reader will fail fast and throw a meaningful exception stating where the error occurred and providing the current content of the buffer.
A cache of the field values is kept for the current record only, but if you need dynamic access, I also included a cached version of the reader, CachedCsvReader
, which internally stores records as they are read from the stream. Of course, using a cache this way makes the memory requirements way higher, as the full set of data is held in memory.
You can find the code for these benchmarks in the demo project. I tried to be fair and follow the same pattern for each parsing method. The regex used comes from Jeffrey Friedl's book, and can be found at page 271. It doesn't handle trimming and multi-line fields.
The test file contains 145 fields, and is about 45 MB (included in the demo project as a RAR archive).
I also included the raw data from the benchmark program and from the CLR Profiler for .NET 2.0.
Using the CodeThe class design follows System.IO.StreamReader
as much as possible. The parsing mechanism introduced in version 2.0 is a bit trickier because we handle the buffering and the new line parsing ourselves. Nonetheless, because the task logic is clearly encapsulated, the flow is easier to understand. All the code is well documented and structured, but if you have any questions, simply post a comment.
using System.IO; using LumenWorks.Framework.IO.Csv; void ReadCsv() { // open the file "data.csv" which is a CSV file with headers using (CsvReader csv = new CsvReader(new StreamReader("data.csv"), true)) { int fieldCount = csv.FieldCount; string[] headers = csv.GetFieldHeaders(); while (csv.ReadNextRecord()) { for (int i = 0; i < fieldCount; i++) Console.Write(string.Format("{0} = {1};", headers[i], csv[i])); Console.WriteLine(); } } }Simple Data-Binding Scenario (ASP.NET)
using System.IO; using LumenWorks.Framework.IO.Csv; void ReadCsv() { // open the file "data.csv" which is a CSV file with headers using (CsvReader csv = new CsvReader( new StreamReader("data.csv"), true)) { myDataRepeater.DataSource = csv; myDataRepeater.DataBind(); } }Complex Data-Binding Scenario (ASP.NET)
Due to the way both the System.Web.UI.WebControls.DataGrid
and System.Web.UI.WebControls.GridView
handle System.ComponentModel.ITypedList
, complex binding in ASP.NET is not possible. The only way around this limitation would be to wrap each field in a container implementing System.ComponentModel.ICustomTypeDescriptor
.
Anyway, even if it was possible, using the simple data-binding method is much more efficient.
For the curious amongst you, the bug comes from the fact that the two grid controls completely ignore the property descriptors returned by System.ComponentModel.ITypedList
, and relies instead on System.ComponentModel.TypeDescriptor.GetProperties(...)
, which obviously returns the properties of the string array and not our custom properties. See System.Web.UI.WebControls.BoundColumn.OnDataBindColumn(...)
in a disassembler.
using System.IO; using LumenWorks.Framework.IO.Csv; void ReadCsv() { // open the file "data.csv" which is a CSV file with headers using (CachedCsvReader csv = new CachedCsvReader(new StreamReader("data.csv"), true)) { // Field headers will automatically be used as column names myDataGrid.DataSource = csv; } }Custom Error Handling Scenario
using System.IO; using LumenWorks.Framework.IO.Csv; void ReadCsv() { // open the file "data.csv" which is a CSV file with headers using (CsvReader csv = new CsvReader( new StreamReader("data.csv"), true)) { // missing fields will not throw an exception, // but will instead be treated as if there was a null value csv.MissingFieldAction = MissingFieldAction.ReplaceByNull; // to replace by "" instead, then use the following action: //csv.MissingFieldAction = MissingFieldAction.ReplaceByEmpty; int fieldCount = csv.FieldCount; string[] headers = csv.GetFieldHeaders(); while (csv.ReadNextRecord()) { for (int i = 0; i < fieldCount; i++) Console.Write(string.Format("{0} = {1};", headers[i], csv[i] == null ? "MISSING" : csv[i])); Console.WriteLine(); } } }Custom Error Handling Using Events Scenario
using System.IO; using LumenWorks.Framework.IO.Csv; void ReadCsv() { // open the file "data.csv" which is a CSV file with headers using (CsvReader csv = new CsvReader( new StreamReader("data.csv"), true)) { // missing fields will not throw an exception, // but will instead be treated as if there was a null value csv.DefaultParseErrorAction = ParseErrorAction.RaiseEvent; csv.ParseError += new ParseErrorEventHandler(csv_ParseError); int fieldCount = csv.FieldCount; string[] headers = csv.GetFieldHeaders(); while (csv.ReadNextRecord()) { for (int i = 0; i < fieldCount; i++) Console.Write(string.Format("{0} = {1};", headers[i], csv[i])); Console.WriteLine(); } } } void csv_ParseError(object sender, ParseErrorEventArgs e) { // if the error is that a field is missing, then skip to next line if (e.Error is MissingFieldCsvException) { Console.Write("--MISSING FIELD ERROR OCCURRED"); e.Action = ParseErrorAction.AdvanceToNextLine; } }History Version 3.8.1 (2011-11-10)
DefaultHeaderName
property (by default, it is "Column" + column index).MoveTo
in a particular action sequence;RecordEnumerator
caused by reusing the same array over each iteration.CachedCsvReader
into a DataTable
and the CSV has no header.CachedCsvReader
without having read a record first.IDataRecord
implementation where GetValue
/GetValues
should return DBNull.Value
when the field value is empty or null
;SkipEmptyLines
(on by default);CsvReader.ReadNextRecord()
would return false
for a CSV file containing only one line ending with a new line character and no header.SupportsMultiline
is false
;IDataReader
schema column "DataType" returned DbType.String
instead of typeof(string)
.SupportsMultiline
property to help boost performance when multi-line support is not needed;FieldCount
property is accessed before having read any record;ReadNextRecord(...)
by eliminating its recursive behavior when initializing headers;CsvReader
supports 2^63 records;FieldHeader
s).RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4