I have run into a few opportunities to further improve the wonderful read_csv() function. I'm using the latest x64 0.10.1 build from 10-Feb-2013 16:52.
I believe the opportunity to fix these exceptions would require simply ignoring any extra trailing commas. This is how many CSV readers work such as opening CSVs in excel. In my case I regularly work with 100K line CSVs that occasionally have extra trailing columns causing read_csv() to fail. Perhaps its possible to have an option to ignore trailing commas, or even better an option to ignore/skip any malformed rows without raising a terminal exception. :)
If a CSV has 'n' matched extra trailing columns and you do not specify any index_col then the parser will correctly assume that the first 'n' columns are the index , if you set index_col=False it fails with: IndexError: list index out of range
In [1]: import numpy as np In [2]: import pandas as pd In [3]: import StringIO In [4]: pd.__version__ Out[4]: '0.10.1' In [5]: data = 'a,b,c\n4,apple,bat,,\n8,orange,cow,,' <-- Matched extra commas In [6]: data2 = 'a,b,c\n4,apple,bat,,\n8,orange,cow,,,' <-- Miss-matched extra commas In [7]: print data a,b,c 4,apple,bat,, 8,orange,cow,, In [8]: print data2 a,b,c 4,apple,bat,, 8,orange,cow,,, In [9]: df = pd.read_csv(StringIO.StringIO(data)) In [10]: df Out[10]: a b c 4 apple bat NaN NaN 8 orange cow NaN NaN In [11]: df.index Out[11]: MultiIndex [(4, apple), (8, orange)] In [12]: df2 = pd.read_csv(StringIO.StringIO(data), index_col=False) --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-12-4d7ece45eef3> in <module>() ----> 1 df2 = pd.read_csv(StringIO.StringIO(data), index_col=False) C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze) 397 buffer_lines=buffer_lines) 398 --> 399 return _read(filepath_or_buffer, kwds) 400 401 parser_f.__name__ = name C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds) 213 return parser 214 --> 215 return parser.read() 216 217 _parser_defaults = { C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows) 629 # self._engine.set_error_bad_lines(False) 630 --> 631 ret = self._engine.read(nrows) 632 633 if self.options.get('as_recarray'): C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows) 952 953 try: --> 954 data = self._reader.read(nrows) 955 except StopIteration: 956 if nrows is None: C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5915)() C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)() C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6946)() C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7670)() C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._get_column_name (pandas\src\parser.c:10545)() IndexError: list index out of range In [13]: df3 = pd.read_csv(StringIO.StringIO(data2), index_col=False) --------------------------------------------------------------------------- CParserError Traceback (most recent call last) <ipython-input-13-441bfd4aff6e> in <module>() ----> 1 df3 = pd.read_csv(StringIO.StringIO(data2), index_col=False) C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze) 397 buffer_lines=buffer_lines) 398 --> 399 return _read(filepath_or_buffer, kwds) 400 401 parser_f.__name__ = name C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds) 213 return parser 214 --> 215 return parser.read() 216 217 _parser_defaults = { C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows) 629 # self._engine.set_error_bad_lines(False) 630 --> 631 ret = self._engine.read(nrows) 632 633 if self.options.get('as_recarray'): C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows) 952 953 try: --> 954 data = self._reader.read(nrows) 955 except StopIteration: 956 if nrows is None: C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5915)() C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)() C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6734)() C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._tokenize_rows (pandas\src\parser.c:6619)() C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.raise_parser_error (pandas\src\parser.c:17023)() CParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 6 - show quoted text -
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4