RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/pandas-dev/pandas/issues/2886 below:

read_csv() & extra trailing comma(s) cause parsing issues. · Issue #2886 · pandas-dev/pandas · GitHub

I have run into a few opportunities to further improve the wonderful read_csv() function. I'm using the latest x64 0.10.1 build from 10-Feb-2013 16:52.

I believe the opportunity to fix these exceptions would require simply ignoring any extra trailing commas. This is how many CSV readers work such as opening CSVs in excel. In my case I regularly work with 100K line CSVs that occasionally have extra trailing columns causing read_csv() to fail. Perhaps its possible to have an option to ignore trailing commas, or even better an option to ignore/skip any malformed rows without raising a terminal exception. :)

If a CSV has 'n' matched extra trailing columns and you do not specify any index_col then the parser will correctly assume that the first 'n' columns are the index , if you set index_col=False it fails with: IndexError: list index out of range

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import StringIO

In [4]: pd.__version__
Out[4]: '0.10.1'

In [5]: data = 'a,b,c\n4,apple,bat,,\n8,orange,cow,,' <-- Matched extra commas

In [6]: data2 = 'a,b,c\n4,apple,bat,,\n8,orange,cow,,,' <-- Miss-matched extra commas

In [7]: print data
a,b,c
4,apple,bat,,
8,orange,cow,,

In [8]: print data2
a,b,c
4,apple,bat,,
8,orange,cow,,,

In [9]: df = pd.read_csv(StringIO.StringIO(data))

In [10]: df
Out[10]:             a   b   c
4 apple   bat NaN NaN
8 orange  cow NaN NaN

In [11]: df.index
Out[11]: MultiIndex
[(4, apple), (8, orange)]

In [12]: df2 = pd.read_csv(StringIO.StringIO(data), index_col=False)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-12-4d7ece45eef3> in <module>()
----> 1 df2 = pd.read_csv(StringIO.StringIO(data), index_col=False)

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    397                     buffer_lines=buffer_lines)
    398
--> 399         return _read(filepath_or_buffer, kwds)
    400
    401     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    213         return parser
    214
--> 215     return parser.read()
    216
    217 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    629             #     self._engine.set_error_bad_lines(False)
    630
--> 631         ret = self._engine.read(nrows)
    632
    633         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    952
    953         try:
--> 954             data = self._reader.read(nrows)
    955         except StopIteration:
    956             if nrows is None:

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5915)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6946)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7670)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._get_column_name (pandas\src\parser.c:10545)()

IndexError: list index out of range

In [13]: df3 = pd.read_csv(StringIO.StringIO(data2), index_col=False)
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-13-441bfd4aff6e> in <module>()
----> 1 df3 = pd.read_csv(StringIO.StringIO(data2), index_col=False)

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    397                     buffer_lines=buffer_lines)
    398
--> 399         return _read(filepath_or_buffer, kwds)
    400
    401     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    213         return parser
    214
--> 215     return parser.read()
    216
    217 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    629             #     self._engine.set_error_bad_lines(False)
    630
--> 631         ret = self._engine.read(nrows)
    632
    633         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    952
    953         try:
--> 954             data = self._reader.read(nrows)
    955         except StopIteration:
    956             if nrows is None:

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5915)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6734)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._tokenize_rows (pandas\src\parser.c:6619)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.raise_parser_error (pandas\src\parser.c:17023)()

CParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 6
- show quoted text -

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4