RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/pandas-dev/pandas/issues/52505 below:

incorrect reading of CSV containing large integers · Issue #52505 · pandas-dev/pandas · GitHub

import pandas as pd
from io import StringIO

scsv = """#SYMBOL,SYSTEM,TYPE,MOMENT,ID,ACTION,PRICE,VOLUME,ID_DEAL,PRICE_DEAL
RIH3,F,S,20230301181139587,1925036343869802844,0,96690.00000,2,,
USDRUBF,F,S,20230301181139587,2023552585717888193,0,75.10000,14,,
USDRUBF,F,S,20230301181139587,2023552585717889863,1,75.00000,14,,
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,1,2023552585717263358,75.00000
USDRUBF,F,B,20230301181139587,2023552585717882895,2,75.00000,1,2023552585717263358,75.00000
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,1,2023552585717263359,75.00000
USDRUBF,F,B,20230301181139587,2023552585717888161,2,75.00000,1,2023552585717263359,75.00000
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,10,2023552585717263360,75.00000
USDRUBF,F,B,20230301181139587,2023552585717889759,2,75.00000,10,2023552585717263360,75.00000
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,2,2023552585717263361,75.00000
USDRUBF,F,B,20230301181139587,2023552585717889827,2,75.00000,2,2023552585717263361,75.00000
"""

for engine in "c pyarrow python".split():
    test = StringIO(scsv)
    orders = pd.read_csv(test, engine=engine,)
    print(engine, len(orders.query("ID_DEAL==2023552585717263360")))

Reading of this financial data peace is terribly incorrect. I get the following output:

None of the engines is able to parse the data correctly, I suppose due to the integers being pretty big. But pandas produces no warning, just silently butchers the data. The issue is not new to 2.0, it was present in 1.5 as well.

It seems to be connected to float64 not having enough precision to hold all the digits. int64 would be ok probably, as the previous column works with int64, but current column has missing values.
using of the new nullable integer type kind of works, but not for pyarrow:
...
orders = pd.read_csv(test, engine=engine,dtype={"ID_DEAL": "Int64"})
...

But certainly the loss of precision needs to be recognized automatically, and either error should be issued, or Int64 dtype suggested. Otherwise, this bug can lead to catastrophical consequences in decision support systems powered by pandas.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4