Hi,
I think I ran into a bug in the RLE decompression implementation.
Short description:
String fields with more than 32 repeated consecutive characters are be cropped at 32 and next fields will spill over corrupting the whole dataframe.
Example:
example.csv with fields of length 50
long_string_field1,long_string_field2,long_string_field3
"00000000000000000000000000000000000000000000000000","11111111111111111111111111111111111111111111111111","aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
Create a CHAR compressed sas7bdat file (system encoding is set to latin1)
options compress=char; proc import datafile="path\example.csv" out=your_lib.example dbms=csv replace; getnames=yes; run;
import pandas as pd example = pd.read_sas("./example.sas7bdat", encoding="latin1")
This is what you get:
>>> example['long_string_field1'].values[0] '00000000000000000000000000000000 111111111111111111'
>>> example['long_string_field2'].values[0] '11111111111111 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> example['long_string_field3'].values[0] nan
There are a couple of interesting points:
sas7bdat
package works fine.I would never use any of the things above out of my free will. Sadly, this is an actual case I keep running into when having to deal with SAS... 😢
Output ofpd.show_versions()
INSTALLED VERSIONS
------------------
commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None
pandas : 0.25.3
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 45.1.0.post20200119
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : None
sqlalchemy : 1.3.12
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4