I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
import pandas as pd pd.DataFrame({"mø": [1, 2, 3, 4]}).to_stata("data.dta", version=119) #.venv/lib/python3.9/site-packages/pandas/io/stata.py:2491: InvalidColumnName: #Not all pandas column names were valid Stata variable names. #The following replacements have been made: # # mø -> m_ # #If this is not what you expect, please make sure you have Stata-compliant #column names in your DataFrame (strings only, max 32 characters, only #alphanumerics and underscores, no Stata reserved words) # # warnings.warn(ws, InvalidColumnName) pd.read_stata("data.dta") # index m_ # 0 0 1 # 1 1 2 # 2 2 3 # 3 3 4Issue Description
All characters in the range 128 <= ord(c) < 256
are replaced with underscore by StataWriterUTF8
, but only (128 <= ord(c) < 192) or ord(c) in {215, 247}
need to be removed. The rest (ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
) are perfectly valid in variable names in Stata version >= 118
Expected Behavior
The characters ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
should not be removed from variable names when saving with StataWriterUTF8
. Happy to submit a pull request.
pandas : 1.4.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.2
setuptools : 59.6.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : 3.5.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : 1.1.6
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4