This came out of work on #10729
In the documentation, we mention that
There are two ways a np.nan can be represented in categorical data: either the value is not available (“missing value”) or np.nan is a valid category.
In the first case, NaN
is not in .categories
, and in the second case it is. I think we should only
recommend the first.
The option of NaN
s in the categories makes the code in #10729 less pleasant that it would be otherwise. I don't think we should error if NaNs are included, just advise against it in the docs. Perhaps a deprecation, but I worry that I'm missing some obvious reason why NaNs were allowed in .categories
.
@JanSchulz do you remember the initial reason for allowing either representation?
Some bad things that come out of NaN
in .categories
:
nan
mapping to a code of -1
:In [2]: s = pd.Categorical(['a', 'b', 'a', np.nan], categories=['a', 'b', np.nan]) In [3]: s Out[3]: [a, b, a, NaN] Categories (3, object): [a, b, NaN] In [4]: s.categories Out[4]: Index(['a', 'b', nan], dtype='object') In [5]: s.codes Out[5]: array([0, 1, 0, 2], dtype=int8)
nan
) in the .categories
Index.NaN
in categories.RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4