RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/pandas-dev/pandas/issues/10748 below:

Deprecate and Advise against having `np.nan` in Categoricals · Issue #10748 · pandas-dev/pandas · GitHub

This came out of work on #10729

In the documentation, we mention that

There are two ways a np.nan can be represented in categorical data: either the value is not available (“missing value”) or np.nan is a valid category.

In the first case, NaN is not in .categories, and in the second case it is. I think we should only
recommend the first.

The option of NaNs in the categories makes the code in #10729 less pleasant that it would be otherwise. I don't think we should error if NaNs are included, just advise against it in the docs. Perhaps a deprecation, but I worry that I'm missing some obvious reason why NaNs were allowed in .categories.

@JanSchulz do you remember the initial reason for allowing either representation?

Some bad things that come out of NaN in .categories:

Can't rely on a value of nan mapping to a code of -1:

In [2]: s = pd.Categorical(['a', 'b', 'a', np.nan], categories=['a', 'b', np.nan])

In [3]: s
Out[3]:
[a, b, a, NaN]
Categories (3, object): [a, b, NaN]

In [4]: s.categories
Out[4]: Index(['a', 'b', nan], dtype='object')

In [5]: s.codes
Out[5]: array([0, 1, 0, 2], dtype=int8)

potentially have to upcast the index type or mix strings and floats (nan) in the .categories Index.
extra code if you want to generically handle Categoricals that may or may not have NaN in categories.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4