A summary of this behavior and the consensus thus far that DataFrameGroupBy will have numeric_only
default to False in 2.0 can be found here: #42395 (comment).
In #41475, the silent dropping of nuisance columns was deprecated.
In #43154, the behavior was changed so that when a DataFrame has numeric_only
unspecified and subsetting to numeric only columns would leave the DataFrame empty, internally pandas treats numeric_only
as False
.
Even though there is consensus that numeric_only
should default to False, because of the above changes I wanted to make sure there is a consensus on how to go about doing so before proceeding.
For the discussion below, it is useful to have three types of columns in mind:
numeric_only=True
.numeric_only=True
but can still be successfully aggregated; e.g. strings with sum
.numeric_only=True
and cannot be successfully aggregated; e.g. object
.To investigate this on 1.4.x, I have been using the following code. In this code, I am using .sum()
. However the results for any reduction or transform, whether it be string or callable, should have the same behavior (though that is not the case today). This includes apply and using axis=1 (for which you may want to tilt your head 90 degrees to the left).
numeric = [1, 1]
nonnumeric_noagg = [object, object]
nonnumeric_agg = ["2", "2"]
for has_numeric, has_nonnumeric_agg, has_nonnumeric_noagg in it.product([True, False], repeat=3):
for numeric_only in [True, False, lib.no_default]:
print(has_numeric, has_nonnumeric_agg, has_nonnumeric_noagg, numeric_only)
df = pd.DataFrame({"A": [1, 1]})
if has_numeric:
df["B"] = numeric
if has_nonnumeric_agg:
df["C"] = nonnumeric_agg
if has_nonnumeric_noagg:
df["D"] = nonnumeric_noagg
warning_msg = ""
try:
with warnings.catch_warnings(record=True) as w:
result = df.groupby("A").sum(numeric_only=numeric_only)
if len(w) > 0:
assert len(w) == 1
assert issubclass(w[-1].category, FutureWarning)
warning_msg = str(w[-1].message)
except TypeError:
print(" TypeError")
else:
print(" Columns:", result.columns.tolist(), "Warning:", warning_msg[:20])
Current and Future behavior numeric_only=True
Current behavior appears entirely correct and will go unchanged in 1.5/2.0. In particular, when there are no numeric columns in the input, the output is empty as well.
numeric_only=False
Current behavior appears entirely correct, in that if there are to be any behavior changes in 2.0, we already emit the appropriate FutureWarning today. The only case where there will be a behavior change from 1.4.x to 2.0 is if the frame contains a nonnumeric column that can't be aggregated. 1.4.x drops the column whereas 2.0 will raise a TypeError.
numeric_only
unspecified (lib.no_default
)
I'll refer to the columns as in the code above:
Columns ['B', 'C', 'D']
numeric_only
defaulting to False in 2.0.Columns ['B', 'C']
numeric_only
defaulting to False in 2.0.Columns ['B', 'D']
numeric_only
defaulting to False in 2.0.Columns ['C', 'D']
Columns ['C']
numeric_only
as True.Columns ['D']
cc @jreback @jbrockmendel @jorisvandenbossche @simonjayhawkins @Dr-Irv
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4