RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/pandas-dev/pandas/issues/45588 below:

API: use "safe" casting by default in astype()

(note: this has been partly discussed as part of #22384, but opening a dedicated issue here (it's also not limited to extension types), and what follows is an attempt to summarize the discussion up to now, and providing some more context and examples)

Context

In general, pandas currently can perform silent "unsafe" casting in several cases, both in the constructor (eg Series(.., dtype=..)) as in the explicit astype(..) call.
One typical case is the silent integer overflow in the following example:

>>> pd.Series([1000], dtype="int64").astype("int8")
0   -24
dtype: int8

While I am using the terms "safe" and "unsafe" here, those are not exactly well defined. In the context of this issue, I am meaning "value / information preserving" or "roundtripping".
In that context, the cast from 1000 to -24 is clearly not value preserving or a roudtrippable conversion. In contrast, for example a cast from the float 2.0 to the integer 2 is information preserving (except for the exact type) and roundtrippable. Also the conversion from Timestamp("2012-01-01") to the string "2012-01-01" can be considered as such (although those actual values don't evaluate equal).

There are a few cases of "unsafe" casting where you potentially can silently get wrong values. I currently think of the following cases (are there others in pandas?):

Integer overflow
Float truncation
Timestamp overflow and truncation
NA / NaN conversion

At the bottom of this post, I gave a concrete explanation and examples for each of those cases.

Numpy has a concept of "casting" levels for how permissive data conversions are allowed to be (eg the casting keyword in ndarray.astype), with possible values of "no", "equiv", "safe", "same_kind", "unsafe".
However, I don't think that translates very well to pandas. In numpy, those casting levels are pre-defined for all combinations of data types, while the cases of unsafe casting I mention above depends on the actual values, not strictly the dtypes.

For example, casting int64 to int8 is considered "unsafe" in numpy ("same_kind" to be correct, but so not "safe"). But if all your int64 integers are actually within the int8 range, doing this cast is safe in practice (at runtime), so IMO we shouldn't raise an error about this by default.
On the other hand, casting int64 to float64 is considered "safe" by numpy, but in practice you can have very large integers that cannot actually be safely cast to float. Or similarly, casting datetime64[s] to datetime64[ns] is also considered safe by numpy, but you can have out-of-bounds values that won't fit in the nanosecond range in practice.

Therefore, I think for pandas, it's more useful to look at the "safety at run-time" (i.e. don't decide upfront about safe vs unsafe casts based on the dtypes, but handle runtime errors (out of bound values, values that would overflow or get truncated, etc)). This way, I would only consider two cases:

Casts that are simply not supported and will directly raise a TypeError.
(e.g. pandas (in contrast to numpy) disallows casting datetime64 to timedelta64)
Casts that are generally supported, but could result in an unsafe cast / raise a ValueError during execution depending on the actual values.

Note 1: this is basically the current situation in pandas, except that for the supported casts we don't have a consistent rule about cast safety and ways to deal with this (i.e. this is what this issue is about)

Note 2: we can also have a lot of discussion about which casts to allow and which not (eg do we want to support casting datetime to int? -> #45034). But let's keep those cases for separate issues, and focus the discussion here on the cast safety aspect for casts we clearly agree on are supported.

Proposal

The proposal is to move towards having safe casting by default in pandas, and have this consistently in both the constructor as explicit astype.

Quoting from @TomAugspurger (#22384 (comment)), he proposes to agree on a couple principles, and work from those:

pandas should be consistent with itself between Series(values, dtype=dtype) and values.astype(dtype=dtype).
pandas should by default error at runtime when type casting causes loss of equality / information (integer overflow, float -> int truncation, ...), with the option to disable that check (since it will be expensive).

I am including the first point because this "safe casting or not" issue is relevant for both the constructors as the astype. But I would keep the practical aspect of this point (how do we achieve this consistency in the code) for a separate discussion, and keep the focus here on the principle and the second point about the default safety.

Some assorted general considerations / questions:

Do we agree on the list of "unsafe" cases? Are there other cases? Or would you leave out some cases?
For moving towards this, we will have to deprecate a bunch of silent unsafe cases first.
Will a single toggle (eg safe=True/False in astype) be sufficient? Or do we want more fine-grained control? (eg case by case)
Having safe casting by default has performance implication (see some example timings at #22384 (comment) to get an idea), but so there will be a keyword to disable the checks if you don't care about the unsafe casts or are sure you don't have values that would result in unsafe casts.
All the unsafe cases discussed here are about casts that can be done (on the numpy array level) but can loose information or give wrong values. In addition, there are also "conversion errors" that never work for certain values, eg casting strings to float where one of the strings does not represent a float (pd.Series(["1.0", "A"]).astype(float)). I would keep this as a separate discussion (this already raises by default, and I don't think we want to change that), although the idea of adding an errors="coerce" option to the exising keyword could also be relevant for the unsafe casting cases. And it might be the question whether we want to combine this in a single keyword?
If we make our casts safe by default, the question will also come up if we will follow this default in other contexts where a cast is done implicitly (eg when concatting, in operations, .. that involve data with different data types). But I would propose to keep those as separate, follow-up discussions (the issue description is already way too long :))

cc @pandas-dev/pandas-core

Concrete examples

Integer overflow

This can happen when casting to different bit-width or signed-ness. Generally, in astype, we don't check for this and silently overflow (following numpy behaviour):

>>> pd.Series([1000], dtype="int64").astype("int8")
0   -24
dtype: int8

In the Series constructor, we already added a deprecation warning about changing this in the future:

>>> pd.Series([1000], dtype="int8")
FutureWarning: Values are too large to be losslessly cast to int8. In a future version this
will raise OverflowError. To retain the old behavior, use pd.Series(values).astype(int8)
0   -24
dtype: int8

Another example casting a negative number to unsigned integer:

>>> pd.Series([-1000], dtype="int64").astype("uint64")
0    18446744073709550616
dtype: uint64

Float truncation

This typically happens when casting floats to integer when your floating numbers are not fully rounded. Following numpy, the behaviour of our astype or constructors is to truncate the floats:

>>> pd.Series([0.5, 1.5], dtype="float64").astype("int64")
0    0
1    1
dtype: int64

Many might find this the expected behaviour, but I want to point out that it can actually be better to explicitly round/ceil/floor, as the "truncation" is not the same as rounding (which I think users would naively expect). For example, you get different numbers here with round vs the astype shown above:

>>> pd.Series([0.5, 1.5], dtype="float64").round()
0    0.0
1    2.0
dtype: float64

In the constructor, when not starting from a numpy array, we actually already raised an error for float truncation in older version (on master this seems to ignore the dtype and give float as result):

>>> pd.Series([1.0, 2.5], dtype="int64") 
...
ValueError: Trying to coerce float values to integers

The truncation can also happen in the cast the other way around from integer to float. Large integers can not always be faithfully represented in the float range. For example:

>>> pd.Series([1_100_100_100_100], dtype="int64").astype("float32")
0    1.100100e+12
dtype: float32

# the repr above is not clear about the truncation, but casting back to integer shows it
>>> pd.Series([1_100_100_100_100], dtype="int64").astype("float32").astype("int64")
0    1100100141056
dtype: int64

Timestamp overflow

Numpy is known to silently overflow for out-of-bounds timestamps when casting to a different resolution, eg:

>>> np.array(["2300-01-01"], dtype="datetime64[s]").astype("datetime64[ns]")
array(['1715-06-13T00:25:26.290448384'], dtype='datetime64[ns]')

We already check for this, and eg in constructor raise:

>>> pd.Series(np.array(["2300-01-01"], dtype="datetime64[s]"), dtype="datetime64[ns]")
...
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2300-01-01 00:00:00

When we support multiple resolutions, this will also apply to astype.
(and the same also applies to timedelta data)

Timestamp truncation

Related to the above, but now when going to a coarser resolution, you can loose information. Numpy will also silently truncate in this case:

>>> np.array(["2022-01-01 00:00:00.01"], dtype="datetime64[ns]").astype("datetime64[s]")
array(['2022-01-01T00:00:00'], dtype='datetime64[s]')

In pandas you can see a similar behaviour (the result is truncated, but still nanoseconds in the return value).

>>> pd.Series(["2022-01-01 00:00:00.01"], dtype="datetime64[ns]").astype("datetime64[s]")
0   2022-01-01
dtype: datetime64[ns]

When we support multiple resolutions, this will become more relevant. And similar to the float truncation above, it might be more explicit to round/ceil/floor first.
(and the same also applies to timedelta data)

Sidenote: something similar can happen for Period data, but there we don't support rounding as an alternative.

NA / NaN conversion

One additional case that is somewhat pandas specific because of not supporting missing values in all dtypes, is casting to data with missing values to integer dtype (not sure if there are actually other dtypes?).

Again, numpy silently gives wrong numbers:

>>> np.array([1.0, np.nan], dtype="float64").astype("int64")
array([                   1, -9223372036854775808])

In pandas, in most cases, we actually already have safe casting for this case, and raise an error. For example:

>>> pd.Series([1.0, np.nan], dtype="float64").astype("int64") 
...
IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

There are some cases, however, where we still silently convert the NaN / NaT to a number:

>>> pd.array(["2012-01-01", "NaT"], dtype="datetime64[ns]").astype("int64") 
array([ 1325376000000000000, -9223372036854775808])

Note that this actually is broader than NaN in the float->int case, as we also have the same error when casting inf to int. So also for the nullable integer dtype, the "non-finite" values are still a relevant case.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4