A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/pandas-dev/pandas/issues/49473 below:

CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate · Issue #49473 · pandas-dev/pandas · GitHub

With the Copy-on-Write implementation (see #36195 / proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit, and overview follow up issue #48998), we can avoid doing an actual copy of the data in DataFrame and Series methods that typically return a copy / new object.
A typical example is the following:

df2 = df.rename(columns=str.lower)

By default, the rename() method returns a new object (DataFrame) with a copy of the data of the original DataFrame (and thus, mutating values in df2 never mutates df). With CoW enabled (pd.options.mode.copy_on_write = True), we can still return a new object, but now pointing to the same data under the hood (avoiding an initial copy), while preserving the observed behaviour of df2 being a copy / not mutating df when df2 is mutated (though the CoW mechanism, only copying the data in df2 when actually needed upon mutation, i.e. a delayed or lazy copy).

The way this is done in practice for a method like rename() or reset_index() is by using the fact that copy(deep=None) will mean a true deep copy (current default behaviour) if CoW is not enabled, and this "lazy" copy when CoW is enabled. For example:

if inplace: new_obj = self else: new_obj = self.copy(deep=None)

The initial CoW implementation in #46958 only added this logic to a few methods (to ensure this mechanism was working): rename, reset_index, reindex (when reindexing the columns), select_dtypes, to_frame and copy itself.
But there are more methods that can make use of this mechanism, and this issue is meant to as the overview issue to summarize and keep track of the progress on this front.

There is a class of methods that perform an actual operation on the data and return newly calculated data (eg typically reductions or the methods wrapping binary operators) that don't have to be considered here. It's only methods that can (potentially, in certain cases) return the original data that could make use of this optimization.

Series / DataFrame methods to update (I added a ? for the ones I wasn't directly sure about, have to look into what those exactly do to be sure, but left them here to keep track of those, can remove from the list once we know more):

Top-level functions:

Want to contribute to this issue?

Pull requests tackling one of the bullet points above are certainly welcome!


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4