A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/dask/dask/issues/874 below:

centering data with dask.array · Issue #874 · dask/dask · GitHub

A common use case for many modeling problems (e.g., in machine learning or climate science) is to center data by subtracting an average of some kind over a given axis. The dask scheduler currently falls flat on its face when attempting to schedule these types of problems.

Here's a simple example of such a fail case:

import dask.array as da
x = da.ones((8, 200, 200), chunks=(1, 200, 200))  # e.g., a large stack of image data
mad = abs(x - x.mean(axis=0)).mean()
mad.visualize()

The scheduler will keep each of the initial chunks in memory that it uses to compute the mean, because they will be used later to as an argument to sub. In contrast, the appropriate way to handle this graph to avoid blowing up memory would be to compute the initial chunks twice.

I know that in principle this could be avoided by using an on-disk cache. But this seems like a waste, because the initial values are often sitting in a file on disk, anyways.

This is a pretty typical use case for dask.array (one of the first things people try with xray), so it's worth seeing if we can come up with a solution that works by default.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4