A common use case for many modeling problems (e.g., in machine learning or climate science) is to center data by subtracting an average of some kind over a given axis. The dask scheduler currently falls flat on its face when attempting to schedule these types of problems.
Here's a simple example of such a fail case:
import dask.array as da x = da.ones((8, 200, 200), chunks=(1, 200, 200)) # e.g., a large stack of image data mad = abs(x - x.mean(axis=0)).mean() mad.visualize()
The scheduler will keep each of the initial chunks in memory that it uses to compute the mean, because they will be used later to as an argument to sub
. In contrast, the appropriate way to handle this graph to avoid blowing up memory would be to compute the initial chunks twice.
I know that in principle this could be avoided by using an on-disk cache. But this seems like a waste, because the initial values are often sitting in a file on disk, anyways.
This is a pretty typical use case for dask.array (one of the first things people try with xray), so it's worth seeing if we can come up with a solution that works by default.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4