tsflex.chunking
Utilities for chunking time-series data before feeding it to the operators.
Expand source code"""Utilities for chunking time-series data before feeding it to the operators.
"""
__author__ = "Jonas Van Der Donckt"
from .chunking import chunk_data
__all__ = ["chunk_data"]
.chunking
(Advanced) tsflex utilities for chunking sequence data.
def chunk_data(data, fs_dict=None, chunk_range_margin=None, min_chunk_dur=None, max_chunk_dur=None, sub_chunk_overlap=None, copy=True, verbose=False)
def chunk_data(
data: Union[
pd.Series,
pd.DataFrame,
List[Union[pd.Series, pd.DataFrame]],
Dict[str, pd.DataFrame],
],
fs_dict: Optional[Dict[str, float]] = None,
chunk_range_margin: Optional[Union[float, str, pd.Timedelta]] = None,
min_chunk_dur: Optional[Union[float, str, pd.Timedelta]] = None,
max_chunk_dur: Optional[Union[float, str, pd.Timedelta]] = None,
sub_chunk_overlap: Optional[Union[float, str, pd.Timedelta]] = None,
copy: bool = True,
verbose: bool = False,
) -> List[List[pd.Series]]:
"""Divide the time-series `data` in same time/sequence-range chunks.
Does 2 things:
1. Detecting gaps in the `data`(-list) sequence series.
2. Divides the `data` into chunks, according to the parameter
configuration and the detected gaps.
Notes
-----
* When you set `fs_dict`, the assumption is made that **each item** in `data`
has a **fixed sample frequency**. If you do not set `fs_dict`, this variable
will use the 1 / max time-diff of the corresponding series as key-value pair.
* All subsequent series-chunks are matched against the time-ranges of the first
series. This implies that **the first item in `data` serves as a reference**
for gap-matching.
* The term `sub-chunk` refers to the chunks who exceed the `max_chunk_duration`
parameter and are therefore further divided into sub-chunks.
Example
-------
```python
df_acc # cols ['ACC_x', 'ACC_y`, 'ACC_z`, 'ACC_SMV`] - 32 Hz
df_gyro # cols ['gyro_x', 'gyro_y`, 'gyro_z`, 'gyro_area`] - 100 Hz
chunk_data({'acc': df_acc, 'g': df_gyro}, fs_dict={'acc': 32, 'g': 100})
```
<br>
.. Note::
If `chunk_range_margin` / `min_chunk_dur` / `max_chunk_dur` /
`sub_chunk_overlap` is a int/float, it will be interpreted as numerical
sequence range and a numerical-indexed `data` will be assumed.
**These attributes must be all either time-based or numerical and match
the data its index dtype**
Parameters
-----------
data: Union[pd.Series, pd.DataFrame, List[Union[pd.Series, pd.DataFrame]], Dict[str, pd.DataFrame]]
The sequence data which will be chunked. Each item in `data` must have a
monotonically increasing index. We assume that each `item` in data
has a _nearly-constant_ sample frequency (when there are no gaps) and all
indices have the same dtype.
fs_dict: Dict[str, int], optional
The sample frequency dict. If set, this dict must at least withhold all the keys
from the items in `data`.
.. note::
if you passed a **_DataFrame-dict_** (i.e., a dict with key=str;
value=DataFrame) to `data`, then you can **use** the **corresponding
dataframe str-key** to describe the `fs` for all the DataFrame its columns
with the `fs_dict` attribute. See also the example above
chunk_range_margin: Union[float, str, pd.Timedelta], optional
The allowed margin for each `ts` chunk their start and end time to be seen as
same time-range chunks with other `ts`. If `None` the margin will be set as:
2 / min(fs_dict.intersection(data.names).values())
Which is equivalent to twice the min-fs (= max-period) of the passed `data`,
by default None.\n
* if `pd.Timedelta`, it will be interpreted as a time-range margin
* if `int` or `float`, it will be interpreted as a numerical range margin
min_chunk_dur : Union[float, str, pd.Timedelta], optional
The minimum duration of a chunk, by default None.
Chunks with durations smaller than this will be discarded (and not returned).\n
* if `pd.Timedelta`, it will be interpreted as a time-range margin
* if `int` or `float`, it will be interpreted as a numerical range margin
max_chunk_dur : Union[float, str, pd.Timedelta], optional
The maximum duration of a chunk, by default None.
Chunks with durations larger than this will be chunked in smaller `sub_chunks`
where each sub-chunk has a maximum duration of `max_chunk_dur`.\n
* if `pd.Timedelta`, it will be interpreted as a time-range margin
* if `int` or `float`, it will be interpreted as a numerical range margin
sub_chunk_overlap: Union[float, str, pd.Timedelta], optional
The sub-chunk boundary overlap. If available, **this
margin / 2 will be added to either side of the `sub_chunk`**.
This is especially useful to not lose inter-`sub_chunk` data (as each
`sub_chunk` is in fact a continuous chunk) when window-based aggregations
are performed on these same time range output (sub_)chunks.
This argument is only relevant if `max_chunk_dur` is set.\n
* if `pd.Timedelta`, it will be interpreted as a time-range margin
* if `int` or `float`, it will be interpreted as a numerical range margin
copy: boolean, optional
If set True will return a new view (on which you won't get a
`SettingWithCopyWarning` if you change the content), by default False.
verbose : bool, optional
If set, will print more verbose output, by default False
Returns
-------
List[List[pd.Series]]
A list of same time range chunks.
"""
if isinstance(data, dict):
if isinstance(fs_dict, dict):
out_dict = {}
for k, fs in fs_dict.items():
if k in data and isinstance(data[k], pd.DataFrame):
out_dict.update({c_name: fs for c_name in data[k].columns})
fs_dict.update(out_dict)
# make `data` `to_series_list` convertable()
data = list(data.values())
# Convert the input data
series_list = to_series_list(data)
# Assert that there are no duplicate series names
assert len(series_list) == len(set([s.name for s in series_list]))
# Assert that the index increases monotonically
assert all(s.index.is_monotonic_increasing for s in series_list)
return _dtype_to_chunk_method[AttributeParser.determine_type(data)](
series_list,
fs_dict,
chunk_range_margin, # type: ignore[arg-type]
min_chunk_dur, # type: ignore[arg-type]
max_chunk_dur, # type: ignore[arg-type]
sub_chunk_overlap, # type: ignore[arg-type]
copy,
verbose,
)
Divide the time-series data
in same time/sequence-range chunks.
Does 2 things:
data
(-list) sequence series.data
into chunks, according to the parameter configuration and the detected gaps.fs_dict
, the assumption is made that each item in data
has a fixed sample frequency. If you do not set fs_dict
, this variable will use the 1 / max time-diff of the corresponding series as key-value pair.data
serves as a reference for gap-matching.sub-chunk
refers to the chunks who exceed the max_chunk_duration
parameter and are therefore further divided into sub-chunks.df_acc # cols ['ACC_x', 'ACC_y`, 'ACC_z`, 'ACC_SMV`] - 32 Hz
df_gyro # cols ['gyro_x', 'gyro_y`, 'gyro_z`, 'gyro_area`] - 100 Hz
chunk_data({'acc': df_acc, 'g': df_gyro}, fs_dict={'acc': 32, 'g': 100})
Note
If chunk_range_margin
/ min_chunk_dur
/ max_chunk_dur
/ sub_chunk_overlap
is a int/float, it will be interpreted as numerical sequence range and a numerical-indexed data
will be assumed. These attributes must be all either time-based or numerical and match the data its index dtype
data
: Union[pd.Series, pd.DataFrame, List[Union[pd.Series, pd.DataFrame]], Dict[str, pd.DataFrame]]
data
must have a monotonically increasing index. We assume that each item
in data has a nearly-constant sample frequency (when there are no gaps) and all indices have the same dtype.
fs_dict
: Dict[str, int]
, optional
data
.
Note
if you passed a
DataFrame-dict(i.e., a dict with key=str; value=DataFrame) to
data
, then you can
usethe
corresponding dataframe str-keyto describe the
fs
for all the DataFrame its columns with the
fs_dict
attribute. See also the example above
chunk_range_margin
: Union[float, str, pd.Timedelta]
, optional
The allowed margin for each ts
chunk their start and end time to be seen as same time-range chunks with other ts
. If None
the margin will be set as:
2 / min(fs_dict.intersection(data.names).values())
Which is equivalent to twice the min-fs (= max-period) of the passed data
, by default None.
pd.Timedelta
, it will be interpreted as a time-range marginint
or float
, it will be interpreted as a numerical range marginmin_chunk_dur
: Union[float, str, pd.Timedelta]
, optional
The minimum duration of a chunk, by default None. Chunks with durations smaller than this will be discarded (and not returned).
pd.Timedelta
, it will be interpreted as a time-range marginint
or float
, it will be interpreted as a numerical range marginmax_chunk_dur
: Union[float, str, pd.Timedelta]
, optional
The maximum duration of a chunk, by default None. Chunks with durations larger than this will be chunked in smaller sub_chunks
where each sub-chunk has a maximum duration of max_chunk_dur
.
pd.Timedelta
, it will be interpreted as a time-range marginint
or float
, it will be interpreted as a numerical range marginsub_chunk_overlap
: Union[float, str, pd.Timedelta]
, optional
The sub-chunk boundary overlap. If available, this margin / 2 will be added to either side of the sub_chunk
. This is especially useful to not lose inter-sub_chunk
data (as each sub_chunk
is in fact a continuous chunk) when window-based aggregations are performed on these same time range output (sub_)chunks. This argument is only relevant if max_chunk_dur
is set.
pd.Timedelta
, it will be interpreted as a time-range marginint
or float
, it will be interpreted as a numerical range margincopy
: boolean
, optional
SettingWithCopyWarning
if you change the content), by default False.
verbose
: bool
, optional
List[List[pd.Series]]
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4