RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/huggingface/datasets/releases below:

Releases · huggingface/datasets · GitHub

Releases: huggingface/datasets

Releases · huggingface/datasets

4.0.0 New Features

Add IterableDataset.push_to_hub() by @lhoestq in #7595

# Build streaming data pipelines in a few lines of code !
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)

Add num_proc= to .push_to_hub() (Dataset and IterableDataset) by @lhoestq in #7606

# Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)

New Column object

Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in #7564
Lazy column by @lhoestq in #7614

# Syntax:
ds["column_name"]  # datasets.Column([...]) or datasets.IterableColumn(...)

# Iterate on a column:
for text in ds["text"]:
    ...

# Load one cell without bringing the full column in memory
first_text = ds["text"][0]  # equivalent to ds[0]["text"]

Torchcodec decoding by @TyTodd in #7616

Enables streaming only the ranges you need !

# Don't download full audios/videos when it's not necessary
# Now with torchcodec it only streams the required ranges/frames:
from datasets import load_dataset

ds = load_dataset(..., streaming=True)
for example in ds:
    video = example["video"]
    frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames

Requires torch>=2.7.0 and FFmpeg >= 4
Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
Load audio data with AudioDecoder:

audio = dataset[0]["audio"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)
samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]
samples.sample_rate  # 16000

# old syntax is still supported
array, sr = audio["array"], audio["sampling_rate"]

Load video data with VideoDecoder:

video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape  # (3, 240, 320)
first_frame.pts_seconds  # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape  # torch.Size([5, 3, 240, 320])

Breaking changes

Remove scripts altogether by @lhoestq in #7592
- trust_remote_code is no longer supported
Torchcodec decoding by @TyTodd in #7616
- torchcodec replaces soundfile for audio decoding
- torchcodec replaces decord for video decoding

Replace Sequence by List by @lhoestq in #7634

Introduction of the List type

from datasets import Features, List, Value

features = Features({
    "texts": List(Value("string")),
    "four_paragraphs": List(Value("string"), length=4)
})

Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature

from datasets import Sequence

Sequence(Value("string"))  # List(Value("string"))
Sequence({"texts": Value("string")})  # {"texts": List(Value("string"))}

Other improvements and bug fixes

Refactor Dataset.map to reuse cache files mapped with different num_proc by @ringohoffman in #7434
fix string_to_dict test by @lhoestq in #7571
Preserve formatting in concatenated IterableDataset by @francescorubbo in #7522
Fix typos in PDF and Video documentation by @AndreaFrancis in #7579
fix: Add embed_storage in Pdf feature by @AndreaFrancis in #7582
load_dataset splits typing by @lhoestq in #7587
Fixed typos by @TopCoder2K in #7572
Fix regex library warnings by @emmanuel-ferdman in #7576
[MINOR:TYPO] Update save_to_disk docstring by @cakiki in #7575
Add missing property on RepeatExamplesIterable by @SilvanCodes in #7581
Avoid multiple default config names by @albertvillanova in #7585
Fix broken link to albumentations by @ternaus in #7593
fix string_to_dict usage for windows by @lhoestq in #7598
No TF in win tests by @lhoestq in #7603
Docs and more methods for IterableDataset: push_to_hub, to_parquet... by @lhoestq in #7604
Tests typing and fixes for push_to_hub by @lhoestq in #7608
fix parallel push_to_hub in dataset_dict by @lhoestq in #7613
remove unused code by @lhoestq in #7615
Update _dill.py to use co_linetable for Python 3.10+ in place of co_lnotab by @qgallouedec in #7609
Fixes in docs by @lhoestq in #7620
Add albumentations to use dataset by @ternaus in #7596
minor docs data aug by @lhoestq in #7621
fix: raise error in FolderBasedBuilder when data_dir and data_files are missing by @ArjunJagdale in #7623
fix save_infos by @lhoestq in #7639
better features repr by @lhoestq in #7640
update docs and docstrings by @lhoestq in #7641
fix length for ci by @lhoestq in #7642
Backward compat sequence instance by @lhoestq in #7643
fix sequence ci by @lhoestq in #7644
Custom metadata filenames by @lhoestq in #7663
Update the beans dataset link in Preprocess by @HJassar in #7659
Backward compat list feature by @lhoestq in #7666
Fix infer list of images by @lhoestq in #7667
Fix audio bytes by @lhoestq in #7670
Fix double sequence by @lhoestq in #7672

New Contributors

@TopCoder2K made their first contribution in #7564
@francescorubbo made their first contribution in #7522
@emmanuel-ferdman made their first contribution in #7576
@SilvanCodes made their first contribution in #7581
@ternaus made their first contribution in #7593
@ArjunJagdale made their first contribution in #7623
@TyTodd made their first contribution in #7616
@HJassar made their first contribution in #7659

Full Changelog: 3.6.0...4.0.0

3.6.0 3.5.1 3.5.0 Datasets Features

Introduce PDF support (#7318) by @yabramuvdi in #7325

>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder"  # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...

What's Changed

Fix local pdf loading by @lhoestq in #7466
Minor fix for metadata files in extension counter by @lhoestq in #7464
Priotitize json by @lhoestq in #7476

New Contributors

@yabramuvdi made their first contribution in #7325

Full Changelog: 3.4.1...3.5.0

3.4.1 3.4.0 Dataset Features

Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in #7424
- /!\ Breaking change: we replaced decord with torchvision to read videos, since decord is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. The Video type is still marked as experimental is this version
```
from datasets import load_dataset, Video

dataset = load_dataset("path/to/video/folder", split="train")
dataset[0]["video"]  # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
```
- faster streaming for image/audio/video folder from Hugging Face
- support for metadata.parquet in addition to metadata.csv or metadata.jsonl for the metadata of the image/audio/video files
Add IterableDataset.decode with multithreading by @lhoestq in #7450
- even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
```
dataset = dataset.decode(num_threads=num_threads)
```
Add with_split to DatasetDict.map by @jp1924 in #7368

General improvements and bug fixes

fix: None default with bool type on load creates typing error by @stephantul in #7426
Use pyupgrade --py39-plus by @cyyever in #7428
Refactor string_to_dict to return None if there is no match instead of raising ValueError by @ringohoffman in #7435
Fix small bugs with async map by @lhoestq in #7445
Fix resuming after ds.set_epoch(new_epoch) by @lhoestq in #7451
minor docs changes by @lhoestq in #7452

New Contributors

@stephantul made their first contribution in #7426
@cyyever made their first contribution in #7428
@jp1924 made their first contribution in #7368

Full Changelog: 3.3.2...3.4.0

3.3.2 3.3.1 3.3.0 Dataset Features

Support async functions in map() by @lhoestq in #7384

Especially useful to download content like images or call inference APIs

prompt = "Answer the following question: {question}. You should think step by step."
async def ask_llm(example):
    return await query_model(prompt.format(question=example["question"]))
ds = ds.map(ask_llm)

Add repeat method to datasets by @alex-hh in #7198

Support faster processing using pandas or polars functions in IterableDataset.map() by @lhoestq in #7370

Add support for "pandas" and "polars" formats in IterableDatasets
This enables optimized data processing using pandas or polars functions with zero-copy, e.g.

ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True)
ds = ds.with_format("polars")
expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
ds = ds.map(lambda df: df.with_columns(expr), batched=True)

Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in #7207
- IterableDatasets with "numpy" format are now much faster

What's Changed

don't import soundfile in tests by @lhoestq in #7340
minor video docs on how to install by @lhoestq in #7341
Fix typo in arrow_dataset by @AndreaFrancis in #7328
remove filecheck to enable symlinks by @fschlatt in #7133
Webdataset special columns in last position by @lhoestq in #7349
Bump hfh to 0.24 to fix ci by @lhoestq in #7350
fsspec 2024.12.0 by @lhoestq in #7352
changes to MappedExamplesIterable to resolve #7345 by @vttrifonov in #7353
Catch OSError for arrow by @lhoestq in #7348
Remove .h5 from imagefolder extensions by @lhoestq in #7374
Add Pandas, PyArrow and Polars docs by @lhoestq in #7382
Optimized sequence encoding for scalars by @lukasgd in #7393
Update docs by @lhoestq in #7395
Update README.md by @lhoestq in #7396
Release: 3.3.0 by @lhoestq in #7398

New Contributors

@AndreaFrancis made their first contribution in #7328
@vttrifonov made their first contribution in #7353
@lukasgd made their first contribution in #7393

Full Changelog: 3.2.0...3.3.0

3.2.0 Dataset Features

Faster parquet streaming + filters with predicate pushdown by @lhoestq in #7309
- Up to +100% streaming speed
- Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
```
from datasets import load_dataset
filters = [('date', '>=', '2023')]
ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
```

Other improvements and bug fixes

fix conda release worlflow by @lhoestq in #7272
Add link to video dataset by @NielsRogge in #7277
Raise error for incorrect JSON serialization by @varadhbhatnagar in #7273
support for custom feature encoding/decoding by @alex-hh in #7284
update load_dataset doctring by @lhoestq in #7301
Let server decide default repo visibility by @Wauplin in #7302
fix: update elasticsearch version by @ruidazeng in #7300
Fix typing in iterable_dataset.py by @lhoestq in #7304
Updated inconsistent output in documentation examples for ClassLabel by @sergiopaniego in #7293
More docs to from_dict to mention that the result lives in RAM by @lhoestq in #7316
Release: 3.2.0 by @lhoestq in #7317

New Contributors

@ruidazeng made their first contribution in #7300
@sergiopaniego made their first contribution in #7293

Full Changelog: 3.1.0...3.2.0

You can’t perform that action at this time.

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4