Releases · huggingface/datasets
4.0.0 New FeaturesAdd IterableDataset.push_to_hub()
by @lhoestq in #7595
# Build streaming data pipelines in a few lines of code ! from datasets import load_dataset ds = load_dataset(..., streaming=True) ds = ds.map(...).filter(...) ds.push_to_hub(...)
Add num_proc=
to .push_to_hub()
(Dataset and IterableDataset) by @lhoestq in #7606
# Faster push to Hub ! Available for both Dataset and IterableDataset ds.push_to_hub(..., num_proc=8)
New Column
object
# Syntax: ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...) # Iterate on a column: for text in ds["text"]: ... # Load one cell without bringing the full column in memory first_text = ds["text"][0] # equivalent to ds[0]["text"]
Torchcodec decoding by @TyTodd in #7616
# Don't download full audios/videos when it's not necessary # Now with torchcodec it only streams the required ranges/frames: from datasets import load_dataset ds = load_dataset(..., streaming=True) for example in ds: video = example["video"] frames = video.get_frames_in_range(start=0, stop=6, step=1) # only stream certain frames
torch>=2.7.0
and FFmpeg >= 4datasets<4.0
AudioDecoder
:audio = dataset[0]["audio"] # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0> samples = audio.get_all_samples() # or use get_samples_played_in_range(...) samples.data # tensor([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 2.3447e-06, -1.9127e-04, -5.3330e-05]] samples.sample_rate # 16000 # old syntax is still supported array, sr = audio["array"], audio["sampling_rate"]
VideoDecoder
:video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0> first_frame = video.get_frame_at(0) first_frame.data.shape # (3, 240, 320) first_frame.pts_seconds # 0.0 frames = video.get_frames_in_range(0, 6, 1) frames.data.shape # torch.Size([5, 3, 240, 320])
Remove scripts altogether by @lhoestq in #7592
trust_remote_code
is no longer supportedTorchcodec decoding by @TyTodd in #7616
Replace Sequence by List by @lhoestq in #7634
List
typefrom datasets import Features, List, Value features = Features({ "texts": List(Value("string")), "four_paragraphs": List(Value("string"), length=4) })
Sequence
was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List
or a dict
depending on the subfeaturefrom datasets import Sequence Sequence(Value("string")) # List(Value("string")) Sequence({"texts": Value("string")}) # {"texts": List(Value("string"))}
Dataset.map
to reuse cache files mapped with different num_proc
by @ringohoffman in #7434RepeatExamplesIterable
by @SilvanCodes in #7581_dill.py
to use co_linetable
for Python 3.10+ in place of co_lnotab
by @qgallouedec in #7609Full Changelog: 3.6.0...4.0.0
3.6.0 3.5.1 3.5.0 Datasets Features>>> from datasets import load_dataset, Pdf >>> repo = "path/to/pdf/folder" # or username/dataset_name on Hugging Face >>> dataset = load_dataset(repo, split="train") >>> dataset[0]["pdf"] <pdfplumber.pdf.PDF at 0x1075bc320> >>> dataset[0]["pdf"].pages[0].extract_text() ...What's Changed
Full Changelog: 3.4.1...3.5.0
3.4.1 3.4.0 Dataset FeaturesFaster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in #7424
decord
with torchvision
to read videos, since decord
is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. The Video
type is still marked as experimental is this versionfrom datasets import load_dataset, Video dataset = load_dataset("path/to/video/folder", split="train") dataset[0]["video"] # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
metadata.parquet
in addition to metadata.csv
or metadata.jsonl
for the metadata of the image/audio/video filesAdd IterableDataset.decode with multithreading by @lhoestq in #7450
dataset = dataset.decode(num_threads=num_threads)
string_to_dict
to return None
if there is no match instead of raising ValueError
by @ringohoffman in #7435ds.set_epoch(new_epoch)
by @lhoestq in #7451Full Changelog: 3.3.2...3.4.0
3.3.2 3.3.1 3.3.0 Dataset FeaturesSupport async functions in map() by @lhoestq in #7384
prompt = "Answer the following question: {question}. You should think step by step." async def ask_llm(example): return await query_model(prompt.format(question=example["question"])) ds = ds.map(ask_llm)
Support faster processing using pandas or polars functions in IterableDataset.map()
by @lhoestq in #7370
ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True) ds = ds.with_format("polars") expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution") ds = ds.map(lambda df: df.with_columns(expr), batched=True)
Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in #7207
Full Changelog: 3.2.0...3.3.0
3.2.0 Dataset Featuresfrom datasets import load_dataset filters = [('date', '>=', '2023')] ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
ClassLabel
by @sergiopaniego in #7293Full Changelog: 3.1.0...3.2.0
You can’t perform that action at this time.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4