What is it?A sparkling update with 1000s of languages
This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.
The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.
In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.
The data was sourced from 96 CommonCrawl snapshots, spanning the summer of 2013 to April 2024, and processed using 🏭 datatrove
, our large scale data processing library. This carefully deduplicated and filtered dataset comprises roughly 20 terabytes, across 5 billion documents, with over 3 trillion words (see How many tokens? for more details). For PII and opt-out see Personal and Sensitive Information and opt-out.
You will find our ablation and evaluation setup in this github repo. We will soon upload model checkpoints from our ablation experiments.
Read our 📝 research paper for details on the dataset creation!
Languages and available subsetsFor English data, please refer to the original 🍷 FineWeb.
Each language is identified by its ISO 639-3 code, and the data is grouped by language-script pairs, since some languages have content in multiple scripts.
In total, we provide filtered data for 1,868 language-script pairs. Of these, 474 have more than 1 thousand documents, and 203 have more than 10 thousand documents of filtered data. Most languages also include a small test
split which should not be trained on.
While we tried our best to not overfilter, we know that our filtering isn't perfect, and wanted to allow the community to easily re-filter the data with their own filtering criteria. We have therefore also uploaded the data that was removed by our filtering pipeline for each language (it is suffixed by _removed
). The filtered + the removed subsets of each language represent the entire data for a given language following global deduplication, which means that you do not have to re-deduplicate it yourself. You can find and adapt our filtering code here. The removed data is available through direct download (using hub_download
for example) but not through load_dataset
, as there would otherwise be an excessive number of subsets.
Additionally, we also uploaded data for scripts that the language classifier does not support or in a supported script but unknown language, without any deduplication or filtering. These are prefixed by und_
.
The following table shows the size of the filtering subset for the biggest 80 languages. The full list is available on Github.
ISO 639-3 code Script Name Language Family Subset Words Documents UTF-8 Bytes Disk size rus Cyrl Russian Indo-Europeanrus_Cyrl
588,579,493,780 699,083,579 5.82TB 1.81TB cmn Hani Mandarin Chinese Sino-Tibetan cmn_Hani
543,543,038,750 636,058,984 2.42TB 1.48TB deu Latn German Indo-European deu_Latn
262,271,052,199 495,964,485 1.51TB 719.08GB jpn Jpan Japanese Japonic jpn_Jpan
331,144,301,801 400,138,563 1.50TB 667.44GB spa Latn Spanish Indo-European spa_Latn
261,523,749,595 441,287,261 1.32TB 593.82GB fra Latn French Indo-European fra_Latn
220,662,584,640 360,058,973 1.11TB 502.82GB ita Latn Italian Indo-European ita_Latn
139,116,026,491 238,984,437 739.24GB 332.47GB por Latn Portuguese Indo-European por_Latn
109,536,087,117 199,737,979 569.24GB 256.92GB pol Latn Polish Indo-European pol_Latn
73,119,437,217 151,966,724 432.01GB 210.35GB nld Latn Dutch Indo-European nld_Latn
74,634,633,118 147,301,270 397.51GB 176.98GB ind Latn Indonesian Austronesian ind_Latn
60,264,322,142 100,238,529 348.65GB 141.70GB vie Latn Vietnamese Austro-Asiatic vie_Latn
50,886,874,358 61,064,248 319.83GB 121.19GB fas Arab Persian Indo-European fas_Arab
39,705,799,658 58,843,652 304.62GB 95.33GB arb Arab Standard Arabic Afro-Asiatic arb_Arab
32,812,858,120 61,977,525 293.59GB 98.69GB tur Latn Turkish Turkic tur_Latn
41,933,799,420 95,129,129 284.52GB 125.53GB tha Thai Thai Kra-Dai tha_Thai
24,662,748,945 35,897,202 278.68GB 69.91GB ukr Cyrl Ukrainian Indo-European ukr_Cyrl
25,586,457,655 53,101,726 254.86GB 84.98GB ell Grek Modern Greek (1453-) Indo-European ell_Grek
22,827,957,288 47,421,073 222.05GB 73.16GB kor Hang Korean Koreanic kor_Hang
48,613,120,582 60,874,355 213.43GB 98.50GB ces Latn Czech Indo-European ces_Latn
35,479,428,809 66,067,904 206.33GB 102.38GB swe Latn Swedish Indo-European swe_Latn
35,745,969,364 59,485,306 202.96GB 88.63GB hun Latn Hungarian Uralic hun_Latn
30,919,839,164 49,935,986 199.69GB 91.73GB ron Latn Romanian Indo-European ron_Latn
35,017,893,659 58,303,671 186.19GB 85.37GB nob Latn Norwegian Bokmål Indo-European nob_Latn
32,008,904,934 38,144,343 172.05GB 78.25GB dan Latn Danish Indo-European dan_Latn
28,055,948,840 45,391,655 150.72GB 65.74GB bul Cyrl Bulgarian Indo-European bul_Cyrl
16,074,326,712 25,994,731 145.75GB 45.68GB fin Latn Finnish Uralic fin_Latn
20,343,096,672 36,710,816 143.03GB 61.94GB hin Deva Hindi Indo-European hin_Deva
11,173,681,651 22,095,985 120.98GB 31.92GB ben Beng Bengali Indo-European ben_Beng
6,153,579,265 15,185,742 87.04GB 22.25GB slk Latn Slovak Indo-European slk_Latn
14,808,010,769 29,991,521 85.43GB 43.00GB heb Hebr Hebrew Afro-Asiatic heb_Hebr
8,462,976,117 14,491,748 68.71GB 23.15GB lit Latn Lithuanian Indo-European lit_Latn
9,132,828,961 13,471,965 56.50GB 25.75GB bos Latn Bosnian Indo-European bos_Latn
9,086,837,979 21,243,255 49.18GB 24.61GB slv Latn Slovenian Indo-European slv_Latn
7,688,373,264 12,059,130 41.80GB 19.22GB ekk Latn Standard Estonian Uralic ekk_Latn
6,564,292,000 10,218,587 40.82GB 18.35GB cat Latn Catalan Indo-European cat_Latn
8,348,091,726 17,136,414 40.35GB 18.52GB tam Taml Tamil Dravidian tam_Taml
1,937,150,898 5,528,854 36.97GB 8.79GB hrv Latn Croatian Indo-European hrv_Latn
6,609,299,440 6,195,824 35.91GB 16.36GB lvs Latn Standard Latvian Indo-European lvs_Latn
5,371,151,279 8,030,316 33.36GB 14.70GB zsm Latn Standard Malay Austronesian zsm_Latn
5,648,387,840 9,421,248 31.94GB 13.28GB azj Latn North Azerbaijani Turkic azj_Latn
3,894,255,826 7,291,231 26.90GB 10.49GB srp Cyrl Serbian Indo-European srp_Cyrl
2,858,500,314 4,146,124 26.87GB 8.64GB kat Geor Georgian Kartvelian kat_Geor
1,439,572,993 3,706,659 25.23GB 6.33GB npi Deva Nepali (individual language) Indo-European npi_Deva
1,642,856,349 4,888,163 25.13GB 6.22GB mar Deva Marathi Indo-European mar_Deva
1,541,225,070 3,912,702 22.57GB 5.85GB mal Mlym Malayalam Dravidian mal_Mlym
1,054,187,581 3,322,526 22.27GB 5.51GB kaz Cyrl Kazakh Turkic kaz_Cyrl
1,876,843,453 3,344,366 20.67GB 6.33GB urd Arab Urdu Indo-European urd_Arab
2,733,266,493 4,809,542 19.93GB 6.40GB als Latn Tosk Albanian Indo-European als_Latn
3,454,387,059 8,597,826 18.18GB 8.42GB mkd Cyrl Macedonian Indo-European mkd_Cyrl
1,611,392,841 4,150,902 14.99GB 4.82GB tel Telu Telugu Dravidian tel_Telu
891,002,487 1,964,395 14.42GB 3.68GB kan Knda Kannada Dravidian kan_Knda
748,850,327 2,390,982 12.91GB 3.28GB mya Mymr Burmese Sino-Tibetan mya_Mymr
854,400,671 1,558,304 12.35GB 2.90GB guj Gujr Gujarati Indo-European guj_Gujr
934,124,052 2,127,094 11.71GB 3.11GB bel Cyrl Belarusian Indo-European bel_Cyrl
1,166,541,148 2,100,873 11.47GB 3.87GB isl Latn Icelandic Indo-European isl_Latn
1,696,354,360 3,014,429 10.27GB 4.59GB khm Khmr Khmer Austro-Asiatic khm_Khmr
667,495,692 1,586,460 8.70GB 2.12GB khk Cyrl Halh Mongolian Mongolic khk_Cyrl
824,211,882 1,622,882 8.52GB 2.58GB fil Latn Filipino Austronesian fil_Latn
1,636,238,017 2,349,050 8.13GB 3.34GB ary Arab Moroccan Arabic Afro-Asiatic ary_Arab
843,523,994 2,365,405 7.74GB 2.67GB afr Latn Afrikaans Indo-European afr_Latn
1,598,352,868 1,992,040 7.69GB 3.40GB hye Armn Armenian Indo-European hye_Armn
634,273,060 1,757,415 7.17GB 2.26GB sin Sinh Sinhala Indo-European sin_Sinh
512,453,069 1,185,323 7.05GB 1.87GB glg Latn Galician Indo-European glg_Latn
1,236,233,473 2,522,814 6.47GB 2.92GB uzn Cyrl Northern Uzbek Turkic uzn_Cyrl
544,866,919 1,357,811 6.12GB 1.83GB pan Guru Panjabi Indo-European pan_Guru
522,788,467 944,160 5.64GB 1.47GB ory Orya Odia Indo-European ory_Orya
333,760,951 1,298,188 4.92GB 1.28GB uzn Latn Northern Uzbek Turkic uzn_Latn
687,002,994 1,233,463 4.45GB 1.90GB kir Cyrl Kirghiz Turkic kir_Cyrl
397,449,282 1,069,582 4.36GB 1.37GB eus Latn Basque Language isolate eus_Latn
711,939,889 1,569,434 4.30GB 1.90GB lat Latn Latin Indo-European lat_Latn
714,764,848 1,473,541 3.86GB 1.64GB tgk Cyrl Tajik Indo-European tgk_Cyrl
396,209,383 688,384 3.75GB 1.15GB gmh Latn Middle High German (ca. 1050-1500) Indo-European gmh_Latn
506,396,917 84,495 3.41GB 1.28GB swh Latn Swahili (individual language) Niger-Congo swh_Latn
569,542,024 1,206,300 3.08GB 1.33GB arz Arab Egyptian Arabic Afro-Asiatic arz_Arab
345,040,810 853,290 2.92GB 1.06GB nno Latn Norwegian Nynorsk Indo-European nno_Latn
522,740,774 1,214,870 2.68GB 1.30GB cym Latn Welsh Indo-European cym_Latn
523,226,616 831,878 2.50GB 1.10GB amh Ethi Amharic Afro-Asiatic amh_Ethi
239,936,286 428,373 2.49GB 848.50MB pbt Arab Southern Pashto Indo-European pbt_Arab
337,138,269 639,983 2.41GB 816.03MB ckb Arab Central Kurdish Indo-European ckb_Arab
236,342,609 554,993 2.39GB 783.85MB ... ... ... ... ... ... ... ... Total 3,339,271,691,958 5,018,505,566 20.78TB 8.58TB How many tokens?
The number of tokens obtained when tokenizing data in a specific language heavily depends on whether the tokenizer was trained with that language, and its script, in mind. For instance, while employing the gpt2
tokenizer to tokenize Thai data might result in a very large number of tokens, using a tokenizer explicitly trained for south-east asian languages would considerably bring down this number.
As such, we chose to only report total number of documents, disk size and words for each language, as reported by the word tokenizer (we don't mean gpt2
here, but a tool that will only split words) that we assigned to each language.
Previous versions remain available in the branch version name
. You can access them using for example revision="v2.0.0"
.
See the tables above for the subset
of the language and version (filtered or removed) of the data you want to download.
We currently do not provide smaller sample
versions, but by setting limit
or using streaming=True
you can easily fetch a sample of the data. If there is interest from the community we might upload smaller sampled versions later on.
datatrove
from datatrove.pipeline.readers import ParquetReader
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/fineweb-2/data/por_Latn/train", limit=1000)
for document in data_reader():
print(document)
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import ParquetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.pipeline.writers import JsonlWriter
pipeline_exec = LocalPipelineExecutor(
pipeline=[
ParquetReader("hf://datasets/HuggingFaceFW/fineweb-2/data/por_Latn/train", limit=1000),
LambdaFilter(lambda doc: "hugging" in doc.text),
JsonlWriter("some-output-path")
],
tasks=10
)
pipeline_exec.run()
Using huggingface_hub
from huggingface_hub import snapshot_download
folder = snapshot_download(
"HuggingFaceFW/fineweb-2",
repo_type="dataset",
local_dir="./fineweb2/",
allow_patterns=["data/ces_Latn/train/*", "data/ces_Latn_removed/train/*"])
For faster downloads, make sure to install pip install huggingface_hub[hf_transfer]
and set the environment variable HF_HUB_ENABLE_HF_TRANSFER=1
.
datasets
As mentioned above, load_dataset
will not work for und_
or _removed
splits.
from datasets import load_dataset
fw = load_dataset("HuggingFaceFW/fineweb-2", name="hrv_Latn", split="train", streaming=True)
Dataset processing steps
We used the 🏭 datatrove
library to process the data. You can find a working script that launches the entire processing pipeline here.
The processing pipeline had to be heavily adapted for a multilingual setting. As each language has its own peculiarities, we individually tuned each filter, defining different thresholds and stopwords for each language. 📊 These thresholds and stopwords are available in /configs/{iso3_lang}_{script}.yml
in our github repo.
The starting point for our dataset was the non-English data (< 0.65 score in English) we obtained when processing the original FineWeb. This data was text extracted using trafilatura and went through our URL filters (for more info see 🍷 FineWeb. To this data, we applied the following processing steps:
Performed using GlotLID, which not only covers a wider variety of languages (2000+ available labels) compared to fasttext176 (used in the original FineWeb), as it also identifies the script used in each document. 📜
For each language, we defined different minimum language classifier confidence scores to keep a document.
Deduplication 🗃️Unlike in 🍷 FineWeb, where data was deduplicated per CommonCrawl snapshot, in 🥂 FineWeb2, data is deduplicated per language, globally. However, following our deduplication findings in the original 🍷 FineWeb, while we remove all except one document from each duplicate cluster, we save the size of this cluster in the kept document's metadata, saved in minhash_cluster_size
. This allows us to "re-hydrate" the dataset: by upsampling documents based on their cluster size, we see clear performance improvements for some languages, particularly high resource ones. 📈
We think upsampling weights should be dataset specific, and have therefore used the filtering rates of each duplicate cluster to compute different weights per language. They are available on our Github repo, along with sample code to Rehydrate the dataset.
WARNING: If you do not upsample based on these weights, dataset performance may be lower than the one obtained on our results.
Data Filtering 🧹We mostly kept the original 🍷 FineWeb set of filters, and do not create new filters targeting individual languages. As such, we had to extensively ablate on different processes of adapting the English filters to all the languages we supported. 🔍
Based on the results of our experiments, we also disabled/changed global values of some specific filters:
short_line_thr
and changed char_dup_ratio
from 0.01 to 0.1.We will soon release more details regarding the reasoning behind each of these decisions in our upcoming blogpost.
Dataset performance evaluation and ablationsWe chose 9 diverse (in script, language family and resource availability) languages for our ablation setup: Chinese, French, Arabic, Russian, Thai, Hindi, Turkish, Swahili, and Telugu. We then selected high signal tasks for these languages out of almost 200 benchmarks. We wrote an entire blogpost about this process: FineTasks, where you will find the full list of tasks we evaluated on, as well as how they were selected. As for metrics, we use normalized probability mass (not accuracies!) for discriminative tasks and f1 for generative tasks, as these metrics have proven to be far more stable than their alternatives.
We conducted our dataset performance ablations and evaluations by training a series of 1.45B parameters models on ~30 billion tokens, tokenized using the gemma tokenizer. To compare 🥂 FineWeb2 with other datasets, we also trained one of these 1.45B models per target dataset, on 30 billion tokens sampled from it (or the entire dataset when its size was < 30 billion tokens). We chose 30B as some of the comparison datasets were relatively small for some languages, but we will soon release some longer ablation runs.
Hyper-parameters for ablation modelsThe detailed configurations for training the models can be found here.
Comparison with other datasetsNote: the results below use an older version of the dataset. Please check our paper for updated results. You will find all the evaluation results in the repo files. The 🥂 FineWeb2 runs were trained on the final data (dedup+filtering) with re-hydration (see the section on deduplication above), unless explicitly stated (e.g. Swahili).
We compared 🥂 FineWeb2 with the following multilingual datasets:
And with language specific monolingual datasets:
Expand each individual language to see the corresponding plot. The error bars correspond to one standard deviation of the scores of 4 models trained on different randomly sampled 30B tokens of unfiltered CommonCrawl data.
Arabic French Hindi Russian Swahili For Swahili, the filtered data (around ~1B tokens) performs worse than the deduplicated (filtered+removed subsets) data (around ~3B tokens). We believe this is due to the small number of remaining tokens. Telugu Thai Turkish Chinese TigerBot and MAP-CC outperform 🥂 FineWeb2, possibly due to filters specificaly targeting Chinese. Dataset card for 🥂 FineWeb2 Dataset SummaryThis dataset was created by processing 96 CommonCrawl dumps comprising web data crawled from the summer of 2013 to April 2024. 🥂 FineWeb2 includes a variety of domains and topics in a variety of languages and is primarily intended to be used as a research artifact on public data in the context of pretraining datasets for large language models. The CommonCrawl data was carefully processed, deduplicated and filtered with the 🏭 datatrove
library, resulting in the largest publicly available multilingual clean LLM pretraining dataset.
The following is an example sample from the dataset. It is part of the French (fra_Latn
) data, originally belonged to the CC-MAIN-2013-20
CommonCrawl snapshot and was crawled on 2013-05-19T07:12:36Z
.
{
"text": "Il y a 61 ans le match le plus long de l'histoire\nLe 6 janvier 1951 les Rochester Royals recevaient les Indianapolis Olympians pour ce qui allait être le match le plus long de l'histoire. Rochester qui sortait d'une victoire face aux Knicks de New York en prolongation étaient sur une série de 7 victoires avant la réception d'Indianapolis. Au final un match remporté au bout de la nuit par les Olympians en 6 prolongations et un tout petit score de 75 à 73. les équipes n'avaient shooté que 23 fois au total des 6 prolongations! (l'horloge de tir n'était pas encore utilisée)\nCe match reste à ce jour le plus long de l'histoire avec 78 minutes de jeu.",
"id": "<urn:uuid:5013b1b9-5092-40f8-8d79-c517970dd814>",
"dump": "CC-MAIN-2013-20",
"url": "http://basket-infos.com/2012/01/06/il-y-a-61-ans-le-match-le-plus-long-de-lhistoire/",
"date": "2013-05-19T07:12:36Z",
"file_path": "s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696384213/warc/CC-MAIN-20130516092624-00033-ip-10-60-113-184.ec2.internal.warc.gz",
"language": "fra",
"language_script": "Latn",
"language_score": 0.9994362592697144,
"minhash_cluster_size": 1,
"top_langs": "{\"fra_Latn_score\": 0.9994362592697144}"
}
Data Fields
text
(string): the main text contentid
(string): original unique identifier for this sample from CommonCrawldump
(string): the CommonCrawl dump this sample was a part ofurl
(string): url to the original page where text
was presentdate
(string): crawl date (from CommonCrawl)file_path
(string): s3 path for the individual CommonCrawl warc file containing this samplelanguage
(string): ISO 639-3 code for the language of this samplelanguage_script
(string): script of the text
, for example Latn
language_score
(float): language prediction score as reported by the GlotLID classifiertop_langs
: language-script pairs for which the language classifierminhash_cluster_size
: number of samples in the minhash cluster of this sample. See the deduplication section to learn why this might be usefulSee "Languages and available subsets" above.
Dataset Creation Curation RationaleWhile multiple open-weights models have regularly been released in recent months, these releases often do not include the model's training data. With 🥂 FineWeb2 we aim to provide the open source community with a very large clean pretraining dataset that can be used to push the envelope on truly open source models (open source models where data is also released). We also seek to improve the representation of lower resource (and often ignored) languages, and deliberately chose a language classifier that supported a large number of language labels.
Source DataThe source data consists of webpages crawled by the CommonCrawl foundation over the 2013-2024 time period.
We then extracted the main page text from the html of each webpage, identified its language, deduplicated the data per language and then filtered with specific thresholds adapted to each language.
Data processing stepsSee "Dataset processing steps" above.
AnnotationsWe augment the original samples with the language
, language_script
, language_score
, top_langs
and minhash_cluster_size
annotations. The language related annotations are automatically generated by our language filter. minhash_cluster_size
is computed during the deduplication process, by saving the size of each duplicate cluster before removing all of its documents except one.
We anonymize email addresses and public IP addresses.
For emails, we apply a regex pattern and replace any occurrence of an email address with either email@example.com
or firstname.lastname@example.org
. For IP addresses, we also employ a regex pattern and then further filter to only anonymize IP addresses allocated for public networks. Matched IP addresses are then replaced with one of the following randomly generated IP addresses, which at the time of dataset creation were not responding to ping requests: 22.214.171.124
, 126.96.36.199
, 188.8.131.52
, 184.108.40.206
, 220.127.116.11
, and 18.104.22.168
. We decided against applying regex patterns for phone numbers due to the high false positive rate.
Despite our efforts, given that 🥂 FineWeb2 is sourced from the internet at large, it is very likely that some personable identifiable information (PII) will be present. If you find your own PII in 🥂 FineWeb2 and would like it removed, please fill out our PII removal/opt out form.
CommonCrawl respects robots.txt at crawl time, but if you are a webmaster and find your website in 🥂 FineWeb2 and would like to have it removed, you may also use the PII removal/opt out form.
Considerations for Using the Data Social Impact of DatasetWith the release of this dataset we aim to make model training more accessible to the machine learning community at large.
While multiple open-weights models with strong performance have been publicly released in the past, more often than not these releases are not accompanied by the corresponding training dataset. This is unfortunate as the dataset specificities and characteristics have been demonstrated to have a very large impact and role in the performances of the models. As the creation of a high quality training dataset is a fundamental requirement to training an LLM capable of excelling at downstream tasks, with 🥂 FineWeb2 we (a) not only make the dataset creation process more transparent, by sharing our entire processing setup including the codebase used, we also (b) help alleviate the costs of dataset curation, both in time and in compute, for model creators by publicly releasing our dataset with the community.
While LLM advancements have primarily focused on English, Chinese, and other Western languages, this release prioritizes broader language support. We consulted with practitioners who develop LLMs for diverse languages to address their specific requirements, such as proper word segmentation (particularly for scripts that don't use whitespace separation) and handling language-specific punctuation, ensuring that medium and lower resource languages were not an afterthought.
Discussion of BiasesEfforts were made to minimize the amount of NSFW and toxic content present in the dataset by employing filtering on the URL level. However, there are still a significant number of documents present in the final dataset that could be considered toxic or contain harmful content. As 🥂 FineWeb2 was sourced from the web as a whole, any harmful biases typically present in it may be reproduced on our dataset.
Some filters might disproportionately target specific domains. One such example is poetry: we noticed that the punctuation filter removes a lot of poems.
We deliberately avoided using machine learning filtering methods that define text quality based on the similarity to a “gold” source such as wikipedia or toxicity classifiers as these methods have been known to disproportionately remove content in specific dialects and overclassify as toxic text related to specific social identities, respectively.
Other Known LimitationsWhile the language classifier we used, GlotLID supports over 2000 language labels, its performance is not ideal for all of them. The training data for many languages is hard to obtain and, additionally, the classifier is prone to sometimes mistaking closely related languages (for instance, Standard Arabic and Arabic dialects or Croatian and Bosnian). We tried to mitigate this by curating stopwords for each language, but these might also not be effective in all cases.
Due to resource constraints and limited access to native speakers, we couldn't test each language individually. We encourage users to review our filtering approach for their languages of interest and modify the processing if needed. To support this, we've made available all data removed by our filtering pipeline (see "Languages and available subsets" above for more info).
You should also probably consider complementing 🥂 FineWeb2 with specialized curated sources (such as Wikipedia, for example) as they will likely have better formatting than the wikipedia content included in 🥂 FineWeb2 (we did not tailor the processing to individual websites).
Additional Information Licensing InformationThe dataset is released under the Open Data Commons Attribution License (ODC-By) v1.0 license. The use of this dataset is also subject to CommonCrawl's Terms of Use.
Citation Information@misc{penedo2025fineweb2pipelinescale,
title={FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language},
author={Guilherme Penedo and Hynek Kydlíček and Vinko Sabolčec and Bettina Messmer and Negar Foroutan and Amir Hossein Kargaran and Colin Raffel and Martin Jaggi and Leandro Von Werra and Thomas Wolf},
year={2025},
eprint={2506.20920},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.20920},
}
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4