Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license.
See the 📓 paper on arXiv for more details.
RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data.
RefinedWeb is also "multimodal-friendly": it contains links and alt texts for images in processed samples.
This public extract should contain 500-650GT depending on the tokenizer you use, and can be enhanced with the curated corpora of your choosing. This public extract is about ~500GB to download, requiring 2.8TB of local storage once unpacked.
from datasets import load_dataset
rw = load_dataset("tiiuae/falcon-refinedweb")
RefinedWeb is the main dataset we have used for training the Falcon LLM models:
Falcon RefinedWeb was created to serve as an English large-scale dataset for the pretraining of large language models. It may be used on its own, or augmented with curated sources (e.g., Wikipedia, StackOverflow).
It was built on top of CommonCrawl, leveraging stringent filtering and extensive deduplication.
Supported Tasks and LeaderboardsRefinedWeb is intended to be primarly used as a pretraining dataset for large language models. Practitioners may leverage it for upstream evaluation with a validation loss, but we do not provide any canonical split.
LanguagesRefinedWeb primarly contains English.
Dataset Structure Data InstancesEach data instance corresponds to an individual web page which has been crawled, processed, and deduplicated against all other instances.
This public extract of RefinedWeb contains about 1B instances (968M individual web pages), for a total of 2.8TB of clean text data.
Data Fieldscontent
: the processed and cleaned text contained in the page;url
: the url of the webpage crawled to produce the sample;timestamp
: timestamp of when the webpage was crawled by CommonCrawl;dump
: the CommonCrawl dump the sample is a part of;segment
: the CommonCrawl segment the sample is a part of;image_urls
: a list of elements in the type [image_url
, image_alt_text
] for all the images found in the content of the sample.We do not provide any canonical splits for RefinedWeb.
Dataset Creation Curation RationaleFalcon RefinedWeb is built on-top of CommonCrawl, using the Macrodata Refinement Pipeline, which combines content extraction, filtering heuristics, and deduplication.
In designing RefinedWeb, we abided to the following philosophy:
During its development, we iterated on RefinedWeb by measuring the zero-shot performance of models trained on development version of the dataset. Our main goal was to maximize the performance obtained, bridging the gap between curated and web data. We also manually audited samples to identify potential filtering improvements.
Source DataRefinedWeb is built from CommonCrawl dumps. These dumps are constructed from crawling publicly available web pages.
Data Collection and PreprocessingWe applied extensive preprocessing and cleaning of the data, using our Macrodata Refinement Pipeline.
We first filter URLs to remove adult content using a blocklist and a score system, we then use trafilatura
to extract content from pages, and perform language identification with the fastText
classifier from CCNet (Wenzek et al., 2019). After this first preprocessing stage, we filter data using heuristics from MassiveWeb (Rae et al., 2021), and our own line-wise corrections.
Finally, we run extensive deduplication, removing URLs revisited across dumps and performing subsequently fuzzy and exact substring deduplication.
AnnotationsWe provide automatically collected annotations for the source url
, timestamp
of the crawl, original CommonCrawl dump
and segment
in which the document was found, and image_urls
contained in the page.
As RefinedWeb is built upon publicly available web pages, it may contain sensitive information such as emails, phone numbers, or IP addresses. We believe that deduplication may have helped reduced the prevalence of PII in the dataset, but practitioners working with RefinedWeb should take care.
Considerations for Using the Data Social Impact of DatasetWith the open-source release of Falcon RefinedWeb, we aim to increase access to high-quality web data, which has typically been held private by model developers. We believe this release will in turn improve the accessibility and the spread of performant large language models.
Discussion of BiasesAs toxic or biased data is prevalent on the internet, it is likely our dataset contains such content. Notably, using the Perspective API, we estimated the prevalence of toxic content in the dataset to be similar to The Pile.
Other Known LimitationsDespite our best efforts to filter content that does not qualify as natural language, and to deduplicate documents, our pipeline may let through documents that may be considered as errors or redundant.
Additional Information Licensing InformationThis public extract is made available under an ODC-By 1.0 license; users should also abide to the CommonCrawl ToU.
Citation Information@article{refinedweb,
title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
journal={arXiv preprint arXiv:2306.01116},
eprint={2306.01116},
eprinttype = {arXiv},
url={https://arxiv.org/abs/2306.01116},
year={2023}
}
Opt-out request
RefinedWeb is based on CommonCrawl. Their crawler honors opt-out requests in the robots.txt
, see the CC FAQ for details.
To remove a document from RefinedWeb, please message falconllm@tii.ae.
ContactWe run the Presidio open source package automatically on a subset of datasets to improve ML creators’ level of information. According to the recognizer model and detection patterns, this dataset may contain the following PII types:
of rows may contain emails
of rows may contain sensitive PII
Models trained or fine-tuned on tiiuae/falcon-refinedweb Spaces using tiiuae/falcon-refinedweb 24 Collection including tiiuae/falcon-refinedwebRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4