A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://huggingface.co/datasets/tiiuae/falcon-refinedweb below:

Website Navigation


tiiuae/falcon-refinedweb · Datasets at Hugging Face

📀 Falcon RefinedWeb

Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license.

See the 📓 paper on arXiv for more details.

RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data.

RefinedWeb is also "multimodal-friendly": it contains links and alt texts for images in processed samples.

This public extract should contain 500-650GT depending on the tokenizer you use, and can be enhanced with the curated corpora of your choosing. This public extract is about ~500GB to download, requiring 2.8TB of local storage once unpacked.

from datasets import load_dataset
rw = load_dataset("tiiuae/falcon-refinedweb")

RefinedWeb is the main dataset we have used for training the Falcon LLM models:

Dataset card for Falcon RefinedWeb Dataset Summary

Falcon RefinedWeb was created to serve as an English large-scale dataset for the pretraining of large language models. It may be used on its own, or augmented with curated sources (e.g., Wikipedia, StackOverflow).

It was built on top of CommonCrawl, leveraging stringent filtering and extensive deduplication.

Supported Tasks and Leaderboards

RefinedWeb is intended to be primarly used as a pretraining dataset for large language models. Practitioners may leverage it for upstream evaluation with a validation loss, but we do not provide any canonical split.

Languages

RefinedWeb primarly contains English.

Dataset Structure Data Instances

Each data instance corresponds to an individual web page which has been crawled, processed, and deduplicated against all other instances.

This public extract of RefinedWeb contains about 1B instances (968M individual web pages), for a total of 2.8TB of clean text data.

Data Fields Data Splits

We do not provide any canonical splits for RefinedWeb.

Dataset Creation Curation Rationale

Falcon RefinedWeb is built on-top of CommonCrawl, using the Macrodata Refinement Pipeline, which combines content extraction, filtering heuristics, and deduplication.

In designing RefinedWeb, we abided to the following philosophy:

During its development, we iterated on RefinedWeb by measuring the zero-shot performance of models trained on development version of the dataset. Our main goal was to maximize the performance obtained, bridging the gap between curated and web data. We also manually audited samples to identify potential filtering improvements.

Source Data

RefinedWeb is built from CommonCrawl dumps. These dumps are constructed from crawling publicly available web pages.

Data Collection and Preprocessing

We applied extensive preprocessing and cleaning of the data, using our Macrodata Refinement Pipeline.

We first filter URLs to remove adult content using a blocklist and a score system, we then use trafilatura to extract content from pages, and perform language identification with the fastText classifier from CCNet (Wenzek et al., 2019). After this first preprocessing stage, we filter data using heuristics from MassiveWeb (Rae et al., 2021), and our own line-wise corrections.

Finally, we run extensive deduplication, removing URLs revisited across dumps and performing subsequently fuzzy and exact substring deduplication.

Annotations

We provide automatically collected annotations for the source url, timestamp of the crawl, original CommonCrawl dump and segment in which the document was found, and image_urls contained in the page.

Personal and Sensitive Information

As RefinedWeb is built upon publicly available web pages, it may contain sensitive information such as emails, phone numbers, or IP addresses. We believe that deduplication may have helped reduced the prevalence of PII in the dataset, but practitioners working with RefinedWeb should take care.

Considerations for Using the Data Social Impact of Dataset

With the open-source release of Falcon RefinedWeb, we aim to increase access to high-quality web data, which has typically been held private by model developers. We believe this release will in turn improve the accessibility and the spread of performant large language models.

Discussion of Biases

As toxic or biased data is prevalent on the internet, it is likely our dataset contains such content. Notably, using the Perspective API, we estimated the prevalence of toxic content in the dataset to be similar to The Pile.

Other Known Limitations

Despite our best efforts to filter content that does not qualify as natural language, and to deduplicate documents, our pipeline may let through documents that may be considered as errors or redundant.

Additional Information Licensing Information

This public extract is made available under an ODC-By 1.0 license; users should also abide to the CommonCrawl ToU.

Citation Information
@article{refinedweb,
  title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
  author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
  journal={arXiv preprint arXiv:2306.01116},
  eprint={2306.01116},
  eprinttype = {arXiv},
  url={https://arxiv.org/abs/2306.01116},
  year={2023}
}
Opt-out request

RefinedWeb is based on CommonCrawl. Their crawler honors opt-out requests in the robots.txt, see the CC FAQ for details.

To remove a document from RefinedWeb, please message falconllm@tii.ae.

Contact

falconllm@tii.ae

Downloads last month
10,032

We run the Presidio open source package automatically on a subset of datasets to improve ML creators’ level of information. According to the recognizer model and detection patterns, this dataset may contain the following PII types:

of rows may contain emails

of rows may contain sensitive PII

Models trained or fine-tuned on tiiuae/falcon-refinedweb Spaces using tiiuae/falcon-refinedweb 24 Collection including tiiuae/falcon-refinedweb

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4