A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.techtarget.com/whatis/definition/data-set below:

What is a data set?

What is a data set?

A data set, sometimes spelled dataset, is a collection of related data that's usually organized in a standardized format. Data sets are used for analytics, business intelligence, artificial intelligence (AI) model training and a variety of other use cases. Data sets can vary significantly in both size and type of data. For example, a data set might contain information about tree species, ocean temperatures, regional sales totals, fruit prices, lottery winners, diseases or just about any other type of data.

Although formats differ from one data set to another, their underlying organization can often be conceptualized as columns and rows, such as those found in spreadsheets or database tables. Each column represents a variable that describes the data, and each row represents a record that contains a related set of variable values. A value within a data set is sometimes referred to as datum or data point.

Many data sets are freely available online. They can be used to develop and test applications, train AI models, perform analytics or carry out other projects. For example, the figure below shows the air quality data set from Data.gov, which offers a wide range of free data sets. The air quality data set contains air quality surveillance data for New York City.

Example of a data set: Air quality surveillance data in New York city displayed in Microsoft Excel.

In the figure, the air quality data set is displayed in a Microsoft Excel spreadsheet. However, the data originated as a comma-separated values (CSV) file downloaded from Data.gov. The data set includes columns such as Unique ID, Geo Place Name and Time Period, which are three of the data set's variables.

The data set also includes rows for each air quality measurement, specific to a place and time. That is, each row is a record of a specific air quality measurement. The record is made up of a set of related values, with each value corresponding to a column, i.e., variable. For example, the value in the Start_Date column for the first record is 12/1/2010.

Data set vs. database

The term data set is sometimes confused with the term database, but the two have different meanings. A database is used to store and manage data. It is part of a larger management platform that includes features for securing, accessing, updating and in other ways working with and protecting data. A data set is simply a file or other structure that contains the data values in a specific format. A database might contain the data from one or more data sets, but the two are not the same.

Data set formats

Data sets are available in a variety of formats, such as JavaScript Object Notation (JSON) and Extensible Markup Language (XML). Such formats provide a standardized structure for sharing data across multiple platforms and applications. The data itself is usually written in plain text, so it can be easily filtered, updated and in other ways transformed to meet specific requirements.

Some data sets are available in more than one format. For example, the air quality data set shown above can be downloaded from Data.gov as a CSV, JSON, XML or Resource Description Framework (RDF) file. When a data set is available in multiple formats, the expectation is that each file contains the same set of records, with each record formatted according to the applicable standard.

A good way to demonstrate how this works is to look at the same air quality record in each of the four formats. For instance, one of the records has a unique ID value of 172653, which identifies that record from all other records. The following four script samples show the record in each format:

CSV record:

172653,375,Nitrogen dioxide (NO2),Mean,ppb,UHF34,203,Bedford Stuyvesant – Crown Heights,Annual Average 2011,12/01/2010,25.3

JSON record:

[ "row-frzi_7bar_4cbg", "00000000-0000-0000-AF08-C339B5581012", 0, 1698955938, null, 1698955938, null, "{ }", "172653", "375", "Nitrogen dioxide (NO2)", "Mean", "ppb", "UHF34", "203", "Bedford Stuyvesant – Crown Heights", "Annual Average 2011", "2010-12-01T00:00:00", "25.30", null ]

XML record:

<row _id="row-frzi_7bar_4cbg" _uuid="00000000-0000-0000-AF08-C339B5581012" _position="0" _address="https://data.cityofnewyork.us/resource/c3uy-2p5r/172653"><unique_id>172653</unique_id><indicator_id>375</indicator_id><name>Nitrogen dioxide (NO2)</name><measure>Mean</measure><measure_info>ppb</measure_info><geo_type_name>UHF34</geo_type_name><geo_join_id>203</geo_join_id><geo_place_name>Bedford Stuyvesant – Crown Heights</geo_place_name><time_period>Annual Average 2011</time_period><start_date>2010-12-01T00:00:00</start_date><data_value>25.30</data_value></row>

RDF record:

<rdf:Description rdf:about="https://data.cityofnewyork.us/resource/c3uy-2p5r/172653">

Each format provides the same core information but does so in a way different from the others. When a data set is available in multiple formats, data scientists and other users can choose whichever format best meets their needs for a specific project or environment. Because the formats are standardized, users can load the data into a system that supports the format, making it relatively simple to view, modify and manipulate data from multiple sources.

Types of data sets

Data sets can be categorized in different ways. One common approach, which is often used in statistics, is to break them down into the following categories:

The term data set originated with IBM, where its meaning was similar to that of file. In an IBM mainframe operating system, a data set is a named group of records that contains individual data units formatted in an IBM-prescribed way and accessed by a specific access method based on the data set format. Format types include sequential, relative sequential, indexed sequential and partitioned. Access methods include the Virtual Sequential Access Method (VSAM) and the Indexed Sequential Access Method (ISAM).

A data set is also an older and now deprecated term for a modem.

Working with numerical data

Numerical data within a data set is often characterized by specific measures that are used in statistics and analytics to describe the properties of a statistical distribution. Such a distribution reflects the set of possible values within the target data. The most common measures include the following:

To better understand how these measures work, consider the following numerical data set:

{2,4,4,6,8,10,13,14,16,18,20,22}

This is a very small numerical data set that contains 12 values, with only one value repeated. All of the values are integers. When the measures are applied to the data, they return the following properties:

If the data set had contained another pair of duplicate numbers, such as two instances of 10, there would have been two modes: 4 and 10. However, if there had been three instances of 4 and only two instances of 10, 4 would have been the only mode.

Data quality directly influences the success of machine learning models and AI initiatives. But a comprehensive approach requires considering real-world outcomes and data privacy. See how data quality shapes machine learning and AI outcomes.

Continue Reading About data set

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4