A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://developers.google.com/machine-learning/crash-course/numerical-data/normalization below:

Numerical data: Normalization | Machine Learning

After examining your data through statistical and visualization techniques, you should transform your data in ways that will help your model train more effectively. The goal of normalization is to transform features to be on a similar scale. For example, consider the following two features:

These two features span very different ranges. Normalization might manipulate X and Y so that they span a similar range, perhaps 0 to 1.

Normalization provides the following benefits:

We recommend normalizing numeric features covering distinctly different ranges (for example, age and income). We also recommend normalizing a single numeric feature that covers a wide range, such as city population.

Warning: If you normalize a feature during training, you must also normalize that feature when making predictions.

Consider the following two features:

Feature A and Feature B have relatively narrow spans. However, Feature B's span is 10 times wider than Feature A's span. Therefore:

The overall damage due to not normalizing will be relatively small; however, we still recommend normalizing Feature A and Feature B to the same scale, perhaps -1.0 to +1.0.

Now consider two features with a greater disparity of ranges:

If you don't normalize Feature C and Feature D, your model will likely be suboptimal. Furthermore, training will take much longer to converge or even fail to converge entirely!

This section covers three popular normalization methods:

This section additionally covers clipping. Although not a true normalization technique, clipping does tame unruly numerical features into ranges that produce better models.

Linear scaling

Linear scaling (more commonly shortened to just scaling) means converting floating-point values from their natural range into a standard range—usually 0 to 1 or -1 to +1.

Click the icon to see the math.

Use the following formula to scale to the standard range 0 to 1, inclusive:

$$ x' = (x - x_{min}) / (x_{max} - x_{min}) $$

where:

For example, consider a feature named quantity whose natural range spans 100 to 900. Suppose the natural value of quantity in a particular example is 300. Therefore, you can calculate the normalized value of 300 as follows:

x' = (300 - 100) / (900 - 100)
x' = 200 / 800
x' = 0.25

Linear scaling is a good choice when all of the following conditions are met:

Suppose human age is a feature. Linear scaling is a good normalization technique for age because:

Note: Most real-world features do not meet all of the criteria for linear scaling. Z-score scaling is typically a better normalization choice than linear scaling. Exercise: Check your understanding

Suppose your model has a feature named

net_worth

that holds the net worth of different people. Would linear scaling be a good normalization technique for

net_worth

? Why or why not?

Click the icon to see the answer.

Answer: Linear scaling would be a poor choice for normalizing net_worth. This feature contains many outliers, and the values are not uniformly distributed across its primary range. Most people would be squeezed within a very narrow band of the overall range.

Z-score scaling

A Z-score is the number of standard deviations a value is from the mean. For example, a value that is 2 standard deviations greater than the mean has a Z-score of +2.0. A value that is 1.5 standard deviations less than the mean has a Z-score of -1.5.

Representing a feature with Z-score scaling means storing that feature's Z-score in the feature vector. For example, the following figure shows two histograms:

Figure 4. Raw data (left) versus Z-score (right) for a normal distribution.

Z-score scaling is also a good choice for data like that shown in the following figure, which has only a vaguely normal distribution.

Figure 5. Raw data (left) versus Z-score scaling (right) for a non-classic normal distribution. Click the icon to see the math.

Use the following formula to normalize a value, $x$, to its Z-score:

$$ x' = (x - μ) / σ $$

where:

For example, suppose:

Therefore:

  Z-score = (130 - 100) / 20
  Z-score = 30 / 20
  Z-score = +1.5
Click the icon to learn more about normal distributions.

In a classic normal distribution:

So, data points with a Z-score less than -4.0 or more than +4.0 are rare, but are they truly outliers? Since

outliers

is a concept without a strict definition, no one can say for sure. Note that a dataset with a sufficiently large number of examples will almost certainly contain at least a few of these "rare" examples. For example, a feature with one billion examples conforming to a classic normal distribution could have as many as 60,000 examples with a score outside the range -4.0 to +4.0.

Z-score is a good choice when the data follows a normal distribution or a distribution somewhat like a normal distribution.

Note that some distributions might be normal within the bulk of their range, but still contain extreme outliers. For example, nearly all of the points in a net_worth feature might fit neatly into 3 standard deviations, but a few examples of this feature could be hundreds of standard deviations away from the mean. In these situations, you can combine Z-score scaling with another form of normalization (usually clipping) to handle this situation.

Exercise: Check your understanding

Suppose your model trains on a feature named

height

that holds the adult heights of ten million women. Would Z-score scaling be a good normalization technique for

height

? Why or why not?

Click the icon to see the answer.

Answer: Z-score scaling would be a good normalization technique for height because this feature conforms to a normal distribution. Ten million examples implies a lot of outliers—probably enough outliers for the model to learn patterns on very high or very low Z-scores.

Log scaling

Log scaling computes the logarithm of the raw value. In theory, the logarithm could be any base; in practice, log scaling usually calculates the natural logarithm (ln).

Click the icon to see the math.

Use the following formula to normalize a value, $x$, to its log:

$$ x' = ln(x) $$

where:

Therefore, the log of the original value is about 4.0:

  4.0 = ln(54.598)

Log scaling is helpful when the data conforms to a power law distribution. Casually speaking, a power law distribution looks as follows:

Movie ratings are a good example of a power law distribution. In the following figure, notice:

Log scaling changes the distribution, which helps train a model that will make better predictions.

Figure 6. Comparing a raw distribution to its log.

As a second example, book sales conform to a power law distribution because:

Suppose you are training a linear model to find the relationship of, say, book covers to book sales. A linear model training on raw values would have to find something about book covers on books that sell a million copies that is 10,000 more powerful than book covers that sell only 100 copies. However, log scaling all the sales figures makes the task far more feasible. For example, the log of 100 is:

  ~4.6 = ln(100)

while the log of 1,000,000 is:

  ~13.8 = ln(1,000,000)

So, the log of 1,000,000 is only about three times larger than the log of 100. You probably could imagine a bestseller book cover being about three times more powerful (in some way) than a tiny-selling book cover.

Clipping

Clipping is a technique to minimize the influence of extreme outliers. In brief, clipping usually caps (reduces) the value of outliers to a specific maximum value. Clipping is a strange idea, and yet, it can be very effective.

For example, imagine a dataset containing a feature named roomsPerPerson, which represents the number of rooms (total rooms divided by number of occupants) for various houses. The following plot shows that over 99% of the feature values conform to a normal distribution (roughly, a mean of 1.8 and a standard deviation of 0.7). However, the feature contains a few outliers, some of them extreme:

Figure 7. Mainly normal, but not completely normal.

How can you minimize the influence of those extreme outliers? Well, the histogram is not an even distribution, a normal distribution, or a power law distribution. What if you simply cap or clip the maximum value of roomsPerPerson at an arbitrary value, say 4.0?

Figure 8. Clipping feature values at 4.0.

Clipping the feature value at 4.0 doesn't mean that your model ignores all values greater than 4.0. Rather, it means that all values that were greater than 4.0 now become 4.0. This explains the peculiar hill at 4.0. Despite that hill, the scaled feature set is now more useful than the original data.

Wait a second! Can you really reduce every outlier value to some arbitrary upper threshold? When training a model, yes.

You can also clip values after applying other forms of normalization. For example, suppose you use Z-score scaling, but a few outliers have absolute values far greater than 3. In this case, you could:

Clipping prevents your model from overindexing on unimportant data. However, some outliers are actually important, so clip values carefully.

Summary of normalization techniques The best normalization technique is one that works well in practice, so try new ideas if you think they'll work well on your feature distribution. Normalization technique Formula When to use Linear scaling $$ x' = \frac{x - x_{min}}{x_{max} - x_{min}} $$ When the feature is mostly uniformly distributed across range. Flat-shaped Z-score scaling $$ x' = \frac{x - μ}{σ}$$ When the feature is normally distributed (peak close to mean). Bell-shaped Log scaling $$ x' = log(x)$$ When the feature distribution is heavy skewed on at least either side of tail. Heavy Tail-shaped Clipping If $x > max$, set $x' = max$
If $x < min$, set $x' = min$ When the feature contains extreme outliers. Exercise: Test your knowledge

Which technique would be most suitable for normalizing a feature with the following distribution?

Z-score scaling

The data points generally conform to a normal distribution, so Z-score scaling will force them into the range –3 to +3.

Linear scaling

Review the discussions of the normalization techniques on this page, and try again.

Log scaling

Review the discussions of the normalization techniques on this page, and try again.

Clipping

Review the discussions of the normalization techniques on this page, and try again.

Suppose you are developing a model that predicts a data center's productivity based on the temperature measured inside the data center. Almost all of the temperature values in your dataset fall between 15 and 30 (Celsius), with the following exceptions:

Which would be a reasonable normalization technique for temperature?

Clip the outlier values between 31 and 45, but delete the outliers with a value of 1,000

The values of 1,000 are mistakes, and should be deleted rather than clipped.

The values between 31 and 45 are legitimate data points. Clipping would probably be a good idea for these values, assuming the dataset doesn't contain enough examples in this temperature range to train the model to make good predictions. However, during inference, note that the clipped model would therefore make the same prediction for a temperature of 45 as for a temperature of 35.

Clip all the outliers

Review the discussions of the normalization techniques on this page, and try again.

Delete all the outliers

Review the discussions of the normalization techniques on this page, and try again.

Delete the outlier values between 31 and 45, but clip the outliers with a value of 1,000.

Review the discussions of the normalization techniques on this page, and try again.

Key terms:

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4