A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://en.wikipedia.org/wiki/Errors-in-variables below:

Errors-in-variables model - Wikipedia

From Wikipedia, the free encyclopedia

Regression models accounting for possible errors in independent variables

Illustration of regression dilution (or attenuation bias) by a range of regression estimates in errors-in-variables models. Two regression lines (red) bound the range of linear regression possibilities. The shallow slope is obtained when the independent variable (or predictor) is on the x-axis. The steeper slope is obtained when the independent variable is on the y-axis. By convention, with the independent variable on the x-axis, the shallower slope is obtained. Green reference lines are averages within arbitrary bins along each axis. Note that the steeper green and red regression estimates are more consistent with smaller errors in the y-axis variable.

In statistics, an errors-in-variables model or a measurement error model is a regression model that accounts for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.[citation needed]

In the case when some regressors have been measured with errors, estimation based on the standard assumption leads to inconsistent estimates, meaning that the parameter estimates do not tend to the true values even in very large samples. For simple linear regression the effect is an underestimate of the coefficient, known as the attenuation bias. In non-linear models the direction of the bias is likely to be more complicated.[1][2][3]

Motivating example[edit]

Consider a simple linear regression model of the form

y t = α + β x t ∗ + ε t , t = 1 , … , T , {\displaystyle y_{t}=\alpha +\beta x_{t}^{*}+\varepsilon _{t}\,,\quad t=1,\ldots ,T,}

where x t ∗ {\displaystyle x_{t}^{*}} denotes the true but unobserved regressor. Instead, we observe this value with an error:

x t = x t ∗ + η t {\displaystyle x_{t}=x_{t}^{*}+\eta _{t}\,}

where the measurement error η t {\displaystyle \eta _{t}} is assumed to be independent of the true value x t ∗ {\displaystyle x_{t}^{*}} .
A practical application is the standard school science experiment for Hooke's law, in which one estimates the relationship between the weight added to a spring and the amount by which the spring stretches.
If the y t {\displaystyle y_{t}} ′s are simply regressed on the x t {\displaystyle x_{t}} ′s (see simple linear regression), then the estimator for the slope coefficient is

β ^ x = 1 T ∑ t = 1 T ( x t − x ¯ ) ( y t − y ¯ ) 1 T ∑ t = 1 T ( x t − x ¯ ) 2 , {\displaystyle {\hat {\beta }}_{x}={\frac {{\tfrac {1}{T}}\sum _{t=1}^{T}(x_{t}-{\bar {x}})(y_{t}-{\bar {y}})}{{\tfrac {1}{T}}\sum _{t=1}^{T}(x_{t}-{\bar {x}})^{2}}}\,,}

which converges as the sample size T {\displaystyle T} increases without bound:

β ^ x → p Cov ⁡ [ x t , y t ] Var ⁡ [ x t ] = β σ x ∗ 2 σ x ∗ 2 + σ η 2 = β 1 + σ η 2 / σ x ∗ 2 . {\displaystyle {\hat {\beta }}_{x}\xrightarrow {p} {\frac {\operatorname {Cov} [\,x_{t},y_{t}\,]}{\operatorname {Var} [\,x_{t}\,]}}={\frac {\beta \sigma _{x^{*}}^{2}}{\sigma _{x^{*}}^{2}+\sigma _{\eta }^{2}}}={\frac {\beta }{1+\sigma _{\eta }^{2}/\sigma _{x^{*}}^{2}}}\,.}

This is in contrast to the "true" effect of β {\displaystyle \beta } , estimated using the x t ∗ {\displaystyle x_{t}^{*}} ,:

β ^ = 1 T ∑ t = 1 T ( x t ∗ − x ¯ ) ( y t − y ¯ ) 1 T ∑ t = 1 T ( x t ∗ − x ¯ ) 2 , {\displaystyle {\hat {\beta }}={\frac {{\tfrac {1}{T}}\sum _{t=1}^{T}(x_{t}^{*}-{\bar {x}})(y_{t}-{\bar {y}})}{{\tfrac {1}{T}}\sum _{t=1}^{T}(x_{t}^{*}-{\bar {x}})^{2}}}\,,}

Variances are non-negative, so that in the limit the estimated β ^ x {\displaystyle {\hat {\beta }}_{x}} is smaller than β ^ {\displaystyle {\hat {\beta }}} , an effect which statisticians call attenuation or regression dilution.[4] Thus the ‘naïve’ least squares estimator β ^ x {\displaystyle {\hat {\beta }}_{x}} is an inconsistent estimator for β {\displaystyle \beta } . However, β ^ x {\displaystyle {\hat {\beta }}_{x}} is a consistent estimator of the parameter required for a best linear predictor of y {\displaystyle y} given the observed x t {\displaystyle x_{t}} : in some applications this may be what is required, rather than an estimate of the 'true' regression coefficient β {\displaystyle \beta } , although that would assume that the variance of the errors in the estimation and prediction is identical. This follows directly from the result quoted immediately above, and the fact that the regression coefficient relating the y t {\displaystyle y_{t}} ′s to the actually observed x t {\displaystyle x_{t}} ′s, in a simple linear regression, is given by

β x = Cov ⁡ [ x t , y t ] Var ⁡ [ x t ] . {\displaystyle \beta _{x}={\frac {\operatorname {Cov} [\,x_{t},y_{t}\,]}{\operatorname {Var} [\,x_{t}\,]}}.}

It is this coefficient, rather than β {\displaystyle \beta } , that would be required for constructing a predictor of y {\displaystyle y} based on an observed x {\displaystyle x} which is subject to noise.

It can be argued that almost all existing data sets contain errors of different nature and magnitude, so that attenuation bias is extremely frequent (although in multivariate regression the direction of bias is ambiguous[5]). Jerry Hausman sees this as an iron law of econometrics: "The magnitude of the estimate is usually smaller than expected."[6]

Usually, measurement error models are described using the latent variables approach. If y {\displaystyle y} is the response variable and x {\displaystyle x} are observed values of the regressors, then it is assumed there exist some latent variables y ∗ {\displaystyle y^{*}} and x ∗ {\displaystyle x^{*}} which follow the model's "true" functional relationship g ( ⋅ ) {\displaystyle g(\cdot )} , and such that the observed quantities are their noisy observations:

{ y ∗ = g ( x ∗ , w | θ ) , y = y ∗ + ε , x = x ∗ + η , {\displaystyle {\begin{cases}y^{*}=g(x^{*}\!,w\,|\,\theta ),\\y=y^{*}+\varepsilon ,\\x=x^{*}+\eta ,\end{cases}}}

where θ {\displaystyle \theta } is the model's parameter and w {\displaystyle w} are those regressors which are assumed to be error-free (for example, when linear regression contains an intercept, the regressor which corresponds to the constant certainly has no "measurement errors"). Depending on the specification these error-free regressors may or may not be treated separately; in the latter case it is simply assumed that corresponding entries in the variance matrix of η {\displaystyle \eta } 's are zero.

The variables y {\displaystyle y} , x {\displaystyle x} , w {\displaystyle w} are all observed, meaning that the statistician possesses a data set of n {\displaystyle n} statistical units { y i , x i , w i } i = 1 , … , n {\displaystyle \left\{y_{i},x_{i},w_{i}\right\}_{i=1,\dots ,n}} which follow the data generating process described above; the latent variables x ∗ {\displaystyle x^{*}} , y ∗ {\displaystyle y^{*}} , ε {\displaystyle \varepsilon } , and η {\displaystyle \eta } are not observed, however.

This specification does not encompass all the existing errors-in-variables models. For example, in some of them, function g ( ⋅ ) {\displaystyle g(\cdot )} may be non-parametric or semi-parametric. Other approaches model the relationship between y ∗ {\displaystyle y^{*}} and x ∗ {\displaystyle x^{*}} as distributional instead of functional; that is, they assume that y ∗ {\displaystyle y^{*}} conditionally on x ∗ {\displaystyle x^{*}} follows a certain (usually parametric) distribution.

Terminology and assumptions[edit]

Linear errors-in-variables models were studied first, probably because linear models were so widely used and they are easier than non-linear ones. Unlike standard least squares regression (OLS), extending errors in variables regression (EiV) from the simple to the multivariable case is not straightforward, unless one treats all variables in the same way i.e. assume equal reliability.[10]

Simple linear model[edit]

The simple linear errors-in-variables model was already presented in the "motivation" section:

{ y t = α + β x t ∗ + ε t , x t = x t ∗ + η t , {\displaystyle {\begin{cases}y_{t}=\alpha +\beta x_{t}^{*}+\varepsilon _{t},\\x_{t}=x_{t}^{*}+\eta _{t},\end{cases}}}

where all variables are scalar. Here α and β are the parameters of interest, whereas σε and ση—standard deviations of the error terms—are the nuisance parameters. The "true" regressor x* is treated as a random variable (structural model), independent of the measurement error η (classic assumption).

This model is identifiable in two cases: (1) either the latent regressor x* is not normally distributed, (2) or x* has normal distribution, but neither εt nor ηt are divisible by a normal distribution.[11] That is, the parameters α, β can be consistently estimated from the data set ( x t , y t ) t = 1 T {\displaystyle \scriptstyle (x_{t},\,y_{t})_{t=1}^{T}} without any additional information, provided the latent regressor is not Gaussian.

Before this identifiability result was established, statisticians attempted to apply the maximum likelihood technique by assuming that all variables are normal, and then concluded that the model is not identified. The suggested remedy was to assume that some of the parameters of the model are known or can be estimated from the outside source. Such estimation methods include[12]

Estimation methods that do not assume knowledge of some of the parameters of the model, include

Multivariable linear model[edit]

The multivariable model looks exactly like the simple linear model, only this time β, ηt, xt and x*t are 1 vectors.

{ y t = α + β ′ x t ∗ + ε t , x t = x t ∗ + η t . {\displaystyle {\begin{cases}y_{t}=\alpha +\beta 'x_{t}^{*}+\varepsilon _{t},\\x_{t}=x_{t}^{*}+\eta _{t}.\end{cases}}}

In the case when (εt,ηt) is jointly normal, the parameter β is not identified if and only if there is a non-singular k×k block matrix [a A], where a is a 1 vector such that a′x* is distributed normally and independently of A′x*. In the case when εt, ηt1,..., ηtk are mutually independent, the parameter β is not identified if and only if in addition to the conditions above some of the errors can be written as the sum of two independent variables one of which is normal.[15]

Some of the estimation methods for multivariable linear models are

A generic non-linear measurement error model takes form

{ y t = g ( x t ∗ ) + ε t , x t = x t ∗ + η t . {\displaystyle {\begin{cases}y_{t}=g(x_{t}^{*})+\varepsilon _{t},\\x_{t}=x_{t}^{*}+\eta _{t}.\end{cases}}}

Here function g can be either parametric or non-parametric. When function g is parametric it will be written as g(x*, β).

For a general vector-valued regressor x* the conditions for model identifiability are not known. However, in the case of scalar x* the model is identified unless the function g is of the "log-exponential" form[20]

g ( x ∗ ) = a + b ln ⁡ ( e c x ∗ + d ) {\displaystyle g(x^{*})=a+b\ln {\big (}e^{cx^{*}}+d{\big )}}

and the latent regressor x* has density

f x ∗ ( x ) = { A e − B e C x + C D x ( e C x + E ) − F , if   d > 0 A e − B x 2 + C x if   d = 0 {\displaystyle f_{x^{*}}(x)={\begin{cases}Ae^{-Be^{Cx}+CDx}(e^{Cx}+E)^{-F},&{\text{if}}\ d>0\\Ae^{-Bx^{2}+Cx}&{\text{if}}\ d=0\end{cases}}}

where constants A,B,C,D,E,F may depend on a,b,c,d.

Despite this optimistic result, as of now no methods exist for estimating non-linear errors-in-variables models without any extraneous information. However, there are several techniques which make use of some additional data: either the instrumental variables, or repeated observations.

Instrumental variables methods[edit] Repeated observations[edit]

In this approach two (or maybe more) repeated observations of the regressor x* are available. Both observations contain their own measurement errors; however, those errors are required to be independent:

{ x 1 t = x t ∗ + η 1 t , x 2 t = x t ∗ + η 2 t , {\displaystyle {\begin{cases}x_{1t}=x_{t}^{*}+\eta _{1t},\\x_{2t}=x_{t}^{*}+\eta _{2t},\end{cases}}}

where x*η1η2. Variables η1, η2 need not be identically distributed (although if they are efficiency of the estimator can be slightly improved). With only these two observations it is possible to consistently estimate the density function of x* using Kotlarski's deconvolution technique.[22]

  1. ^ Griliches, Zvi; Ringstad, Vidar (1970). "Errors-in-the-variables bias in nonlinear contexts". Econometrica. 38 (2): 368–370. doi:10.2307/1913020. JSTOR 1913020.
  2. ^ Chesher, Andrew (1991). "The effect of measurement error". Biometrika. 78 (3): 451–462. doi:10.1093/biomet/78.3.451. JSTOR 2337015.
  3. ^ Carroll, Raymond J.; Ruppert, David; Stefanski, Leonard A.; Crainiceanu, Ciprian (2006). Measurement Error in Nonlinear Models: A Modern Perspective (Second ed.). CRC Press. ISBN 978-1-58488-633-4.
  4. ^ Greene, William H. (2003). Econometric Analysis (5th ed.). New Jersey: Prentice Hall. Chapter 5.6.1. ISBN 978-0-13-066189-0.
  5. ^ Wansbeek, T.; Meijer, E. (2000). "Measurement Error and Latent Variables". In Baltagi, B. H. (ed.). A Companion to Theoretical Econometrics. Blackwell. pp. 162–179. doi:10.1111/b.9781405106764.2003.00013.x (inactive 17 December 2024). ISBN 9781405106764.{{cite book}}: CS1 maint: DOI inactive as of December 2024 (link)
  6. ^ Hausman, Jerry A. (2001). "Mismeasured variables in econometric analysis: problems from the right and problems from the left". Journal of Economic Perspectives. 15 (4): 57–67 [p. 58]. doi:10.1257/jep.15.4.57. JSTOR 2696516.
  7. ^ Fuller, Wayne A. (1987). Measurement Error Models. John Wiley & Sons. p. 2. ISBN 978-0-471-86187-4.
  8. ^ Hayashi, Fumio (2000). Econometrics. Princeton University Press. pp. 7–8. ISBN 978-1400823833.
  9. ^ Koul, Hira; Song, Weixing (2008). "Regression model checking with Berkson measurement errors". Journal of Statistical Planning and Inference. 138 (6): 1615–1628. doi:10.1016/j.jspi.2007.05.048.
  10. ^ Tofallis, C. (2023). Fitting an Equation to Data Impartially. Mathematics, 11(18), 3957. https://ssrn.com/abstract=4556739 https://doi.org/10.3390/math11183957
  11. ^ Reiersøl, Olav (1950). "Identifiability of a linear relation between variables which are subject to error". Econometrica. 18 (4): 375–389 [p. 383]. doi:10.2307/1907835. JSTOR 1907835. A somewhat more restrictive result was established earlier by Geary, R. C. (1942). "Inherent relations between random variables". Proceedings of the Royal Irish Academy. 47: 63–76. JSTOR 20488436. He showed that under the additional assumption that (ε, η) are jointly normal, the model is not identified if and only if x*s are normal.
  12. ^ Fuller, Wayne A. (1987). "A Single Explanatory Variable". Measurement Error Models. John Wiley & Sons. pp. 1–99. ISBN 978-0-471-86187-4.
  13. ^ Pal, Manoranjan (1980). "Consistent moment estimators of regression coefficients in the presence of errors in variables". Journal of Econometrics. 14 (3): 349–364 (pp. 360–361). doi:10.1016/0304-4076(80)90032-9.
  14. ^ Xu, Shaoji (2014-10-02). "A Property of Geometric Mean Regression". The American Statistician. 68 (4): 277–281. doi:10.1080/00031305.2014.962763. ISSN 0003-1305.
  15. ^ Ben-Moshe, Dan (2020). "Identification of linear regressions with errors in all variables". Econometric Theory. 37 (4): 1–31. arXiv:1404.1473. doi:10.1017/S0266466620000250. S2CID 225653359.
  16. ^ Dagenais, Marcel G.; Dagenais, Denyse L. (1997). "Higher moment estimators for linear regression models with errors in the variables". Journal of Econometrics. 76 (1–2): 193–221. CiteSeerX 10.1.1.669.8286. doi:10.1016/0304-4076(95)01789-5. In the earlier paper Pal (1980) considered a simpler case when all components in vector (ε, η) are independent and symmetrically distributed.
  17. ^ Fuller, Wayne A. (1987). Measurement Error Models. John Wiley & Sons. p. 184. ISBN 978-0-471-86187-4.
  18. ^ Erickson, Timothy; Whited, Toni M. (2002). "Two-step GMM estimation of the errors-in-variables model using high-order moments". Econometric Theory. 18 (3): 776–799. doi:10.1017/s0266466602183101. JSTOR 3533649. S2CID 14729228.
  19. ^ Tofallis, C. (2023). Fitting an Equation to Data Impartially. Mathematics, 11(18), 3957. https://ssrn.com/abstract=4556739 https://doi.org/10.3390/math11183957
  20. ^ Schennach, S.; Hu, Y.; Lewbel, A. (2007). "Nonparametric identification of the classical errors-in-variables model without side information". Working Paper.
  21. ^ Newey, Whitney K. (2001). "Flexible simulated moment estimation of nonlinear errors-in-variables model". Review of Economics and Statistics. 83 (4): 616–627. doi:10.1162/003465301753237704. hdl:1721.1/63613. JSTOR 3211757. S2CID 57566922.
  22. ^ Li, Tong; Vuong, Quang (1998). "Nonparametric estimation of the measurement error model using multiple indicators". Journal of Multivariate Analysis. 65 (2): 139–165. doi:10.1006/jmva.1998.1741.
  23. ^ Li, Tong (2002). "Robust and consistent estimation of nonlinear errors-in-variables models". Journal of Econometrics. 110 (1): 1–26. doi:10.1016/S0304-4076(02)00120-3.
  24. ^ Schennach, Susanne M. (2004). "Estimation of nonlinear models with measurement error". Econometrica. 72 (1): 33–75. doi:10.1111/j.1468-0262.2004.00477.x. JSTOR 3598849.
  25. ^ Schennach, Susanne M. (2004). "Nonparametric regression in the presence of measurement error". Econometric Theory. 20 (6): 1046–1093. doi:10.1017/S0266466604206028 (inactive 22 December 2024). S2CID 123036368.{{cite journal}}: CS1 maint: DOI inactive as of December 2024 (link)

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4