Probstat/notes/regression

This is part of probstat.

In this section, we shall discuss linear regression. We shall focus on one-variable linear regression.

เนื้อหา

1 Model
2 The least squares estimators
3 Distribution of regression parameters
4 Statistical tests on regression parameters

Model

We consider two variables $X$ and $Y$ where $Y$ is a function of $X$ . We refer to $X$ as independent or input variable, and $Y$ as a dependent variable. We consider linear relationship between independent variable and dependent variable. We assume that there exist hidden variables $\alpha$ and $\beta$ such that

$Y=\alpha +\beta \cdot X+e,$

where $e$ is a random error. We further assume that the error is unbiased, i.e., $E[e]=0$ and is independent of $X$ .

Input: As an input to the regression process, we are given a set of $n$ data points: $(x_{1},y_{1}),(x_{2},y_{2}),\ldots ,(x_{n},y_{n})$ generated from the previous equation.

Goal: We want to estimate $\alpha$ and $\beta$ .

The least squares estimators

Denote our estimate for $\alpha$ as $A$ and for $\beta$ as $B$ . Using both variables as estimator, the error at data point $(x_{i},y_{i})$ , the error is

$y_{i}-(A+Bx_{i})=y_{i}-A-Bx_{i}$ .

We focus more on the sum of squared errors, i.e.,

$SS=\sum _{i=1}^{n}(y_{i}-A-Bx_{i})^{2}$ .

The method of least squares use the parameters that minimize the squared errors as an estimator. Therefore, we want to find $A$ and $B$ that minimize $SS$ . To do so, we partially differentiate $SS$ with respect to $A$ and $B$ :

${\frac {\partial }{\partial A}}SS=-2\sum _{i=1}^{n}(y_{i}-A-Bx_{i})$ (Eq1)

${\frac {\partial }{\partial B}}SS=-2\sum _{i=1}^{n}x_{i}(y_{i}-A-Bx_{i})$ (Eq2)

We set these two equations to zero to find the maximum and obtain these two equations we have to solve.

$\sum _{i=1}^{n}y_{i}=nA+B\sum _{i=1}^{n}x_{i}$

$\sum _{i=1}^{n}x_{i}y_{i}=A\sum _{i=1}^{n}x_{i}+B\sum _{i=1}^{n}x_{i}^{2}$

Before solving these two equations, let's define

${\bar {y}}=\sum _{i=1}^{n}y_{i}/n,\ \ \ \ {\bar {x}}=\sum _{i=1}^{n}x_{i}/n.$

We start by rewriting the first equation (Eq1) as

$A={\bar {y}}-B{\bar {x}},$ (Eq3)

and put it in (Eq2) to get

$\sum _{i=1}^{n}x_{i}y_{i}=({\bar {y}}-B{\bar {x}})\sum _{i=1}^{n}x_{i}+B\sum _{i=1}^{n}x_{i}^{2}$ .

With some calculation, we get

$B={\frac {\sum _{i=1}^{n}x_{i}y_{i}-n{\bar {x}}{\bar {y}}}{\sum _{i=1}^{n}x_{i}^{2}-n{\bar {x}}^{2}}}$ .

To find $A$ , we can just use equation (Eq3).

Estimated regression parameters. Using the least squares method, we obtain the following estimates

$B={\frac {\sum _{i=1}^{n}x_{i}y_{i}-n{\bar {x}}{\bar {y}}}{\sum _{i=1}^{n}x_{i}^{2}-n{\bar {x}}^{2}}}$ ,

and

$A={\bar {y}}-B{\bar {x}}$ .

Distribution of regression parameters

Although the estimators $A$ and $B$ are least-squares estimators, we are not sure if they are good estimators. In this section, we shall discuss various properties of these parameters.

We make an assumption on the error, that is is normally distributed with variance $\sigma ^{2}$ . Therefore,

$y_{i}\sim Normal(\alpha +\beta x_{i},\sigma ^{2})$ .

We shall start with $B$ . First, note that $x_{i}$ 's are inputs and are not random. Therefore, if we look at the formula for $B$ , we see that $B$ is actually a sum of independent normal random variables. This implies that $B$ is a normal random variable. If we can find its mean and its variance, we have a complete information about the distribution of $B$ .

We can calculate the expectation and variance of $B$ as follows.

$\mathrm {E} [B]=\beta$ .

$Var(B)={\frac {\sigma ^{2}}{\sum _{i=1}^{n}x_{i}^{2}-n{\bar {x}}^{2}}}$ .

We can also calculate the expectation and variance of $A$ .

$E[A]=\alpha$ .

$Var(A)={\frac {\sigma ^{2}\sum _{i=1}^{n}x_{i}^{2}}{n(\sum _{i=1}^{n}x_{i}^{2}-n{\bar {x}}^{2})}}$ .

Statistical tests on regression parameters

We focus on how to test the null hypothesis:

$H_{0}:\ \ \ \beta =0.$

Since $B$ is normal with mean $\beta$ and variance $\sigma ^{2}/(\sum _{i=1}^{n}x_{i}^{2}-n{\bar {x}}^{2})$ , we know that the statistic

${\frac {B-\beta }{\sigma /{\sqrt {(\sum _{i=1}^{n}x_{i}^{2}-n{\bar {x}}^{2})}}}}$

is unit normal and it is possible to perform various statistical tests on $\beta$ based on the estimated value $B$ if we know parameter $\sigma ^{2}$ . However, usually, we do not.

We end up with a situation similar to when we perform sampling on populations with unknown variances. Another key quantity in this case is the sum of squares of the residuals:

$SS_{R}=\sum _{i=1}^{n}(y_{i}-A-Bx_{i})^{2}$

Note that if we substitute $A$ with $\alpha$ and $B$ with $\beta$ the term $y_{i}-\alpha -\beta x_{i}$ is exactly the errors (which is normally distributed with mean 0 and variance $\sigma ^{2}$ ). This motivate the fact that $SS_{R}$ can be used to estimate $\sigma ^{2}$ :

$E\left[{\frac {SS_{R}}{n-2}}\right]=\sigma ^{2}$ .

More over, it can be shown that

${\frac {SS_{R}}{\sigma ^{2}}}\sim \chi _{n-2}^{2}$ ,

and $SS_{R}$ is independent of $B$ . These two facts implies that

${\frac {\left({\frac {B-\beta }{\sigma /{\sqrt {(\sum _{i=1}^{n}x_{i}^{2}-n{\bar {x}}^{2})}}}}\right)}{\sqrt {\frac {SS_{R}}{\sigma ^{2}(n-2)}}}}={\frac {\left({\frac {B-\beta }{1/{\sqrt {(\sum _{i=1}^{n}x_{i}^{2}-n{\bar {x}}^{2})}}}}\right)}{\sqrt {\frac {SS_{R}}{(n-2)}}}}={\frac {B-\beta }{\sqrt {\frac {SS_{R}}{(n-2)S_{xx}}}}}\sim t_{n-2},$

where $S_{xx}=\sum _{i=1}^{n}x_{i}^{2}-n{\bar {x}}^{2}$ .

Therefore, if we want to perform a hypothesis testing if $\beta =0$ , we can check if ${\frac {B-\beta }{\sqrt {\frac {SS_{R}}{(n-2)S_{xx}}}}}$ deviate far enough from the t-distribution with $n-2$ degrees of freedom.

Notes: This is fairly similar to the use of the t-distribution for the sample mean when the variance of the population is unknown, where the quantity $SS_{R}$ acts as $S^{2}$ in that case.

Probstat/notes/regression

เนื้อหา

Model

The least squares estimators

Distribution of regression parameters

Statistical tests on regression parameters

รายการเลือกการนำทาง

เครื่องมือส่วนตัว

เนมสเปซ

สิ่งที่แตกต่าง

ดู

เพิ่มเติม

ค้นหา

การนำทาง

เครื่องมือ