Probstat/notes/regression

จาก Theory Wiki
ไปยังการนำทาง ไปยังการค้นหา
This is part of probstat.

In this section, we shall discuss linear regression. We shall focus on one-variable linear regression.

Model

We consider two variables and where is a function of . We refer to as independent or input variable, and as a dependent variable. We consider linear relationship between independent variable and dependent variable. We assume that there exist hidden variables and such that

where is a random error. We further assume that the error is unbiased, i.e., and is independent of .

Input: As an input to the regression process, we are given a set of data points: generated from the previous equation.

Goal: We want to estimate and .

The least squares estimators

Denote our estimate for as and for as . Using both variables as estimator, the error at data point , the error is

.

We focus more on the sum of squared errors, i.e.,

.

The method of least squares use the parameters that minimize the squared errors as an estimator. Therefore, we want to find and that minimize . To do so, we partially differentiate with respect to and :

             (Eq1)

             (Eq2)

We set these two equations to zero to find the maximum and obtain these two equations we have to solve.

Before solving these two equations, let's define

We start by rewriting the first equation (Eq1) as

             (Eq3)

and put it in (Eq2) to get

.

With some calculation, we get

.

To find , we can just use equation (Eq3).

Estimated regression parameters. Using the least squares method, we obtain the following estimates

,

and

.

Distribution of regression parameters

Although the estimators and are least-squares estimators, we are not sure if they are good estimators. In this section, we shall discuss various properties of these parameters.

We make an assumption on the error, that is is normally distributed with variance . Therefore,

.

We shall start with . First, note that 's are inputs and are not random. Therefore, if we look at the formula for , we see that is actually a sum of independent normal random variables. This implies that is a normal random variable. If we can find its mean and its variance, we have a complete information about the distribution of .

We can calculate the expectation and variance of as follows.

  • .
  • .

We can also calculate the expectation and variance of .

  • .
  • .

Statistical tests on regression parameters

We focus on how to test the null hypothesis:

Since is normal with mean and variance , we know that the statistic

is unit normal and it is possible to perform various statistical tests on based on the estimated value if we know parameter . However, usually, we do not.

We end up with a situation similar to when we perform sampling on populations with unknown variances. Another key quantity in this case is the sum of squares of the residuals:

Note that if we substitute with and with the term is exactly the errors (which is normally distributed with mean 0 and variance ). This motivate the fact that can be used to estimate :

.

More over, it can be shown that

,

and is independent of . These two facts implies that

where .

Therefore, if we want to perform a hypothesis testing if , we can check if deviate far enough from the t-distribution with degrees of freedom.

Notes: This is fairly similar to the use of the t-distribution for the sample mean when the variance of the population is unknown, where the quantity acts as in that case.