'तरी न्यून ते पुरते |अधिक ते सरते |'- संत ज्ञानेश्वर Regression : Concept, meaning, definition, Lines of regression.

 

Regression ©

 

The term ‘regression’, literally means ‘stepping back towards the average’. It was first used by a British Biometrician Sir Francis Galton (1822-1911). He found that although tall parents have tall children, the average height of children is less than the average height of their parents and although short parents have short children, average height of children is more than the average height of their parents. In other words the average height of children of tall parents or short parents will regress or go back to the average height of population. (This is like: संत ज्ञानेश्वरांच्या ओवीप्रमाणे 'न्यून ते पुरते, अधिक ते सरते')This phenomenon was described by him as 'regression’.  

Definition:  The method of estimating the value of one variable when that of the other is           known and the variables are known to be correlated.

Lines of regression:-

     We know that, if two correlated variables are plotted on a graph paper, we will find that the points in the scatter diagram will cluster around some curve, called the curve of regression. If the curve is straight line it is called the line of regression & their is linear regression between two variables otherwise regression is said to be curvilinear.

The line of regression is the line which gives the best estimate to the value of one variable for any specific value of the other variable. Thus the line of regression is the line of best fit.

In order to obtain the line of regression, we have to find a line such that the distance or deviations of the points from that line will be minimum. We measure the distance i.e. deviation  i) vertically & ii) horizontally and get one line when distances are minimized vertically & second line when distances are minimized horizontally. Thus we get two lines of regression.

 Lines of regression of y on x:

If we minimize the distances or deviations of the points from the line measured along y- axis, we get a line which is called the line of regression of y on x. It's equation is written in the form y=a+bx. This line is used for estimating value of y for a given value of x. The distance 'd’ is minimized.

       
               Line of regression of y on x                 Line of regression of x on y

 Lines of regression of x on y:

 If we minimize the distances or deviations of the points from the line measured along x-axis, we get a line which is called the line of regression of x on y.  It's equation is written in the form x=a+b.y. This line is used for estimating value of x for a given value of y. The distance 'd’ is minimized. ©

There are two methods of obtaining the lines of regression.

                                           i) The method of scatter diagram &

                                         ii) The method of least squares

3. Lines of Regression

In simple linear regression, we can have:

  1. Regression of Y on X


    Y = a + bX + \varepsilon

    • Y
      → dependent variable

    • → independent variable

    • → intercept (value of Y when X = 0)


    • b
      → regression coefficient of Y on X (rate of change of Y for unit change in X)

    • → error term (random deviation)

  2. Regression of X on Y


    X = c + dY + \varepsilon

    • d
      → regression coefficient of X on Y.

Note: The two regression lines are generally different (unless correlation is ±1).

Fitting of regression lines by the method of least squares

 Let us suppose that in a bivariate distribution (xi, yi),  i=1,2,… , n. Y is dependent & X              is independent variable. Let the line of regression of y on x be,

                               y= a + b. x.  ----- (*)

According to the principle of least squares, the normal equations for estimating a & b                 are given by,

Σ y = Σ a + b Σ x --- (1)

(taking summation on both sides of (*)) and

Σ xy = a Σ x +b Σ x2 --- (2)

(multiplying by x and taking summation on both sides of (*)

Dividing by n to (1), we get

Thus the line of regression of y on x passes through the point                                                                                                                                                      

***
***

Concept of Residual

Residual means the difference between observed value of dependent variable (y) and predicted value of dependent variable (ŷ) , Sometimes residuals are also called "errors".

Residual =Observed value of y - Predicted value of y.

That is,   e= y -ŷ

The data points usually don't fall exactly on the regression line, they are scattered around. A residual is the vertical distance between a data point and the regression line. Each data point has one residual. They are: Positive if they are above the regression line. Negative if they are below the regression line, Zero if the regression line actually passes through the pointThe sum of the residuals and the mean of the residuals are equal to zero.

    
 Mean Residual Sum of Squares                                                                              Let Yi, be the observed value and ŷ be the corresponding ith predicted value of the dependent variable Y. Then the Mean Residual Sum of Squares (MRSS) is defined as:
                                      MRSS =  Ʃ(y-ŷ )2.
In other words, the MRSS is the mean of the squares of the errors, (Y- ŷ )2 

RESIDUAL PLOT

A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.

The table below shows inputs and outputs from a simple linear regression analysis.


x

y

ŷ

e

60

70

65.411

4.589

70

65

71.849

-6.849

80

70

78.288

-8.288

85

95

81.507

13.493

95

85

87.945

-2.945


  
And a residual plot by plotting the residual (e) verses independent variable (X) is as given below.

The residual plot shows a fairly random pattern - the first residual is positive, the next two are negative, the fourth is positive, and the last residual is negative. This random pattern indicates that a linear model provides a decent fit to the data.

Below, the residual plots show three typical patterns. The first plot shows a random pattern, indicating a good fit for a linear model.

 
The other plot patterns are non-random (U-shaped and inverted U), suggesting a better fit for a nonlinear model.                          


***






Comments

Popular posts from this blog

Unit 1 : Multiple Regression , Multiple Correlation and Partial Correlation 1.1: Multiple Linear Regression (for trivariate data)

Time Series