What are some examples of independent variables

Linear regression

You can find more information about the sample data used in the article here

In addition, linear regression is the basis for a large number of advanced procedures, such as logistic or multinomial regression, multi-level regression, procedures for panel analysis, etc. By the way, ANOVA and ANCOVA models can also be represented as linear regressions.

A linear regression analysis can be used to examine the influence of one or more variables on a continuous scale variable. The following question serves as an example:

Do students get higher scores in a German test if they spend more time studying and / or if they sleep longer the night before the test?

Linear regression analyzes are used in practice for a variety of purposes, including:

  • Describe connections:

    "By how many points does the result in the German test increase on average for every additional hour of sleep?"

  • To secure relationships between two variables against the influences of third-party variables:

    "Is the effect of the duration of sleep perhaps only due to the fact that the more diligent students go to bed earlier?"

  • To test whether a relationship found in a sample can be transferred to the population:

    "Does the relationship found only exist for the examined students, or can it be assumed that such a relationship exists for all students from whom the examined sample was selected?"

  • To make empirically based prognoses:

    "What score will a student get if she studies 4.5 hours and sleeps 7 hours?"

For non-metric variables to be explained, there are further procedures that are based on the basic principle of linear regression, such as logistic regression for binary features (or proportions). Examples of non-metric variables would be "which party is chosen", "test passed / failed" or "patient has no / slight / severe pain"

What does "context" mean?

As a reminder: two variables are related if the values ​​of one variable depend on the values ​​of the other - which of course also applies vice versa.

In a regression analysis we are now interested in the influence one or more variables to another variable - we define the "effective direction". The affected (to be explained) variable is often called dependent variable we call the influencing (explanatory) variables independent variables.

  • In our example, score is the dependent variable, sleep time and study time are our independent variables.

To illustrate a relationship between two variables in a scatter plot, it is common to plot the dependent variable on the y-axis and the independent variable on the x-axis:

Bivariate relationships: scatter plots

Description of the procedure

How is the relationship mapped in a linear regression?

To explain the procedure, we will first restrict ourselves to the bivariate case with one dependent and one independent variable. The basic idea of ​​linear regression analysis is to provide a Just to be placed in the point cloud that reflects the relationship as well as possible.

  • To define a straight line, we only need that Slope of the straight line \ (b \) and the y-intercept Determine \ (a \): \ (y = {a} + {b} \ cdot x \).
  • The linear regression model can be written down as: \ (y = {a} + {b} \ cdot x + e \). With the "error term" \ (e \) we take into account that the observations (the points in the scatter plot) do not all lie on the straight line, but will deviate from the values ​​determined with \ (y = a + b \ cdot x \).

With which values ​​for the slope and the axis intercept can the relationship be best described in the following example?

The regression model estimated in our example is exactly:

\ (y = 12.7 + 9.8 \ cdot x + e \)

This section explains the underlying linear model in more detail.

A linear regression analysis generally uses the following model to reduce a (scale) variable to \ (k \) explanatory variables:

\ [y = b_ {0} + b_ {1} \ cdot x_ {1} + b_2 \ cdot x_ {2} + \ ldots + b_ {k} \ cdot x_ {k} + e \]

The \ (y \) values ​​result from a Linear combination the values ​​of the explanatory variables \ (x_1 \) to \ (x_k \) and one Error term \ (e_i \).

A note on the notation: We often write too

\ [y_i = b_ {0} + b_ {1} \ cdot x_ {1i} + b_2 \ cdot x_ {2i} + \ ldots + b_ {k} \ cdot x_ {ki} + e_ {i} \]

If we add the index \ (i \) for the observations, we emphasize that the \ (y- \) values ​​of concrete observations result from the assigned \ (x \) values. The term \ (a \) is often used for the intercept \ (b_ {0} \) - as was the case with the search for the "best straight line" above.

What is a linear combination?

The right side of the regression function without the error term \ (e_ {i} \) is a linear combination of the form \ (b_ {0} + b_ {1} \ cdot x_ {1} + b_ {2} \ cdot x_ {2} + \ ldots + b_ {k} \ cdot x_ {k} \). It consists of the Coefficients \ (b_0 \) to \ (b_k \) and the variables \ (x_1 \) to \ (x_k \). The coefficients are fixed values, the variables can take on all possible values. If we choose certain values ​​for the coefficients, we get by inserting all possible values ​​(combinations) of the \ (x \) - variables:

  • for functions of the form \ (y = b_ {0} + b_ {1} \ cdot x_ {1} \): a line (Straight line) in a two-dimensional surface,
  • for functions of the form \ (y = b_ {0} + b_ {1} \ cdot x_ {1} + b_2 \ cdot x_ {2} \): one surface in a three-dimensional space, and in general:
  • for functions of the form \ (y = b_ {0} + b_ {1} \ cdot x_ {1} + b_2 \ cdot x_ {2} + \ ldots + b_ {k} \ cdot x_ {k} \): one Linear combination in a higher dimensional space.

The assumed relationship between the number of points and the duration of sleep can be described using a linear combination as follows:

\ [\ hat {\ text {points}} = b_ {0} + b_ {1} \ cdot \ text {sleep time} \]

In addition, we suspect that the learning time is also related to the number of points:

\ [\ hat {\ text {points}} = b_ {0} + b_ {1} \ cdot \ text {sleeping time} + b_ {2} \ cdot \ text {learning time} \]

Why does the model need an "error term"?

As a rule, the values ​​observed in our data are of course not all on one "surface" in multidimensional space. The difference between the values ​​\ (\ hat {y} \) predicted by the model and the actually observed values ​​\ (y \) is expressed in the error term. So every single observed value comes from the systematic Part of the linear combination (identical for all observations with the same combination of \ (x \) values) and the as coincidentally interpreted error term (individually for each observation).

\ [\ hat {y} = E (y | x_1, x_2, ..., x_k) = b_ {0} + b_1 \ cdot x_ {1} + b_2 \ cdot x_ {2} + \ ldots + b_k \ cdot x_ {k} \]

The error term thus depicts the influence of other, unobserved variables and / or random processes. In our example, for example, "learner type", "linguistic talent" or "motivation" are not taken into account in addition to many other influences.

In very practical terms, we can imagine that a linear regression model tries to estimate the mean values ​​of the variable \ (y \) for all groups that are defined by different combinations of the characteristics of the \ (x \) variables - assuming that all mean values ​​are arranged on a line or area.

This does not mean that the conditional expected values ​​("group mean values") are actually on a line / area / ... in empirical reality. By using a linear model, however, we assume this structure of empirical reality! If we estimate the parameters \ (b_0 \), \ (b_1 \), \ (\ ldots \) ​​of the model, we get the optimal parameter values ​​for our data for this model. Whether the results and interpretations are meaningful and meaningful depends on whether our Assumptions were plausible.

Determination of the regression line

How do you determine what the "best" straight line is?

If we look for the "best" straight line "by hand", several straight lines probably seem to us to be equally well suited to describe the relationship. How can you decide which is the best straight line? The linear regression analysis uses a certain procedure here, the "method of least squares" (Ordinary Least Squares, OLS). If the data meet some computational requirements, this method can always be used to determine exactly a straight line that best describes the relationship.

This section explains how the least squares method works.

How do we find the coefficients \ (b_0 \) to \ (b_k \) for which the estimated values ​​\ (\ hat {y} \) correspond to the observed values ​​\ (y \) "as closely as possible"? The method used in a linear regression to determine the parameters is based on the error terms \ (e_i \) and is called Least squares method or Ordinary least squares-Criterion denotes:

The parameters should be chosen so that the sum of the squared error terms - i.e. the squared distances of the observed values ​​\ (y_i \) from the estimated values ​​\ (\ hat {y} _i \) - becomes as small as possible:

\ [min \ sum _ {_ i = 1} ^ n e_i ^ 2 = min \ sum _ {_ i = 1} ^ n (y_i - \ hat {y} _i) ^ 2 = min \ sum _ {_ i = 1} ^ n ( y_i - (b_0 + b_1 \ cdot x_ {1i} + b_2 \ cdot x_ {2i} + \ ldots + b_k \ cdot x_ {ki}) ^ 2 \]

With this minimization condition, exactly one set of values ​​for the parameters can always be determined if the data meet some basic requirements.

If we have a certain set of observation data for the variables \ (y, x_1, \ ldots, x_k \), the values ​​are to be minimized Objective function only depends on the values ​​selected for the parameters \ (b_0, b_1, \ ldots, b_k \):

\ [\ sum_ {i = 1} ^ n (y_i - (b_0 + b_1 \ cdot x_ {1i} + b_2 \ cdot x_ {2i} + \ ldots + b_k \ cdot x_ {ki})) ^ 2 \]

For a model with an independent variable, we can graph the objective function as a function of the values ​​used for a and b:

The function runs as an upwardly open parabola or "bowl" -shaped - so there is a clear minimum. The values ​​we are looking for for the parameters \ (b_0 \), \ (b_1 \), \ (b_2, \ ldots, b_k \) are exactly the values ​​at which the function has the lowest value.

In our example, the regression function is:

\ [Points = 12.7 + 9.8 \ cdot sleep time + e \]

If we also take the learning time into account, we get the following regression equation:

\ [Points = 13.3 + 4.8 \ cdot sleeping time + 4.5 \ cdot learning time + e \]

It is immediately noticeable that the regression coefficient for the duration of sleep changes if we take into account the influence of the learning time. This is an example of a Third variable control.

In our example, the learning time is not only related to the number of points, but also to the length of sleep: The students who have prepared well for the exam sleep longer (perhaps they take the exam more seriously or they can sleep more comfortably ... .). This relationship leads to the effect of the duration of sleep overrated will if we don't control for the study time. The coefficient of the sleep duration then not only shows the effect of the sleep duration, but partly also the effect of the learning time.

This section shows how to find a computational solution to the OLS condition.

Since we can rule out that there is a maximum, the parameter values ​​can be determined from the first derivative of the objective function. Specifically, we form the partial derivatives according to the parameters to be determined and set this to zero. It turns out to be a linear system of equations, in our example with two independent variables:

\ [\ sum_ {i = 1} ^ n (y_i - b_0 - b_1 \ cdot x_ {1i} - b_2 \ cdot x_ {2i}) = 0 \]
\ [\ sum_ {i = 1} ^ n x_ {1i} \ cdot (y_i - b_0 - b_1 \ cdot x_ {1i} - b_2 \ cdot x_ {2i}) = 0 \]
\ [\ sum_ {i = 1} ^ n x_ {2i} \ cdot (y_i - b_0 - b_1 \ cdot x_ {1i} - b_2 \ cdot x_ {2i}) = 0 \]

The system of equations derived in this way can - for any number of independent variables - in matrix notation as

to be written.

For a data set with four observed cases, the equation with the matrices written out looks like this:

\ [\ begin {gather} \ begin {bmatrix} 1 & 1 & 1 & 1 \ x_ {11} & x_ {12} & x_ {13} & x_ {14} \ x_ {21} & x_ {22 } & x_ {23} & x_ {24} \ end {bmatrix} \ cdot \ left (\ begin {bmatrix} y_1 \ y_2 \ y_3 \ y_4 \ end {bmatrix} - \ begin {bmatrix} 1 & x_ {11} & x_ {21} \ 1 & x_ {12} & x_ {22} \ 1 & x_ {13} & x_ {23} \ 1 & x_ {14} & x_ {24} \ end { bmatrix} \ cdot \ begin {bmatrix} a & b \ end {bmatrix} \ right) = 0 \ end {gather} \]

\ (y_1, y_2, y_3 \) and \ (y_4 \) are the \ (y \) values ​​observed for four cases, \ (x_1, x_2, x_3 \) and \ (x_4 \) those for the four cases observed x values.

The equation can be converted to b:

Here \ (b \) is a vector that contains the coefficients \ (a \) and \ (b \) we are looking for. This gives us a formula with which we can directly calculate the parameter values ​​that meet the OLS criterion.

(For the derivation of the derivatives and for the conversion of the equation according to \ (b \) see Wolf / Best 2010: 614f.)

Interpretation of the regression equation

The best way to understand the interpretation of this regression equation is to substitute some values ​​for the independent variable and calculate the estimated y values:

  • The estimated pitch of the regression line is \ (b = 9.8 \), ie: if the value of the independent variable increases by one unit (here: one hour of sleep), we estimate a value that is 9.8 units (here: test points) higher the dependent variable.
  • The Intercept \ (a = 12.7 \) can be interpreted as the estimated value for students with 0 hours of sleep.

One often reads the interpretation "if the value of the \ (x \) - variable rises by one unit, then the value of the \ (y \) - variable rises \ (b \) units" (here: a student falls asleep Hour longer, he / she will achieve a test result that is 9.8 points higher). These causal, procedural interpretation is only permissible under very far-reaching assumptions. We did not observe in our data how a person's test results differ when they sleep for different lengths of time. The estimate of the slope coefficient \ (b \) is only based on group comparisons of people who have slept for different lengths of time. In the context of regression models, however, we can try to find a causal interpretation through the Control of third-party variables approximate.

This section explains in more detail how to interpret the regression equation.

Interpretation of the regression coefficients \ (b_1 \) to \ (b_k \)

This section deepens the interpretation of the regression coefficients.

When interpreting the regression coefficients \ (b_1 \) to \ (b_k \), we refer to the fact that they determine the slope of the regression line (or in the multivariate case: the slope in the direction of the respective x-dimension):

With an increase of \ (x_k \) by one \ (x \) - unit, the value estimated for \ (y \) changes by \ (b_k \) \ (y \) - units, if the values ​​of the further independent variables remain constant in the model.

In our example (model with sleep time and study time):

  • Pupils who study one hour longer have an average of 4.5 higher points in the German test if they sleep the same time.
  • Pupils who sleep an hour longer have an average of 4.8 higher points if the learning time remains the same.

A very clean interpretation would be:

  • If we consider two groups of students who differ in their study time by an hour but have the same sleep time, we expect a difference in the mean score of 4.8.
  • If we consider two groups of students who differ in their sleeping time by an hour but have the same study time, we expect a difference in the mean score of 4.5.

Interpretation of the intercept \ (b_0 \)

For the Interpretation of the "intercept" \ (b_0 = 13.3 \) we have to remember that this is the \ (y \) value at which the line intersects the y-axis. In terms of content, we can therefore interpret the value as the predicted \ (y \) value for observations with the value 0 on all \ (x \) variables. Since nobody slept 0 hours in our example, the value itself cannot be meaningfully interpreted here (but is required to determine the regression line).

In order to obtain a meaningfully interpretable axis intercept, we could use the \ (x \) variables center. For this purpose, the mean value of the variable is subtracted from each observed value: \ (x_ {i, centered} = x_i - \ bar {x} \). If we estimate the model with variables thus centered, the intercept can be interpreted as the predicted value for a case with mean value on all explanatory variables.

As a rule, we will not only be interested in the type of relationship (how many additional math points can I count on with an extra hour of study time?), But also in the Strength of the effect. It is obvious to use the coefficients here, which is why they too Regression weights to be named. A simple comparison of the size of the coefficients can be problematic for two reasons:

  1. The size of the coefficients depends on the units of the variables together. In our example this is not a problem, both variables were measured in hours. If, for example, we had recorded the learning time in minutes, we would get \ (b_ {learning time} = 4.6 / 60 = 0.077 \). It becomes even more problematic if we want to look at variables on completely different scales (for example: chocolate consumed during learning in kg).
  2. Whether a variable has a substantively meaningful effect also depends on the empirical distribution. In our example, the pupils differ in their sleeping time between \ (4.2 \) and \ (7.8 \) hours, but in their learning time only between \ (0.7 \) and \ (12.2 \) Hours. This puts the similar size of the two effects into perspective: The "maximum effect" of the learning time is only \ ((7.8-4.2) \ cdot 4.8 = 17.3 \) points, the "maximum effect" the sleep time at \ ((12.2-0.7) \ cdot 4.5 = 51.75 \) points.

Use the regression equation to make predictions

From the graph it is also clear that we can use the estimated regression equation to calculate a value of the dependent variable \ (y \) for every desired value of the independent variable \ (x \), regardless of whether there is a case with this \ (y \) in our data. (x \) - value gives or not. In this way, a regression model can also be used for forecast of values ​​of the dependent variable for certain values ​​of the independent variable.

We should be careful about making predictions for ranges of values ​​of the independent variables for which we have no observations. We can easily use the estimated regression equation to predict the score for students who slept for 48 hours before the test. However, it is obviously unrealistic to calculate with a score of \ (12.7 + 48 \ cdot 9.8 = 483.1 \) in such a case. The same applies to the interpretation of the intercept: Since we did not observe any students in our data who had gone through the night before the test (i.e. who slept for 0 hours), we should not use this value as a prediction.

Several explanatory variables

How can other variables be taken into account for the explanation?

Note:In the Level 2 sections we have already shown a model with more than one independent variable.

The great advantage of regression models is that the linear equation \ (a + b \ cdot x + e \) used to "explain" the dependent variable \ (y \) can easily be expanded to include additional variables. If the learning time is to be taken into account in our example, the equation is:

\ [Score = a + b_1 \ cdot duration of sleep + b_2 \ cdot learning time + e \]

A graphical representation of the relationship now looks like this:

We are now no longer looking for the best possible straight line through a 2D point cloud, but for the best surface in a 3D point cloud. The area is described by two slopes - the slope of the "sleep duration" axis and the slope of the "learning time" axis.

The estimated regression equation in our example is:

  • Score = \ (13.3 + 4.8 \ cdot \) duration of sleep \ (+ 4.5 \ cdot \) learning time \ (+ e \).

The interpretation of the slope coefficients \ (b_1 \) and \ (b_2 \) is now:

  • For a student with one hour more sleep we expect a test result that is \ (b_1 = 4.8 \) points better, if the learning time remains the same.
  • For a student with one hour more learning time, we expect a test result that is \ (b_2 = 4.5 \) points better, when the length of sleep remains the same.

The regression equation can easily be expanded to include additional explanatory variables - a graphic representation is then no longer possible, as we define 4D or higher-dimensional spaces in this way. With the OLS method, however, we can estimate the corresponding slope coefficients without any problems.

The terms "explained" and "explanatory" variables suggest that causal relationships can be investigated with a regression model. It's not like this! It is important to realize that a regression model can only describe relationships in observed data and is not a statistical "trick" to conjure up causal relationships.

However, regression models are often used in practice to infer causal relationships based on observational data, in which possible competing influences are excluded by means of third-party variable controls. Such a procedure can only make a causal interpretation more plausible and never completely secure it.

When formulating the interpretation of the relationships found, we should therefore be very careful not to suggest that we have found a causal effect.

Wolf, Christof / Best, Henning (2010): Linear regression analysis. In: Wolf, Christof / Best, Henning (eds.), Handbuch der Sozialwissenschaftlichen Datenanalyse, pp. 607-638. Wiesbaden: VS publishing house for social sciences.