Unlike descriptive statistics, which are used to describe the characteristics (i.e. distribution, central tendency, and dispersion) of a single variable, inferential statistics are used to make inferences about the larger population based on the sample. Since a sample is a small subset of the larger population (or sampling frame), the inferences are necessarily error prone. That is, we cannot say with 100% confidence that the characteristics of the sample accurately reflect the characteristics of the larger population (or sampling frame) too. Hence, only qualified inferences can be made, within a degree of certainty, which is often expressed in terms of probability (e.g., 90% or 95% probability that the sample reflects the population).
Typically, inferential statistics deals with analyzing two (called BIVARIATE analysis) or more (called MULTIVARIATE analysis) variables. In this discussion, we will limit ourselves to 2 variables, i.e. BIVARIATE ANALYSIS.
There are different types of inferential statistics that are used. The type of inferential statistics used depends on the type of variable (i.e. NOMINAL, ORDINAL, INTERVAL/ RATIO). While the type of statistical analysis is different for these variables, the main idea is the same: we try to determine how one variable compares to another. Values of one variable could be systematically higher/ lower/ or the same as the other (e.g., men's and women's wages). Alternatively, there could be a relationship between the two (e.g. age and wages), in which case, we find the correlation between them. The different types of analysis could be summarized as below:
| Type of Variables | Inferential statistics |
| Nominal (e.g. GENDER, male and female) | Compare the DISTRIBUTION, CENTRAL TENDENCY [Carry out separate test to check the validity (i.e. margin of error) of above comparison, in which DISPERSION measures are used] |
| Ordinal (e.g. class grades) | Beyond scope [should be taught in Statistics class] |
| Ratio/ Interval (e.g. AGE and WAGE) | Regression Analysis |
Often, we need to compare two nominal variables. For example, we might want to find out if MEN earn more than WOMEN. In this example, MEN and WOMEN are nominal values of GENDER. We are comparing their EARNINGS, which has RATIO values. Hence, in this case, we might compare the CENTRAL TENDENCY for MEN's EARNINGS to WOMEN's EARNINGS. What measure of CENTRAL TENDENCY (i.e. mean, median, or mode) do you think will be most appropriate to compare in this case? [Hint: Obviously, this will depend on FREQUENCY distribution of EARNINGS. Typically, it is a SKEWED distribution.] As mentioned earlier, we cannot be 100 percent confident of this comparison. To verify if the comparison is variable, we need to calculate the t-statistic [should be covered in Statistics class].
Regression analysis is used to measure the degree of relationship between two or more RATIO variables. Consider any two RATIO variables, for example AGE and WAGES. One might reasonably expect that WAGES might increase as AGE increases, based on the hypothesis that one's experience increases with age. Thus, consider the following hypothesis:
Hypothesis: WAGES are positively related to AGE. [That is, higher the AGE, higher the WAGES; lower the AGE, lower the WAGES.]
Of course, AGE is not the only factor that determines WAGES. There might be other factors. GENDER is often such a factor (Census Bureau figures reveal that women earn less than women); EDUCATION might be another; and so on. Despite such other factors, we may reasonably be inclined to test the above hypothesis to see if it is indeed true. This is a BIVARIATE analysis since we are using only 2 variables. We could use regression analysis to find out the relationship between AGE and WAGES, i.e. test whether there is indeed a relationship between the two variables.
In the above example, clearly, AGE is the INDEPENDENT variable, and WAGES is the DEPENDENT variable. In regression analysis, the DEPENDENT variable is generically denoted by Y, and the INDEPENDENT variable is denoted by X. [Below, whenever I refer to X, it is the independent variable; Y is the dependent variable.]
The first step in the regression analysis is to chart the X and Y values graphically to visually see if there is indeed a relationship between the X and the Y. X is typically on the horizontal (x) axis; Y is typically on the verical (y) axis. This chart of plotted values is called a scatterplot. The scatterplot should give you a good visual clue as to whether X and Y are related or not. See the charts below. A POSITIVE association between AGE and WAGES would have an upward trend (positive slope), where higher WAGES correspond to higher AGE and lower WAGES correspond to lower AGE. A NEGATIVE association would be indicated by the opposite effect (negative slope), where the older individuals (i.e. higher AGE) have lower WAGES than the younger individuals (i.e. lower AGE) (this could arguably apply in computer programming, which is a relatively young field). A RANDOM association (i.e. zero association) is one where the scatterplot does not indicate any trend (i.e. either positive or negative). In this case, young as well as old individuals may expect to earn high or low earnings (i.e. the trend would be flat). There are, however, many cases where the relationship between X and Y may not be as linear; the relationship may be curvilinear, e.g., U or reverse U. For example, WAGES might rise with AGE upto a certain number of years (say, retirement), and decrease after that (a reverse U). All of this information can be visually gleaned from the scatterplot. Examine the following scatterplots.
![]() |
![]() |
| 1. Positive Correlation | 2. Negative Correlation |
![]() |
![]() |
| 3. Random (i.e. NO) correlation | 4. Non-linear (reverse U) correlation |
Obviously, if the hypothesis stated above is true, we should expect to see Figure 1 if we drew a scatterplot of AGE and WAGES. If we somehow get any of the other scatterplots, understandably, the hypothesis may not be true.
The above scatterplots give a good idea of the overall type of relationship between X (Independent) and Y (Dependent) variables. Yet, they do not give us a precise idea (i.e. mathematically accurate) idea of the relationship between the two variables. Hence, the second step is to test the relationship mathematically. We will deal only with LINEAR relationships here. In a linear relationship, if you recall high school mathematics, the relationship between X and Y can be described by a single line. A line is given by the equation:
| Y = A + B * X, where Y = Dependent variable; |
![]() |
I will not get into the statistical procedures for how to calculate the values for A and B; these are covered in the class on statistics [You can simply calculate this using Excel, as shown in class]. Here, my interest is more in explaining and interpreting what these values mean. From the scatterplot and the regression line, you should be able to more precisely understand the relationship between X and Y. There are several likely scenarios:
(a) the line is at 45 degrees (i.e. B = 1), which means that X and Y have a perfect relationship (i.e. for 1 unit increase in X, there is a corresponding 1 unit increase in Y). That means our hypothesis is fully true. However, this is rarely the case in most social science studies;
(b) the line is off from 45 degrees but is inclined close to it (i.e. B~1), which means X and Y are indeed related (i.e. for 1 unit increase in X, there is a fractional increase in Y). If there is a positive slope (i.e. the line is inclined upward), the hypothesis holds true; if there is a negative slope, the hypothesis does not hold true. This is more likely to be the case in many occasions.
(c) the line is vertical or horizontal (i.e. B=0 or infinite), which means X and Y are not related. This means our hypothesis is not true.
Thus the value B tells much about the relationship between the Independent and Dependent variables.
The regression equation is really useful in predicting the value of Y for a given value of X. That is, in the above example of relationship between AGE and WAGE, you will be able to predict what WAGE one will earn at a particular AGE, when the values of A and B are given. Thus, suppose the regression equation between AGE and WAGE is given as (A= -6; B= 0.9):
WAGE = -6 + 0.9 * AGE [WAGE is hourly; AGE is in years]
Then, at the AGE 45, the person could expect to receive: -6 + 0.9 * 45 = -6 + 40.5 = $32.5 per hour.
[The value A is the value of Y when X = 0. This value is of no statistical use unless X can actually take values near 0.]
Obviously, from the scatterplot and regression equation, you should now be able to predict if there is indeed any relationship between the Independent and Dependent variables. The third step tells you how much of an effect the Independent variable has on the Dependent variable. Here, we calculate the Correlation coefficient. This coefficient, also called Pearson's R, gives the strength of relationship between the two variables. [Again, I am not describing how to calculate; this should be covered in Statistics class; you can simply do this using Excel as showed in class]. The value of Pearson's R could range anywhere between 0 and 1. Generally, in social science, a value of R above 0.6 indicates a strong relationship between the two variables. A value between 0.3 and 0.6 indicates a moderate relationship. Anything below 0.3 indicates a weak relationship.
More generally, the value of R-squared (i.e. the squared value of Pearson's R) is calculated to give the percentage strength of relationship between the independent and dependent variables. Similar to R, R-squared value could be anywhere between 0 and 1. Let's say in the above example, the Pearson's R is 0.7. This value indicates that there is a strong relationship between AGE and WAGES. The R-squared value is 0.7 * 0.7 = 0.49. This means that AGE represents 49% of the increase in one's WAGES. [The other 51 percent could be other factors, such as education, etc.]
There are additional steps required to test if the values of R and R-squared above are indeed reliable; these should be covered in your Statistics class.