1. A regression analysis includes the effect of Age, which is categorised into five levels. How many dummy variables should be defined for including it in the analysis?

2. A regression analysis includes Sex as an explanatory variable. How many dummy variables are required to be defined?

3. A regression analysis has:

-five continuous variables;

-sex;

-ethnic group with five levels; and

-education at four levels.

What is the degress of freedom of the regression sum of squares in the ANOVA table?

4. When point A is removed from a regression analysis the correlation coefficient increases. Which one of the following must be correct?
Select one:
a. A corresponds to a large residual value.
b. A is an outlier.
c. A should not be removed from the regression.
d. A is a point of high leverage.

5. When point A is removed from a regression analysis the correlation coefficient decreases. Which one of the following must be correct.
Select one:
a. A corresponds to a low residual value.
b. A is an outlier.
c. A is a point of high leverage.
d. A should not be removed from the regression.

6. In a multiple regression the first full model is fitted with all the explanatory variables. Several of the variables are not significant. Which one of the following is the best way to proceed?
Select one:
a. Remove all the non-significant variables from the model.
b. Remove the variable with the lowest p-value first.
c. Remove the non-significant variable with the largest p-value first,
d. There is no need to remove any variables, as this will reduce the coefficient of variation.

12

1. To include the effect of Age, which is categorized into five levels, you would need to define four dummy variables. In general, for a categorical variable with k levels, you will need to define k-1 dummy variables to include it in the analysis. This is because one level serves as the reference category, and the effect of the other levels is captured by the dummy variables relative to the reference category.

2. When including Sex, a binary variable with two categories (e.g., male or female), you will need to define one dummy variable. In this case, you can choose either category as the reference category, and the dummy variable will capture the effect of the other category relative to the reference category.

3. To calculate the degrees of freedom (df) of the regression sum of squares in the ANOVA table, you need to consider the number of explanatory variables being included in the regression model. In this case, the regression model includes the following variables:
- Five continuous variables (assuming they are not being categorized), which will contribute 5*1 = 5 degrees of freedom.
- Sex, a binary variable, which contributes 1 degree of freedom.
- Ethnic group with five levels, which will contribute 5-1 = 4 degrees of freedom.
- Education at four levels, which will contribute 4-1 = 3 degrees of freedom.

Therefore, the total degrees of freedom for the regression sum of squares would be 5 + 1 + 4 + 3 = 13.

4. When the correlation coefficient increases after removing point A from a regression analysis, it suggests that point A is an outlier. An outlier is an observation that does not follow the general pattern of the data and can have a substantial effect on the relationship between variables. Removing this point might improve the overall fit of the regression model and increase the correlation coefficient.

5. When the correlation coefficient decreases after removing point A from a regression analysis, it suggests that point A is a point of high leverage. Leverage refers to how much an individual data point can influence the regression line. A point of high leverage means that it has a disproportionate effect on the regression line, and removing it can result in a lower correlation coefficient.

6. When several variables in a multiple regression model are not significant, the best way to proceed depends on the specific goals and considerations of the analysis. However, a common approach is to remove the non-significant variable with the largest p-value first. This is because the p-value indicates the probability of observing the data if the null hypothesis (no relationship between the explanatory variable and the response variable) is true. By removing the variable with the largest p-value, you are potentially eliminating the weakest explanatory factor from the model. However, it is important to consider the context and interpret the results carefully before removing any variables from the model.