Question 3 [17 Marks)

Question

Question 3 [17 Marks)

There are five (5) set of data; A to E is provided for this question. Each group will be assigned to analyze
ONE dataset only and answer the questions below. The assigned dataset will be determined by your
lecturer in separate announcement in MMLS.
Below are the descriptions for each of the dataset:
Dataset A : The high level of serum cholesterol may result in the risk of having cardiovascular disease.
There are several researches which found that the body weight is closely related to the cardiovascular
disease. In this study, we would like to investigate if the body weight (in pound) may influence the level
of serum cholesterol (in mg) of people in order to identify the prevention strategy towards
cardiovascular disease.
Dataset B: This data gives the values of food price index and house price measure for a particular state
in UK from 1971 to 1989. In this study, we would like to investigate if the food price index (in %) may has
some significant effect on the average price (in $100) of houses in the state.
Dataset C: This data provides list of standardized mortality ratio (SMR) of deaths from lung cancer for 25
groups of job among male employees. Also given are smoking ratios for each job group. In this study, we
would like to identify if the smoking ratio may influence the lung cancer SMR.
Dataset D: This data come from a study in Britain of possible influences on the extent to which parents
consult a doctor when their child is ill. The parents were asked of how often each child had been ill in
the past two weeks and reported to the doctor. Ages of the child were also recorded. Does age have an
effect on the rate of reporting illness to doctors?
Dataset E: This data consist of 36 consecutive monthly sales and monthly advertising expenditures for a
dietary weight control product. This study would like to investigate if there is any significant effect of the
total expenditures of advertising towards total sales.
Using the dataset in EXCEL spreadsheet , answer the following questions.
a. Determine the dependent variable, Y and independent variable, X for your dataset based on the
description provided. [2 marks]
b. Generate the simple linear regression model using EXCEL function between dependent variable,
Y and the independent, X. [3 marks]
c. Determine the estimated regression equation from output in (b) and interpret the coefficients.
[3 marks]
d. Can we infer that the dependent variable has significant association with the independent
variable? [4 marks]
e. State the coefficient of determination and interpret its value. [2 marks]
f. Graph the relationship between the two variables as a scatter plot. Discuss the relationship of
the variables based on the pattern of straight line of the scatter plot. [3 marks]

Answer 1

a. To determine the dependent and independent variables for the assigned dataset, we need to analyze the description provided.

For Dataset A: The dependent variable is the level of serum cholesterol (in mg), represented by Y. The independent variable is the body weight (in pounds), represented by X.

b. To generate the simple linear regression model using Excel function, we will use the built-in regression analysis tool.

Step 1: Select the dataset in Excel that contains both the dependent variable (Y - level of serum cholesterol) and the independent variable (X - body weight).
Step 2: Go to the Data tab in Excel and click on the Data Analysis button.
Step 3: In the Data Analysis dialog box, select Regression and click OK.
Step 4: In the Regression dialog box, specify the input range for the dependent variable (Y) and the independent variable (X).
Step 5: Check the box for Labels if your data contains column headers, and select an output range for the regression analysis results.
Step 6: Click OK to generate the simple linear regression model.

c. The output of the regression analysis will provide the estimated regression equation. Interpretation of the coefficients can be done based on the output.

The estimated regression equation will be in the form of Y = a + bX, where a is the intercept coefficient and b is the slope coefficient.

d. To determine if there is a significant association between the dependent variable and the independent variable, we need to analyze the p-values of the coefficients in the regression analysis output.

If the p-value is less than the significance level (e.g., 0.05), we can infer that there is a significant association between the dependent variable and the independent variable.

e. The coefficient of determination (R-squared) represents the proportion of the variation in the dependent variable that can be explained by the independent variable(s).

To determine the coefficient of determination, we need to look at the output of the regression analysis.

Interpretation of the coefficient of determination:
- If R-squared is close to 1, it indicates that a high proportion of the variation in the dependent variable is explained by the independent variable(s).
- If R-squared is close to 0, it indicates that the independent variable(s) have little or no effect in explaining the variation in the dependent variable.

f. To graph the relationship between the two variables as a scatter plot, we can use Excel's charting capabilities.

Step 1: Select the two columns of data representing the dependent variable (Y) and the independent variable (X).
Step 2: Go to the Insert tab in Excel and click on the Scatter chart button.
Step 3: Choose the scatter plot option that best represents the relationship between the variables (e.g., scatter plot with straight lines).
Step 4: Excel will generate the scatter plot based on the selected data.

Based on the pattern of the straight line in the scatter plot, we can observe the relationship between the variables. If the plot shows a clear upward or downward trend, it indicates a positive or negative relationship between the variables, respectively. If the plot doesn't show a clear trend, it suggests a weak or no relationship between the variables.