Clickfraud has become a major concern as more and more companies advertise on the Internet. When Google places an ad for a company with its search results, the company pays a fee to Google each time someone clicks on the link. That’s fine when it’s a person who’s interested in buying a product or service, but not so good when it’s a computer program pretending to be a customer. Analysis of 1200 clicks coming into a business during a week identified 175 of these clicks as fraudulent.

Question

Clickfraud has become a major concern as more and more companies advertise on the Internet. When Google places an ad for a company with its search results, the company pays a fee to Google each time someone clicks on the link. That’s fine when it’s a person who’s interested in buying a product or service, but not so good when it’s a computer program pretending to be a customer. Analysis of 1200 clicks coming into a business during a week identified 175 of these clicks as fraudulent.

• What is the difference between a critical value and a test statistic? How do you decide which test statistic to use?
• When is it appropriate to use a one-tailed test versus a two-tailed test? Does the direction of the test affect statistical significance? Explain.
• What is a P-value? What does a P-value of 0.0000001 mean?
• What are degrees of freedom? Illustrate with one new example.
• Under what conditions does it make sense to treat these 1200 clicks as a sample? What would be the population?
• Show the 95% confidence interval for the population proportion of fraudulent clicks in a form suitable for sharing with a nontechnical audience.
Based on the scenario above, if a company pays Google $4.50 for each click, give a confidence interval (again, to presentation precision) for the mean costs due to fraud per click

Answer 1

To answer your questions, let's break them down one by one:

1. Difference between a critical value and a test statistic:
- A critical value is a value used in hypothesis testing that defines the boundaries for the region of rejection. It helps determine if we should reject the null hypothesis or not.
- On the other hand, a test statistic is a quantity calculated from a sample that is used in hypothesis testing. It measures the strength of evidence against the null hypothesis.

When it comes to deciding which test statistic to use, it depends on the specific hypothesis you are testing, the type of data you have, and the assumptions of the statistical test you are performing. Different tests have different test statistics associated with them.

2. One-tailed test versus two-tailed test:
- It is appropriate to use a one-tailed test when the alternative hypothesis specifies a direction (e.g., the fraudulent clicks are significantly higher than expected).
- A two-tailed test, on the other hand, is used when the alternative hypothesis does not specify a direction (e.g., the fraudulent clicks are significantly different from the expected value).

The direction of the test does affect statistical significance. In a one-tailed test, we only consider extreme values in one direction, whereas in a two-tailed test, we consider both extremes. Therefore, for the same level of significance, a two-tailed test may yield a higher p-value.

3. P-value:
- The p-value is the probability of obtaining a test statistic as extreme or more extreme than the one observed, assuming the null hypothesis is true. It measures the strength of evidence against the null hypothesis.
- A p-value of 0.0000001 means that there is a very low probability (0.00001%) of obtaining a test statistic as extreme or more extreme than the one observed, assuming the null hypothesis is true. In other words, it suggests strong evidence against the null hypothesis.

4. Degrees of freedom:
- Degrees of freedom (df) is the number of independent pieces of information that are available to estimate a statistic. It determines the shape and scale of the sampling distribution.
- For example, if you have a sample of size n, the degrees of freedom for estimating the sample mean is (n-1). This is because once you know n-1 observations, the value of the nth observation is determined.

5. Treating the 1200 clicks as a sample:
- It makes sense to treat the 1200 clicks as a sample if they are representative of a larger population of clicks. In this case, the population would be the entire set of clicks that the company receives.

6. 95% confidence interval for the population proportion of fraudulent clicks:
- To calculate the confidence interval, you can use the formula:
confidence interval = sample proportion ± margin of error
margin of error = critical value * standard error

To share this with a nontechnical audience, you can say something like: "Based on our sample of 1200 clicks, we are 95% confident that the true proportion of fraudulent clicks in the entire population falls within the range of [lower bound, upper bound]."

7. Confidence interval for mean costs due to fraud per click:
- To calculate the confidence interval for the mean costs, you would also need the sample mean and sample standard deviation.
- Using the formula: confidence interval = sample mean ± (critical value * standard deviation / √(sample size))

Given that the company pays $4.50 for each click, you would convert this into costs due to fraud using the proportion of fraudulent clicks. Then, you can calculate the confidence interval using the provided formula.

Answer 2

1. The difference between a critical value and a test statistic:

- A critical value is a value used in hypothesis testing that determines the boundary beyond which a hypothesis is rejected. It is based on the significance level chosen for the test.
- A test statistic is a numerical summary of the data used to make inference about the population parameter. It is used to calculate the p-value and compare it to the significance level to make a decision about the hypothesis.

The choice of which test statistic to use depends on the type of data (e.g., categorical or continuous), the nature of the research question, and the statistical test being performed.

2. One-tailed vs. two-tailed tests:
- A one-tailed test is appropriate when the research question has a specific directional hypothesis. It tests whether the sample mean or proportion is significantly greater than or less than a specified value.
- A two-tailed test is appropriate when the research question has a non-directional hypothesis. It tests whether the sample mean or proportion is significantly different from a specified value.

The direction of the test may affect statistical significance because the critical region is divided differently for one-tailed and two-tailed tests. In a one-tailed test, all the significance level is allocated to one tail, while in a two-tailed test, it is split between both tails.

3. P-value:
- The p-value is a measure of the evidence against the null hypothesis. It represents the probability of obtaining a test statistic as extreme as, or more extreme than, the observed data, assuming the null hypothesis is true.
- A p-value of 0.0000001 (or 1e-7) means that the probability of observing the test statistic (or more extreme) under the null hypothesis is very small, approximately 0.00001%. This indicates strong evidence against the null hypothesis and suggests statistical significance.

4. Degrees of freedom:
- Degrees of freedom (df) represent the number of values that are free to vary in the calculation of a statistic. It is used to determine critical values from probability distributions.
- For example, in a t-test comparing the means of two samples, the degrees of freedom are calculated as the sum of the sample sizes minus two (df = n1 + n2 - 2), where n1 and n2 are the sample sizes of the two groups being compared.

5. Treating the 1200 clicks as a sample and population:
- It makes sense to treat the 1200 clicks as a sample when it is not possible or practical to collect data on the entire population of clicks.
- The population, in this case, would be all the potential clicks that could occur on Google ads placed by the company during the specified time period.

6. 95% confidence interval for the population proportion of fraudulent clicks:
- A suitable 95% confidence interval to share with a nontechnical audience would be:
"The estimated proportion of fraudulent clicks in the population is between X% and Y%, with 95% confidence."

7. Confidence interval for the mean costs due to fraud per click:
- To calculate the confidence interval, we need to know the sample mean (X̄), sample standard deviation (s), sample size (n), and the desired level of confidence (e.g., 95%).

Let's assume the sample mean cost per click due to fraud is $4.50 and the sample standard deviation is not provided. Without this information, we cannot calculate the confidence interval.