I'm currently working on a project to see which data imputation method works best with a dataset I have.

Question

I'm currently working on a project to see which data imputation method works best with a dataset I have.

I have the complete dataset.

Independent variable : Yield of the crop
Dependent variables : Year , Season , Production per hectare

So I'm planning to apply data imputation methods such as Multiple Linear Regression, KNN, Polynomial Interpolation.

My method is to randomly remove some independent variable fields(test set) and then try to imputate them using above techniques by training the rest of the dataset using above techniques, and comparing with the original Yield value.
Then I plan to select the data imputation method which works best for this dataset.

Consider this procedure done using Python programming language.(Google Colab environment)

Now I've coded upto the part where I've trained the model using 80:20 train:test data ratio.

I've computed the linear regression coefficients and my test dataset already have been inserted with the Yield values from the model.

Since, I need graphical and statistical evidence of the efficiency and accuracy of each model, how am I supposed to impute Yield values to the whole dataset and compare with original Yield values.

Do I have to manually create an equation containing the equation of the linear model , substitute independent variables and then find the Yield values from the model and then then compare with the original Yield value?

Is there any code that automatically adds a column with the Yield values derived from the linear regression model, for the whole dataset, just any method that will give the estimate values for all the Yield values in the dataset.

Answer 1

To impute the Yield values for the entire dataset using the linear regression model, you can use the coefficients obtained from training the model.

1. First, fit your linear regression model to the training data, as you mentioned you have already done.
2. Retrieve the coefficients of the linear regression model. In Python, this can be done using the `coef_` attribute of the linear regression model object.
3. Multiply each independent variable in your complete dataset by its corresponding coefficient obtained from the model.
4. Sum up the products for each row in the dataset to obtain the estimated Yield value.

Here's an example code snippet that demonstrates how to perform these steps:

```python
# Assuming you have already trained your linear regression model and obtained the coefficients

# Load your complete dataset, including the independent variables (Year, Season, Production per hectare)
# into a pandas DataFrame called 'data'

# Get the independent variables from the dataset
X = data[['Year', 'Season', 'Production per hectare']]

# Multiply each independent variable by its corresponding coefficient
imputed_yield = (X * coefficients).sum(axis=1)

# Add a new column with the imputed Yield values to the DataFrame
data['Imputed Yield'] = imputed_yield
```

By following these steps, you will have a new column in your dataset called 'Imputed Yield' that contains the estimated Yield values derived from the linear regression model.

You can then compare the imputed Yield values with the original Yield values in your dataset to evaluate the accuracy and efficiency of the linear regression model.