This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Missing data or missing value is information that is not available on a subject (case). Missing data occurs because some information on the object is not given, thus it is difficult to find or the actual information does not exist. The case of missing data is ignored as it will certainly make it difficult to obtain a high accuracy for result classification even though the most reliable classification algorithm is used. One method in handling the missing data problem is by imputation. Multiple imputation methods can be used to replace missing data with a constant value, hot deck, regression method, expectation maximization method, and multiple imputation.

To analyze, compare, and determine the best imputation method of missing data between hot deck and regression methods.

Data used is the data of respondents who practice family planning in the town of Pasuruan, East Java, Indonesia, and age variable. Variable age is used as the simulation data is lost, then imputated by hot deck or regression. The original data results will be compared with the imputed data using

Results of imputation using simulated data age variable show that regression method is better than hot deck method in handling missing data on health science.

The best method views from the results are not significant

Missing data or missing value is information that is not available on a subject (case). Missing data occurs because some information on the object is not given, thus it is difficult to find or the actual information does not exist.

Methods for handling problem of missing data in a statistical analysis are such as procedures based on complete unit (completely recorded units), model-based procedure, weighting procedures, and procedure-based imputation. Multiple imputation methods can be used to replace missing data with a constant value such as hot deck, regression, expectation maximization, and multiple imputation. Some research shows that handling missing data with imputation method can increase classification accuracy than without imputation.

This research will compare two methods of imputation which are hot deck and regression. Hot deck is a complete method of replacing missing data with an average value, especially in prediction standard errors that underestimate. Before using this method, the data must first be sorted by variables assessed variables that are linked to missing data items. People who are in the same cluster are then placed in the same file. The weakness of the hot deck is that the missing data repeatedly filled with value then prediction will be biased.

Data that used in this study is by monitoring data of fertile couples with a mini survey of Indonesia in 2014. Mini survey is a research method to collect and analyze a simple data quantitatively and is cheap and fast.

Type of research is non-reactive research which is a kind of research for secondary data.

The data used is variable age of 80 respondents, which is reduced to 15%, 10%, and 5% by random and is repeated three times. In data sets of missing 15%, it is reduced to 12 data if its 10% the data, it is reduced to 8 data and group data of 5% is reduced to 4 data. After the data reduction, empty data is conducted by imputation method hot deck and regression. The reduction of the missing data was repeated 3 times, so imputation is also repeated 3 times. Total 15% missing data produce 36 data, the missing data of 10% total generates 24 data, and the missing data of 5% results in 12 data. Here are the comparison data imputation results with the original data.

From

Results imputation of missing data

H_{0}: There is no difference between the original data with the data after imputation

H_{1}: There is a difference between the original data with the data after imputation.

Results paired

Paired

Correlation test is used to determine the strong relationship between the original data and data after imputation. If the values are getting closer to r +1, then the relationship is stronger, otherwise if close to −1 then the relationship is getting weaker.

For data missing of 15%, 10%, and 5%, which the value r is close to +1 then regression method is used, which means between the original data and data after imputation with regression methods have a strong relationship. Pearson correlation test is not only judged by the value of

Results paired t-test

Results Pearson correlation

Results RMSE

The lower RMSE value shows that the variation value produced by a variation of forecast models approached observation. The lower the RMSE value, resulting data is better.

From both of methods that have the smallest RMSE value is regression method. Other than the results of RMSE, RMSE repetition patterns also become one of the considerations in determining the best method. For missing data of 15%, there is no method that has a value is stable, but the method of regression in the second and third imputation has a stable value. At missing data of 10%, there is no method that has RMSE values are stable, but the method of regression in the first and second imputation has a stable value. At missing data of 5% regression method that have the most stable RMSE value, that means the results of imputation first, second, and third resulted RMSE values are not much of a difference.

Correlation or Pearson correlation test is to know the powerful relation between the original data and data after imputation. The overall result of the test Pearson correlation in data group missing 5%, 10%, and 15% showed that regression method produces

For the results of imputation need to be determined the RMSE test to know the results of imputation have large error or not, the smaller value of RMSE the data result is better. RMSE value derived from the square root of the difference between data after imputation with data before imputation, the bigger differences in data before and after imputation the larger is the RMSE value produced and otherwise. This caused regression method have the smallest RMSE values compared to other imputation methods.

In conclusion, the best method views from the results are not significant