Tomato Maturity Classification using Naive Bayes Algorithm and Histogram Feature Extraction

Tomatoes have nutritional content that is very beneficial for human health and is one source of vitamins and minerals. Tomato classification plays an important role in many ways related to the distribution and sales of tomatoes. Classification can be done on images by extracting features and then classifying them with certain methods. This research proposes a classification technique using feature histogram extraction and Naïve Bayes Classifier. Histogram feature extractions are widely used and play a role in the classification results. Naïve Bayes is proposed because it has high accuracy and high computational speed when applied to a large number of databases, is robust to isolated noise points, and only requires small training data to estimate the parameters needed for classification. The proposed classification is divided into three classes, namely raw, mature and rotten. Based on the results of the experiment using 75 training data and 25 testing data obtained 76% accuracy


INTRODUCTION
Tomatoes are native plants from Central America, South America, Mexico to Peru. Tomatoes come from the highlands of the west coast of South America. Tomatoes have a short life cycle and can grow as high as 1-3 meters [1]. Tomatoes can have green, yellow, and red colors that can be used as vegetables in cooking or eaten directly. Tomato fruit is a fruit that has high nutrition, in addition to being consumed as fresh fruit, but also used as flavoring ingredients and ingredients in the food and beverage industry. Delivery of tomatoes between traders can also affect the maturity of tomatoes because tomatoes have a short life cycle [2].
Along with the development of information technology, it is possible to identify fruit maturity with the computers help [3] [4]. Identification can be done by classifying the image of tomatoes with various methods such as K-Nearest Neighbor (KNN) [4] [5], Random Forest [6], Support Vector Machine(SVM) [7] [8], Naïve Bayes [7] [8] [9] [10] [11] [12] [13], etc. In this study, identification of maturity in a set of tomatoes using the Naïve Bayes algorithm. Naïve Bayes was chosen because the Naïve Bayes algorithm has high accuracy and speed when applied to a large number of databases, is robust to isolated noise points, and only requires small training data to estimate the parameters needed for classification.
But the Naïve Bayes algorithm also has a weakness, that is if the conditional probability is zero, then the probability of prediction will be zero. Therefore, to increase the accuracy, the histogram feature extract is added. Plus the feature extraction histogram in order to know the pixel intensity values of the image, the frequency of occurrence of a relative, brightness, contrast, etc [14]. Then the Naïve Bayes algorithm will study the values that have been obtained from the histogram feature extraction and can determine the maturity level of tomatoes more accurately.

RESEARCH METHOD
In this research, tomato image classification is proposed based on its maturity using GLCM feature extraction and Naïve Bayes Classifier. Before the image is classified, several stages are carried out, namely tomato image data collection and acquisition, image preprocessing with the function of resizing and cropping, image color conversion, feature extraction, classification and testing described in Figure 1.

Data Gathering and Aquisition
In this study tomato image retrieval data was taken directly using the camera of the ASUS smartphone with a Z00AD camera model. Image data used in this study were 100 tomato images with a size of 4096 * 3072, consisting of 34 raw tomatoes, 33 mature tomatoes, and 33 rotten tomatoes. Furthermore, image data is divided into 75 training data and 25 testing data. Figure 2 is the sample image of tomatoes used.

Pre-processing
At this stage, the image data to be classified is done by cropping and resizing to a size of 100 x 100 pixels, with the aim to normalize the image and speed up the computation process at the training and testing stages. Figure 3 is a sample of preprocessing images.

Convert Image to Grayscale
After the preprocessing stage is carried out, then the image color conversion from RGB to Grayscale is done using the formula (1). gray=0.2989 * R + 0.5870 * G + 0.1140 * B (1) Where: gray is a grayscale channel, R is a red channel, G is a green channel, and B is a blue channel

Histogram Feature Extraction
A histogram is a method for obtaining a texture with a base on the histogram. An image histogram is a graph that describes the distribution of pixel intensity values of an image or a particular part of an image. In terms of histograms, it can be seen that the frequency of occurrence of a relative of an intensity in the image [15]. The histogram method is a static method of the first order to get texture features. The features contained in the histogram are mean intensity, standard deviation, energy, entropy, smoothness, and skewness [16].
The first feature is the mean intensity that can be calculated by the formula (2).
(2) Where m is the average value of intensity, which is obtained from the results of multiplication i with p (i). Where i is the value of the grayness of an image and p (i) is the probability of the appearance of the value i.
The second feature is the standard deviation, this feature shows the size of the contrast of an image. The results of the standard deviation are obtained from the results of the reduction i with m squared and multiplied by p (i) after that the new results at the root. Where i is the gray level value of an image, p (i) is the probability of the occurrence of i and m values as a result of mean intensity, which can be calculated by the formula (3).
The third feature is energy, the value of energy which is often called the value of uniformity in an image has a maximum value of 1. Images that have a lot of gray level values will have a little energy value compared to images that have a little gray level will have more energy values. From the above calculation, get the energy value. Energy results are obtained from the rank 2 result of the probability appearance i. Energy features can be calculated by the formula (4).
(4) The fourth feature is entropy, where entropy shows the complexity of an image, so that the higher the entropy value of the image, the more complex the image is. The result of entropy is obtained from the multiplication of p (i) with log 2 and multiplied by p (i). Where p (i) is the prominence of the occurrence of values i. Entropy can be calculated by the formula Entropy shows the complexity of an image so that the higher the entropy value of the image the more complex the image is. The result of entropy is obtained from the multiplication of p (i) with log 2 and multiplied by p (i). Where p (i) is the prominence of the occurrence of values i.
The fifth feature is smoothness. The smoothness value is obtained from the result of a reduction in the value of 1 with 1 divided by 1 and added with the standard deviation value that is squared. A high smoothness value indicates that the image has a smooth intensity. Smoothness can be calculated by the formula (6).
The sixth feature is skewness, skewness is often referred to as a third-order moment in which the negative value of the brightness distribution is left-leaning towards the mean and the positive value states that the brightness distribution is right-leaning towards the average. The skewness results are obtained from the result of the reduction in the value of i with the m raised by 3 and multiplied by p (i). Where i is the gray level value of an image, m is the average intensity and p (i) is the probability of the occurrence of values i. Skewness can be calculated by the formula (7).

Naïve Bayes Classifier
The Naïve Bayes algorithm is a classification algorithm that requires the learning process first. This algorithm predicts future opportunities based on past experience. The Naïve Bayes algorithm uses probability methods and simple statistics by assuming that one class with the other class is independent [17]. Naïve Bayes has two main processes, namely the learning process and testing [18]. The learning process uses existing data to estimate the probability distribution parameters, where it has been assumed that there is independence from each class. The data testing process uses a model that has been built into the learning process to calculate posterior opportunities and then classifies it into the largest posterior opportunity [17]. The advantage in using the Naïve Bayes algorithm is (a) it does not need to use numerical optimization, matrix, and others so that it is easier (b) the training process and more efficient use (c) can use binary or polynomial data (d) can be implemented with various kinds dataset because it has been assumed to be independent.
The general form of the naïve Bayes theorem for categorical data can be calculated by the formula (8) [10].
Where: P(X|Y) : the probability of data with X vector in class Y P(Y) : the initial probability of class X P(Xi|Y) : the independent probability of Y class from features in X vector P(X) : the probability of X

RESULTS AND DISCUSSION
Before performing the testing process on the test image data, the input image is first entered into the pre-processing and feature extraction stages. From the results of the test image processing, the classification procedure is then performed using the Naive Bayes algorithm calculation. The stages of the classification process can be explained as follows: 1. Calculate Prio probability P (Ci) training data for each class. P (Ci) is the probability value of the amount of training data per class with the entire data. In this study three classes of image types are used, namely raw, mature, rotten which each contains 25 data so that the value of P (Ci) of each class can be calculated as follows: P (Ci= "raw") = 25 / 75 = 0,33 P (Ci= "mature") = 25 / 75 = 0,33 P (Ci= "rotten") = 25 / 75 = 0,33 2. Calculate the mean (μ) and standard deviation (σ) of each attribute in the training data based on the class. To calculate the mean (μ) equation (9) and standard deviation (σ) are used equation (10).
Where X = attribute value in the training data used, n = the value of the total training data for each class Based on the data in the training database and equations (9) and (10)  Furthermore, the calculation is performed on all the attributes of the training data based on each class. As for the overall results of the calculation of μ and σ, each attribute of the training data based on rotten, mature and raw classes can be seen in tables 1, 2, and 3.     The calculation results of the μ and σ values will then be used in the third stage to calculate P (Xk.uji | Ci) for each class. 3. The third stage calculates P (Xk.uji | Ci) for each attribute of test data based on class.
Based on the calculation of the mean (μ) and standard deviation (σ) each attribute of the training data is based on the class and the results of the extraction of mature image features that have been done before, then the value of P (Xk.uji | Ci) of each attribute of test data based on class can calculated using equation (11): Where, σ = the result of calculation σ each attribute of the training data is based on the class that has been done before. µ = the result of the calculation of μ each attribute of the training data is based on the class that has been done before. Π = 3.14 e = 2.718282 The calculation results of P (Xk.uji | Ci) above can be seen in table 4. 4. The fourth stage of the probability P (Ci | Xk.uji) posterior count for each class P (Ci | Xk.uji) is the probability of the Ci hypothesis based on the condition X test data. From the stage that has been done before, the value of P (Ci | Xk.uji) for each class can be calculated using equation (12).
Where: P (Ci|Xk.uji) = probability Ci hypothesis based on the condition X test data (posterior probability) P (Xk.uji|Ci) = probability X test data based on conditions in the Ci hypothesis P (Ci) = probability hypothesis Ci (prior probability) P (Xk.uji) = probability from X test data Calculations with equation (12)  class. This final step will look for the highest value from the calculation of P (Ci | Xk.uji) for each class. The results of this stage will then be used as a determinant of class predictions for a data entry. P(Ci = "raw"|Xk.uji) = 0.0090104 P(Ci = "mature"|Xk.uji) = 6.8935e-17 P(Ci = "rotten"|XK.uji) = 0.002849227124 From the results above it can be seen that the highest P (Chi | Xk.uji) value of each class is 0.0090104 which is owned by the posterior probability P (Ci = "raw"|X). These results indicate that the test image is classified into the raw type class. The same classification calculation was carried out on 25 training images to test the correctness of the system. The following results of the overall classification of 25 test images using the Naïve Bayes algorithm are shown in table 5. Based on the results of accuracy calculations, obtained an accuracy value of 76% in the classification of the maturity of tomatoes using the Naïve Bayes method. Based on experimental results in this study, accuracy is still not better than similar studies as in research [18] [17]. So in the next research, it is necessary to do some experiments using different features or better preprocessing stages to improve the accuracy of tomato image classification.