Gerga Orange Quality Using Naïve Bayes Based on Feature Extraction

- In an effort to increase the number of sales, the process of classifying the type of Gerga citrus fruit is very necessary. The problem that often occurs is the mixing of various types of fruit from the storage warehouse so that the quality of the fruit will be mixed and it will be difficult to determine the selling price because the quality of the fruit itself is not evenly distributed so that a sorting process is needed. There are still many sellers or growers of citrus fruits who sort the quality of the fruit manually so that it can take a very long time. Given these problems, it is necessary to classify the quality of Gerga oranges automatically with the Naïve Bayes Classifier algorithm with GLCM feature extraction and HSV color characteristics. as a method for classifying the quality of Gerga citrus fruit and as for the media used, there is digital image media. From the experiments that have been carried out the use of angles in the formation of co-occurrence matrices with the best accuracy values reaching 80% are found at angles of 0°, 45°, and 135°, while the lowest accuracy values are found at angles of 90°. It was concluded that the Gerga citrus fruit quality classification system using the Naïve Bayes method was categorized as good with an AUC value of 0.8.


INTRODUCTION
In the era of information technology, digital image processing or image processing has been widely used in various fields. In the agricultural sector, disease detection, quality classification, weight determination and identification of plant species have implemented a classification process using image processing [1]- [4]. In an effort to increase the number of sales, the process of classifying the type of Gerga citrus fruit is very necessary. The problem that often occurs is the mixing of various types of fruit from the storage warehouse so that the quality of the fruit will be mixed and it will be difficult to determine the selling price because the quality of the fruit itself is not evenly distributed so that a sorting process is needed. Some significant parameters in the sorting process are size, shape, color and texture. In general, the planting and care of citrus fruits is not paid enough attention and they have not or have not implemented a production system that pays attention to quality which results in uncontrolled and low fruit quality, such as fruit skin that does not look smooth, fruit skin color tends to be green and yellow unevenly, and the sweet taste of the fruit is reduced and tastes slightly bitter, which causes growth in plant sales to not be optimal [5]. There are still many citrus fruit sellers or growers who sort the quality of the fruit manually so that it can take a very long time [6]. Given these problems, it is necessary to classify the quality of Gerga oranges automatically by utilizing image processing [7]. Digital image processing [8], [9] is the process of processing pixels in a digital image to explore and obtain specific information from the image digitally for a specific purpose. Initially the image processing process was intended as an improvement from bad image quality, but with the development of the computer world in line with the increasing capacity and speed of processing on computers as well as computational sciences that continue to grow and are very diverse, it allows researchers to retrieve information from an image to help make a decision, such as classification, identification, and facial recognition. Classification is a grouping of objects on a predetermined variable. In the classification process, digital image processing is required, image data is used in the system training process so that it can classify test data into certain groups. So a good feature extraction method is needed in order to get a good texture value [3], [10]- [12]. Naive Bayes is one of the algorithms that is often used as image classification, this algorithm classifies based on promentality and the Bayesian theorem which assumes that each variable is independent and can handle both quantitative and discrete data. Naïve Bayes also does not require large amounts of training data for the classification process so this algorithm is easy to use [13]. GLCM is a feature extraction algorithm to obtain second-order statistical values by calculating the probability of a close relationship between two pixels at a certain distance (d) and angle (θ) and the extracted features include contrast, correlation, energy, and homogeneity, according to [14] classification using the Naïve Bayes algorithm and GLCM feature extraction produces effective and efficient results. Research conducted by [15] with a discussion of comparing the Naïve Bayes algorithm with KNN to identify types of apples with LBP and HSV feature extraction with the highest results reaching 97%, namely the use of the HSV algorithm with Naïve Bayes.

Gerga Orange
Rimau Gerga Lebong oranges or commonly called Gerga oranges, are superior local oranges that originate and are cultivated in Bengkulu province, Lebong district and Rejang Lebong district. Gerga oranges are said to be superior crops because they have a wide market share from various age groups with the support and availability of fruit that can be harvested throughout the year as shown in Figure 1. Each tree can produce bauh throughout the year. Gerga oranges are a type of tangerine, this orange has physical characteristics including large leaf size, fruit weight 170g-350g, fruit skin that is ready to be harvested is yellowish-green or yellow-orange in color with orange flesh color with a sweet and fresh taste, when viewed from the outside Total Dissolved Solids (TPT) is between 12-16% Brix and in terms of chemical specifications, Gerga oranges contain 89.20% water, 0.92% acid, and 18.34 mg/100 g vitamin C.

Gray Level Co-occurrence Matrix (GLCM)
Feature Extraction There are 14 types of textural features that can be extracted from the GLCM method but in this study only four main features will be used including contrast, correlation, energy and homogeneity [14], [16]. Gray Level Co-occurrence Matrix or commonly abbreviated as GLCM is a feature extraction system based on text analysis with calculations using the matrix method based on the appearance between two pixels with a certain distance, intensity, and angle in determining the degree of gray in an image. Orientation is formed in four angular directions including, 0°, 45°, 90°, and 135°. While the distance between pixels is set at 1 pixel.
• Contrast The contrast value shows how the difference is between the dark and bright parts of an image. If the contrast value is high, the image looks sharper, whereas lowering the contrast value will reduce the sharpness of an image. Visually, the contrast value is a measure of the variation between the degrees of gray in an image area with the following equation. The higher the value of the image contrast, the higher the value of the contrast feature where P(I,j) is the cooccurrence matrix of the i-th row and j-column.

• Homogeneity
This feature is useful for determining the level of homogeneity or uniformity between gray degrees of the same type, the more uniformity of gray degrees, the higher the value of the homogeneity feature. Where P(i,j) is the i-th row and j-column in the co-occurrence matrix.
• Correlation This feature provides a clue with a value if there is a linear structure in the image by measuring the linear dependence of the gray degree on the image.
Where P(i,j) : elements in the i-th row and j-column of the co-occurrence matrix, μi : average value for the i-th row, µj : average value for the jth column, σi : standard deviation value for the i-th row, j : standard deviation value for column j.

• Energy
This feature shows uniformity between neighbors on the pixel in terms of gray level, the more similar the pixel values, the higher the energy feature value will be. Where P(i,j) is the i-th row and j-column elements in the co-occurrence matrix.
• RGB Color Space Basically, color images come from a color space that comes with a combination of the three main colors namely red, greed, and blue and are combined to produce a variety of different colors. The color combination model also depends on the device used and allows each device to detect and produce different color combination values. The following are the color parameters obtained by normalizing each RGB component in the image with the following equation.
HSV is a color feature in digital images which, among other things, represents HUE, Saturation, and Value values. HUE is a basic color identity that differs from one color to another expressed in degrees, namely Red 0°, Green 120° and Blue 240°. For example, oranges are generally yellow in color, different from mangoes which are green, this color identity difference is called HUE. Saturation is the level of density of a color, the higher the level of density of a color, the closer it is to the base color, the lower the level of density, the color will be more faded and gray in color [15]. Value is the brightness level of a color, the brighter a color, the higher the V value. A bright color has more white elements while a dark color has more black elements.

Naïve Bayes
The Naïve Bayes algorithm is one of the algorithms found in classification techniques. This algorithm was discovered by a British scientist named Thomas Bayes in 1763, Naive Bayes is a classification using probability and statistical methods, this algorithm is called Bayes' Theorem because it predicts opportunities based on variables and previous experience with new variables being tested [17]- [19]. The theorem is combined with Naïve which means that all conditional attributes are free. Classification with this algorithm will assume a class characteristic for a particular class and has nothing to do with other classes as shown I equation (9).

P(H|X)= P(X|H)P(H)
Where X : Classless X data to test, H : Assumes the ith data into one of the classes, P(H|X) : Probability of hypothesis H based on condition X, P(H) : Probability of the hypothesis H, P(X|H) : Probability of hypothesis X based on condition H, and P(X) : Probability of hypothesis X. The flow of the Naïve Bayes algorithm is as follows: • Reading test data by parameters • Determine the mean and standard deviation values for each parameter • Calculating the probability of the value appearing on each parameter • Calculating test data for each probability class based on training data • Determine the highest probability value

Confusion Matrix
The Confusion Matrix is one of the test systems to measure the level of accuracy of a prediction based on the actual value of an expert. The Confusion Matrix Single Decision Threshold will be used to measure how the performance of the predicted classification results is in accordance with the actual value of the quality of Gerga oranges. There are several terms in the Single Decision Threshold method, including: • True Positive (TP) is where the condition has a real positive value on the test data and will produce a positive value prediction on the system. • True Negative (TN) is where the condition has a negative real value on the test data and will produce a negative predicted value on the system. • False Positive (FP) is where the condition has a negative real value in the test data but produces a positive value in the system. • False Negative (FN) is where the condition has a positive real value on the test data but produces a negative value on the system. For application to the system, parameters will be used, namely sensitivity, precision, specificity, and accuracy [15] [16]. Sensitivity will know how many positive values but with negative predictive results. Precision is useful for getting positive test values with positive predictions. Accuracy that is useful for calculating the success rate of global classification on the system. Specificity is the predicted negative value with the true value being negative as illustrated in equation (11) until (14).
Sensitivity= TP/(TP+FN).100% (11) Precision= TP/(TP+FP).100% (12) Accuracy= (TP+TN)/(TP+TN+FP+FN).100% (13) Specificity=TN/(TN+FP).100% (14) The predicted value is the value generated from the system, while the real value is the value that corresponds to the events that occur or is referred to as the expert value. The following is the Confusion Matrix which has been described in Table 2 and for the Confusion Matrix and the formula that will be used according to the class label classification in this final project as follows in equation (15).  • FP2: Inappropriate system testing where the expert value of the test data is not good with the best predictive value.
• TC : System testing for images is good enough where expert values and test results match.
• FC1: Inappropriate system testing where the expert value of the test data is the best with a pretty good predictive value.
• FP2: Inappropriate system testing where the expert value of the test data is not good with the best predictive value.
• TK: Prediction according to the test image is not good and the expert value is not good.
• FK1: Inappropriate system testing where the expert value of the test data is the best with a less good predictive value.
• FK2: Inappropriate system testing where the expert value of the test data is quite good with a poor predictive value.
"Precision =" "TP" / "TP + FP1 + FP2" "100%" "Sensitivity Fairly Good = " "TC" /"TC + FC1 + FC2" "100%" "Poor Sensitivity = " "TK" /"TK + FK1 + FK2" "100%" "Accuracy =" "TP + TC + TK" / "Sum of Test Data" "100% As a benchmark for the performance and feasibility level of the classification program that will be made using the Area Under Curve (AUC). Based on the accuracy value, it will be concluded whether the Gerga citrus quality classification system can be applied properly and can be used properly. For the value of the level of accuracy in the AUC itself between 0.0 to 1.0 the more the resulting value is close to the value of 1.0, the classification is said to be very good or very feasible and if the resulting accuracy value is less than 0.6 then it can be said that the classification method used is not good or wrong [20], [21]. In the following, the categories of AUC values have been described from the best value to the less good value in Table 4 and the equation is as follows.

Classification Scheme
In the training process, each training data has been labeled into one class, namely the best, pretty good, and not very good, which will be calculated for the features used to create the training data database as benchmark values in the testing process. The process of testing the classification system is carried out with test images based on benchmark values in the training data database, the system will return the test image values into one of the best, sufficient, or not good classes as shown in Figure 2.

Figure 2. Training Stages
This process is also useful to find out whether the classification system is working properly and the process of calculating the accuracy value is also carried out. The training stage is the initial stage to obtain training characteristics and training targets on training images with the final result being a training database model that will be used in the test image classification process as a reference value based on expert values with the Naïve Bayes algorithm. Calling or selecting images is the initial stage of the classification process at the training and image testing stages. The training images used are 300 images with jpg extension which have been grouped into the best, moderately good, or less good classes according to the real conditions where each class has 100 training images. The process of calling a training image is shown in Figure 3.  The image resizing stage aims to speed up image processing time, the image resizing process is carried out after calling a training image or test image, the initial image size is 2048 x 1536 pixels, the image will be reduced to 256 x 256 pixels. Image conversion is carried out to convert a color image into a grayscale image to obtain the gray level of an image, in contrast to a color image which has three color components, namely red, green, and blue, whereas a grayscale image only has one component, namely gray with a scale of 0 to 255, this image is will be used in GLCM calculations to obtain texture features on the training image and test image. Hue (H) is the basic color of an image which is described in degrees, namely 0 degrees red, 120 degrees green, and 240 degrees blue. Saturation (S) is the level of density of a color, the higher the level of density of a color, the closer it is to the base color, the lower the level of density, the color will be faded and gray in color. Value (V) is the brightness level of a color, the brighter a color, the higher the V value as shown in Figure 4. A bright color has more white elements while a dark color has more black elements. Each image that has been processed to form a co-occurrence matrix will perform GLCM feature calculations to obtain its features, namely Contrast, Correlation, Energy and Homogeneity, each image from the best class, good enough, and not good enough will get these four features. The GLCM feature extraction process is as shown in Figure 5. Based on Figure 6, extraction of training data image features is stored in a database in .xls format, this database is carried out by the Naïve Bayes trainer process and a database of modeling results is obtained in .mat format. Structure of the training database. After the feature extraction process and training on the training data is complete, the next process is testing with the test data image. This stage is carried out as a benchmark for how effective the classification system with the algrotima and training images used is. Tests were carried out with 30 test data images and the angles used in the GLCM feature extraction were 0°, 45°, 90°, and 135° as shown in Figure 7.

RESULTS AND DISCUSSION
Graphic User Interface (GUI), which is a display that will bridge between computer programs and users, the use of a GUI is needed to make it easier for users to operate a system on a computer, namely the quality classification system of Gerga oranges in this study. The part that will be implemented into the GUI is the testing phase, at this stage it will explain every part of the GUI that is made. The GUI display is as shown in the following Figure 8. The training database named NaiveBayes is called with the load function and the classification process is carried out with the predict function based on the test characteristics with the training targets stored in the training database, if the classification results return number 1 then the classification result is the best class, if the number is 2 then the class is good enough, and if the classification result returns number 3 then the classification result is a poor class, the classification results are displayed on the Classification Results panel in the GUI with data of type String.

Figure 8. GUI Application Gerga Orange Quality
Based on the classification results with the Naïve Bayes algorithm, two methods will be tested, namely the Confusion Matrix and Area Under Curve (AUC). Testing with the Confusion Matrix method aims to determine how accurate the system is in determining the correct classification of test image quality, while the AUC method aims to measure the performance of the classification method used in this study, namely the classification method with the Naïve Bayes algorithm. Based on the experiments that have been carried out the use of angles in the formation of the co-occurrence matrix with the best accuracy value reaching 80% is found in the use of angles 0°, 45°, and 135°, while the lowest accuracy value is found in the use of angles of 90° with a value of 73.3333% as shown in Table 4. This test is carried out to measure how effective the classification method used is. What is the AUC value in the range of 0 and 1, if the value gets closer to 1 then the method used can be said to be effective in classifying images. The AUC calculation is done with the following equation (18).
AUC=((1+Sensitivity)-(1-Spesivicity))/2 (18) Testing this classification method will use the values obtained from the Confusion Matrix test with the results obtained as shown in Table 5. Based on the tests carried out there were several classification results that were not in accordance with expert scores, with the lowest accuracy value reaching 73.3333% and the AUC value reaching 0.8 actually the system was categorized as good, the most classification errors were found in pretty good classes with the highest Specificity value only reaching 60%. From the analysis of system failures in conducting classification, there are several conclusions, namely taking images with poor and inconsistent light conditions, this causes the color of the image to change, tends to be dark, reddish, or bluish, too bright light also causes the image background to be too shady as shown in Figure 8. The condition of an image that is out of focus also makes the image look blurry, this makes the conversion of the gray level in the image not good and the feature extraction results are not optimal as shown in the following Figure 9. Poor lighting conditions cause the background of the image to be shaded, making the color feature extraction less irregular. Under these conditions, the color feature extraction results become excessive and inaccurate.

CONCLUSION
Based on the research that has been done, it can be concluded that the use of the Naïve Bayes classification algorithm based on GLCM feature extraction in the Gerga citrus fruit quality classification system with image media is categorized as good and successful with the highest AUC value reaching 0.85. In this study there are still some deficiencies that need to be considered and corrected so that suggestions are needed for further research. The following are suggestions that need to be considered.
a. To take training images or test images, attention should be paid to the light conditions so that the fruit images are not too shady so that the feature extraction process can be appropriate and maximized. b. The use of more training images and the use of lighting aids are expected to increase the accuracy value in the image classification process. c. In the testing process using only one test image so it will be less effective if it is carried out in the sorting process at Gerga Orange plantations so it is hoped that in the future a fruit quality classification machine will be made using many test images at one time based on the method used in this study. d. This research is expected to help and motivate researchers and farmers of Gerga oranges or other oranges to develop a system for sorting the quality of oranges in the hope of increasing the sales quality of the oranges themselves so that they can contribute to the development of sales and cultivation of oranges in Indonesia.