Sentiment Analyst on Twitter Using the K-Nearest Neighbors (KNN) Algorithm Against Covid-19 Vaccination

- The corona virus (2019-nCoV), commonly known as COVID-19 has been officially designated as a global pandemic by the WHO. Twitter, is one of the social media used by many people and is popular among internet users in expressing opinions. One of the problems related to Covid-19 and causing a stir is the procurement of the Covid-19 vaccine. The procurement of the vaccine caused various opinions in Indonesian society, where the uproar was also quite busy being discussed on Twitter and even became a Trending Topic. The opinions that appear on Twitter will then be used as data for the Sentiment Analysis process. One of the members of the House of Representatives (DPR), namely Ribka Tjiptaning was also included in the Trending Topic list on Twitter for refusing to receive the Covid-19 vaccine. Sentiment analysis itself is a computational study of opinions, sentiments and emotions expressed textually. Sentiment analysis is also a technique to extract information in the form of a person's attitude towards an issue or event by classifying the polarity of a text. Research related to Sentiment Analysis will be examined by dividing public opinion on Twitter social media into positive and negative sentiments, and using the K-Nearest Neighbor (KNN) algorithm to classify public opinion about COVID-19 vaccination. In the testing section, the Confusion Matrix method is used which then results in an accuracy of 85%, precision of 100%, and recall of 78.94%.


INTRODUCTION
The outbreak of a new disease caused by the corona virus (2019-nCoV) or commonly known as COVID-19 was officially declared a global pandemic by the World Health Organization (WHO) on March 11, 2020. Although the epicenter of the virus outbreak at the end of 2019 was from the Chinese state of Wuhan, the virus has now spread globally, with more than 41.5 million cases and more than 1.1 million deaths annually. In Indonesia itself, President Joko Widodo announced the first COVID-19 case that entered Indonesia on March 2, 2020. According to Liu et al, Covid-19 vaccination is one of the effective ways to deal with the spread of the Corona virus. Because after receiving the Covid-19 vaccination, the immune body of the human body will be immune and will get the benefits of being protected from Covid-19 [1], [2], but on the other hand, vaccination is also in protecting others around us so that it can reduce the expansion and spread of the Covid-19 virus. Vaccination campaign plans must consider all aspects, from the feasibility of using the vaccine, the risks after use, to the various stages and procedures from vaccination to outreach to the public [3]- [6].
One of the social networking media that is often used by many people and is very popular among internet users in giving their opinions is Twitter. Indonesia is one of the countries that has quite a lot of daily active Twitter users. Based on data from Hootsuite, Indonesia is in 8th place with a reach of 10 million users [7]- [9]. The procurement of the Covid-19 vaccine has caused mixed opinions in Indonesian society. On Twitter social media, the corona vaccine had become a trending topic because it was widely discussed by the Indonesian people. Opinions that are on Twitter will then become data for sentiment analysis. One of the members of the House of Representatives (DPR) Ribka Tjiptaning who became a trending topic on Twitter refused to receive the Covid-19 vaccine, even though he is 63 years old and prefers to pay a fine, he was given a sanction by the government of 5 million rupiah on the grounds that Bio Farma had not tested of the Covid-19 vaccine, highlighting the incidence of the polio vaccine and elephantiasis vaccine in Indonesia. Therefore, it can be assumed that if the Covid-19 vaccine is to be used in humans, further evidence is needed. Sentiment analysis was applied in this study to analyze the opinions, feelings, views, and preferences of individuals regarding the COVID19 vaccination event by collecting data from Twitter users who discussed the topic of vaccination against COVID19[10]- [12]. The spread of this pandemic throughout the world and the imposition of restrictions on social interaction affect people's social conditions, the circulation of problems, both positive and negative, and even creates a big panic in social networks [7], [13].
The K-Nearest Neighbor algorithm [14], [15] is a classification algorithm because it is easy to implement, using data that has labels so that during the grouping process it becomes easier to get into the most appropriate class. Has ease in the aspect of translating the results, accuracy and calculation time of predictions. The K-NN algorithm has several advantages including being proven in accordance with the calculations applied in an application and achieving good accuracy results [6], [14], [16], the K-NN algorithm has a pretty good performance as a classification shown by several researchers who use it, especially classification in text. This is evidenced through a study entitled Twitter Sentiment Analysis Against the 2019 Indonesian Presidential Candidates using the K-NN Method which obtained an accuracy rate of the K-NN method reaching 83.33%.

K-Nearest Neighbor (KNN) Algorithm
The K-Nearest Neighbors algorithm is an approach to finding cases by calculating the proximity between new cases and old cases based on the weight matching of a number of existing features as pattern recognition that is commonly used for classification and regression purposes [17], [18]. The classification process carried out in text mining aims to classify data into several groups. The process of grouping data refers to the data that has been known in advance the group or class. Data that does not have a group is determined by a comparison process with data that is already known to the group. K-Nearest Neighbor (K-NN) is a classification technique that makes accurate predictions on test data based on the comparison of K nearest neighbors. Parameter K in K-Nearest Neigbor, K is the number of closest neighbors involved and has an influence in determining the prediction results.
K-Nearest Neighbor is an instance-based classification method that selects a training object with the nearest neighbor attribute. The nature of this neighbor is obtained from the calculation of the value of similarity or dissimilarity. KNN uses a method to calculate dissimilarity values (Euclidian, Manhattan, Square Euclidian, dil). Near or far neighbors are usually calculated based on the eludian distance. KNN will choose the K closest neighbors by looking at the number of class occurrences in the selected K neighbors to determine the classification results. The best value of k for this algorithm depends on the data, usually a high value of k will reduce the effect of noise on the application. Good & values can be selected by parameter optimization, for example by cross-validation. The class that appears the most will be the class resulting from the classification. The best K value for this algorithm depends on the data in general, a high K value will reduce the effect of noise on the classification. If an unlabeled object is given, the algorithm searches for the same or neighboring objects in the search space, and assigns a label to the unlabeled object based on the nearest neighbor attribute. The same concept can also be applied to sequences of observations, such as measuring current levels. The K-NN algorithm identifies K past data sequences that are most similar to the pattern being examined. The closest combination of values makes an expected prediction of the expected future value based on a time step. The general closeness is between values 0 and 1. A value of 0 means that the two cases are not at all similar, and the value of 1 case is almost exactly similar. The process of calculating the distance between the two cases is carried out using the following equation.
Where: T = New case S = Cases that are in deviation n = Number of attributes in each case i = Individual attributes between 1 to n f = The similarity function of attribute i between cases T and S w = The weight assigned to the attribute i

Confussion Matrix
The performance of the classification system describes how well the data classification system is. Confusion matrix is a method that can be used to measure the performance of a classification method. Basically, the confusion matrix contains information that compares the results of the classification performed by the system with the classification results that should be obtained. The confusion matrix can be used to evaluate the algorithm performance of Machine Learning (ML) [3]. The Confusion Matrix represents the predictions and actual (actual) conditions of the data generated by the Machine Learning (ML) algorithm. Based on the Confusion Matrix, it can determine Accuracy, Precision, Recall Specificity and F1 score. When measuring the performance of an algorithm using a confusion matrix, there are 4 terms to represent the results of the classification process. The four terms are TP (True Positive), TN (True Negative), FP (False Positive) and FN (False Negative). The TN value is the number of negative data that is correctly detected, while the FP is negative data but detected as positive data. Meanwhile, TP is positive data that is detected correctly. FN is the opposite of True Positive, so the data is positive, but is detected as negative data.

RapidMiner
Rapidminer is an open source software which is one of the solutions for analyzing predictive analysis, text mining, and data mining. Rapidminer uses various descriptive and predictive techniques in providing knowledge to users so that users can make the best decisions. Rapidminer is a stand-alone software and has functions for data analysis and can be used as a data mining machine that can be integrated into its own products. Rapidminer is written using the Java programming language so that it can work on all operating systems. Rapidminer has the following properties: a. Developed using the Java programming language so that it can run on various operating systems. b. The knowledge discovery process is modeled as operator trees. c. XML representation, internal to ensure the standard format of data exchange. d. The scripting language allows for large-scale experiments and automation of experiments. e. Multi-layer concept to ensure efficient data display and guarantee data handling. f. It has a GUI, command line mode, and java API that can be called from other programs.

Text Preprocessing
Text Preprocessing is a process that functions to clean text or words before further processing is carried out. Unstructured data and still contains noise such as punctuation marks, affixes, numbers, special characters, and others [19], [20]. At the text preprocessing stage so that the data used can be ready to be processed in the next phase.

TF-IDF
The TF-IDF (term frequency-inverse document frequency) stage plays a role in determining the terms or keywords that characterize a document that can distinguish documents from one another in one corpus [21], [22]. TF-IDF works by increasing in proportion to the number of times a word appears in the document, but offset by the number of documents containing the same keyword. In text mining, feature selection is the most important stage that has a very significant role in the accuracy of text analytics, because feature selection is a process used to remove or delete irrelevant features contained in a dataset. There are four most widely used approaches in the feature selection process, namely Document Frequency (DF), Term Frequency (TF), Inverse Document Frequency (IDF) and Term Frequency/ Inverse Document Frequency (TFADF).

Data Preparation
The data used is data taken from Twitter by utilizing Rapidminer software. As many as 300 data were taken which were divided into two equal numbers, namely 150 data were positive opinions, and the remaining 150 data were negative opinions. In Table 1, several sample words that will be used as markers of a sentence are shown are positive or negative sentences. In the positive word category, if the word is not in one sentence, it is not included in the positive class, as well as the rules that apply to negative words.
After the preprocessing process is complete, the next step is to calculate the TF and IDF in each term or word that represents each document. The frequency of a word in a particular document indicates its importance in the document. The frequency of documents containing the word indicates how common the word is. In this way, if a word appears frequently in the document, and the entire document containing that word appears infrequently in the document set, the weight of the word-document relationship will be high. And to get the IDF results, the calculation process uses the following formula: After getting the IDF value, the TF-IDF value will be searched again, namely the multiplication between the results of the frequency of occurrence of words in each document (TF) with the weighting of words in all documents (IDF). The results of the multiplication of TF and IDF are in table 4. After getting the IDF value, the TF-IDF value will be searched again, namely the multiplication between the results of the frequency of occurrence of words in each document (TF) with the weighting of words in all documents (IDF). After getting the results of the document vector similarity, the next process is to calculate the length of the vector by squared the result of the document vector similarity for each term then add up the square value of each document and then calculate the root of that number as shown in Table 5.  After getting the results of the vector length, the next step is to calculate cosine similarity, to compare the similarities between documents. For the process of calculating cosine similarity by doing scalar multiplication between queries. After getting the results of the vector length, the next step is to calculate cosine similarity, to compare the similarities between documents. The process of calculating cosine similarity by doing scalar multiplication (D1, D2, D3, Dn) between queries (Dx).   The results obtained from cosine similarity will be sorted from the largest value to the smallest value. Next, a class will be given according to the class labeling at the beginning. Here is the table of cosine similarity results: In calculating the classification using the K-NN method, the first thing to do is determine the value of K. Next, take the similarity results from the K value starting from the highest value.

KNN Algorithm
In calculating the classification using the K-NN method, the first thing to do is determine the value of K. Then take the similarity results from the K value starting from the highest value. After getting the results from the similarity of a specified number of K values, you can determine whether the classification results are included in the Positive or Negative class.
Based on the test data "let's fight covid with the second Sinovac vaccine even though it has a fever" is a positive sentiment because from the highest k = 3, the probability of D5 against the Positive class is greater than that of the Negative class, so the sentiment on D5 is included in the Positive class category.

Confusion Matrix Test
Accuracy testing to determine the level of accuracy is carried out by the system using the K-Nearest Neighbor algorithm to classify public opinion about COVID-19 vaccination. To find out the value of the accuracy of the testing data used as much as 87 and training data as much as 224. It has been tested using the Confusion Matrix method with 87 testing data and 224 training data, getting results of 85% accuracy, 100% precision and 78.94% recall.

CONCLUSION
Accuracy testing was carried out using the confusion matrix method to obtain an accuracy of 85%, precision of 100% and recall of 78.94% from testing data of 83 data and 224 training data. The K-Nearest Neighbors algorithm used can capture public opinion related COVID-19 vaccines can be captured on Twitter social media, such as public discussions about vaccines, halal certification of vaccines, proper use of vaccines, vaccine prices, and to general public talks such as functions & objects of vaccination.