Comparison of String Similarity Algorithm in post-processing OCR

Al Birr Karim Susanto, Nuraziz Muliadi, Bagus Nugroho, Muljono Muljono

Abstract


The Optical Character Recognition (OCR) problem that often occurs is that the image used, has a lot of noise covering letters in a word partially. This can cause misspellings in the process of word recognition or detection in the image. After the OCR process, we must do some post-processing for correcting the word. The words will be corrected using a string similarity algorithm. So what is the best algorithm? We conducted a comparison algorithm including the Levenshtein distance, Hamming distance, Jaro-Winkler, and Sørensen – Dice coefficient. After testing, the most effective algorithm is the Sørensen-Dice coefficient with a value of 0.88 for the value of precision, recall, and F1 score

Full Text:

PDF

References


“Levensthein distance as a post-process to improve the performance of OCR in written

road signs.” https://ieeexplore.ieee.org/document/8280534 (accessed May 26, 2021).

R. Smith, “An Overview of the Tesseract OCR Engine,” in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Sep. 2007, vol. 2, pp. 629–633. doi: 10.1109/ICDAR.2007.4376991.

H. Hu, L. Zhang, and J. Wu, “Hamming distance based approximate similarity text search algorithm,” in 2015 Seventh International Conference on Advanced Computational Intelligence (ICACI), Mar. 2015, pp. 1–6. doi: 10.1109/ICACI.2015.7184772.

K. Manaf, S. Pitara, B. Subaeki, R. Gunawan, Rodiah, and Bakhtiar, “Comparison of Carp Rabin Algorithm and Jaro-Winkler Distance to Determine The Equality of Sunda Languages,” in 2019 IEEE 13th International Conference on Telecommunication Systems, Services, and Applications (TSSA), Oct. 2019, pp. 77–81. doi: 10.1109/TSSA48701.2019.8985470.

V. R. Chifu, I. Salomie, E. ?t. Chifu, B. Izabella, C. B. Pop, and M. Antal, “Cuckoo search algorithm for clustering food offers,” in 2014 IEEE 10th International Conference on Intelligent Computer Communication and Processing (ICCP), Sep. 2014, pp. 17–22. doi: 10.1109/ICCP.2014.6936974.

E. Brajkovi? and D. Vasi?, “Tree and word embedding based sentence similarity for evaluation of good answers in intelligent tutoring system,” in 2017 25th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Sep. 2017, pp. 1–5. doi: 10.23919/SOFTCOM.2017.8115592.

M. Pikies and J. Ali, “String similarity algorithms for a ticket classification system,” in 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), Apr. 2019, pp. 36–41. doi: 10.1109/CoDIT.2019.8820497.

“Denoising Dirty Documents.” https://kaggle.com/c/denoising-dirty-documents (accessed Jul. 17, 2021).

“tesseract.js,” npm. https://www.npmjs.com/package/tesseract.js (accessed Jul. 18,

.

“tesseract.js/api.md at master • naptha/tesseract.js,” GitHub.

https://github.com/naptha/tesseract.js (accessed Jul. 23, 2021).

C. A. B. de Mello, A. L. I. de Oliveira, and W. P. dos Santos, Eds., Digital document analysis and processing. New York: Nova Science Publishers, 2012.

J. Mei, A. Islam, A. Moh’d, Y. Wu, and E. E. Milios, “MiBio: A dataset for OCR post- processing evaluation,” Data Brief, vol. 21, pp. 251–255, Dec. 2018, doi: 10.1016/j.dib.2018.08.099.

S. Rani and J. Singh, “Enhancing Levenshtein’s Edit Distance Algorithm for Evaluating Document Similarity,” in Computing, Analytics and Networks, Singapore, 2018, pp. 72–80. doi: 10.1007/978-981-13-0755-3_6.

R. W. Hamming, “Error detecting and error correcting codes,” Bell Syst. Tech. J., vol. 29,

no. 2, pp. 147–160, Apr. 1950, doi: 10.1002/j.1538-7305.1950.tb00463.x.

“Jaro–Winkler distance,” Wikipedia. May 30, 2021. Accessed: Jul. 19, 2021. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Jaro%E2%80%93Winkler_distance&oldid=102 5977252

H. Gueddah, A. Yousfi, and M. Belkasmi, “The filtered combination of the weighted edit distance and the Jaro-Winkler distance to improve spellchecking Arabic texts,” in 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), Nov. 2015, pp. 1–6. doi: 10.1109/AICCSA.2015.7507128.

“Image To Text Conversion With React And Tesseract.js (OCR),” Smashing Magazine. https://www.smashingmagazine.com/2021/06/image-text-conversion-react-tesseract-js- ocr/ (accessed Jul. 19, 2021).

Penerapan OCR (Optical Character Recognition) Pada Sistem Akuisisi Dokumen Jabatan Fungsional Dosen - UMM Institutional Repository. (2020, July 24). UMM Institutional Repository. Retrieved December 26, 2022, from https://eprints.umm.ac.id/63700/




DOI: https://doi.org/10.33633/jais.v8i1.7079

Article Metrics

Abstract view : 205 times
PDF - 160 times

Refbacks

  • There are currently no refbacks.


Flag Counter

 

 

 

 

Journal of Applied Intelligent System (e-ISSN : 2502-9401p-ISSN : 2503-0493) is published by Department of Informatics Universitas Dian Nuswantoro Semarang and IndoCEISS.

  

 

Journal of Applied Intelligent System indexed by :


This journal is under licensed of Creative Commons Attribution 4.0 International License.

Visitor Stats