Comparison of String Similarity Algorithm in post-processing OCR

Authors

  • Al Birr Karim Susanto Teknik Informatika, Universitas Dian Nuswantoro
  • Nuraziz Muliadi Teknik Informatika, Universitas Dian Nuswantoro
  • Bagus Nugroho Teknik Informatika, Universitas Dian Nuswantoro
  • Muljono Muljono Teknik Informatika, Universitas Dian Nuswantoro, Semarang

DOI:

https://doi.org/10.33633/jais.v8i1.7079

Abstract

The Optical Character Recognition (OCR) problem that often occurs is that the image used, has a lot of noise covering letters in a word partially. This can cause misspellings in the process of word recognition or detection in the image. After the OCR process, we must do some post-processing for correcting the word. The words will be corrected using a string similarity algorithm. So what is the best algorithm? We conducted a comparison algorithm including the Levenshtein distance, Hamming distance, Jaro-Winkler, and Sørensen – Dice coefficient. After testing, the most effective algorithm is the Sørensen-Dice coefficient with a value of 0.88 for the value of precision, recall, and F1 score

Author Biography

Muljono Muljono, Teknik Informatika, Universitas Dian Nuswantoro, Semarang

ProfilScopus ID :7409884994Google Scholar : dp36ibQAAAAJSinta ID : 5975037

References

“Levensthein distance as a post-process to improve the performance of OCR in written

road signs.” https://ieeexplore.ieee.org/document/8280534 (accessed May 26, 2021).

R. Smith, “An Overview of the Tesseract OCR Engine,” in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Sep. 2007, vol. 2, pp. 629–633. doi: 10.1109/ICDAR.2007.4376991.

H. Hu, L. Zhang, and J. Wu, “Hamming distance based approximate similarity text search algorithm,” in 2015 Seventh International Conference on Advanced Computational Intelligence (ICACI), Mar. 2015, pp. 1–6. doi: 10.1109/ICACI.2015.7184772.

K. Manaf, S. Pitara, B. Subaeki, R. Gunawan, Rodiah, and Bakhtiar, “Comparison of Carp Rabin Algorithm and Jaro-Winkler Distance to Determine The Equality of Sunda Languages,” in 2019 IEEE 13th International Conference on Telecommunication Systems, Services, and Applications (TSSA), Oct. 2019, pp. 77–81. doi: 10.1109/TSSA48701.2019.8985470.

V. R. Chifu, I. Salomie, E. ?t. Chifu, B. Izabella, C. B. Pop, and M. Antal, “Cuckoo search algorithm for clustering food offers,” in 2014 IEEE 10th International Conference on Intelligent Computer Communication and Processing (ICCP), Sep. 2014, pp. 17–22. doi: 10.1109/ICCP.2014.6936974.

E. Brajkovi? and D. Vasi?, “Tree and word embedding based sentence similarity for evaluation of good answers in intelligent tutoring system,” in 2017 25th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Sep. 2017, pp. 1–5. doi: 10.23919/SOFTCOM.2017.8115592.

M. Pikies and J. Ali, “String similarity algorithms for a ticket classification system,” in 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), Apr. 2019, pp. 36–41. doi: 10.1109/CoDIT.2019.8820497.

“Denoising Dirty Documents.” https://kaggle.com/c/denoising-dirty-documents (accessed Jul. 17, 2021).

“tesseract.js,” npm. https://www.npmjs.com/package/tesseract.js (accessed Jul. 18,

.

“tesseract.js/api.md at master • naptha/tesseract.js,” GitHub.

https://github.com/naptha/tesseract.js (accessed Jul. 23, 2021).

C. A. B. de Mello, A. L. I. de Oliveira, and W. P. dos Santos, Eds., Digital document analysis and processing. New York: Nova Science Publishers, 2012.

J. Mei, A. Islam, A. Moh’d, Y. Wu, and E. E. Milios, “MiBio: A dataset for OCR post- processing evaluation,” Data Brief, vol. 21, pp. 251–255, Dec. 2018, doi: 10.1016/j.dib.2018.08.099.

S. Rani and J. Singh, “Enhancing Levenshtein’s Edit Distance Algorithm for Evaluating Document Similarity,” in Computing, Analytics and Networks, Singapore, 2018, pp. 72–80. doi: 10.1007/978-981-13-0755-3_6.

R. W. Hamming, “Error detecting and error correcting codes,” Bell Syst. Tech. J., vol. 29,

no. 2, pp. 147–160, Apr. 1950, doi: 10.1002/j.1538-7305.1950.tb00463.x.

“Jaro–Winkler distance,” Wikipedia. May 30, 2021. Accessed: Jul. 19, 2021. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Jaro%E2%80%93Winkler_distance&oldid=102 5977252

H. Gueddah, A. Yousfi, and M. Belkasmi, “The filtered combination of the weighted edit distance and the Jaro-Winkler distance to improve spellchecking Arabic texts,” in 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), Nov. 2015, pp. 1–6. doi: 10.1109/AICCSA.2015.7507128.

“Image To Text Conversion With React And Tesseract.js (OCR),” Smashing Magazine. https://www.smashingmagazine.com/2021/06/image-text-conversion-react-tesseract-js- ocr/ (accessed Jul. 19, 2021).

Penerapan OCR (Optical Character Recognition) Pada Sistem Akuisisi Dokumen Jabatan Fungsional Dosen - UMM Institutional Repository. (2020, July 24). UMM Institutional Repository. Retrieved December 26, 2022, from https://eprints.umm.ac.id/63700/

Downloads

Published

2023-02-17