SentiGEN: Synthetic Data Generator for Sentiment Analysis

Authors

  • Pushpika Sundarreson University of Westminster
  • Sapna Kumarapathirage University of Westminster

DOI:

https://doi.org/10.62411/jcta.10480

Keywords:

Data Quality, Machine Learning, Optimization, Sentiment Analysis, Synthetic Data

Abstract

Obtaining high-quality, diverse, accurate datasets for sentiment analysis has always been a significant challenge. Traditional approaches include annotators, which may introduce bias to datasets and are also time-consuming and expensive. These types of datasets may also not represent the variety needed to train robust and generalizable sentiment analysis models. This study introduces a novel combination of techniques to approach the problem with a novel solution. The proposed system, SentiGEN includes the use of a transformer, T5, fine-tuned and optimized using an evolutionary algorithm to generate high-quality, diverse, accurate data for sentiment analysis. The generated data is validated using XLNet to ensure high sentiment accuracy. This combination of technologies has proven successful based on the results derived from evaluating multiple models. From complex transformers such as BERT to more straightforward approaches like KNN, those trained using synthetic data demonstrated superior performance compared to their counterparts trained on real data. This enhancement in predictive accuracy was observed when evaluated on benchmark datasets such as SST-2 and Yelp. SentiGEN can generate high-quality, diverse, accurate, realistic data for sentiment analysis and successfully increased the performance of models trained on synthetic data compared to the same model trained on real data.

References

J. Luo, M. Bouazizi, and T. Ohtsuki, “Data Augmentation for Sentiment Analysis Using Sentence Compression-Based SeqGAN With Data Screening,” IEEE Access, vol. 9, pp. 99922–99931, 2021, doi: 10.1109/ACCESS.2021.3094023.

H. Q. Abonizio, E. C. Paraiso, and S. Barbon, “Toward Text Data Augmentation for Sentiment Analysis,” IEEE Trans. Artif. Intell., vol. 3, no. 5, pp. 657–668, Oct. 2022, doi: 10.1109/TAI.2021.3114390.

S. A. Assefa, D. Dervovic, M. Mahfouz, R. E. Tillman, P. Reddy, and M. Veloso, “Generating synthetic data in finance,” in Proceedings of the First ACM International Conference on AI in Finance, Oct. 2020, pp. 1–8. doi: 10.1145/3383455.3422554.

M. Endres, A. Mannarapotta Venugopal, and T. S. Tran, “Synthetic Data Generation: A Comparative Study,” in International Database Engineered Applications Symposium, Aug. 2022, pp. 94–102. doi: 10.1145/3548785.3548793.

A. S. Imran, R. Yang, Z. Kastrati, S. M. Daudpota, and S. Shaikh, “The impact of synthetic text generation for sentiment analysis using GAN based models,” Egypt. Informatics J., vol. 23, no. 3, pp. 547–557, Sep. 2022, doi: 10.1016/j.eij.2022.05.006.

P. Eigenschink, T. Reutterer, S. Vamosi, R. Vamosi, C. Sun, and K. Kalcher, “Deep Generative Models for Synthetic Data: A Survey,” IEEE Access, vol. 11, pp. 47304–47320, 2023, doi: 10.1109/ACCESS.2023.3275134.

G. Li, H. Wang, Y. Ding, K. Zhou, and X. Yan, “Data augmentation for aspect-based sentiment analysis,” Int. J. Mach. Learn. Cybern., vol. 14, no. 1, pp. 125–133, Jan. 2023, doi: 10.1007/s13042-022-01535-5.

A. Goncalves, P. Ray, B. Soper, J. Stevens, L. Coyle, and A. P. Sales, “Generation and evaluation of synthetic patient data,” BMC Med. Res. Methodol., vol. 20, no. 1, p. 108, Dec. 2020, doi: 10.1186/s12874-020-00977-1.

C. F. Moreno-Garcia, C. Jayne, and E. Elyan, “Class-Decomposition and Augmentation for Imbalanced Data Sentiment Analysis,” in 2021 International Joint Conference on Neural Networks (IJCNN), Jul. 2021, pp. 1–7. doi: 10.1109/IJCNN52387.2021.9533603.

J. Cui, Z. Wang, S.-B. Ho, and E. Cambria, “Survey on sentiment analysis: evolution of research methods and topics,” Artif. Intell. Rev., vol. 56, no. 8, pp. 8469–8510, Aug. 2023, doi: 10.1007/s10462-022-10386-z.

A. Jadon and S. Kumar, “Leveraging Generative AI Models for Synthetic Data Generation in Healthcare: Balancing Research and Privacy,” in 2023 International Conference on Smart Applications, Communications and Networking (SmartNets), Jul. 2023, pp. 1–4. doi: 10.1109/SmartNets58706.2023.10215825.

S. Akkaradamrongrat, P. Kachamas, and S. Sinthupinyo, “Text Generation for Imbalanced Text Classification,” in 2019 16th International Joint Conference on Computer Science and Software Engineering (JCSSE), Jul. 2019, pp. 181–186. doi: 10.1109/JCSSE.2019.8864181.

J. Guan, R. Li, S. Yu, and X. Zhang, “A Method for Generating Synthetic Electronic Medical Record Text,” IEEE/ACM Trans. Comput. Biol. Bioinforma., pp. 1–1, 2019, doi: 10.1109/TCBB.2019.2948985.

R. Gupta, “Data Augmentation for Low Resource Sentiment Analysis Using Generative Adversarial Networks,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 7380–7384. doi: 10.1109/ICASSP.2019.8682544.

S. D. Sosun et al., “Deep sentiment analysis with data augmentation in distance education during the pandemic,” in 2022 Innovations in Intelligent Systems and Applications Conference (ASYU), Sep. 2022, pp. 1–5. doi: 10.1109/ASYU56188.2022.9925379.

C. Charitou, S. Dragicevic, and A. d’Avila Garcez, “Synthetic Data Generation for Fraud Detection using GANs.” Sep. 26, 2021. [Online]. Available: http://arxiv.org/abs/2109.12546

A. Kothare, S. Chaube, Y. Moharir, G. Bajodia, and S. Dongre, “SynGen: Synthetic Data Generation,” in 2021 International Conference on Computational Intelligence and Computing Applications (ICCICA), Nov. 2021, pp. 1–4. doi: 10.1109/ICCICA52458.2021.9697232.

N. C. Abay, Y. Zhou, M. Kantarcioglu, B. Thuraisingham, and L. Sweeney, “Privacy Preserving Synthetic Data Release Using Deep Learning,” 2019, pp. 510–526. doi: 10.1007/978-3-030-10925-7_31.

K. Fang, V. Mugunthan, V. Ramkumar, and L. Kagal, “Overcoming Challenges of Synthetic Data Generation,” in 2022 IEEE International Conference on Big Data (Big Data), Dec. 2022, pp. 262–270. doi: 10.1109/BigData55660.2022.10020479.

Y. Tao, R. McKenna, M. Hay, A. Machanavajjhala, and G. Miklau, “Benchmarking Differentially Private Synthetic Data Generation Algorithms.” Dec. 16, 2021. [Online]. Available: http://arxiv.org/abs/2112.09238

K. Pipalia, R. Bhadja, and M. Shukla, “Comparative Analysis of Different Transformer Based Architectures Used in Sentiment Analysis,” in 2020 9th International Conference System Modeling and Advancement in Research Trends (SMART), Dec. 2020, pp. 411–415. doi: 10.1109/SMART50582.2020.9337081.

P. Isaranontakul and W. Kreesuradej, “A Study of Using GPT-3 to Generate a Thai Sentiment Analysis of COVID-19 Tweets Dataset,” in 2023 20th International Joint Conference on Computer Science and Software Engineering (JCSSE), Jun. 2023, pp. 106–111. doi: 10.1109/JCSSE58229.2023.10201994.

A. Figueira and B. Vaz, “Survey on Synthetic Data Generation, Evaluation Methods and GANs,” Mathematics, vol. 10, no. 15, p. 2733, Aug. 2022, doi: 10.3390/math10152733.

Y. Shang, X. Su, Z. Xiao, and Z. Chen, “Campus Sentiment Analysis with GAN-based Data Augmentation,” in 2021 13th International Conference on Advanced Infocomm Technology (ICAIT), Oct. 2021, pp. 209–214. doi: 10.1109/ICAIT52638.2021.9702068.

T. Liesting, F. Frasincar, and M. M. Tru?c?, “Data augmentation in a hybrid approach for aspect-based sentiment analysis,” in Proceedings of the 36th Annual ACM Symposium on Applied Computing, Mar. 2021, pp. 828–835. doi: 10.1145/3412841.3441958.

J. Lee and J. Kim, “Improving Generation of Sentiment Commonsense by Bias Mitigation,” in 2023 IEEE International Conference on Big Data and Smart Computing (BigComp), Feb. 2023, pp. 308–311. doi: 10.1109/BigComp57234.2023.00061.

X. Wang, S. Xue, J. Liu, J. Zhang, J. Wang, and J. Zhou, “Sentiment Classification Based on RoBERTa and Data Augmentation,” in 2023 IEEE 9th International Conference on Cloud Computing and Intelligent Systems (CCIS), Aug. 2023, pp. 260–264. doi: 10.1109/CCIS59572.2023.10263002.

R. Xiang, E. Chersoni, Q. Lu, C. Huang, W. Li, and Y. Long, “Lexical data augmentation for sentiment analysis,” J. Assoc. Inf. Sci. Technol., vol. 72, no. 11, pp. 1432–1447, Nov. 2021, doi: 10.1002/asi.24493.

A. Nazarizadeh, T. Banirostam, and M. Sayyadpour, “Using Group Deep Learning and Data Augmentation in Persian Sentiment Analysis,” in 2022 8th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS), Dec. 2022, pp. 1–5. doi: 10.1109/ICSPIS56952.2022.10044052.

K.-H. Le Minh and K.-H. Le, “AirGen: GAN-based synthetic data generator for air monitoring in Smart City,” in 2021 IEEE 6th International Forum on Research and Technology for Society and Industry (RTSI), Sep. 2021, pp. 317–322. doi: 10.1109/RTSI50628.2021.9597364.

A. Ali and A. Said, “Generative Adversarial Networks (GANs): Models that can generate realistic synthetic data by training two competing neural networks.” 2023. [Online]. Available: https://www.researchgate.net/publication/372649363_Generative_Adversarial_Networks_GANs_Models_that_can_generate_realistic_synthetic_data_by_training_two_competing_neural_networks

A. Kiran and S. S. Kumar, “A Comparative Analysis of GAN and VAE based Synthetic Data Generators for High Dimensional, Imbalanced Tabular data,” in 2023 2nd International Conference for Innovation in Technology (INOCON), Mar. 2023, pp. 1–6. doi: 10.1109/INOCON57975.2023.10101315.

Z. Liu, J. Wang, and Z. Liang, “CatGAN: Category-Aware Generative Adversarial Networks with Hierarchical Evolutionary Learning for Category Text Generation,” Proc. AAAI Conf. Artif. Intell., vol. 34, no. 05, pp. 8425–8432, Apr. 2020, doi: 10.1609/aaai.v34i05.6361.

L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, “Modeling Tabular data using Conditional GAN,” Jun. 2019, [Online]. Available: http://arxiv.org/abs/1907.00503

L. Bencke and V. P. Moreira, “Data augmentation strategies to improve text classification: a use case in smart cities,” Lang. Resour. Eval., Aug. 2023, doi: 10.1007/s10579-023-09685-w.

C. Shorten, T. M. Khoshgoftaar, and B. Furht, “Text Data Augmentation for Deep Learning,” J. Big Data, vol. 8, no. 1, p. 101, Dec. 2021, doi: 10.1186/s40537-021-00492-0.

J. Li, T. Tang, W. X. Zhao, and J.-R. Wen, “Pretrained Language Model for Text Generation: A Survey,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Aug. 2021, pp. 4492–4499. doi: 10.24963/ijcai.2021/612.

A. Venkataramana, K. Srividya, and R. Cristin, “Abstractive Text Summarization Using BART,” in 2022 IEEE 2nd Mysore Sub Section International Conference (MysuruCon), Oct. 2022, pp. 1–6. doi: 10.1109/MysuruCon55714.2022.9972639.

H. Queiroz Abonizio and S. Barbon Junior, “Pre-trained Data Augmentation for Text Classification,” 2020, pp. 551–565. doi: 10.1007/978-3-030-61377-8_38.

A. Shuklin, D. Parygin, A. Gurtyakov, O. Savina, and N. Rashevskiy, “Synthetic News as a Tool for Evaluating Urban Area Development Policies,” in 2022 International Conference on Engineering and Emerging Technologies (ICEET), Oct. 2022, pp. 1–6. doi: 10.1109/ICEET56468.2022.10007405.

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners.” [Online]. Available: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

X. Zheng, C. Zhang, and P. C. Woodland, “Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec. 2021, pp. 162–168. doi: 10.1109/ASRU51503.2021.9688232.

M. Binz and E. Schulz, “Using cognitive psychology to understand GPT-3,” Proc. Natl. Acad. Sci., vol. 120, no. 6, Feb. 2023, doi: 10.1073/pnas.2218523120.

A. Mosallanezhad, K. Shu, and H. Liu, “Generating Topic-Preserving Synthetic News,” in 2021 IEEE International Conference on Big Data (Big Data), Dec. 2021, pp. 490–499. doi: 10.1109/BigData52589.2021.9671623.

N. El Houda Ouamane and H. Belhadef, “Deep Reinforcement Learning Applied to NLP: A Brief Survey,” in 2022 2nd International Conference on New Technologies of Information and Communication (NTIC), Dec. 2022, pp. 1–5. doi: 10.1109/NTIC55069.2022.10100477.

R. Behjati, E. Arisholm, M. Bedregal, and C. Tan, “Synthetic Test Data Generation Using Recurrent Neural Networks: A Position Paper,” in 2019 IEEE/ACM 7th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE), May 2019, pp. 22–27. doi: 10.1109/RAISE.2019.00012.

R. Dos Santos, J. Aguilar, and M. D. R-Moreno, “A synthetic Data Generator for Smart Grids based on the Variational-Autoencoder Technique and Linked Data Paradigm,” in 2022 XVLIII Latin American Computer Conference (CLEI), Oct. 2022, pp. 1–7. doi: 10.1109/CLEI56649.2022.9959918.

S. Kamthe, S. Assefa, and M. Deisenroth, “Copula Flows for Synthetic Data Generation,” Jan. 2021, [Online]. Available: http://arxiv.org/abs/2101.00598

Y. Sei, J. A. Onesimu, and A. Ohsuga, “Machine Learning Model Generation With Copula-Based Synthetic Dataset for Local Differentially Private Numerical Data,” IEEE Access, vol. 10, pp. 101656–101671, 2022, doi: 10.1109/ACCESS.2022.3208715.

H. Liu et al., “Using t-distributed Stochastic Neighbor Embedding (t-SNE) for cluster analysis and spatial zone delineation of groundwater geochemistry data,” J. Hydrol., vol. 597, p. 126146, Jun. 2021, doi: 10.1016/j.jhydrol.2021.126146.

V. S. Kodiyala and R. E. Mercer, “Emotion Recognition and Sentiment Classification using BERT with Data Augmentation and Emotion Lexicon Enrichment,” in 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), Dec. 2021, pp. 191–198. doi: 10.1109/ICMLA52953.2021.00037.

R. Goyal, P. Kumar, and V. P. Singh, “A Systematic survey on automated text generation tools and techniques: application, evaluation, and challenges,” Multimed. Tools Appl., vol. 82, no. 28, pp. 43089–43144, Nov. 2023, doi: 10.1007/s11042-023-15224-0.

N. Fatima, A. S. Imran, Z. Kastrati, S. M. Daudpota, and A. Soomro, “A Systematic Literature Review on Text Generation Using Deep Neural Network Models,” IEEE Access, vol. 10, pp. 53490–53503, 2022, doi: 10.1109/ACCESS.2022.3174108.

T. Iqbal and S. Qureshi, “The survey: Text generation models in deep learning,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 6, pp. 2515–2528, Jun. 2022, doi: 10.1016/j.jksuci.2020.04.001.

A. K. Pandey and S. S. Roy, “Natural Language Generation Using Sequential Models: A Survey,” Neural Process. Lett., vol. 55, no. 6, pp. 7709–7742, Dec. 2023, doi: 10.1007/s11063-023-11281-6.

F.-A. Fortin, F.-M. De Rainville, M.-A. Gardner, M. Parizeau, and C. Gagne, “DEAP: evolutionary algorithms made easy,” J. Mach. Learn. Res., vol. 13, pp. 2171–2175, 2012, [Online]. Available: https://www.jmlr.org/papers/volume13/fortin12a/fortin12a.pdf

L. S. Hadla, T. M. Hailat, and M. N. Al-Kabi, “Comparative Study Between METEOR and BLEU Methods of MT: Arabic into English Translation as a Case Study,” Int. J. Adv. Comput. Sci. Appl., vol. 6, no. 11, pp. 215–223, 2015, doi: 10.14569/IJACSA.2015.061128.

Downloads

Published

2024-04-27

How to Cite

Sundarreson, P., & Kumarapathirage, S. (2024). SentiGEN: Synthetic Data Generator for Sentiment Analysis. Journal of Computing Theories and Applications, 1(4), 461–477. https://doi.org/10.62411/jcta.10480