A Machine Learning Model for Evaluation of the Corrosion Inhibition Capacity of Quinoxaline Compounds

ABSTRACT


INTRODUCTION
Inhibitor technology is a straightforward, practical, and economical way to control corrosion [1], [2].One well-known and efficient method of preventing corrosion damage is to use inhibitors [3], [4].The benefit of corrosion inhibitor chemicals is that they can prevent charge and mass transfer from occurring, forming a protective layer on the metal surface that shields the metal from corrosive environmental effects [5], [6], [7].Typically, corrosion inhibitors function by creating a shield to prevent oxidation reactions that lead to corrosion on the metal surface [8], [9].
Because quinoxaline compounds can inhibit corrosion in a wide range of environments, they have drawn a lot of attention in the context of organic inhibitors.The inclusion of functional groups, double conjugate bonds, and aromatic rings in the molecular structure of quinoxaline-based corrosion inhibitors has been linked to their higher performance.Generally speaking, to determine the electrical and structural characteristics pertinent to inhibitory efficacy, researchers have used theoretical methods like quantum chemical analyses and atomic simulations [10], [11].Furthermore, the inhibitor's inhibitory mechanism has been explained by several investigations using the outcomes of theoretical computations such as density functional theory (DFT) and molecular simulations [12], [13].
Since there is a quantifiable association between a compound's structure and its molecular properties and activity, machine learning (ML) can be used to evaluate a compound's performance in inhibiting corrosion [14], [15].Several algorithms, including ensemble methods, Bayesian approaches, decision trees, gradient boosting machines, deep learning neural networks, and clustering algorithms, have also been employed and combined in attempts to create machine learning models to assess inhibitor performance [16], [17], [18], [19], [20], [21].
The main challenge in ML development is developing models that can provide accurate predictions so that the results can provide relevant information and describe the actual properties of the material being tested.Therefore, in this study, we tested the ML model consisting of the XGBoost model and an ensemblebased model as validation in predicting the corrosion inhibition efficiency (CIE) value of quinoxaline derivative compound inhibitors.

ML Model
Preprocessing is the earliest step in creating an ML model.Data normalization using the MinMax scaling technique is the first step in the preprocessing stage, which lowers sensitivity to certain features.The data is divided using the k-fold cross-validation approach as the following preprocessing step.By training the model repeatedly until it finds the lowest possible statistical error, this strategy was chosen to overcome bias and variation in the data [26], [27].As a result, one fold serves as the test set in this study, while the remaining nine folds serve as the training set (k = 10).Although the exact value of the k-fold relies on the properties of the data being utilized, in general, k = 5 or k = 10 are employed [28], [29].
During the modeling phase, we assess and test the prediction performance of the XGBoost model against ensemble-based models including bagging (BAG), adaboost (ADA), and random forest (RF).Regression measures like mean absolute percentage error (MAPE), coefficient of determination (R 2 ), and root mean square error (RMSE) are used to assess the effectiveness of prediction models.The optimal model has an R 2 value that is near 1 and lower values for RMSE, MAPE, and R 2 [30].

RESULT AND DISCUSSION
The metrics R 2 , RMSE, and MAPE are commonly used to evaluate the performance of regression models.These metrics provide insights into different aspects of the model's predictive accuracy and are crucial for comparing different models.R 2 measures the proportion of the variance in the dependent variable that is predictable from the independent variables.It ranges from 0 to 1, where 1 indicates a perfect fit.Higher R 2 values imply better predictive performance.RMSE represents the square root of the average squared differences between predicted values and observed values.It provides a measure of the average magnitude of errors.Lower RMSE values indicate better predictive accuracy.MAPE measures the average absolute percentage difference between predicted and observed values.It is expressed as a percentage and lower values signify better predictive accuracy.In the context of the analysis provided, Table 1 presents the R 2 , RMSE, and MAPE values for different models, namely XGBoost, ADA, BAG, and RF.These values serve as quantitative measures of each model's performance.
Based on the provided Table 1, showcases the prediction performances of different models using the metrics R 2 , RMSE, and MAPE.The R 2 values indicate the proportion of variance in the dependent variable that is explained by the independent variables in each model.A higher R 2 value suggests that the model captures more variance and therefore has better predictive power.In this case, the XGBoost model demonstrates the highest R 2 value of 0.97, indicating that it explains approximately 97% of the variance in the data.ADA follows with 0.88, BAG with 0.86, and RF with 0.83.Thus, XGBoost outperforms the other models in terms of explaining the variance in the data.RMSE represents the square root of the average squared differences between predicted and observed values.Lower RMSE values indicate better model performance, as they suggest smaller prediction errors.In this case, the XGBoost model achieves the lowest RMSE of 2.48, followed by ADA with 2.65, BAG with 2.72, and RF with 3.01.This confirms that XGBoost has the smallest average prediction error among the models considered.MAPE measures the average absolute percentage difference between predicted and observed values.Similarly to RMSE, lower MAPE values indicate better predictive accuracy.The XGBoost model achieves the lowest MAPE of 2.87, followed by ADA with 3.23, BAG with 3.56, and RF with 3.79.Again, this confirms that XGBoost yields smaller percentage errors on average compared to the other models.
From the table, it is inferred that the XGBoost model outperforms ADA, BAG, and RF in terms of prediction accuracy across all assessment measures.XGBoost yields higher R 2 values, lower RMSE values, and lower MAPE values compared to the other models.This implies that XGBoost captures more variance in the data, produces smaller prediction errors on average, and exhibits lower percentage errors in prediction.Additionally, Figure 1 visually supports these findings by illustrating the distribution of data points concerning the prediction lines of the models.The data points are closer to the prediction line of the XGBoost model compared to the other models, indicating a better fit and alignment with the actual data.The analysis reveals that XGBoost consistently outperforms ADA, BAG, and RF models across all evaluation metrics (R 2 , RMSE, and MAPE), indicating superior predictive performance.This reinforces the efficacy of XGBoost for the prediction task.

Figure 1 .
Scatter plot of data point model prediction for (a) XGBoost and (b) ADA 4. CONCLUSION By contrasting the XGBoost and ensemble-based models, the ML model's ability to predict the CIE value of quinoxaline compounds has been investigated.Based on the R2, MAPE, and RMSE measurements, it was determined that the XGBoost model was more accurate than the ADA, BAG, and RF models.With higher R2 values indicating better variance capture, lower RMSE values reflecting smaller prediction errors, and lower MAPE values denoting improved accuracy, XGBoost emerges as the superior model.This conclusion is further supported by visual inspection of the data distribution relative to model predictions, reaffirming XGBoost's better fit to the actual data.To help the industry create corrosioninhibiting materials, this research offers valuable insights into creating practical and efficient material exploration techniques.

Table 1 .
Model prediction performances