ARIMA MODEL IN PREDICTING OF COVID-19 EPIDEMIC FOR THE SOUTHERN AFRICA REGION

Background: Coronavirus pandemic, a serious global public health threat, affects the Southern African countries more than any other country on the continent. The region has become the epicenter of the coronavirus with South Africa accounting for the most cases. To cap the deadly effect caused by the pandemic, we apply a statistical modelling approach to investigate and predict COVID-19 incidence. Methods: Using secondary data on the daily confirmed COVID-19 cases per million for Southern Africa Development Community (SADC) member states from March 5, 2020, to July 15, 2021, we model and forecast the spread of coronavirus in the region. We select the best ARIMA model based on the log-likelihood, AIC, and BIC of the fitted models. Results: The ARIMA (11,1,11) model for the complete data set was finally selected among ARIMA models based upon the parameter test and the Box–Ljung test. The ARIMA(11,1,9) was the best candidate for the training set. A 15-day forecast was also made from the model, which shows a perfect fit with the testing set. Conclusion: The number of new COVID-19 cases per million for the SADC shows a downward trend, but the trend is characterized by peaks from time to time. Tightening up of the preventive measures continuously needs to be adapted in order to eradicate the coronavirus epidemic from the population.


Introduction
Sixteen countries constitute the Southern Africa Development Community (SADC) Member States; namely Angola, Botswana, Eswatini, Comoros, Democratic Republic of Congo (DRC), Lesotho, Madagascar, Malawi, Mauritius, Mozambique, Namibia, Seychelles, South Africa, Tanzania, Zambia, and Zimbabwe. The first case of COVID-19 in the SADC region was recorded in early March 2020 in South Africa and the numbers have been increasing exponentially (WHO, 2020). With the exception of Comoro and Lesotho, other member states had been affected by the epidemic by the 15 th of April 2020 (WHO, 2020).
The coronavirus pandemic has resulted in complex challenges across the world, and the SADC region has not been spared. Various measures have been undertaken by the SADC Member states and these include preparedness and response mechanisms, awareness programs, suspension of inbound and outbound flights, suspension of business and tourism travel, set up of border and in-country testing centres, social distancing, and cancellation of gatherings, adoption of self-isolation and mandatory quarantine for a minimum of fourteen days, and treatment for those that test positive (WHO, 2020).
The Southern Africa region has been hit hardest by the COVID-19 pandemic in Africa, thus the epicenter of the coronavirus in the African continent (Massinga et al. 2020). By February 2021 the SADC region had accounted for half of the reported cases in Africa. Of the five African countries accounting for close to 76% of new infections, three of them are members of the SADC namely South Africa, Zambia, and Namibia (UN Economic Commission for Africa, 2020).
With COVID-19 becoming one of the most serious global public health threats, investigating and predicting COVID-19 incidence contributes to the control of its spread. Modelling a future forecast that estimates the regular number of confirmed cases enhances the implementation of rules aimed at controlling the spread of COVID-19. Statistical forecast models play a role in predicting future epidemic threats, managing of societal, economic, cultural, and public health matters. Katoch and Sidhu (2021) predicted the spread and the final size of the COVID-19 epidemic in India using the ARIMA model. Singh et al. (2020) predicted the daily confirmed COVID-19 cases for Malaysia using the ARIMA model.
In Bangladesh, Kundu et al. (2021) forecasted the expected number of total confirmed cases, total confirmed new cases, total deaths, and total new deaths in Bangladesh.
The ARIMA model, generally known as the Box-Jenkins methodology is used for forecasting and analysis in the time series approach. We model the number of daily COVID-19 cases per million in the SADC region using the ARIMA (p,d,q) models. First, we model for all combined SADC countries than for each of the member countries. We finally forecast the spread of the pandemic using data from the three SADC countries (South Africa, Zambia, and Namibia) reporting high cases of the COVID-19 pandemic.

ARIMA Model
Oftentimes, the ARIMA models are used in the analysis and forecasting of the time series, focusing on the random side of the time series. The acronym ARIMA (p,d,q) consist of three main sections: 1) Autoregressive models, AR (p) which express the present value as a linear function in the lagged values of the variable. Thus, where denotes the parameters of the autoregressive, denotes the lag operator, denotes the error terms, and denotes the constant.
2) Integration, I (d) which indicates the degree to which the variable is stationary. Thus, where denote the parameters of the moving average and denotes the expectation of (often assumed to equal 0).

Ethical Consideration
No formal ethical approval was obtained for this study because all the data used were from a secondary source.

Estimation of parameters of ARIMA Model Process in Stages
The ARIMA models estimation method involves several stages undertaken before making predictions namely the identification, estimation, forecasting, and forecasting validation stages. (a) Identification stage -where the degree of stationary of the variable is determined using Augmented Dickey-Fuller (ADF). Based on the autocorrelation function (ACF) and the partial autocorrelation function (PAC), we first define (d), followed by (p) and then (q). (b) Estimation stagewere using the Maximum likelihood estimation method and based on Akaike information criterion (AIC), the appropriate model is selected after comparing possible models. (c) Forecasting stage -where the prediction is made using the final model. (d) Forecasting validation stage -where the model is validated on the prediction based on several indicators that represent the deviation of the calculated values from the actual values, which are the mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE). The ACF plot ascertains the existence of autocorrelation between the residual values. The validity of the prediction is checked using the plots of the difference between the actual and the forecast. Table 1 presents the definition of indicators used in assessing the validity of the forecasts and associated formulae.

Test of stationarity
We test for stationary time series using an Augmented Dickey-Fuller (ADF) test (Dickey and Fuller, 1979). For the transformed nonstationary time series into a stationary time series, we adapt the difference or logarithmic transformation. Achieving stationary is a precondition for establishing an ARIMA model.

Model identification
Appropriate values of p, d, and q of the ARIMA model were identified as part of model identification. We identified the value of d by the number of differentials, the AR and MA from the autocorrelation function (ACF), and the partial autocorrelation function (PACF) plot against the lag length, respectively. In addition, some model selection criteria such as the Akaike information criterion (AIC), and Bayesian information criteria (BIC) were used. The optimal model was chosen based on the smallest value of AIC and BIC (Akaike, 1974).
After the identification of appropriate values of p and q, the next step was to estimate the parameter of the autoregressive and moving average terms included in the model. This was done using the maximum likelihood estimation method.

Diagnostic checking
The identification and fitting of the ARIMA model and estimation of the parameters preceded the checking of whether the residual series was a white noise using the Ljung-Box Q test. Failure to reject the null hypothesis of white noise led to accepting the fitted model. The Ljung-Box Q statistics is defined as where, T is the number of observations, s is the length of coefficients to test autocorrelation, r is the autocorrelation coefficient for lag k. The Ljung-Box Q statistics follows approximately the chi-square distribution with (k-q) degrees of freedom, where q is the number of parameters to be estimated (Ljung and Box, 1978).

Descriptive statistics
In this study, we used an openly available daily number of confirmed cases of COVID -19 per million reported by Our World in Data (www.ourworldindata/coronavirus-source-data) from 7 March 2020 to 3 August 2021. Table 1 presents the summary statistics (including the mean and median). Results from Table 2 shows that the number of reported daily COVID-19 cases per million in the SADC region ranged from 0 to 18663.85 with an average of 760. 03 per million cases reported each day from the 7 th of March, 2020 to the 3 rd of August, 2021. Figure 1 presents a plot of the daily cases, the density plot, normal Q-Q plot and boxplot. We use these plots to check for normality of the time series data. The density plot and the box plot show that the data is skewed to the right and therefore not normality distributed. The normality Q-Q plot also shows some deviation from normality.

Modelling new COVID-19 cases per million for the SADC region
We start by testing for the stationarity of the original time series data and also that of the differenced time series data. We do this using the augmented Dickey-Fuller (ADF) test and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test at 5% level of significance. Table 2 presents results of the ADF and KPSS tests where ADF test for non-stationary and nonseasonal and KPSS test for stationary and non-seasonal in the time series data. However, the KPSS test shows that the original data is non-stationary. The first difference of the data is performed and both the ADF and KPSS test show that the differenced series is stationary in its mean and variance at a 5% level. Therefore we adopt d = 1 for ARIMA (p,d,q) model.
We assessed the performance of a number of fitted ARIMA models based on the ME, RMSE, MAE, and MASE. We started off by automatically fitting the ARIMA model for the data. Table 3 presents the result which suggests the candidate ARIMA models.  Table 4 results, the ARIMA (11,1,11) has the lowest RMSE hence the best candidate to explain the daily COVID-19 cases per million for the SADC region. We further investigate the fitted model by examining the residuals plot and performing the Box-Ljung test. Figure 3 presents residual diagnosis results for the ARIMA (11,1,11) model. From Figure 2, the ACF plot of the residuals from the ARIMA(11,1,11) model shows all autocorrelations to fall within the threshold limits, an indication that the residuals are white noise. The Box-Ljung test returns a large p-value ( df = 20, p-value = 0.9973), also suggesting that the residuals are white noise. We opt to use a training and a test set rather than time series cross-validation because the series is long.

Training and testing sets
Using the ARIMA models fitted for the cross-validation time series, we set the criteria for the training data at the first 90% (observations 1 to 498) and for the test data at the last 10% (observations 499 to 515). Table 4 present results from the fitted models ARIMA (9,1,9), ARIMA(11, 1,9) and ARIMA(11,1,11), using the training data. The corresponding ME, RMSE, MAE, MASE, and ACF1 for each of the fitted models are included. Results in Table 5 show that the three models meet the requirements of white noise for the residual time series (pvalue > 0.05), thus the RMSE values were compared. An ARIMA(11, 1,9) has the lowest RMSE, and hence the best candidate for the forecasting. The model has high accuracy and outperforms the forecasting accuracy of the other two models. Table 6 presents the coefficients (estimate, standard error, z-value and p-value) for the ARIMA (11,1,9) model for p=1 to 11 and q=1 to 9, fitted using the training data. We test the accuracy of the estimated parameters at 5% significance level. The estimated parameters for the fitted models have a magnitude less than one ( Table 6). The majority of the fitted models were significant at a 5 % significance level, except for the AR(1), AR(2), AR(7), AR(11), MA(2), and MA (6). The standard errors for the estimated parameters were relatively small indicating less variation.

Forecasting using the ARIMA (11, 1, 9) model
We carried out forecasting for all the time series data using ARIMA (11,1,9) model. First, we divided the data into two sets of the year, "1991" to"2014" as the training data and from "7 march 2021" to "16 July 2021" as the testing data. The data from "17 July 2021" to "3 August 2021" were used as an out-sample forecast. The residuals of the chosen models were stationary and white noise. Table 6 presents a 15 days forecasting for the daily COVID-19 per million cases using ARIMA (11, 1,9) model. We present the point forecast and the corresponding 80 % and 95% confidence intervals, for the observation numbers 499 to 513.  Figure 3 shows that all forecasted lines for the training set (ARIMA(11, 1,9)) are close to the actual values from the test set, which emphasises the quality of the selected models. The up and down trend characterizes the daily COVID-19 cases per million.

Discussion
The first COVID-19 case in the SADC region was recorded on the 5 th of March 2020 in South Africa. We predict the spread of COVID-19 in the SADC region using reported daily cases per million from the 7 th of March 2020 to the 3 rd of August 2021. Using time-series data for the 515 observations we provide an appropriate ARIMA model to predict the spread of COVID-19 in the SADC region. Candidate models are designed by observing the spikes from the autocorrelation function (ACF) and partial autocorrelation function (PACF) charts. The Fitting of several ARIMA models resulted in a prediction model whose accuracy in the performance was assessed using RMSEs. We used a training and test set rather than the time series cross-validation because the time series was relatively long.
The use of all data in the analysis provides ARIMA (11,1,11) models that are good for predicting the spread of COVID-19 in the SADC region. Further refinement considering the training set (first 90% of the set) led to an ARIMA (11, 1,9) model as the best model. Use of an ARIMA(11, 1,9) model in forecasting 15 days in advance fits well with the test (observed) set. An ARIMA (11,1,9) model has a smaller RMSE compared to the ARIMA(11,1,11) model.

Conclusion
Amongst all the ARIMA models, ARIMA (11, 1,9) best predicts the spread of COVID-19 in the SADC region when cases per million are used. A downward trend characterized by peaks from time to time shows the number of new COVID-19 cases per million for the SADC region. The trend suggests a need to continue tightening up the measures that help in mitigating the epidemic. These measures include social distancing, wearing of masks, reducing the number of people in social gatherings, reducing the movement of people between the borders, and so on.

Conflict of interest
The authors declare that there is no conflict of interest associated with this study.