We suggest a framework to seek for cost-effective SD interventions balancing the well being results (averted DALY losses) and the prices related to SD intervention measures. Choices on SD interventions are taken with respect to a maximal SD degree, figuring out a inhabitants fraction complying with the stay-at-home restrictions aimed to manage the COVID-19 transmission in a typical Australian city. Our framework includes three foremost elements: (i) a way to judge the cost-effectiveness of SD intervention measures, (ii) an agent-based mannequin (ABM) to simulate the impact of those interventions on the development of the COVID-19 illness, and (iii) an RL algorithm to optimise an adaptive SD intervention simulated throughout the ABM and evaluated when it comes to cost-effectiveness. The next sections describe these elements in additional element. Our examine didn’t contain experiments on people/human information or the usage of human tissue samples. The anonymised census information, which is said to the construct of the ABM, are publicly obtainable from the Australian Bureau of Statistics.

### Internet well being profit

With a view to consider the cost-effectiveness of NPI interventions, we quantify the web well being profit (NHB)^{46,67}. The NHB captures the distinction between the well being impact of a brand new intervention and the comparative well being impact, given the related price incurred at some pre-defined cost-effectiveness thresholds. The fee and the well being impact of the brand new intervention are measured in opposition to the “null” intervention, that’s, in presence of some baseline interventions which aren’t topic to analysis^{47}. In our examine, the null set includes solely the bottom interventions, i.e., case isolation (CI), house quarantine (HQ), and journey (border management) restrictions (TR). Therefore, we consider cost-effectiveness of the NPIs formed by social distancing (SD), past that of the CI, HQ and TR interventions. The speed modulating the well being results’ comparability is named “willingness to pay” (WTP), outlined as a most financial threshold that the society accepts as the price of a further unit of well being gained because of the new intervention. The NHB of the SD intervention is outlined as follows:

$$start{aligned} textual content {NHB} = mu _{E_{SD}} – frac{mu _{C_{SD}}}{lambda } finish{aligned}$$

(1)

the place (mu _{E_{SD}}) is the imply of the well being impact (E_{SD}) produced by the SD intervention, (mu _{C_{SD}}) is the imply of the price (C_{SD}) incurred by this intervention, and (lambda) is the WTP set by coverage makers or public well being packages.

The corresponding well being impact (E_{SD}) of the SD intervention is computed by evaluating the well being losses averted by the evaluated intervention to the losses of the null intervention:

$$start{aligned} E_{SD} = L_{0} – L_{SD} finish{aligned}$$

(2)

the place (L_{0}) and (L_{SD}) are the well being losses for the null and SD interventions respectively (see Fig. 7 for illustration). On this examine, we quantify well being losses utilizing Incapacity-Adjusted Life Yr (DALY) strategy really useful by the World Well being Group (WHO)^{40,68}. Particularly, the years of life misplaced as a result of untimely mortality (YLL) are mixed with the years of life lived with incapacity (YLD), producing a single amount expressing the burden of illness in time models:

$$start{aligned} textual content {DALY} = textual content {YLL} + textual content {YLD} finish{aligned}$$

(3)

For every contaminated particular person (represented by an agent within the ABM), YLL is calculated because the distinction between the life expectancy and the 12 months of demise if this agent dies because of the COVID-19. The second time period, YLD, is measured by the period of the illness inside an contaminated agent who recovers from the COVID-19 (adjusted by a incapacity weight representing the illness severity). For non-infected brokers, (textual content {YLL} = 0) and (textual content {YLD} = 0), below the idea that the COVID-19 has not affected their well being situations. On this examine, we additionally assumed {that a} life 12 months misplaced because of the COVID-19-related demise and an impacted 12 months lived with illness for non-fatal instances are equally essential (that’s, we set the incapacity weight equal to 1). As well as, no age weighting and discounting on future well being advantages^{68} had been utilized in our calculation for DALY, following^{69} and^{70}. The well being impacts had been calculated on the inhabitants degree, accumulating the one measurements from all brokers in our ABM.

Moreover, the price of the evaluated intervention is estimated below the idea of the equal distribution of the full price throughout the brokers. When an SD intervention is imposed over a inhabitants fraction outlined by some SD compliance degree, the corresponding price is assumed to be proportional to this fraction. For instance, an intervention with the SD degree of fifty% is assumed to price half as a lot as the complete lockdown on the SD degree of 100%. A scaling in proportion to the variety of impacted people is utilized in approximating the weekly intervention prices for a typical city, given the intervention prices estimated at $1.4 billion per week for your entire Australian economic system (i.e., your entire inhabitants)^{71}.

The NHB strategy permits us to comparatively consider the cost-effectiveness of assorted interventions which can considerably differ of their prices and corresponding well being results. Consequently, it allows to derive adaptive SD interventions by steadily altering the SD ranges in a path that will increase the cost-effectiveness. Thus, the NHB will be simply utilized by a reinforcement studying course of exploring the search-space for more cost effective interventions.

### Willingness to pay

Prior research thought of a broad vary of the WTP ranges. For instance, the price per quality-adjusted life 12 months (QALY) will be estimated because the likelihood that the respondent will reject the bid values^{72,73}. The estimates by this examine resulted in: JPY 5.0 million in Japan (US$41,000 per QALY), KNW 608 million within the Republic of Korea (US$74,000 per QALY), NT 2.1 million in Taiwan (US$77,000 per QALY), (kilos)23,000 within the UK (US$36,000 per QALY), AU$64,000 in Australia (US$47,000 per QALY), and US$62,000 per QALY within the USA.

One other strategy decided that, on common, the price per DALY averted was associated to the Gross Home Product (GDP) per capita. For example, the price was 0.34 occasions the GDP per capita in low Human Improvement Index (HDI) nations, 0.67 occasions the GDP per capita in medium HDI nations, 1.22 occasions the GDP per capita in excessive HDI nations, and 1.46 occasions the GDP per capita in very excessive HDI nations^{74}. For Australia, this is able to correspond to the price within the vary of AU$93,197.9 = US$67,735.6 (or, 1.22 x US$55521.0) and AU$111,531.9 = US$81,060.7 (or, 1.46 x US$55521.0). These estimates are produced utilizing information from World Financial institution^{75} and Worldwide Financial Fund^{76} for the common 5-year GDP per capita and USD-AUD trade charge, respectively.

One other accepted strategy is to characterize the WTP threshold because the (consumption) worth {that a} society attaches to a QALY^{77}. This societal perspective was adopted by the contingent valuation strategy which valued QALYs below uncertainty for the Dutch inhabitants, producing the vary from €52,000 to €83,000 (roughly, AU$82,409.9 – AU$131,538.8).

A current examine within the Australian context used a variety of WTP as much as US$300,000 (or AU$412,771.9) per health-adjusted life 12 months (HALY). It specified preferable COVID-19 intervention insurance policies in three ranges: (i) as much as US$20,000 (AU$27,518.1), (ii) from US$30,000 (AU$41,277.2) to US$240,000 (AU$330,217.4), and (iii) above US$240,000^{50}.

These research knowledgeable the selection of the WTP thresholds utilized in our evaluation. Specifically, we thought of three WTP thresholds: $10K per DALY, $50K per DALY and $100K per DALY.

### Agent-based mannequin for COVID-19 transmission and management

With a view to simulate transmission and management of the COVID-19 pandemic in Australia we used a well-established ABM^{19,31,78}, calibrated to the Delta variant (B.1.617.2), and modified to seize a fluctuating adherence to social distancing in addition to extra refined vaccination protection. The unique ABM included numerous particular person brokers representing your entire inhabitants of Australia when it comes to demographic attributes, reminiscent of age, residence and workplace or schooling. In re-calibrating and validating this mannequin, we used a surrogate inhabitants of New South Wales (7,485,860 brokers), whereas the first simulations, coupled with the RL algorithm, employed a surrogate inhabitants of 2393 brokers representing the inhabitants of a small Australian native space (e.g., a city), generated to match key traits of the Australian census carried out in 2016. The ABM is described intimately in Supplementary Materials: Agent-based Mannequin, and right here we solely summarise its foremost options.

Every agent belongs to a lot of mixing contexts (family, neighborhood, office, faculty, and so forth.) and follows commuting patterns between the areas of residence and work/schooling. The commuting patterns are obtained from the Australian census and different datasets obtainable from the Australian Bureau of Statistics (ABS)^{79,80,81}. The transmission of the illness is simulated in discrete time steps, with two steps per day: daytime for work/schooling interactions, and nighttime for all residential and neighborhood interactions. The contact and transmission possibilities differ throughout contexts and ages.

The illness development inside an agent is simulated over a number of disease-related states, together with Vulnerable, Infectious (Asymptomatic or Symptomatic), and Eliminated. All brokers are initialised as Vulnerable. When an agent is Infectious, different vulnerable brokers sharing some mixing context with the agent might turn into contaminated, and infectious after some latent interval. An age-dependent fraction of brokers progresses by means of the illness asymptomatically. The transmission possibilities are decided at every step, given the brokers’ mixing contexts, in addition to their symptomaticity. The likelihood of transmission from an Infectious agent varies throughout the time because the publicity, rising to a peak of infectivity after which declining throughout the agent’s restoration. On the finish of the infectious interval, the brokers change their state to Eliminated (i.e., recovered or diseased), which excludes the agent from the Vulnerable inhabitants. Thus, re-infections are usually not simulated, on condition that the simulated timeframe is comparatively quick (19 weeks following the primary week throughout which the social distancing intervention is triggered, as talked about beneath).

A pandemic state of affairs is simulated by infecting some brokers. Throughout calibration and validation, these brokers are chosen (“seeded”) in proximity to a world airport^{19,31}. In the course of the main simulations of every outbreak in a small Australian city, we seeded all preliminary instances inside this space, in keeping with a binomial sampling course of, described in Supplementary Materials: Simulation Situations and Seeding Technique. The seeding course of is terminated when cumulative instances exceed a predefined threshold, simulating an imposition of journey restrictions across the city. At this level, the infections might proceed to unfold solely because of the native transmission.

A vaccination rollout scheme is carried out in two modes: (i) a progressive rollout mode (i.e., reactive vaccination) used to validate the mannequin with the precise information from the Sydney outbreak throughout June-November 2021, and (ii) a pre-emptive mode used to simulate pandemic situations managed by NPIs, assuming that some inhabitants immunity has been already developed in response to previous vaccination campaigns. Each modes assume hybrid vaccinations with two vaccines: Oxford/AstraZeneca (ChAdOx1 nCoV-19) and Pfizer/BioNTech (BNT162b2), concurring with the Australian campaigns throughout 2021^{25,31}.

Completely different NPIs are simulated: case isolation, house quarantine, and social distancing interventions^{19,31}. Case isolation and residential quarantine are assumed to be the baseline interventions, activated from the simulation onset. Social distancing (i.e., “stay-at-home” restrictions) is barely triggered when cumulative instances surpass a particular threshold. Not like earlier implementations of the ABM, the compliance of brokers, bounded by a given SD degree, is simulated heterogeneously and dynamically, with Bernoulli sampling used to find out whether or not an agent is compliant with the SD intervention at any given simulation step (throughout the whole restrict on the fraction of compliant brokers). Vaccination states and compliance with NPIs modify the transmission possibilities within the corresponding mixing contexts, thus affecting unfold of the outbreak.

Importantly, the well being results ensuing from the COVID-19 pandemic are captured by aggregating the high-resolution information simulated on the agent degree. Not like different research which estimate the outcomes solely on the finish, we quantify the well being losses after each simulation day, by measuring possible age-dependent deaths and whole possible impacted days for newly contaminated brokers. This temporal decision permits us to assemble a choice course of evaluating social distancing interventions in a manner appropriate with the RL methodology. Particularly, every choice level features a state (i.e., data describing the present pandemic scenario throughout all brokers), an motion (e.g., a choice setting a degree of compliance with social distancing beneath the restrict (SD_{max})), and the related final result.

### Reinforcement learning-based seek for cost-effective NPIs

Our framework for optimising the cost-effectiveness of SD interventions consists of two typical RL elements: a decision-maker and an setting, as proven in Fig. 6. The choice-maker is configured as a neural community^{82,83,84,85} that may make selections on the SD compliance ranges (throughout the restrict (SD_{max})), given the decision-maker’s commentary of the setting. The setting includes the ABM which simulates results of those selections on the transmission and management of the COVID-19 inside a typical Australian city, as described in earlier part. Our goal is to be taught the decision-making neural community based mostly on the interactions between the decision-maker and the setting, in order that cost-effectiveness of the SD intervention is maximised.

In our setting, as soon as the outbreak begins, the selections are assumed to be made each week, concurring with the time decision adopted in different research^{55,56,86}. At a choice level *t*, the decision-maker takes a (partial) commentary of the setting (denoted by (o_t)), and selects its motion (a_t) aiming to cost-effectively management the continuing outbreak. An commentary characterises the present pandemic state, together with the detected incidence (asymptomatic and symptomatic), prevalence, and the rely of recoveries and fatalities. As soon as choice (a_t) is made setting the SD compliance for the following week, the ABM setting simulates the SD intervention related to (a_t) and its results throughout the interval from the choice level *t* to the following choice level (t+1). This simulation determines the financial prices incurred throughout the interval (i.e., one week) and the related well being losses (averted DALY). These portions represent the reward sign, offering suggestions to the decision-maker. On the subsequent choice level this suggestions is used to judge the selection of (a_t).

The interactions between the decision-maker and the setting begin when the variety of cumulative instances exceeds a threshold triggering the SD interventions, and continues till the tip of the simulation interval (e.g., consists of (N = 19) choice factors (t in {0, 1, 2,ldots , N})). The interactions type a sequence of observations, actions, and rewards, registered at a number of choice factors. The RL algorithm samples from this sequence, carry out its optimisation, and updates the choice neural community. Usually, this sampling step will be carried out at each choice level as the brand new information are collected, or be delayed relying on the algorithm.

All interactions between the decision-maker and the setting type an “episode” within the studying technique of optimum SD interventions. The educational course of entails a number of episodes independently following one another. The overall reward generated throughout an episode, or the episodic reward, is predicted to develop as the training course of continues, evidencing that the discovered NPIs generate greater NHB. The educational course of continues till solely minimal enhancements within the episodic reward are noticed, marking convergence of the RL algorithm. The resultant choice neural community can then be used unchanged throughout the analysis section. The outcomes and evaluation introduced in part “Outcomes” are based mostly on the analysis section.

The method of decision-making follows a Markov Determination Course of (MDP) represented by a tuple (langle S, O , P, A, R rangle). The set *S* comprises doable states of the setting. The choice-maker is assumed to watch these states solely partially, and the set *O* comprises all partial observations. The set *A* comprises all doable actions (a_t) obtainable to the decision-maker at every time step *t*. Every motion (a_t in A) determines the corresponding SD compliance degree (f(a_t)) for the SD intervention, imposed over the inhabitants between the time step *t* and the following time step (t+1). Formally, (f:A rightarrow [0,1]), e.g., (f(a^0_t) = 0) for zero SD intervention (a^0) at any time *t*. Be aware that (a_t) defines the SD intervention utilized along with baseline interventions, reminiscent of CI, HQ and TR that are all the time enabled by default.

Usually, the decision-maker can decide its motion in keeping with a stochastic coverage (pi :O occasions A rightarrow [0,1]) , or a deterministic coverage (pi : O rightarrow A). On this examine, we configure the decision-maker to comply with a stochastic coverage described by likelihood distribution (pi (o,a)). Given commentary (o_t) obtained at time step *t*, the motion (a_t) will be sampled from the coverage distribution, denoted as (a_t sim pi (o_t,cdot )).

Not like different research^{52,56}, which discretise the vary of social distancing percentages, we outlined (f(a_t)) to be steady within the interval ([0, SD_{max}]), for some restrict (0 le SD_{max} le 1). The execution of an motion (a_t) on the state (s_t in S) constrains the setting dynamics creating between the time steps *t* and (t+1). The state transition likelihood, denoted by (P(s’ | s,a): S occasions A occasions S rightarrow [0,1]), quantifies the prospect of transition from the present state *s* to the following state (s’), following the execution of the motion *a*. Thus, the likelihood (P(s’ | s,a)) displays the pandemic dynamics managed by the interventions. After the motion (a_t) is executed, the setting produces the reward sign (r_{t+1} in R) in order that the agent can reinforce its coverage on the subsequent time step (t+1). Every reward (r_{t+1}) is given by the corresponding well being results attained throughout the simulated interval.

With a view to optimise the SD interventions by maximising their NHB estimates over your entire simulation interval of *N* weeks, we use a period-wise strategy maximising the next goal perform (see Supplementary Materials: Interval-wide NHB Goal Perform for additional particulars):

$$start{aligned} max _{pi _{theta }} mathop {{mathbb {E}}}_{start{array}{c} a_t sim pi _{theta }(o_t, cdot ) (s_t, a_t, s_{t+1})simtau (s^0_t, a^0, s^0_{t+1})simtau ^0 finish{array}} sum _{t=0}^{N} left[ L(s^0_t, a^0) – L(s_t, a_t) – frac{f(a_t) C^1}{lambda } right] , finish{aligned}$$

(4)

the place (pi _{theta }) is the coverage formed by parameters (theta); *L*(*s*, *a*) is the well being losses, measured in DALYs, ensuing when the intervention on the SD degree *a* is utilized to the setting at state *s*; motion (a_t) is sampled from coverage (pi _{theta }(o_t, cdot )) based mostly on the environmental commentary (o_t); the transition from (s_t) to (s_{t+1}) belongs to the trajectory (tau) managed by SD interventions (a_t) (i.e., is sampled from a distribution of trajectories); the transition from (s^0_t) to (s^0_{t+1}) belongs to the uncontrolled trajectory (tau ^0) formed by null motion (a^0); and (C^1) is the imply price for the complete 100% SD intervention between two consecutive time steps, with the complete price scaled down by the issue (f(a_t) in [0, SD_{max}]). The distinction within the well being losses between the trajectories (tau ^0) and (tau), representing the well being results of the simulated SD intervention, is illustrated in Fig. 7.

With a view to maximise the target perform expressed by Eq. 4, we specify the reward sign for the motion (a_t) as follows:

$$start{aligned} r(s_t, a_t | s^0_t) = L(s^0_t, a^0) – L(s_t, a_t) – frac{f(a_t) C^1}{lambda } finish{aligned}$$

(5)

Maximising the full acquired rewards alongside the trajectory (tau) is equal to maximising the target expressed in Eq. 4, yielding the optimum decision-making coverage (pi ^*).

The ABM simulation is inherently stochastic, and therefore, we use a reduced model of the amassed rewards:

$$start{aligned} max _{pi _{theta }} mathop {{mathbb {E}}}_{start{array}{c} a_t sim pi _{theta }(o_t, cdot ) (s_t, a_t, s_{t+1})simtau (s^0_t, a^0, s^0_{t+1})simtau ^0 finish{array}} sum _{t=0}^{N} gamma ^{t} r(s_t, a_t | s^0_t) finish{aligned}$$

(6)

the place (gamma in (0,1)) is the low cost issue, and *r* is the reward perform outlined by Eq. 5.

The coverage (pi _{theta }) is decided by a set of parameters (theta) which specify the weights of the choice neural community. A parameterised coverage (pi _{theta }) will be optimised by maximising a coverage efficiency measure perform (J(theta )). A canonical replace for the parameter (theta) in every studying step *okay* follows the gradient ascent methodology^{82}, in search of to maximise the efficiency perform (J(theta )):

$$start{aligned} theta _{okay+1} = theta _{okay} + alpha widehat{nabla J(theta _k)} finish{aligned}$$

(7)

the place (alpha) is the training charge for the replace, and (widehat{nabla J(theta _k)}) is the estimation for the gradient of the efficiency perform with respect to (theta _k).

In our examine, we used the Proximal Coverage Optimisation (PPO) algorithm^{87} (see Supplementary Materials: PPO Algorithm), aiming to keep away from “damaging giant coverage updates” reported when the discounted goal perform, outlined by Eq. 6, is optimised immediately^{87}. Particularly, we utilised the implementation of PPO for steady actions offered by the Secure-Baselines3 library^{88}. The convergence within the coaching for SD intervention insurance policies, evidenced by enchancment of the amassed rewards over coaching episodes, is introduced in Supplementary Materials: Empirical Convergence within the Coaching of SD Insurance policies.