A compartmental model for smoking dynamics in Italy: a pipeline for inference, validation, and forecasting under hypothetical scenarios - BMC Medical Research Methodology

A compartmental model for smoking dynamics in Italy: a pipeline for inference, validation, and forecasting under hypothetical scenarios – BMC Medical Research Methodology

Posted by Admin July 13, 2024 Fitness

Data

The analyses relied on data from heterogeneous sources. We used data from the National Institute of Statistics (ISTAT) Multipurpose Surveys “Aspect of Daily Life” (AVQ) (www.istat.it/it/archivio/91926), which every year collect fundamental information related to the daily life of individuals and families in Italy, enrolling about 25,000 families distributed in about 800 Italian municipalities of different population sizes. Specifically, we obtained from the ISTAT AVQ surveys an estimate of the distribution by smoking habit (never, current, and former smokers) of the population residing in Tuscany for each year from 1993 to 2019, separately for males and females and by age class (14-17, 18-19, 20-24, 25-34, 35-44, 45-54, 55-59, 60-64, 65-74, 75+). We obtained from the same surveys the smoking intensity distribution for current smokers, by sex and age class.

We used data from the ISTAT Multipurpose Surveys European Health Survey (EHIS) (www.istat.it/it/archivio/167485), a survey on the main aspects of public health carried out every 5 years from 1980 in all member states of the European Union, to obtain an estimate of the smoking intensity distribution among former smokers, as well as information about time since smoking cessation, separately for males and females and by the same age classes reported above. In particular, we considered the surveys for 1994, 1999, 2004, and 2013.

We obtained the size of the Tuscany population on January 1st 1993 and January 1st 2005, by age and sex, from the ISTAT website (www.istat.it). From the same website, we got the mortality rates by age and sex and the number of new births in Tuscany for the period 1993-2019. The relative risks (RRs) of death for smokers and ex-smokers versus never smokers are those reported in the Appendix of [32].

Model specification

We specified a compartmental model for smoking habit dynamics in the population, which we call Smoking Habits Compartmental (SHC) model. In order to better present the SHC model adopted for the analysis, we first introduce a simpler version of it, and then proceed step by step, adding elements of complexity.

The starting model assumes that at each time the alive population is divided into the following non-overlapping compartments: never (N), current (C), and former (F) smokers. We consider only cigarette smoking. Never smokers can become current smokers, current smokers can become former smokers, and former smokers may restart smoking (smoking relapse). The compartments C and F are further divided into sub-compartments denoted by $C_i$ and $F_i$, where $i\in \{l,m,h\}$ indicates the level of smoking intensity, corresponding to low ($\ge$10 and $\ge$20 cigarettes/day) smoking intensity, respectively. During their life, individuals can change their smoking status, but, for the sake of simplicity, we assume that they cannot change their level of smoking intensity. The model admits deaths and new births. From each compartment, subjects can transit to a deceased compartment denoted by the letter D and a subscript corresponding to the compartment of origin. New births ($\nu (t)$ is the number of new births at time t) increase the size of the compartment N. Transitions of the individuals from a given compartment to another one determine flows regulated by the transition parameters, among which the rates of starting smoking ($\gamma _i^*$), stopping smoking ($\epsilon _i^*$), and relapsing into smoking after having stopped ($\eta _i^*$). Note that these rates can depend on the level of smoking intensity i. Death happens with different rates for never ($\delta ^*_N$), current ($\delta ^*_{C_i}$), and former ($\delta ^*_{F_i}$) smokers. For current and former smokers, the mortality rates may depend also on smoking intensity. This compartmental model, graphically represented in Fig. 1, is expressed by the following system of differential equations for each $i\in \{l,m,h\}$:

$$\begin{aligned} \left\{ \begin{array}{l} \frac{dN(t)}{dt}=-N(t) \left( 1-\delta ^*_N\right) \gamma ^*-N(t)\delta ^*_N+\nu (t) \\ \frac{dC_i(t)}{dt}=-C_i(t) \left( 1-\delta ^*_{C_i}\right) \epsilon _i^*-C_i(t)\delta ^*_{C_i}+N(t)\left( 1-\delta ^*_N\right) \gamma ^*_i+F_i(t)\left( 1-\delta ^*_{F_i}\right) \eta _i^*\\ \frac{dF_i(t)}{dt}=-F_i(t)\left( 1-\delta ^*_{F_i}\right) \eta _i^*-F_i(t)\delta ^*_{F_i}+C_i(t)\left( 1-\delta ^*_{C_i}\right) \epsilon _i^*\\ \frac{dD_N(t)}{dt}=N(t)\delta ^*_N\\ \frac{dD_{C_i}(t)}{dt}=C_i(t)\delta ^*_{C_i}\\ \frac{dD_{F_i}(t)}{dt}=F_i(t)\delta ^*_{F_i}, \end{array}\right. \end{aligned}$$

(1)

where $\gamma ^*=\gamma ^*_l+\gamma ^*_m+\gamma ^*_h$ is the overall transition rate from the status of never smoker to the status of current smoker. The initial conditions of the system, i.e. the sizes of the compartments at time 0, set to the 1$^{st}$ of January 1993, are $N(0)=n_0$, $D_N(0)=0$, $F_i(0)=f^i_0$, $C_i(0)=c^i_0$ and $D_{C_i}(0)=D_{F_i}(0)=0$ $\forall i \in \{l,m,h\}$, where $n_0$ is the number of never smokers in the considered population on the first day of the study period, and $f^i_0$ and $c^i_0$ are the number of ex-smokers and current smokers with smoking intensity i.

Fig. 1

Smoking Habits Compartmental model in its simplest form

For computational reasons, it is convenient to discretise the system of differential equations in Eq. (1), assuming that the size of the compartments is constant during 1-year time steps. Hereafter, t will denote discrete time, with the year as a time-unit ($t\in \{1,…,T\}$), and we replace the system in Eq. (1) with a system of difference equations, where the annual probability of stopping smoking ($\epsilon _i$), and the annual probabilities of smoking relapse ($\eta _i$) are derived from the corresponding rates in Eq. (1), as well as the annual probabilities of death for never ($\delta _{N}$), current ($\delta _{C_i}$), and former ($\delta _{F_i}$) smokers. In particular, $\delta _N=1-\exp (-\delta ^*_{N})$, and $\epsilon _i=1-\exp (-\epsilon _i^*)$, $\eta _i=1-\exp (-\eta _i^*)$, $\delta _{C_i}=1-\exp (-\delta ^*_{C_i})$, $\delta _{F_i}=1-\exp (-\delta ^*_{F_i})$ with $i \in \{l,m,h\}$. Regarding the probabilities of starting smoking for never smokers, the overall annual probability $\gamma$ comes from the corresponding rate, $\gamma =1-\exp (-\gamma ^*)$, while $\gamma _i=\pi _{C_i}\gamma$, where $\varvec{\pi }=(\pi _{C_l}, \pi _{C_m}, \pi _{C_h})$ is the distribution of the level of smoking intensity among the new current smokers. Notice that, if $\lambda$ is the rate of occurrence of an event, the probability of experiencing at least one event in the time unit is $1-\exp (-\lambda )$. The resulting system of discretised equations for each $i\in \{l,m,h\}$ and $t\in \{1,…, T\}$ is:

$$\begin{aligned} \left\{ \begin{array}{l} N(t)=N(t-1)\left( 1-\delta _N\right) \left( 1-\gamma \right) +\nu (t)\\ C_i(t)=C_i(t-1)\left( 1-\delta _{C_i}\right) \left( 1-\epsilon _i\right) +N(t-1)\left( 1-\delta _N\right) \gamma _i+F_i(t-1)\left( 1-\delta _{F_i}\right) \eta _i\\ F_i(t)=F_i(t-1)\left( 1-\delta _{F_i}\right) \left( 1-\eta _i\right) +C_i(t-1)\left( 1-\delta _{C_i}\right) \epsilon _i\\ D_N(t)=D_N(t-1)+N(t-1)\delta _N\\ D_{C_i}(t)=D_{C_i}(t-1)+C_i(t-1)\delta _{C_i}\\ D_{F_i}(t)=D_{F_i}(t-1)+F_i(t-1)\delta _{F_i}, \end{array}\right. \end{aligned}$$

(2)

where $\nu (t)$ denotes the newborns in the year t. The initial conditions of the system coincide with those of the previous model in Eq. (1).

The SHC model extends the system in Eq. (2) to account for two additional discrete time axes: age and time since smoking cessation. The final model is a compartmental model with separate compartments for each discrete age (a), where also a stratification by years since smoking cessation (c) is introduced for former smokers. Two separate SHC models are specified by sex. The final SHC model is defined by the following system of equations for each $i\in \{l,m,h\}$ and $t\in \{1,…, T\}$:

$$\begin{aligned} \left\{ \begin{array}{ll} N(t;a)=\nu (t) &{} \text {if }a=0\\ N(t;a)=N(t-1;a-1)\left( 1-\delta _N(a-1)\right) \left( 1-\gamma (a-1)\right) &{} \text {if }a>0\\ C_i(t;a)=0 &{} \text {if }a=0\\ C_i(t;a)=C_i(t-1;a-1)\left( 1-\delta _{C_i}(a-1)\right) \left( 1-\epsilon (a-1)\right) + \\ \qquad \qquad N(t-1;a-1)\left( 1-\delta _N(a-1)\right) \pi _{C_i}\gamma (a-1) + &{} \\ \qquad \qquad \sum \limits _{c>0}F_i(t-1;a-1,c-1)\left( 1-\delta _{F_i}(a-1,c-1)\right) \eta (c-1) &{} \text {if }a>0\\ F_i(t;a,c)=0 &{} \text {if }a=0,\;c\ge 0\\ F_i(t;a,c)=C_i(t-1;a-1)\left( 1-\delta _{C_i}(a-1)\right) \epsilon (a-1) &{} \text {if }a>0,\;c=0\\ F_i(t;a,c)=F_i(t-1;a-1,c-1)\left( 1-\delta _{F_i}(a-1,c-1)\right) \left( 1-\eta (c-1)\right) &{} \text {if }a>0,\;c>0\\ D_N(t;a)=0 &{} \text {if }a=0\\ D_N(t;a)=D_N(t-1;a)+N(t-1;a-1)\delta _N(a-1) &{} \text {if }a>0\\ D_{C_i}(t;a)=0 &{} \text {if }a=0\\ D_{C_i}(t;a)=D_{C_i}(t-1;a)+C_i(t-1;a-1)\delta _{C_i}(a-1) &{} \text {if }a>0\\ D_{F_i}(t;a,c)=0 &{} \text {if }a=0,\;c\ge 0\\ D_{F_i}(t;a,c)=0 &{} \text {if }a>0,\;c=0\\ D_{F_i}(t;a,c)=D_{F_i}(t-1;a,c)+F_i(t-1;a-1,c-1)\delta _{F_i}(a-1,c-1) &{} \text {if }a>0,\;c>0. \end{array}\right. \end{aligned}$$

(3)

The initial conditions of the system are obtained by generalizing those of the model in Eq. (2), to take into account the stratification by age for current smokers, and the stratification by age and time since cessation for former smokers.

For simplicity, $\nu (t)$ was assumed to be constant over time. The age a takes values from 0 to 100. We set $\gamma (a)$ to 0 until 13 and from 35 years of age, and, in order to account for the possible non-linearity between 14 and 34, we modelled the logit transformation of $\gamma (a)$ through a natural cubic regression spline of age, with 2 equidistant internal knots. Similarly, we set $\epsilon (a)$ to 0 until 19 years of age; we introduced a natural cubic regression spline with 2 equidistant internal knots to model non-linearity for $a\ge 20$. The resulting functions are the following:

$$\begin{aligned} \gamma (a)= \left\{ \begin{array}{ll} 0 &{} \;0\le a\le 13\cup \;a\ge 35\\ \frac{\exp (s(a;\varvec{\psi }))}{1+\exp (s(a;\varvec{\psi }))}&{}\;14\le a\le 34 \end{array}\right. \qquad \epsilon (a)= \left\{ \begin{array}{ll} 0&{}\;0\le a\le 19\\ \frac{\exp (s(a;\varvec{\phi }))}{1+\exp (s(a;\varvec{\phi }))}&{}\;a\ge 20, \end{array}\right. \end{aligned}$$

where $\varvec{\psi }=(\psi _0,\psi _1,\psi _2,\psi _3)$ and $\varvec{\phi }=(\phi _0,\phi _1,\phi _2,\phi _3)$ are vectors of unknown parameters governing the probabilities of starting and quitting smoking, respectively. The relapsing rate, $\eta ^*(c)$, was modelled as a negative exponential function of the time since cessation, with parameters $\varvec{\omega }=(\omega _0,\omega _1$):

$$\begin{aligned} \eta ^*(c)= \left\{ \begin{array}{ll} 0&{}\;c=0\\ \omega _0\omega _1\exp (-\omega _1c)&{}\;1\le c\le 15\\ \omega _0\omega _1\exp (-\omega _115)&{}\;c\ge 16, \end{array}\right. \end{aligned}$$

where $\omega _0$ governs the lifetime probability of no relapse and $\omega _1$ tunes how fast the rate of smoking relapse declines with the time from cessation [12, 23, 28, 33]. Both $\omega _0$ and $\omega _1$ are assumed to be positive so that $\eta ^*(c)$ is a positive, decreasing function of c. The assumptions on which the SHC model is based are summarized in Section Model assumptions, Supplemental Material.

Estimation strategy

An important issue in compartmental models concerns parameter identifiability [30]. Complex models with many compartments, such as the model in Eq. (3), have many parameters governing the admitted transitions, but unfortunately observed data are often insufficient to estimate all of them. To overcome this problem we fixed some of the parameters to values from the literature or external data, leaving as unknown the mortality risks and the spline coefficients $\varvec{\phi }$ and $\varvec{\psi }$, and $\varvec{\omega }$. Regarding the initial size of the compartments, it was obtained by combining the population size at the beginning of the study period with estimated prevalences arising from the ISTAT AVQ and EHIS surveys. Details on the values assigned to the fixed parameters and the initial size of the compartments are provided in Section S2, Supplemental Material. The unknown parameters have been estimated following the two step-procedure described in the next section.

Two-step estimation

In order to estimate the unknown parameters, we adopted a two-step procedure. Both steps use as observed data the prevalence of never, current and former smokers from ISTAT AVQ, here denoted by $p^{obs}(t;a^*)=\left( p^{obs}_C(t;a^*),p^{obs}_N(t;a^*),p^{obs}_F(t;a^*)\right)$, where t denotes the year and $a^*$ the age class. In particular, we considered years from 1993 to 2019 and age classes $a^*\in \{14-17, 18-19, 20-24, 25-34, 35-44, 45-54, 55-59, 60-64, 65-74, 75+\}$. According to the ISTAT AVQ survey, current smokers are defined as individuals who reported being smokers at the time of the interview, while former smokers as those who reported having quitted.

First step.

We estimated the age-specific risks of mortality for never smokers $\delta _N(a)$ using the prevalence values, as well as relative risks coming from the literature. In particular, the age-specific risks of dying for current and former smokers in the population at time t are respectively $\delta _C(t;a)=RR_{C}\times \delta _{N}(t;a)$ and $\delta _F(t;a)=RR_{F}\times \delta _{N}(t;a)$, with $RR_{C}$ and $RR_{F}$ the relative risks of dying for current smokers and former smokers versus never smokers. Let $p(t;a)=\left( p_N(t;a), p_C(t;a), p_F(t;a)\right)$ be the distribution of never, current and former smokers in the population. The overall mortality at age a in the year t, $\delta _{pop}(t;a)$, is a weighted average of $\delta _N(t;a)$, $\delta _C(t;a)$, and $\delta _F(t;a)$ with weights p(t; a). Thus, $\delta _N(t;a)$ can be derived as the ratio:

$$\begin{aligned} \delta _N(t;a)=\frac{\delta _{pop}(t;a)}{p_N(t;a)+RR_Cp_C(t;a)+RR_Fp_F(t;a)}. \end{aligned}$$

(4)

Therefore, separately for each year t in the period 1993-2019, we obtained an estimate of $\delta _N(t;a)$, plugging into Eq. (4) the mortality risk at age a reported for Tuscany, the relative risks for current and former smokers versus never smokers [32], and the observed age-specific prevalence of never, current and former smokers $p^{obs}(t;a^*)$. Finally, we averaged the year-specific $\hat{\delta }_N(t;a)$ over t, obtaining the overall estimate $\hat{\delta }_N(a)$. The risks of dying for current and former smokers by i and c were then derived as:

$$\begin{aligned} \hat{\delta }_{C_{i}}(a)=RR_{C_i}\times \hat{\delta }_{N}(a) \qquad \hat{\delta }_{F_i}(a,c)=RR_{F_i}(c)\times \hat{\delta }_{N}(a). \end{aligned}$$

Second step.

After fixing the mortality risks to the values computed at the first step, $\hat{\varvec{\delta }}(a,c)$, we calibrated the model on the observed prevalence $p^{obs}(t;a^*)$ to estimate the vector of parameters which were still unknown, $\varvec{\theta }=(\varvec{\psi },\varvec{\phi },\varvec{\omega })$. Let $p(t;a^*,\varvec{\theta })=\left( p_{C}(t;a^*,\varvec{\theta }),p_{N}(t;a^*,\varvec{\theta }),p_{F}(t;a^*,\varvec{\theta })\right)$ be the vector of the prevalence of never, current and former smokers belonging to the class of age $a^*$ at time t, calculated on the population predicted by the model in Eq. (3), given a specific value of $\varvec{\theta }$. With calibration, we searched for the value of $\varvec{\theta }$ that leads to predicted prevalences as close as possible to the observed ones. To compare observed and simulated trajectories, we considered the following objective function, where $H(\cdot ,\cdot )$ denotes the Hellinger distance [34] between two discrete probability distributions:

$$\begin{aligned} Obj(\varvec{\theta }){} & {} =\dfrac{1}{T\times A^*} \sum \limits _{t,a^*} H \left( p(t;a^*,\varvec{\theta }),p^{obs}(t;a^*) \right) \nonumber \\{} & {} = \dfrac{1}{T\times A^*\times \sqrt{2}} \sum \limits _{t,a^*} \sqrt{\sum \limits _{k\in \{C,N,F\}} \left( \sqrt{p_k(t;a^*,\varvec{\theta })}-\sqrt{p^{obs}_k(t;a^*)} \right) ^2}, \end{aligned}$$

(5)

where $A^*$ is the number of age classes $a^*$. We minimized the objective function in Eq. (5) over $\varvec{\theta }$ via a global optimization procedure, resorting to the JULIA package Optim.jl [35]. It is well-known that, in the context of compartmental models, optimization results often depend on the chosen starting points of the algorithm [36, 37]. To avoid the problem of getting stuck in local minima, we performed several optimizations using different starting points, then we selected the solution that brought to the minimum Hellinger distance [30, 36]. The two-step procedure was performed separately by sex, obtaining different estimates for males and females and sex-specific evolution of the compartment sizes. We estimated the compartment sizes up to 2043 by projecting the model dynamics, assuming that parameters and model structure do not change after 2019.

Parametric bootstrap procedure

We quantified the sampling variability around point estimates and projections by using a parametric bootstrap procedure [29, 30]. Let $\hat{\varvec{\theta }}$ be the vector of parameters minimizing the objective function in Eq. (5) and $p(t;a^*,\hat{\varvec{\theta }})$ the corresponding estimated vector of prevalence for never, current and former smokers of age $a^*$ in the population at time t. Let $n(t;a^*)$ be the number of subjects belonging to the age class $a^*$, enrolled in the ISTAT AVQ in the year t in Tuscany (i.e. the denominator of the observed prevalence $p^{obs}(t;a^*)$). The bootstrap procedure consisted of the following steps:

1.

for each $a^*$ and t, we sampled a vector of prevalence from a Dirichlet distribution:

$$\begin{aligned} p^b(t;a^*)\sim Dirichlet \left( p_C(t;a^*,\hat{\varvec{\theta }})n(t;a^*),p_N(t;a^*,\hat{\varvec{\theta }})n(t;a^*),p_F(t;a^*,\hat{\varvec{\theta }})n(t;a^*)\right) ; \end{aligned}$$

(6)
2.

we considered the collection of these sampled vectors as the observed values and performed the two-step estimation, computing the vector $\varvec{\delta }^b(a,c)$ and finding $\varvec{\theta }^b$ that minimized the objective function;
3.

we repeated the previous two steps $B=1000$ times, collecting a sample of B bootstrap estimates of $\varvec{\delta }(a,c)$ and $\varvec{\theta }$ to be used to estimate as many curves describing the transition parameters and compartment size trajectories;
4.

we calculated the $90\%$ confidence intervals for the quantities of interest as the 5$^{th}$ and 95$^{th}$ percentiles of the bootstrap estimates; pointwise confidence intervals were calculated for the curves.

Model validation

In order to evaluate the predictive performance of the estimation procedure described in Estimation strategy section, we applied cross-validation (CV) on a rolling basis. We started defining the first 3 years of the period 1993-2019 as the training set, and the subsequent q years as the test set. Then, we calibrated the compartmental model in Eq. (3) on the training set and used the estimated model to forecast the prevalence of never, current and former smokers in the years belonging to the test time window. The discrepancy between observed and projected prevalence was evaluated in terms of absolute percentage error. Then, we progressively extended the length of the training set by adding one year at a time, and we obtained the projections for the q subsequent years every time. We stopped when the last training set considered the years between 1993 to 2019-q. We finally computed the Mean Absolute Percentage Error (MAPE), by averaging the absolute percentage errors across different types of smokers over time, age classes, and training sets. Note that in general, for a set of n observations, MAPE is defined as $\frac{100}{n}\sum \nolimits _{i=1}^{n}{\frac{|O_i-E_i|}{O_i}}$, where $O_i$ is the observed value and $E_i$ is the expected one for unit i. We calculated the MAPE for different forecasting horizons by setting $q=3,6,9,12$ years.

Sensitivity analysis

A key assumption of our model is that the dynamics of the studied phenomenon, particularly the transition probabilities between compartments, remain constant from 1993 to 2019 (and continue to do so until 2043). To verify its appropriateness, we conducted two separate analyses, first calibrating the model on the period 1993-2004 and then on the period 2005-2019, and compared the results. Notice that in the analysis 2005-2019 the initial sizes of the compartments were set to values obtained from 2005 surveys (see Section Details on the fixed parameters, Supplemental Material).

Another crucial point concerns the fact that the inference results could be affected by the model parameters assumed as fixed. To address this issue, we utilized a variance-based approach to Global Sensitivity Analysis (GSA) [31]. Given $K_X$ mutually independent inputs $(X_1, X_2,…, X_{K_X})$ and a model which, given the inputs, returns $K_Y$ outputs $(Y_1, Y_2,…, Y_{K_Y})$, this approach quantifies the relative importance of each input to the model’s outcomes by propagating uncertainty from the inputs to the outputs and computing variance indices. In our application, given the model in Eq. (3), we considered as inputs all the parameters, both fixed and unknown, and the Hellinger distance in Eq. (5) as the output Y. Note that, considering the Hellinger distance as the output, we directly measure the influence of the inputs on the discrepancy between observed and predicted data, thus, ultimately, on the inference results. Then we calculated, for each input $X_i$, the so-called total variance index, which $S^{tot}_i$ measures the overall effect of the i-th input on the output Y, including all the interactions of $X_i$ with the other inputs. This index corresponds to the expected variance of Y that would be left on average when all the parameters but $X_i$, $X_{\sim i}$, are fixed:

$$\begin{aligned} S^{tot}_i=\frac{E_{X\sim i}(Var_{X_i}(Y|X_{\sim i}))}{Var(Y)}. \end{aligned}$$

A total variance index close to zero indicates that the parameter $X_i$ does not influence Y, and therefore, the inference results. Conversely, a large total variance index indicates that the parameter does have an impact on them. In the former case, the parameter can be fixed without affecting our estimates, or in other words, our model and data do not provide information on this parameter. The computation of $S^{tot}_i$ relies on Monte Carlo simulations [38]. We simulated $K=10,000$ different combinations of the model inputs, then, for each of them, we predicted the prevalence values via the model in Eq. (3) and calculated the corresponding Hellinger distance. Specifically, we draw the model parameters from the distributions reported in Table S3.1, Supplemental Material, adopting a quasi-random numbers sampling which provides a more efficient exploration of the sample space [39, 40]. On the basis of the simulated Hellinger distance and the combination of the parameters, we computed the total variance indices as described in [38]. It is worth noting that in the GSA we did not include the age-specific mortality rates, $\delta _{pop}(t;a)$, among the model inputs. It is reasonable, as done elsewhere [23], to treat these parameters as not affected by uncertainty, given that they were estimated based on the entire population.

Health impact assessment

The impact of smoking on population health was quantified in terms of attributable deaths. We calculated the Smoking-Attributable Deaths (SADs) in the year t as the difference between the number of deaths occurring in that year under the actual scenario, i.e. the number of deaths predicted by the model in Eq. (3) given $\hat{\varvec{\theta }}$ and $\hat{\varvec{\delta }}(a)$, and the deaths we would observe under a specific counterfactual condition. We considered the counterfactual condition where current smokers and former smokers in the year t were never smokers. Therefore, for each age a, we applied to the size of the compartments of smokers or ex-smokers the excess risk relative to never-smokers. The excess risk is defined as the difference between risks. For example, the excess risk of current smokers of age a and smoking intensity i relative to never-smokers is $\delta _{C_i}(a,c)-\delta _N(a)$. Thus, for the year t, the number of SADs among people of age a was calculated as:

$$\begin{aligned} \text {SAD}(t;a)=\sum \limits _iC_i(t,a; \hat{\varvec{\theta }})(\hat{\delta }_{C_i}(a)-\hat{\delta }_N(a))+ \sum \limits _i\sum \limits _cF_i(t,a,c; \hat{\varvec{\theta }})(\hat{\delta }_{F_i}(a,c)-\hat{\delta }_N(a)). \end{aligned}$$

The age-specific $\text {SAD}(t;a)$ can be summed over a to obtain the total number of attributable deaths in population or in a certain class of age: $\text {SAD}(t)=\sum \limits _a\text {SAD}(t;a)$. The impact of smoking on population health can be expressed also in terms of Population Attributable Fraction (PAF), defined as the proportion of deaths that would be avoided if all current and former smokers in the population or in a subset of it were never smokers [41]. For details, see Section Population Attributable Fraction computation, Supplemental Material. We calculated SADs and PAFs over the period 1993-2043, separately by sex and for the ages 35+ and 65+.

Impact of future hypothetical policies

In order to illustrate the use of the compartmental model to assess the impact of hypothetical TCPs on SAD, we focused on three policies acting on the rates of starting and stopping smoking, $\gamma ^*(a)$ and $\epsilon ^*(a)$. We assumed that all the defined policies are implemented in 2023 and that, in the absence of policies, the smoking habit dynamics would not change.

Taking inspiration from [42], and from a recent policy introduced in New Zealand (www.bbc.com/news/world-asia-63954862) we defined the following hypothetical TCPs starting from 2023:

TCP1, a policy able to reduce the rate of starting smoking by 25% in 10 years for subjects between 14 and 34 years of age; for simplicity, we assumed a linear decrease, starting with a decrease of 2.5% the first year, a decrease of 5% the second one and so on, up to a final decrease of 25% after 10 years;
TCP2, a policy able to increase the rate of stopping smoking by 25% in 10 years for subjects between 25 and 100 years of age; for simplicity, we assumed a linear growth, starting with an increase of 2.5% the first year, an increase of 5% the second one and so on, up to a final increase of 25% after 10 years;
TCP3, a policy that imposes a complete smoking ban on cohorts born since 2009.

For each policy, we calculated the evolution of smoking prevalence and the number of avoided deaths expected from its implementation, taking the scenario without policies as a reference (TCP0). To better appreciate the impact of policies in terms of SAD, limited to this analysis, we extended the projections up to 2063.