# Statistical Analyses

### Statistical methods used in ISAAC: Phase One

The two age groups (6 & 7 years and 13 & 14 years) were analysed separately. Symptom prevalences in each centre were calculated by dividing the number of positive responses to each question by the number of completed questionnaires for the written and video questionnaires separately. Thus, apparent inconsistencies between responses to the stem and branch questions were accepted and not recoded. Country and regional level prevalence estimates were calculated in the same manner. All the positive responses within the country (or region) were divided by the number of completed questionnaires from the same geographical area.

The main variables reported are defined as:

- Wheeze: “Have you/your child had wheezing or whistling in the chest in the last 12 months?”
- Severe wheeze: “Have you/your child had wheezing or whistling in the chest in the last 12 months?” and one of “4 or more attacks of wheeze” or “sleep been disturbed due to wheezing on average once or more per week” or “had wheezing severe enough to limit speech to only one or two words at a time between breaths”.
- Reported asthma: “Have you/your child ever had asthma?”
- Rhinoconjunctivitis: “In the past 12 months, have you had a problem with sneezing, or a runny, or a blocked nose when you DID NOT have a cold or the flu? If yes: in the past 12 months, has this nose problem been accompanied by itchy-watery eyes?”
- Hay Fever ever: “Have you/your child ever had hayfever?”
- Eczema: “Have you ever had an itchy rash which was coming and going for at least 6 months? If yes: Have you had this itchy rash at any time in the last 12 months? If yes: Has this itchy rash at any time affected any of the following places: the folds of the elbows, behind the knees, in front of the ankles, under the buttocks, or around the neck, ears, or eyes?”
- Reported eczema: “Have you/your child ever had eczema?”

In centres where a random sample of schools was taken, the effect of cluster sampling by schools was examined calculating the design effects [Rao 1992]. The effects of cluster sampling were generally small but have been incorporated in analyses involving tests of significance.

Basic descriptive summaries of the data were compiled by centre and country, in both age groups, along with Spearman correlations between variables. These summaries have often been displayed as ranked plots (see example right). A variety of analytic methods have been used in papers, some are described below.

The within-country and between-country variances were estimated using a generalised linear mixed model in which country, and centre within country, are random effects [Wolfinger 1993]. With this model, the ratio of the 95% CI of prevalences (between country to within country) were calculated.

### Statistical methods used in ISAAC: Phase Two

Definitions for the key outcome variables in Phase Two followed the conventions set in Phase One. Sample sizes in most of the Phase Two centres were smaller than in Phase One, typically in the region of 1000 children, so clustering at the level of school within centres was not considered in the analysis.

An important feature of the Phase Two design was the restriction of more expensive or invasive measurements to a subsample of children within each centre, selected according to history of wheezing in the last year. This stratified sampling design required statistical analyses for many of the variables to be weighted (using “survey weights” inversely proportional to the sampling fractions for wheezers and non-wheezers). The SAS procedures SURVEYREG and SURVEYLOGISTIC were used for this purpose (in Stata, svy: commands perform the same survey-weighted analysis).

The general approach adopted for Phase Two data analysis was to fit separate models for each centre and then pool the resulting regression coefficients in a random-effects meta-analysis. The random-effects pooling allowed for possible heterogeneity of risk factor associations between centres. In many analyses, a separate pooling within two groups of centres (more affluent, and less affluent, defined by national GNI per capita) proved to be informative.

This two-step approach to analysis of risk factor associations in Phase Two contrasts with the single-step approach adopted in Phase Three, where a fixed-effect pooling of regression coefficients was implemented along with random centre-level intercepts, using PROC GLIMMIX in SAS. Such a single-step approach could not be implemented for many of the outcomes in Phase Two, since the necessary survey-weighted regression cannot be combined with the multi-level model structure within PROC GLIMMIX.

However, for Phase Two outcomes which were ascertained on all subjects, multi-level models were developed in SAS (PROC GLIMMIX) and Stata (xtmelogit) to explore random effects both for intercepts (ie. centre-level prevalences) and slopes (ie. risk factor associations).

### Statistical methods used in ISAAC: Phase Three prevalence maps and time trend analyses

The approaches used for global comparisons of prevalence in Phase Three followed those adopted in Phase One. However, for analysis of time trends between Phase One and Phase Three a number of additional statistical issues arose:

- Whether to use absolute or relative change in prevalence: the former was chosen.
- Calculation of change per year to address the variable time period between studies.
- Use of mean prevalence (average of Phase One and Phase Three), rather than Phase One prevalence, to assess change in relation to prevalence. This followed the approach of Bland and Altman which avoids the problem of “regression to the mean” leading to a spurious correlation between initial level of a measurement and change over time.
- Adjustment for the cluster sample design by adjustment to the effective sample size of the prevalence estimates. Since most centres selected a sample of schools and then studied all children of the eligible age within those schools, there is a theoretical “design effect” due to the greater correlation of asthma and allergy prevalence within schools than between schools. This “design effect” was accounted for in analyses which involved significance tests by decreasing the sample size of each prevalence estimate by a factor derived for each outcome, centre, age-group and ISAAC phase, representing the effective sample size, relative to the actual sample size, adjusting for clustering at the school level. In most centres, the effect of this adjustment was small.
- Tolerance of minor differences in fieldwork procedures between Phase One and Phase Three. This is discussed in greater detail under “Quality Assurance”

### Statistical methods used in ISAAC: Phase Three risk factor analyses

Outcome definition and assessment of within-centre clustering followed the conventions set in the prevalence comparisons. For each outcome, centre and age-group, a single design-effect-adjustment variable was generated, representing the effective sample size for that age-group, centre and outcome. This set of design-effect adjustment factors was derived before merging in the risk factor (EQ) data, so it is a common set for all Phase Three risk factor analyses.

Centres with fewer than 500 children (except for centres representing a complete census of the population), and centres with more than 30% missing data for the risk factor and covariates of interest, were excluded from the analysis. Frequency tabulations of the outcome, risk factor of interest, and specified individual-level covariates were prepared for each centre and combined into a single dataset for each outcome and age group. The frequency counts were then adjusted downwards in proportion to the design-effect adjustment factors for the outcome in question, for each centre and age group.

These design-effect-adjusted frequency tabulations provided the input for SAS DATA/PROC... (conversion procedure to individual-level data? – equivalent procedure in Stata is “expand”) and were analysed in PROC GLIMMIX specifying random intercepts at the centre level, but common slopes for the individual-level risk factors and covariates. Region, language and GNI per capita were included as standard centre-level covariates. Sex was always included as an individual-level covariate. Analyses were performed for all centres combined, for subgroups of centres defined by region, language and GNI, and for boys and girls separately. Additional individual-level covariates and interactions were included in the models, as appropriate for specific risk factor analyses.

### Statistical methods used in ISAAC: Centre-level differences adjusted for individual-level risk factors

Two approaches have been used for investigating between-centre differences in prevalence, adjusting for individual-level risk factors. The first approach is analogous to direct standardisation of routine statistics such as national mortality rates. The second applies multi-level modelling techniques to evaluate simultaneously the associations at the individual and the centre level.

Direct standardisation:

- Separate regression models are fitted for each study centre, to obtain centre-specific slopes for each explanatory (x-)variable. Since the main outcomes of interest are dichotomous, our outcome (y-)variable is logit(p) where p is the proportion of “cases” (affected individuals). Thus, the parameter estimates from these centre-specific models are in the form of log-odds-ratios and the linear predictions derived from them (“xb” in SAS/Stata terminology) are in the form of log-prevalence-odds: ln[p/(1-p)].
- For each centre, a prediction (xb) and its standard error (stdp) is derived at the level of each explanatory variable which correponds to its mean in the global (all-centres) dataset. (This is analogous to directly standardising centre-specific death rates for each age-sex group by applying them to a global distribution of age and sex).
- The standardised (risk-factor-adjusted) prevalence logodds for each centre, and their corresponding variances, can then be considered as units in a conventional meta-analysis, deriving measures of heterogeneity including Cochran’s Q and Higgins I². They can also be used as the outcome variable in ecological analyses of disease prevalence at the centre level.

Multi-level modelling:

- All centres are modelled in a single dataset with an categorical indicator variable for each centre and centre-level covariates (such as language, or GNI per capita) match-merged by centre.
- Multi-level modelling procedures such as PROC GLIMMIX in SAS, and xtmelogit in Stata, offer options for analysing either the centre-level intercepts, or the centre-specific risk factor associations (regression slopes), or both, as “random effects” (ie. drawn from a hypothetical distribution of intercepts or slopes, with the usual assumption being that this distribution is Gaussian).
- The approach used in Phase Three risk factor analyses specified random intercepts and common slopes. This is equivalent to a fixed-effect (inverse-variance-weighted) pooling of the risk factor associations across study centres.
- The approach used in exploratory Phase Two analyses specifies random intercepts and random slopes.
- The two-step meta-analytical approach used in standard Phase Two publications is broadly equivalent to fixed centre-level intercepts and random slopes.

### Statistical methods used in ISAAC: Ecological analyses at the centre level

A series of ISAAC papers were based on ecological data (data gleaned from external sources). These papers correlated the prevalence rates observed in ISAAC centres or countries with information available elsewhere. An example was the relationship of the prevalence levels to the per capita gross national product (GNP) for each of the countries. The GNP information came from the World Bank website. We assumed a linear relationship between the prevalence of the various symptom measures in each country and the GNP of that country. The data were modelled using a generalised linear mixed model that allowed each centre to be considered as if randomly selected from within its country (not a very good assumption in some cases). The model used a binomial error but assumed the identity link so there was a simple linear association between the outcome measure and the ecological variable. All ecological analyses (subsequent to the one in which GNP was the focus) included GNP in the model as a potential confounder.

References

Rao JNK, Scott AJ. A simple method for the analysis of clustered binary data. Biometrics 1992; 48: 577-585.

Wolfinger R, O'Connell M. Generalized linear mixed models: a pseudo-likelihood approach. J Statist Comput Simul 1993; 48: 233-243.