Thursday, November 10, 2005

Multiple Testing

The goal of this post is to generate reactions on how we should tackle the issue of multiple testing that takes place in biosurveillance systems. It considers multiple testing in the context of biosurveillance. The notes are based on personal opinions and questions and on conversations with Dr. Howard Burkom from the Johns Hopkins Applied Physics Lab.

Multiplicity occurs at multiple levels within a biosurveillance system:

  1. Regional level-- When monitoring multiple regions. This is also true within a region, where we are monitoring multiple locations (e.g., hospitals, offices, stores)
  2. Source level-- Within a region we are monitoring multiple sources (OTC, ER…)
  3. Series level -- within each data source we are monitoring multiple series. Sometimes multiple series are created from a single series by stratifying the data by age group or gender.
  4. Algorithm level -- Within a single series, using multiple algorithms (e.g., for detecting changes in different parameters) or even a method such as wavelets that breaks down a single series into multiple series

The multiplicity actually plays a slightly different role in each case, because we have different numbers and sets of hypotheses.

Howard Burkom et al. (2005) coin the terms “parallel monitoring” vs. “consensus monitoring” to distinguish between the case of multiple hypotheses being tested simultaneously by multiple independent data (“parallel”) and the case of monitoring multiple data sources for testing a single hypothesis (“consensus”). According to this distinction we have parallel testing at the regional level, but consensus monitoring at the source-, series-, and algorithm-level.

Are the two types of multiplicity conceptually different? Should the multiple results (e.g., p-values) be combined in the same way?

Regional level
-- Each region has a separate null hypothesis. For region i we have
H0: no outbreak in region i
H1: outbreak in region i
Therefore we have multiple sets of hypotheses.
If we consider singular bioterrorist attacks, then these sets (and the tests) are independent.
If we expect a coordinated terrorist attack at multiple locations simultaneously, then there is positive dependence. For an epidemic, there is positive dependence
Source level -- even if we limit ourselves to a certain geographic location and, say, a single zipcode, then we have multiple data sources. In this case we have a single conceptual null hypothesis for all sources:
H0: no outbreak in this region
H1: outbreak in this region

However, we should really treat the outbreak occurrence as a hidden event. We are using the syndromic data as a proxy for measuring the hidden Bernoulli variable, and in fact testing source-specific hypotheses:
H0: no (outbreak-related) increase in OTC sales
H1: (outbreak-related) increase in OTC sales
When we test this we ignore the (outbreak-related) part. We only search for increases in OTC sales or ER admissions, etc. and try to eliminate as many external factors as possible (promotions, day of week, etc.) To see the added level of uncertainty in using proxy information, consider the following diagram:

When there is no outbreak we might still be getting false alarms that we would not have received had we been measuring a direct manifestation of the outbreak (such as laboratory results). For example, a new pharmacy opening up in a monitored chain would show an increase in medication sales, which might cause an alert. So we should expect a much higher false alarm rate.

On the other hand, in the presence of an outbreak we are likely to miss it if it does not get manifested in the data.

So the underlying assumptions when monitoring syndromic data are
(1) The probability of outbreak-related anomalies manifesting themselves in the data is high (removing red nodes from tree)
(2) The probability of an alarm due to non-outbreak reasons is minimal (removing blue nodes from tree)

Based on these two assumptions, most algorithms are designed to test:
H0: No change in parameter of syndromic data
H1: change in parameter of syndromic data
With respect to assumption (2), it has been noted that there are many cases where non-outbreak reasons lead to alarms. Thus, controlling tightly for those is importance. Alternatively, the false alarm rate (and correct detection rate) should be adjusted to account for these additional factors. These same issues are true for the series- and algorithm-level monitoring.

Series level – multiple series within a data source are usually collected for monitoring different syndromes. For instance: cough medication/cc, fever medication/cc, etc. This is also how CDC is thinking about the multiple series, by grouping ICD-9 codes into (11) categories by symptoms. If we treat each symptom separately then we have 11 tests going on.
H0: no increase in syndrome j
H1: increase in syndrome j

Ideally, we’d look at the specific combination of symptoms (i.e. a syndrome) that increase to better understand which disease is being spread. Also, we believe that an outbreak will lead to an increase in multiple symptoms. So these hypotheses are really related. Again, the conditional part comes in (whether there is an outbreak + the additional level of uncertainty on if/how the syndromic data will show this)

Algorithm level – multiple algorithms running on a same series might be used because each algorithm is looking for a different type of signal. This would give a single H0 but multiple H1
H0: Series mean is same
H1: change in series mean of type k
Or, if we have algorithms that monitor different parameters (cusum for mean and F-statistic for variance), then we also have a multiplicity of H0. Finally, algorithms such as wavelet-based MS SPC break down the series into multiple resolutions. Then, if we test at each resolution we have a multiplicity. Whether the resolutions are correlated or not depends on the algorithm (e.g., using downsampling or not).

There are several methods for handling multiple testing, ranging from Bonferroni type corrections to Hochberg & Benjamini’s False Discovery Rate (and its variants). There are also Bayesian methods aimed at tackling this problem. Each of the methods has its limitations (Bonferroni is considered over-conservative, FDR corrections depend on the number of hypotheses and are problematic with too few hypotheses, and Bayesian methods are sensitive to the choice of prior and it is unclear how to choose a prior). Howard Burkom et al. (2005) consider these methods for correcting for "parallel monitoring" (multiple hypotheses with independent data streams). For "consensus monitoring" they consider a different set of methods for combining the multiple p-values, from the world of clinical trials. These include Fisher’s statistic and Edgington’s method (that can be approximated for a large number of tests by . Burkom et al. (2005) discuss the advantages and disadvantages of these two methods in the context of the ESSENCE surveillance system.

But should we really use different methods for accounting for the multiplicity? What is the link between the actual corrections and the conceptual differences?

  1. Rate the quality of data sources: signaling by more reliable sources should be more heavily weighted
  2. Evaluate the risk level of the different regions: alarms in higher-risk regions should be taken more seriously (like Vicky Bier’s arguments about investment in higher-risk cases)
  3. “The more the merrier” is probably not a good strategy when concerning the number of data sources. It is better to invest in few reliable data sources than in multiple less-reliable ones. Along the same lines, the series chosen to be monitored should also be carefully screened according to their real contribution and their reliability. With respect to regions, better to monitor more risky regions (in the context of bioterrorist attacks or epidemics).
  4. Solutions should depend on who the monitoring body is: national surveillance systems (e.g., the BioSense by CDC) have more of the regional issue than local systems.
  5. The choice of symptom grouping and syndrome definitions, which is currently based on medical considerations (, would benefit from incorporating statistical considerations.


Burkom, H. S, Murphy, S., Coberly, J., and Hurt-Mullen, K. “Public Health Monitoring Tools for Multiple Data Streams”, MMWR, Aug 26, 2005 / 54(Suppl);55-62.

Marshall, C., Best Nicky, Bottle, A., and Aylin, P. “Statistical Issues in the Prospective Monitoring of Health Outcomes Across Multiple Sources”, JRSS A, 2004, vol 167 (3), pp. 541-559.


Blogger Henry Rolka said...

Reaction to Thoughts on Multiple Testing: A Thought on the ‘Hypothesis Testing’ Model in Discussing Health Threat Surveillance Issues

There is a fundamental issue that makes the multiple comparisons problem even more difficult to interpret as presented, in the hypothesis testing context. This is an open question for consideration in advancing probabilistic methods for surveillance data analysis in general and relates to the Type I and Type II error concepts. If we consider the null condition to be the assumption of ‘no event of importance in progress’ and the alternative to be supported when there is sufficient data to conclude that escalation toward a countermeasure response is needed, then the type I error is defined to be - falsely concluding that a response is needed when in fact it is not necessary. This seems like the less important ‘mistake’ in that if something were occurring that warranted a reaction and we did not respond, lives would be lost and precious time would have passed in stopping an event of threat importance. Thus, our general approach to controlling the type I error using ‘alpha’ for threshold setting is questionable in this setting. On the other hand, being overly conservative at the expense of allowing too many false alerts may fatigue readiness resulting in an inability to respond when truly needed. The goal is to strike an informed balance between sensitivity maintenance and false alert toleration. Currently implemented surveillance systems in public health are based on inferential concepts that use p-values for thresholds under the null assumption that the situation is expected with relation to the temporal and/or geographical context. Given the situational consequences of failing to alert to true events and too frequently alerting to unimportant events, more refined bases for conclusions may be established as standard operating procedures using decision theoretic approaches and specifying risk and utility functions.

Monday, November 14, 2005  
Blogger Galit Shmueli said...

This is by Yajun Mei (via an email to me):

The multiple testing problems you mentioned also occur in the signal process (or more broadly, engineering) literature. "Parallel monitoring" seems to be closely related to "multi-channel signal detection", where signals will appear in one of the M noisy channels. "Consensus monitoring" seems to be related to "decentralized or distributed detection" in "sensor networks", or "information fusion." One of my papers which deals with the latter case was published in IEEE Transaction on Information Theory, vol 51, issue 7, July 2005, page 2669-2681. Professor H. V. Poor at Princeton, Professor Venugopal V. Veeravalli at UIUC, and Professor Alexander Tartakovsky at the University of Southern California are several of leading researchers in this field.

Wednesday, November 30, 2005  
Anonymous deepak agarwal said...

A mathematical framework which I have found useful to think about the multiple testing problem is as follows (all symbols are in latex).

The main idea is that we want to adjust for multiple comparisons and also for things that are global and affect a set of time series. This will provide a more concise description of anomalies to the subject matter expert.

Let e_{st} denote the standardized residuals(could be normal scores of p-values) for the s^{th} series at time t. Consider the model

e_{st} = \Delta_{st} + \mu_{st}(\theta_{t}) + \epsilon_{st}

where \epsilon_{st}~N(0,1).
\mu_{st} is an adjustment factor derived from some model which depends on some parameter \theta_{t} to be estimated from e_{st}s.
We assume \Delta_{st}~P_{t}1_{0} + (1-P_{t})N(0,\tau) and declare anomalies by looking at the posterior of \Delta_{st}.

How to get \mu_{st}?
First note that if \mu_{st}=0, this is exactly the multiple testing model that Susie talked about in her presentation. The main goal is to define \mu_{st} so that one could adjust for things that happened at time t(but not incorporated in the baseline) that are already known to the analyst. It might be possible to formalize this concept by using a minimum description length argument but I've not done that yet.

Some examples:
Example 1:

If most of the series experience a change in the same direction due to some common effect at time t, let's say sales of most items at a Walmart store drop due to bad weather.

Then \mu_{st}=\beta_{t} is a good model which will replace a whole bunch of drops by an alert which says there was a global drop (indicated by a \beta_{t} estimate significantly different from 0).

Example 2: Let's say we are monitoring number of emergency room visits by syndromic groups and zip codes (i.e. a 2-d contingency model). Then, replacing
s by ij(i^{th} syndrome, j^{th} zip), one can assume
\mu_{ijt}=r_{it} + c_{jt}.
This will help produce anomalies on the interactions and will adjust for factors that affect entire rows or columns.

Example 3:
Let's say the analyst knows that items in the electronic department are on sale at a Walmart store at time t. In such a case, it is desirable to produce anomalies after adjusting for this fact. If we define a binary vector x_{t} which is 1 for all items s \in electronic department and 0 o.w., an appropriate model is
\mu_{st} = x_{st}\beta_{t}.

Friday, December 02, 2005  

Post a Comment

Links to this post:

Create a Link

<< Home