Tuesday, February 14, 2006

Monitoring Complex Data

I. Relevance
A main challenge in biosurveillance is that pre-diagnostic data are much more complicated than more traditional diagnostic data. These data are more frequent (daily vs. weekly or monthly), more noisy, and include a lot of irrelevant variability. There are many other challenges that arise on top of this, but the "ugly" data are a fundamental issue that needs to be accounted for.

II. Current State
Many current monitoring methods, which may have worked for traditional data, appear to be somewhat over-simplistic for monitoring modern pre-diagnostic data. The major effort has therefore been to design more sophisticated algorithms. And indeed, this approach is also popular in other fields where modern data are of different nature than previously collected data. In a forthcoming paper (see references below), Stephen Fienberg and I survey advanced monitoring methods in other fields. The thought behind this was to see whether we can learn from other fields, and see how they tackle the new data era. We found methods that range from singular spectral analysis, wavelets, exponential smoothing, and to data depth. These more sophisticated methods are designed for monitoring noisy, frequent, and often non-stationary data.

III. Alternative Approach
An alternative approach to designing more complicated algorithms is to simplify the data. This is actually standard statistical thinking, if we consider transformations and other pre-processing techniques that are common in statistical analyses. However, unlike traditional pre-processing, practical biosurveillance systems require a much more automated way of data simplification. We cannot afford to have a statistician tweak every data stream separately, because there is an abundance of data sources, data types, and data streams, and all of these are subject to frequent changes. So unless we want to develop a new population of "data tweakers", finding automated ways sounds more reasonable.

By "simplification" I mean brining the data to a form that can then be fed into standard monitoring techniques. I am sure that the end-users, mainly epidemiologists and public health experts, would prefer using CuSum and EWMA charts to learning complicated new monitoring methods, or treating the system as a black-box.

Our research group, together with Howard Burkom and Sean Murphy from Johns Hopkins APL, has been focusing on this goal. I plan to post a few papers on some results soon. I'd be interested in hearing about similar efforts.

IV. References

Shmueli G. and Fienberg, S. E. (2006), Current and Potential Statistical Methods for Monitoring Multiple Data Streams for Bio-Surveillance, in Statistical Methods in Counter-Terrorism, Eds: A Wilson and D Olwell, Springer, to appear.