Limits on the discrimination possible with discrete valued data, with application to medical risk prediction
D. R. Lovell, C. R. Dance, M. Niranjan, R. W. Prager, and K. J. Dalton
We describe an upper bound on the accuracy (in the ROC sense)
attainable in two-alternative forced choice risk prediction, for a
specific set of data represented by discrete features. By accuracy, we
mean the probability that a risk prediction system will correctly rank a
randomly chosen high risk case and a randomly chosen low risk case.
We also present methods for estimating the maximum accuracy we can
expect to attain using a given set of discrete features to represent
data sampled from a given population.
These techniques allow an experimenter to calculate the maximum
performance that could be achieved, without having to resort to
applying specific risk prediction methods. Furthermore, these
techniques can be used to rank discrete features in order of their
effect on maximum attainable accuracy.