Evaluation of studies of assessment and screening tools, and diagnostic tests
- Andrew Jull, RN, MA
In nursing, diagnostic testing is commonly a regulated activity performed by nurse practitioners and midwives. However, assessment and screening of patients for further testing (case finding) are central elements of nursing. The number of studies evaluating assessments and tests is increasing, but overall, the methodological quality of these studies has been poor.1 Thus, nurses should be able to critically appraise evidence from such studies to ensure that the highest quality assessment and screening tools are used. The tools of assessment, case finding (or screening), and diagnosis are evaluated using different criteria from those applied to studies investigating preventive or therapeutic interventions, although the 3 basic questions of critical appraisal are the same: are the results valid? What are the results? Will the results help me in caring for my patients? In this article, we outline a framework by Sackett et al2 to critique studies that evaluate a screening tool to assess patients for depression. The same framework can also be applied to studies of assessment tools such as fall risk assessments or pressure sore risk scoring, as well as studies of diagnostic tests.
You are a district nurse attending a 68 year old man with a diabetic ulcer in his home. He feels that his ulcer is taking forever to heal and that he will never be well again. You know from previous conversations that his wife died several years ago and his 2 children live outside of the area or overseas. When he could drive, he was socially active, but since becoming reliant on others to assist him, he doesn't get out much now. He reports he is eating okay, and that his glucose concentrations are kept within normal range with his hypoglycaemic medications. He also says he feels tired all the time. You have noticed he is not taking as much care with his appearance recently and that he seems much less interested in world events than when you first started dressing his ulcer. Although you are aware that malaise is a side effect of some hypoglycaemics, his drug regimen has not been changed recently. On further questioning, it seems unlikely your patient is anaemic, so you begin to consider whether he might be depressed. You wonder if there are simple assessments that could help you to screen patients for appropriate referral.
Unlike studies of preventive or therapeutic interventions, which are best answered by randomised controlled trials or systematic reviews of such trials, questions about the effectiveness of screening and assessment tools in clinical practice are best answered by cross sectional studies. Studies of diagnostic tests and screening tools can be quite difficult to locate. The keyword phrase sensitivity and specificity is useful, but studies cannot always be located with that phrase. If that is the case, then the subheading /di (for diagnosis) can be used. Specialist databases such as the Cochrane Library do not include studies of diagnostic testing, so you do the following search on Medline (OVID):
(1 OR 2) AND 3
303 articles are identified. Adding the text words primary care, and limiting the search to English language papers with abstracts published in the past 5 years produces a more manageable 24 citations, one of which is a study of screening tools, including a simple 2 question instrument (“During the past month have you often been bothered by feeling down, depressed, or hopeless?” and “During the past month have you often been bothered by little interest or pleasure in doing things?”).3 Whooley et al tested 7 screening tools on 542 consecutive patients attending an urgent care clinic. 97% of participants were men, the majority of whom were unemployed. Prevalence of depression was 18%. All screening tools returned similar results, but the investigators recommended the 2 question instrument for use in primary care because of its simplicity. Before accepting the authors' conclusions, readers need to assure themselves that the study is valid. This requires answering 4 questions.
Are the results of the study valid?
WAS THERE AN INDEPENDENT BLIND COMPARISON WITH A REFERENCE (GOLD) STANDARD OF DIAGNOSIS?
There are 2 aspects to this question. Firstly, the accuracy of any tool is best determined by comparing its results with those obtained from a widely accepted reference test. This is also referred to as a gold standard test and is often more invasive than the initial test. Thus palpation of a child's forehead for fever could be compared with a reading from a mercury thermometer to obtain a true estimation as to whether a child has a major fever. Similarly, the ankle-brachial pressure index (ABPI) could be compared with the gold standard of venography if testing ABPI as a screen for arterial disease in the leg. Readers need to be assured that the reference test is the gold standard test for the condition. Comparing palpation with a measure less accurate than a mercury thermometer (such as tympanic thermometry) would provide an inaccurate estimation of how many patients actually had a fever.4 Even if the study used mercury thermometry or a device with similar accuracy, the reader must still be reassured that an acceptable technique of thermometry was applied. Thus, even if mercury thermometry was used, an axillary route is likely to provide an unreliable estimate of temperature. If >1 assessor was used, the study should also provide an estimate of the level of agreement between assessors.
The second aspect of the question guards against expectation bias. Although in most clinical situations, healthcare workers have access to patient records, it is important, when evaluating an instrument, that clinicians form their own determinations of the patient's condition. Prior knowledge of the presence or absence of a disorder could influence a clinician's assessment. Therefore, it is imperative that clinicians making an assessment using the gold standard test are separate from those using the other instrument, and that the 2 groups of clinicians are blinded to each other's assessments. A methodological study of evaluations of diagnostic tests has found that unblinded assessments overestimate correct diagnoses by as much as 30% compared with blinded studies.5
WAS THE DIAGNOSTIC TEST EVALUATED IN AN APPROPRIATE SPECTRUM OF PATIENTS (LIKE THOSE WE WOULD MEET IN CLINICAL PRACTICE)?
The main challenge when evaluating a case finding or screening instrument is to apply it to the indicated population.6Tests are often developed by the quick and dirty method of using an accessible population of patients known to have the target disorder and a group of healthy controls. If an instrument does not discriminate between those with and without the disorder at this stage of development, then it is unlikely to be clinically useful. But, the value of an assessment lies in its ability to distinguish the full spectrum of presenting patients with the disorder (as well as those who present with similar symptoms arising from different disorders) from those who do not have the condition. Diagnostically, it is easier to identify patients with florid presentation from those without the condition than it is to identify those with a mild presentation. Only if the instrument can differentiate those likely to have the disorder from those who do not in a real clinical population can it then be deemed useful. Evaluations of new tests often omit the essential developmental stage of evaluation in a real clinical population. For example, one use of abpi is assessing patients with leg ulcers to screen for those with peripheral arterial disease, which would rule out treatment with high compression bandaging. In the late 1960s the normal values of an abpi were established by testing 110 patients with known occlusive peripheral arterial disease and comparing their test values with those of 25 healthy controls.7It is only recently that the utility of the ABPI has been tested in community populations similar to those in which it is commonly used by district nurses.8It generally accepted that it is good practice for studies to enrol consecutive patients who have agreed to participate (minimising the potential for selection bias), although non-consecutive enrolment has not been found to have any significant effect on study results.5
WAS THE REFERENCE STANDARD APPLIED REGARDLESS OF THE TEST RESULTS?
To avoid verification OR workup bias, participants need to receive both tests regardless of the outcome of the first test. If the first test is negative and the participant does not receive the gold standard test to verify this result, then the study results will be distorted. In some instances, participants who have had a negative test may decide not to have the gold standard test, especially if the gold standard test is an invasive procedure such as venography. Rather than exclude these participants, investigators can follow them up over an appropriate time period and monitor for symptoms of the target disorder.
WAS THE TEST VALIDATED IN A SECOND, INDEPENDENT GROUP OF PATIENTS?
For a reader to be reassured that the study findings are accurate and not the result of idiosyncrasies in the initial cohort of participants or the individual skills of the assessors, the tool should be evaluated in a second independent group of patients.2If the findings are replicated, healthcare providers can have more confidence in the accuracy of the test results. For example, the combined use of gram staining and acridine-orange leucocyte cytospin testing to rapidly diagnose catheter related bloodstream infections without removing the central venous device has been favourably reported, but the study did not evaluate the test on a second group of patients;9 hence, the call for confirmation studies before the test is widely accepted.10
Answering the original question
The study by Whooley et al probably met 3 of the 4 validity criteria. Firstly, the investigators compared the case finding instruments with an acceptable reference standard, a computerised version of the Diagnostic Interview Schedule (DIS). This is a 20 minute interview with a sensitivity of 80% and a specificity of 84% when compared with DSM III criteria for depression. Three trained interviewers who administered the reference test were blinded to the results of the screening tools. There was a high level of inter-rater agreement between the 3 interviewers with respect to the results of the reference test (x=0.88). Secondly, the study sample was a real clinical population in a primary care setting, with patients representing the full spectrum of depressive histories: recent episodes of depression, lifetime history of depression, and no history of depression. Cautious consideration needs to be given to some of the sample's features (such as the high prevalence of depression, the ratio of men to women, and the high number of unemployed people), but these can be addressed when considering the applicability of the study to your own patients. Thirdly, all 7 screening instruments and the reference test were administered to 542 consecutive participants attending an urgent care clinic at a Veterans Administration medical centre, although the results from 7 participants were excluded from analysis because of missing data. The results of the case finding instruments did not appear to influence whether the DIS was done. However, the study does not meet the fourth criterion for validity, as the study findings were not evaluated in a second group. No other evaluation of the instrument seems to have been done, although one is currently under way in general practice populations in New Zealand (personal communication, B Arroll).
What are the results?
When patients present to healthcare providers, they have a probability of having particular disorders. This probability is the baseline prevalence of each disorder in the community. But each patient is different. Think about 2 patients, both presenting with a small ulcer involving the medial malleolus, with ankle flare, presence of haemosiderian pigmentation, and a history of varicose veins. One is a 71 year old woman who is otherwise healthy and the second is a 55 year old man with type 2 diabetes. Although venous aetiology accounts for up to 70% of all leg ulcers,11 an experienced clinician knows that the baseline or pretest probability of the ulcer being venous for these 2 patients is different. For the first patient, who has an uncomplicated presentation, the pretest probability of having a venous ulcer is likely to be between 50% and 70%. Following simple assessments of her blood supply to rule out other causes, the experienced clinician is likely to recommend that the patient start compression treatment. However, the pretest probability for venous ulceration is likely to be considerably lower in the second patient. Venous disease only causes 6% to 9% of leg ulcers in patients with diabetes.12 Simple assessments to rule out other causes of the ulcer may not convince the clinician that the ulcer is venous. Treatment for venous ulceration involves applying high compression bandaging to the patient's affected limb, but the bandaging can create an ischaemic leg if the patient has arterial insufficiency. The clinical hazard of misdiagnosis and ischaemia has increased the threshold for beginning treatment and the low pretest probability means that the clinician may prefer to refer the patient for further testing before being reassured that compression is safe.
The above example illustrates that no matter what the outcome of an assessment or test is, it cannot tell a clinician whether or not the patient has the disorder. It can only reveal the probability of having or not having the disorder.13 The ability to discriminate between people likely to have a disorder and those less likely to have a disorder is determined by a test's likelihood ratio (LR). With respect to screening for depression, Whooley et al found several instruments with similar results, but the simplest instrument was the 2 question instrument. The reference test indicated that 97 patients had depression. The 2 question case finding instrument correctly identified 94 of these 97 patients (94/97 or 0.97) as likely to be depressed. However, the instrument also incorrectly classified 189 patients as likely to be depressed from the 439 patients (189/439 or 0.43) whom the reference test ruled out as not depressed (table 1). The ratio between these 2 likelihoods is the LR. When considering LRs, it is the percentages or proportions of patients that the test correctly and incorrectly identifies as having the disorder that is considered, not the actual numbers of patients. Thus, the ratio of true positive results (ie, those that the instrument correctly identifies as being depressed) to false positive results (those that the instrument incorrectly identifies as being depressed) is 0.97/0.43, or 2.25. This is the likelihood ratio for a positive test result (+LR) being correct. From the +LR 2.25, we can infer that a positive result from the 2 question instrument is only about 2 times more likely to be a true positive than a false positive result. If this instrument were used to diagnose depression, clinicians would be wrong quite often. Clearly, the 2 question case finding instrument is not very effective at diagnosing if a patient is depressed.
Just as the +LR can be calculated, the likelihood of the instrument being wrong when it returns a negative result can also be calculated. The 2 question instrument missed 3 of the 97 depressed patients (3/97 or 0.03), but correctly identified 250 patients as unlikely to be depressed out of the 439 patients (250/439 or 0.57) in which depression was absent. The ratio of false negative results (ie, those that the instrument incorrectly identifies as not being depressed) to true negative results (those that the instrument correctly identifies as not being depressed) is 0.03/0.57, or 0.05. This is the likelihood ratio for a negative test result (–LR) being wrong. From the –LR 0.05, we can infer that very few patients are likely to be depressed when the case finding instrument returns a negative result.
The usefulness of LRs is revealed when we look at their ability to shift a patient from a pretest probability to a post-test probability, and in doing so, help reduce the clinical uncertainty associated with case finding, screening, or diagnosis. A rough guide to the magnitude of LRs and their effect on post-test probability is shown in table 2.
The challenge in working out the changes in probability of a patient having a disorder after a test is eased by a simple nomogram (figure).14 By running a straight line through the pretest probability (left hand column) and the LR (centre column), the post-test probability can be determined from the point at which the line intersects the right hand column. A pretest probability could simply be the prevalence of depression in the community, which has been estimated to be 5% of the adult population in Great Britain.15 If the patient in our scenario answers yes to both questions, we can extend a line from a pretest probability of 5% through approximately 2 (+LR 2.25) to obtain a post-test probability of a little more than 10% that our patient actually has depression. However, if our patient answers no to both questions, we can extend a line from 5% through 0.05 (–LR) to obtain a posttest probability of approximately 0.03% of being wrong if we accept our patient is not depressed. Thus, we can be confident that if a patient answers no to the 2 questions, he or she is very unlikely to be depressed. On the other hand, a posttest probability of approximately 10% might not be high enough even to consider referral for further testing unless there is no other likely explanation for the patient's symptoms. However, the tool may be useful at determining whether further testing is desirable in settings where the pretest probability is higher.
Whooley et al provided the LRs for each of the case finding instruments. Older studies often do not report LRs, but instead report the sensitivity and specificity of the tests. The sensitivity of a test is the proportion of patients with the target disorder who have a positive test result, whereas the specificity is the proportion of patients without the target disorder who have a negative test result. LRs are easily obtained if the sensitivity and specificity of a test are known. The sensitivity and specificity of the 2 question case finding instrument are 0.97 and 0.57, respectively. A +LR is obtained by the following formula:
Similarly -LR can be obtained using a slightly different formula:
Sometimes sensitivity and specificity are presented as percentages (ie, 97% and 57%). The same formulas can be used, substituting 100 for 1 when subtracting. For further explanation of how sensitivity and specificity are calculated, see Sackett et al2 or any text on clinical epidemiology.
Can I apply this test to my patient?
We have determined that the study by Whooley et al is probably valid and decided that the results indicate that the instrument (1) may be useful for identifying patients who may benefit from referral for further testing when the patient responds positively to the questions, and (2) is useful for ruling out depression as a possibility when the patient responds negatively to the questions. The next step is to determine whether it can be used with your presenting patient or group of patients. Answering 3 questions will assist this decision.
IS THE TEST AVAILABLE, AFFORDABLE, ACCURATE, AND PRECISE IN YOUR SETTING?
Obviously if a test is not available, or the costs are similar to equally accurate and usable alternatives, then it is unlikely to be used. Similarly, we need to be assured that a test will maintain its accuracy in the clinical setting in which we work. LRs can be stable, but they are derived from selections of patients, and thus may not be as accurate for patients who are selected in different ways. In an earlier question about validity, we needed to be assured that the instrument was tested in patients with mild, moderate, and severe conditions as well as those without the disorder. Now we need to be assured of the similarity of the study population to that in our own setting. It is uncommon to find a report that exactly describes a population of patients like our own, so we need to examine the demography of the study participants to decide whether they are so dissimilar from our own to rule out using the study. Another concern about the accuracy of a test is that many instruments are reported as having only one +LR and one –LR, although a test can behave differently depending on the severity of the disorder. Higher lrs are found with florid conditions and lower ones with earlier presentations of the disorder. Some tests make this distinction by reporting lrs for different presentations of the disorder, but this would be unusual for screening or case finding tools.
CAN WE GENERATE CLINICALLY SENSIBLE ESTIMATES OF PATIENTS' PRETEST PROBABILITIES?
Pretest probability is the probability that a presenting patient has a particular disorder. Sackett et al identify 5 different sources for estimating pretest probability: clinical experience, prevalence statistics, practice databases, studies specifically focused on determining pretest probabilities, and the original study itself.2 Clinical experience will generate what is essentially a “guesstimate”, and several false heuristics may influence such an estimate. However, in the absence of other sources, this method can still be useful. Prevalence statistics can be drawn from regional or national morbidity data, or from studies investigating the prevalence of a disorder, but these estimates are only as good as the sources of the data or the settings of the prevalence studies. Databases that rely on voluntary reporting can have inaccurate data. If the prevalence study is set in an acute care setting, the results can be misleading if applied to primary care settings. Practice databases, whether local, regional, or national, are also only as good as their data sources. Studies investigating pretest probabilities are few in number and difficult to retrieve from databases. Finally, the prevalence of the disorder in the study being critically appraised can be used.
WILL THE RESULTING POST-TEST PROBABILITIES AFFECT PATIENT MANAGEMENT?
The major concern here is whether the results will move a patient across a threshold that would stop further testing for the suspected disorder. This would occur when a disorder has been ruled out, when a referral for further testing or treatment is made, or when treatment is initiated. For example, if the pre-test probability for depression is 5%, and the patient response to the 2 question case finding instrument is negative, the post-test probability would be so low that depression could be abandoned as an explanation for the patient's symptoms. However, if the patient response was positive (and remembering that the post-test probability was slightly >10%), it would still be too low to move the patient over a treatment threshold, or perhaps even to referral for further testing. On the other hand, if the pretest probability was higher, perhaps because of a higher prevalence of depression in people with diabetes, then referral for further investigation might be warranted.
Resolution of the scenario
Although you accept that the study by Whooley et al has reasonably strong validity, you have reservations about using the 2 question case finding instrument with all of your patients. The study sample was primarily men (97%), 71% were unemployed, and the setting was an urgent care medical clinic where patients had a high prevalence of depression (18%) rather than a community population where the prevalence is probably much lower. Thus, you decide to continue looking for a case finding instrument that has been evaluated in different populations and may be more generally applicable to your patients. Until you find such a tool, the 2 question instrument could be useful with this particular patient. You have a feeling that depression is more frequent in people with diabetes than in the general population, and a quick search of the literature reveals a systematic review of the prevalence of depression people with diabetes that confirms this view.16 The mean rate of current depression (as opposed to lifetime history of depression) in controlled studies is reported as 14%, almost 3 times that of the general population. You decide to use this as your pretest probability. A quick check of the nomogram shows that if the patient answers yes to both questions, then the post-test probability of depression will be approximately 27%, which is high enough to suggest the need for further testing. Given that the 2 question tool is no less an invasion of privacy than obtaining a blood sample for testing, you decide to discuss with your patient the possibility of a clinical cause for his symptoms the next time you visit to change his ulcer dressing. If he is willing, you resolve to use the 2 question case finding instrument to screen for depression.