Evidence at the Crossroads Pt. 6: Evidence is Only as Good as What You Do With It
We are living in a time of unprecedented systematic research on the effectiveness of interventions that are intended to produce better outcomes. This level of effort is producing a substantial volume of intervention research, but a critical question is what to do with these studies?
The federal government has made massive investments to stimulate and support research on intervention effectiveness. One form of this that should not be neglected is the grant programs administered through the National Institutes of Health, the Institute of Education Sciences, the National Institute of Justice, and other such agencies that support field initiated intervention research in their respective topic areas. And one of the most notable developments, as has been well described in previous posts in this series, is the launch of a number of federal initiatives that provide tiered funding to support programs differentially according to the strength of supporting evidence. Most important for present purposes is the common requirement that the funded programs be evaluated, adding still further research to our stock of evidence about what works.
These initiatives will report a range of findings on a diversity of outcomes—some positive, some inconclusive, some possibly negative. How do we extract from these findings insights that can contribute to our understanding of what works and what doesn’t? And how do we determine which findings have general applicability and thus identify programs and practices that could be effectively scaled up, and which ones are specific only to the situations and circumstances in which a particular study was done?
It will be tempting to try to interpret each study individually. If such a study shows positive effects, we might then conclude that the intervention it tested must be effective and we should promote its wider use. That would be a mistake. Any individual study has idiosyncrasies of method, circumstances, participants, intervention particulars, local support, and the like, along with a dose of happenstance, that make it uncertain whether the same results would occur if the study was replicated, and even more uncertain that those results would generalize to other settings, participants, etc. This is not to say that each study does not make an important contribution to knowledge. It does, but we must be careful about overinterpreting the implications of the findings of any one study.
If we are to draw conclusions with practical implications from intervention research, multiple studies are needed to provide a sufficiently broad base of evidence to support generalization beyond the idiosyncrasies of a single study. There is much talk of replication these days, and replication of the results of an intervention study is informative. But if what we want to know is how confident we can be that the findings generalize to other contexts, then multiple studies with realistic variations will be more informative than attempts at exact replication. Relatively consistent findings across different participant groups, implementation contexts, and variants of the intervention itself provide evidence that the intervention is robust to the kinds of variation likely to occur in actual practice. Inconsistent findings, on the other hand, provide clues to where the boundaries of generalization are—the circumstances under which the intervention may not work so well.
With multiple studies, of course, extracting and interpreting the information useful for practice and policy presents an even bigger challenge. One very intuitive approach would be to categorize the studies according to whether they find positive, null, or negative effects and then try to figure out what differentiates these groups. That too would be a mistake. By what criteria would we categorize the findings of each study? Individual studies universally characterize their effects according to their statistical significance, and it is that indication that is typically relied upon as an indication of whether a positive effect was found. That makes sense for a single study—it provides some protection against claiming effects that are likely to have occurred only by chance. But statistical significance is a joint function of the size of the sample and the magnitude of the intervention effect. When reviewing multiple studies, it is necessary to focus on the magnitude of the effects so that smaller studies that find large effects are not overlooked, and very large studies that find small effects are not over-interpreted.
As many readers will recognize, we have a well-developed procedure for integrating the findings of multiple intervention studies with a focus on the size of the effects rather than their statistical significance, and for exploring the characteristics of those studies that are associated with the effects they find. It is meta-analysis.
In meta-analysis, statistical metrics for the magnitude of the effects on each outcome (effect sizes) are systematically extracted from each study along with a profile of study characteristics related to methods, participant samples, intervention characteristics, implementation, setting, and the like. Analysis is then conducted to describe the distributions of effect sizes on different outcomes and, most important, to explore the relationships between the study and intervention characteristics and the nature and magnitude of the effects. The results, then, characterize the findings of a body of research, not just individual studies, and does so in a differentiated way that attempts to identify the factors associated with differential outcomes.
Meta-analysis of broader and narrower bodies of intervention research is quite common, usually undertaken by academic researchers and promoted by such organizations as the Campbell Collaboration and the Cochrane Collaboration. However, it has not been integrated into the plans of many of the government agencies sponsoring the various tiered-evidence initiatives that are underway as a method for integrating and interpreting the findings of the many studies those initiatives are producing. There are exceptions, and some indications of movement in that direction. The Office of Adolescent Health, for instance, is sponsoring a meta-analysis of the studies of teen pregnancy prevention it, and some of its sister agencies, have initiated. Similarly the Centers for Medicare and Medicaid Services is undertaking meta-analysis of the research done under its Health Care Innovation Awards initiative. Other agencies may be making similar efforts or contemplating them, but this perspective is not widespread.
Initiating a meta-analysis of the studies developed under any of these initiatives, however, is something best planned from the beginning, not decided after the studies are underway. To support the most informative meta-analysis, there are some advantageous features that can be built into the initiative at the start. For example, the distribution of studies is important. It will be more informative to have multiple studies on practical variations of each intervention in a selected set of interventions than one study on each of a set of distinctly different interventions.
Most important, however, is the opportunity to specify in advance the kind of detailed information that each local evaluator should collect and report in order to provide a rich set of variables for the meta-analysis to explore. These should include many particulars related to the implementation of the intervention, the participants who receive it, the organizational and service delivery context, and other such factors that are often not reported in sufficient detail to support the most informative meta-analysis. Agencies sponsoring (and paying for) multiple studies under a single initiative can require a level of consistency in the way those details are reported that is rarely attained when study authors are left on their own to decide what to report.
The promise of this approach is not simply that the most effective interventions will be identified in a systematic and methodologically credible way, though that will be one result. That form of knowledge allows dissemination of program and practice models expected to be effective if they are implemented with fidelity, that is, in the same way they were in the supporting research. Having a repertoire of such models is indeed a big step forward in the quest to find and scale up effective programs and practices. However, despite the evidence, there will be many reasons why providers and practitioners will not adopt those models and, if adopted, will do so with adaptations that change them from the original evidence-based version.
The larger promise from sufficient bodies of evidence and differentiated meta-analysis is identification of the principles that make the respective programs and practices effective. Knowing why something works, and not just that it works, provides an explanation—a theory if you will—that can guide effective practice in flexible ways amenable to local adaptations and practical constraints, so long as those variants preserve the underlying change mechanism that makes the intervention work.
We need a cookbook full of recipes for effective practice, but even better is knowing how to create recipes for effective practice from the ingredients on hand in the local kitchen.
In “Evidence at the Crossroads,” we seek to provoke discussion and debate about the state of evidence use in policy, specifically federal efforts to build and use evidence of What Works. We start with the premise that research evidence can improve public policies and programs, but fulfilling that potential will require honest assessments of current initiatives, coming to terms with outsize expectations, and learning ways to improve social interventions and public systems. Read other posts in the series, and join the conversation by tweeting #EvidenceCrossroads.