Blog Post

Research Grantee David Yeager is Using a National Sample to Examine Variability in Intervention Effects

Research grantee David Yeager is studying whether an exercise that instills in students the idea that intelligence can be developed over time can reduce disparities in math achievement. While numerous studies have focused on such “growth mindset” interventions, which encourage students to think in ways that support learning, Yeager’s project contributes to the literature by looking beneath the surface and examining variability in effects across settings and contexts.

Adopting a double-blind randomized controlled trial model to test the effectiveness of a growth mindset intervention on a nationally-representative sample of high schools, each providing a census of 9th-graders, Yeager is using a national sample that can help get to the bottom of how, why, and for whom the intervention may yield positive effects.

Ultimately, the team’s findings may increase our understanding of programs that could be applied across the country.

Here, Foundation president Adam Gamoran talks to Yeager in detail about his study’s design and methodology, including the importance of a national sample and the focus on representative populations.

AG: How does a study on the scale of yours, which looks at students across the country, add to the existing research on the effects of growth mindset interventions?

DY: While evidence of effects exists, almost any effect in education, no matter how established, needs to be studied in more heterogeneous samples. As Tony Bryk is apt to say, once science has established that there is an effect of some factor, then variability of the effect becomes the problem to solve. Studying cross-site variability can make several contributions to the existing literature on growth mindset interventions.

We need to study variability in treatment effects across schools and students in part because it’s central to how and why the interventions affect students. A growth mindset intervention doesn’t teach math—the math teacher does. But a growth mindset intervention might reduce some of the worries that students have when they work on hard math problems, sustaining students’ motivation in the face of challenging tasks and making the benefits of the teacher’s pedagogy more apparent. It follows that the benefits that emerge from a growth mindset intervention depend on the resources—the quality of curriculum and instruction—available in a context.

So, with this study, if we can identify schools that have foundational elements in place—adequate curriculum or instruction—but an environment that isn’t supportive of motivation, then a growth mindset intervention may be able to unleash some of the school’s latent potential, and allow students in that school to catch up, thereby reducing inequality in academic outcomes. Second, without evidence about variability of effects, or where growth mindset interventions are more or less effective, we run the risk of making these interventions seem “magical.” This is a problem because magical thinking can lead to overzealous scale-ups on the one hand, or premature rejection of promising ideas on the other. We’d like growth mindset interventions to be used appropriately and effectively.

Third and finally, our study uses a national probability sample, and it represents the most rigorous replication of growth mindset effects to date. As more and more behavioral-science-based experiments have come under intense scrutiny for difficulty with replication, it’s important to put the prominent phenomenon of growth mindset to its toughest test so far.

AG: What is the specific importance of a national sample? Isn’t that unusual for a psychological study?

There’s a scientific reason and a personal reason for launching a national probability sample experiment.

The scientific reason for probability sampling is called site selection bias. This reflects the possibility that schools who volunteer to be in early trials may be especially likely to show effects. This could be because the school partner is willing to comply with procedures, or because all the elements are in place for the treatment to be effective. Of course, in a first study, it would be unwise to do a large evaluation of a novel intervention in a place where everyone expected there to be no effects. But then initial effect sizes can be over-estimated and we don’t understand the facilitating conditions for the effects.

Later trials conducted in volunteer samples may under-estimate treatment effects because researchers may be probing for boundary conditions. Researchers may pick sites that are especially unlikely to show benefits, to learn about boundary conditions, because a researcher might want to know “could it work even there?” What sometimes happens next is that people do a meta-analysis where they average these two kinds of trials—the early effects (which are often outliers) and the later smaller effects (which can also be outliers). But then we get an average that doesn’t mean much, because the effects can be truly different.

In mindset research, you could imagine the same kinds of problems. Schools who volunteered in our initial trials might have known a lot about Carol Dweck’s book, Mindset, and might have been implementing programming to support it. This kind of programming could have tilled the soil for our online treatment, and resulted in larger treatment effects. The only way to get an estimate of effect size that applies to the whole country is to take a random sample of schools, try to recruit as many as we can, and then apply survey weights to make the sample nationally representative.

The personal reason for the probability sample, though, comes from my secret life as a survey methodologist. With my collaborator Jon Krosnick at Stanford University, a psychologist and political scientist, I’ve been working on projects over the last decade that evaluate the quality of data obtained from large national samples of convenience (such as online surveys that recruit people through ads on websites) compared to large nationally-representative sample surveys (that recruit participants through address-based methods or telephone). In general, we find that convenience samples are less accurate when estimating population statistics, even when good survey weights are applied, but probability sample surveys are surprisingly accurate even when they have response rates below 10%, provided that good weights are applied. I always wanted to translate that work on survey accuracy into the experimental setting—to bridge my secret survey life with my less-secret life as a developmental psychologist. Our study was a great chance to do this (as a side note, Jon Krosnick was instrumental in helping us launch the study back in 2013).

AG: What can we learn from an experiment that occurs with a representative population that we cannot learn from the more typical samples of voluntary participants?

DY: The random assignment aspect has obvious benefits for causal inference: we can safely assume that any difference in outcomes between the treatment and control groups can be attributed to the effects of the treatment exercise, because the two groups were identical before random assignment.

Our goal is to estimate the differences between the treatment and control groups. A probability sample is useful for estimating that because of two contributions: estimating average effects, and understanding moderators.

First, there is a distinction between what’s called the sample average treatment effect (SATE) and the population average treatment effect (PATE). If you take a sample of volunteers, the study is designed to tell us whether there was a causal effect in the sample of participants that we happened to pick—a sample average treatment effect. This can be interesting, but in policy evaluation it is not usually what we want to know. What we usually want to know is: what effect could I expect to get when I gave the treatment to a much larger group of people—to a population? The easiest way to get the population average treatment effect is to take a random sample of the population (or just assume the treatment effect is homogeneous, but it usually isn’t).

In the case of growth mindset treatments, there could be many moderators of the treatment effect, both known and unknown. And they may exist in unknown proportions across the country. The national probability sample design ensures that, regardless of whether we can name and measure the source of the heterogeneity, we can be reasonably confident that we have the right proportions in our sample.

The second contribution of the national probability sampling design is in its ability to let us study moderators—that is, factors that explain why the treatment is larger or smaller in some groups of students versus others.

The search for moderators is really the primary goal of the national study—we don’t just want to know whether a growth mindset treatment is effective but where and why it is effective. Specifically, we’re interested in studying whether the growth mindset intervention works in schools with low, medium, and high levels of achievement, and whether it works when there is a supportive versus an unsupportive peer climate.

In line with the question about typical samples, readers might ask, “Won’t any low-achieving or high-achieving school do? As long as you have enough schools from each group to do a comparison, then isn’t that enough?” No, for basically the same reason we explained above. A group of schools with the same achievement level might still vary a lot in ways that are relevant to treatment effect sizes. Suppose, for instance, that the only low-achieving schools who volunteered to be in a growth mindset study were leading charter schools with principals who had read Carol Dweck’s Mindset. Perhaps the treatment would work unexpectedly well there. And suppose that the high-achieving schools who volunteered were looking for a quick fix for student motivation—a brief online exercise—rather than looking at deeper issues. Perhaps the treatment would not work well there. If a study of volunteers found stronger effects in the former school relative to the latter, it might mean that these other selection factors were the real moderator, not the school achievement level. The end result would be inaccurate theory about the intervention and misguided advice for policy.

If we knew something about all of these types of conditions and had information about them in advance, then it is possible that volunteer samples could be recruited, provided that we assume that nothing else matters. But this is probably not the case. Moreover, studies have almost never measured these kinds of moderators in generalizable samples in the first place, so we don’t have the knowledge to construct better volunteer samples yet.

Often, the safest thing to do will be to randomly sample schools. At a minimum, we can expect that all of the known and unknown moderators are distributed in our sample similarly to how they are distributed in the population, within sampling error.

David Yeager is an Associate Professor of Psychology at the University of Texas at Austin. Read more about the work of David and his team at the Mindset Scholars Network.

More about research grants
Proposals for our research grants program are evaluated on the basis of their fit with a given focus area; the strength and feasibility of their designs, methods, and analyses; and their potential to inform change and contribute to bodies of knowledge that can improve the lives of young people. Although we have highlighted a particular design and methodology here, we always begin application reviews by looking at the research questions or hypotheses, then evaluating whether the proposed research designs and methods will provide empirical evidence on those questions. Learn more about applying for research grants.

Related content

Subscribe for Updates