Monday, December 31, 2012

Another claim of gaming benefits

Yesterday I wrote an extended piece about the lack of strong evidence supporting benefits of videogame training. In Today's Guardian/Observer, Vaughan Bell makes the same claim that the Scientific American Mind article did (see also his MindHacks blog post). Unlike that article, he did mention that some researchers have raised questions about the methods in that literature, but then claimed that there have since been other better-designed studies. The article provides no citations to better-designed studies, and I know of exactly zero that have adequately addressed the Boot et al critiques.

Other than mentioning that some people suggest there are problems, he also gave no coverage to the types of problems undermining claims of gaming benefits. We need more critical coverage of these claims of benefits, not more hyping of the claims. He gave much more critical discussion of the videogame/violence claims. How about giving the same level of critical thought to the gaming/benefits claims. It seems to be a trend to use the cognitive benefits claim as an unquestioned way to offset the games are bad for you claim.

Sunday, December 30, 2012

Think video games make you smarter? Not so fast...

This month's issue of Scientific American Mind ran an article that uncritically touted the power of first-person video games to improve cognition in general (broad transfer from the games to basic perception/attention measures). The article neglected to mention any of the failures to replicate those effects (e.g., link, link), it ignored critiques of the evidence underlying claims of broad transfer, it ignored the implications of potential conflicts of interest among those promoting the benefits of game training, etc. This neglect of contrary evidence and method problems isn't a new trend. All too often, new papers use designs with known problems or draw unjustified conclusions, and media coverage of them largely ignores the shortcomings (as, for some reason, do reviewers and editors). Given the visibility of this latest article, it seemed appropriate to resurrect my commentary from September 13, 2011, a variant of which was originally posted on my invisiblegorilla.com blog.

Try to spot the flaw in this study. A scientist recruits a group of subjects to test the effectiveness of a new drug that purportedly improves attention. After giving subjects a pre-test to measure their attention, the experimenter tells the subjects all about the exciting new pill, and after they take the pill, the experimenter re-tests their attention. The subjects show significantly better performance the second time they’re tested.
This study would never pass muster in the peer review process—the flaws are too glaring. First, the subjects are not blind to the hypothesis—the experimenter told them about the exiting new drug—so they could be motivated to try harder the second time they take the test. The experimenter isn’t blind to the hypothesis either, so they might influence subject performance as well. There’s also no placebo control condition to account for the typical improvement people make when performing a task for the second time. In fact, this study lacks all of the gold-standard controls needed in a clinical intervention.
A couple years ago, +Walter Boot, Daniel Blakely and I wrote a paper in Frontiers in Psychology that describes serious flaws in many of the studies underlying the popular notion that playing action video games enhances cognitive abilities. The flaws are sometimes more subtle, but they’re remarkably common: None of the existing studies include all the gold-standard controls necessary to draw a firm conclusion about the benefits of gaming on cognition. When coupled with publication biases that exclude failures to replicate from the published literature, these flaws raise doubts about whether the cumulative evidence supports any claim of a benefit.
The evidence in favor of a benefit from video games on cognition takes two forms: (a) expert/novice differences and (b) training studies.
The majority of studies compare the performance of experienced gamers to non-gamers, and many (although not all) show that gamers outperform non-gamers on measures of attention, processing speed, etc (e.g., Bailystock, 2006; Chisholm et al., 2010, Clark, Fleck, & Mitroff, 2011; Colzato et al., 2010; Donohue, Woldorff, & Mitroff, 2010; Karle, Watter, & Shedden, 2009; West et al., 2008). Such expert/novice comparisons are useful and informative, but they do not permit any causal claim about the effects of video games on cognition. In essence, they are correlational studies rather than causal ones. Perhaps the experienced gamers took up gaming because they were better at those basic cognitive tasks. That is, gamers might just be better at those cognitive tasks in general, and their superior cognitive skills are what made them successful gamers. Or, some third factor such as intelligence or reaction times might contribute to interest in gaming and performance on the cognitive tasks.
Fortunately, only a few researchers make the mistake of drawing causal conclusions from a comparison of experts and novices. Yet, almost all mistakenly infer the existence of an underlying cognitive difference from a performance difference without controlling for other factors that could lead to performance differences. Experts in these studies are recruited because they are gamers. Many are familiar with claims that gamers outperform non-gamers on cognition and perception tasks. They are akin to a drug treatment group that has been told how wonderful the drug is. In other words, they are not blind to their condition, and they likely will be motivated to perform well. The only way around this motivation effect is to recruit subjects with no mention of gaming and only ask them about their gaming experience after they have completed the primary cognitive tasks, but only a handful of studies have done that. And, even with blind recruiting, gamers might still be more motivated to perform well because they are asked to perform a game-like task on a computer. In other words, any expert-novice performance differences might reflect different motivation and not different cognitive abilities.
Even if we accept the claim that gamers have superior cognitive abilities, such differences do not show that games affects cognition. Only by measuring cognitive improvements before and after game training might a causal conclusion be justified (e.g., Green & Bavelier, 2003; 2006a; 2006b, 2007). Such training studies are expensive and time-consuming to conduct, and only a handful of labs have even attempted them. And, at least one large-scale training study has failed to replicate a benefit from action game training (Boot et al., 2008). Yet, these studies are the sole basis for claims that games benefit cognition. In our Frontiers paper, we discuss a number of problems with the published training studie. Taken together, they raise doubts about the validity of claims that games improve cognition:
  1. The studies are not double-blind. The experimenters know the hypotheses and could subtly influence the experiment outcome.
  2. The subjects are not blind to their condition. A truly blind design is impossible because subjects know which game they are playing. And, if they see a connection between their game and the cognitive tasks, then such expectations could lead to improvements via motivation (a placebo effect). Unless a study controls for differential expectations between the experimental condition and the control condition, then it does not have adequate control for a placebo effect explanation for any differences. To date, no study has controlled for differential expectations.
  3. Almost all of the published training studies showing a benefit of video game training relative to a control group show no test-retest effect in the control group. That's bizarre. The control group should show improvement from the pre-test to the post-test—people should get better with practice. The lack of improvement in the baseline condition raises the concern that the “action” in these studies comes not from a benefit of action game training but from some unusual cost in the control condition.
  4. It is unclear how many independent findings of training benefits actually exist. Many of the papers touting the benefits of training for cognition only discuss the results of one or two outcome measures. It would be prohibitively expensive to do 30-50 hours of training with just one chance to find a benefit. In reality, such studies likely included many outcome measures but reported only a couple. If so, there's a legitimate possibility that the reported results reflect p-hacking. Those papers often note that participants also completed "unrelated experiments," but it's not clear what those are or whether they actually were the same experiment but different outcome measures. Based on the game scores noted in some of these papers, it appears that data from different outcome measures with some of the same trained subjects were reported in separate papers. That is, the groups of subjects tested in separate papers might have overlapped. If so, then the papers do not constitute independent tests of the benefits of gaming. If we don't know whether or not these separate papers constitute separate studies, any meta-analytic estimate of the the existence and effect size for game-training benefits is impossible. Together with the known failures to replicate training benefits and possible file drawer issues, it is unclear whether the accumulated evidence supports any claim of game training benefits at all.
Given that expert/novice studies tell us nothing about a causal benefit of video games for cognition and that the evidence for training benefits is mixed and uncertain, we should hesitate to promote game training as a cognitive elixir. In some ways, the case that video games can enhance the mind is the complement to recent fear mongering that the internet is making us stupid. In both cases, the claim is that technology is altering our abilities. And, in both cases, the claims seem to go well beyond the evidence. The cognitive training literature shows that we can enhance cognition, but the effects of practice tend to be narrowly limited to the tasks we practice (see Ball et al., 2002; Hertzog, Kramer, Wilson, & Lindenberger, 2009; Owens et al., 2010; Singley & Anderson, 1989; for examples and discussion). Practicing crossword puzzles will make you better at crossword puzzles, but it won’t help you recall your friend’s name when you meet him on the street. None of the gaming studies provide evidence that the benefits, to the extent that they exist at all, actually transfer to anything other than simple computer-based laboratory tasks.
If you enjoy playing video games, by all means do so. Just don’t view them as an all-purpose mind builder. There’s no reason to think that gaming will help your real world cognition any more than would just going for a walk. If  you want to generalize your gaming prowess to real-world skills, you could always try your hand at paintball. Or, if you like Mario, you could spend some time as a plumber and turtle-stomper.
Other Sources Cited:
  • Ball, K., Berch, D. B., Helmers, K. F., Jobe, J. B., Leveck, M. D., Marsiske, M., et al. (2002). Effects of cognitive training interventions with older adults: A randomized controlled trial. JAMA: Journal of the American Medical Association, 288(18), 2271-2281.
  • Bialystok, E. (2006).  Effect of bilingualisim and computer video game experience on the simon task.   Candadian Journal of Experimental Psychology, 60, 68-79.
  • Boot WR, Blakely DP and Simons DJ (2011) Do action video games improve perception and cognition? Front. Psychology 2:226. doi: 10.3389/fpsyg.2011.00226.  Link to Full Text
  • Chisholm, J.D., Hickey, C., Theeuwes, J. & Kingston, A. (2010) Reduced attentional capture in video game players.  Attention, Perception, & Psychophysics, 72, 667-671.
  • Clark, K., Fleck, M. S., & Mitroff, S. R. (2011). Enhanced change detection performance reveals improved strategy use in avid action video game players. Acta Psychologica, 136, 67-72.
  • Colzato, L. S., van Leeuwen, P. J. A., van den Wildenberg, W. P. M., & Hommel, B. (2010). DOOM’d to switch: superior cognitive flexibility in players of first person shooter games. Frontiers in Psychology, 1, 1-5.
  • Donohue, S. E., Woldorff, M. G., & Mitroff, S. R. (2010). Video game players show more precise multisensory temporal processing abilities. Attention, Perception, & Psychophysics, 72, 1120-1129.
  • Green, C. S. & Bavelier, D. (2003). Action video game modifies visual selective attention. Nature, 423, 534-537.
  • Green, C.S. & Bavelier, D. (2006a). Effect of action video games on the spatial distribution of visuospatial attention. Journal of Experimental Psychology: Human Perception and Performance, 1465-1468.
  • Green, C. S. & Bavelier, D. (2006b). Enumeration versus multiple object tracking: the case of action video game players. Cognition, 101, 217-245.
  • Green, C.S. & Bavelier, D. (2007). Action video game experience alters the spatial resolution of attention. Psychological Science, 18, 88-94.
  • Hertzog, C., Kramer, A. F., Wilson, R. S., & Lindenberger, U. (2009). Enrichment effects on adult cognitive development. Psychological Science in the Public Interest, 9, 1–65.
  • Irons, J. L., Remington, R. W. and McLean, J. P. (2011), Not so fast: Rethinking the effects of action video games on attentional capacity. Australian Journal of Psychology, 63: no. doi: 10.1111/j.1742-9536.2011.00001.x
  • Karle, J.W., Watter, S., & Shedden, J.M. (2010).  Task switching in video game players:  Benefits of selective attention but not resistance to proactive interference.  Acta Psychologica, 134, 70-78.
  • Murphy, K. & Spencer, A. (2009). Playing video games does not make for better visual attention skills. Journal of Articles in Support of the Null Hypothesis, 6, 1-20.
  • Owen, A.M., Hampshire, A., Grahn, J.A., Stenton, R., Dajani, S., Burns, A.S., Howard, R.J., & Ballard, C.G. (2010).  Putting brain training to the test.  Nature, 465, 775-779.
  • Singley, M. K., & Anderson, J. R.  (1989). The transfer of cognitive skill.  Cambridge, MA.: Harvard University Press.
  • West, G. L., Stevens, S. S., Pun, C., & Pratt, J. (2008). Visuospatial experience modulates attentional capture: Evidence from action video game players. Journal of Vision, 8, 1-9.

Wednesday, December 26, 2012

Journals of null results and the goal of replication

Here is my response to the following question that +Gary Marcus forwarded me from one of his readers:
Is there a place in experimental science for a journal dedicated to publishing "failed" experiments? Or would publication in a failed-studies journal be so ignominious for the scientists involved as not to be worthwhile Does a "failed-studies" journal have any chance of success (no pun intended)?

Over the years, there have been a number of attempts to form "null results" journals. Currently, the field has The Journal In Support of the Null Hypothesis (there may well be others). As a rule, such journals are not terribly successful. They tend to become an outlet for studies that can't get published anywhere else. And, given that there are many reasons for failed replications, people generally don't devote much attention to them.

Journals like PLoS One have been doing a better job than many others in publishing direct replication attempts. They emphasize experimental accuracy over theoretical contributions, which fits the goal of a journal that publishes replication attempts whether or not they work. There also are websites now that compile replication attempts (psychfiledrawer.org). The main goal of that site is to make researchers aware of existing replication attempts.

For me, there's a bigger problem with null results journals and websites: They treat replications as an attempt to make a binary decision about the existence of an effect. The replication either succeeds or fails, and there's no intermediary state of the world. Yet, in my view, the goal of replication should be to provide a more accurate estimate of the true effect, not to decide whether a replication is a failure or success.

Few replication attempts should lead to a binary succeed/fail judgment. Some will show the original finding to be a true false positive with no actual effect, but most will just find at the original study overestimated the size of the effect (I say "most" because publication bias ensures that many reported effects overestimate the true effects). The goal of replication should be to achieve greater and greater confidence in the estimate of the actual effet. Only with repeated replication can we zero in on the actual estimate. The greater the size of the new study (e.g., more subjects tested), the better the estimate.

The initiatives I'm pushing behind the scenes (more on those soon) is to encourage multiple replications using identical protocols in order to achieve a better estimate of the true effect. One failure to replicate is no more informative than one positive effect -- both results could be false. With repeated replication, though, we get a sense of what the real effect actually is.

Monday, December 24, 2012

Recommendations for improving psychology

In a new blog post in The New Yorker,  makes some excellent suggestions for improving the state of psychology publishing. I personally prefer not to cluster fraud cases together with results that do not replicate. The former are revealed only via whistle blowing or investigation. The later inspire the "science is self correcting" refrain—later work can "fix" claims that don't hold up. 

The broader problem, though, is that science often is not self correcting. Science can't correct itself if nobody tries to replicate published claims or if those replications are relegated to the file drawer. The bias against publishing negative findings provides a disincentive for attempting direct replications in the first place. And, even when replication attempts are published, they often are ignored both in the literature and in the public eye (just as newspaper errata are buried). 

The field needs to change the incentive structure that governs scientific publishing in order to encourage and support direct replications. Initiatives like 's reproducibility project get us part of the way there, but only when the academic societies and journals encourage direct replication can the field really change. I'm involved in a project to implement just such a change, and I'll be writing more about it soon.

Tuesday, December 18, 2012

The demographics of surveys: Phone vs. Mechanical Turk

Last year, Chris Chabris and I published the results of a national survey of beliefs about memory (in PLoS One). We found that many people agreed with statements about memory and the mind that experts roundly dimiss as wrong. We conducted the survey in 2009 using the polling firm, SurveyUSA, with a nominal sample of 1500 respondents representing the population of the United States. SurveyUSA uses random-digit dialing of landline phone numbers, and respondents press keys on their keypad to questions from a recorded voice. There method is fairly typical of so-called "robotic" polling.

Last summer, we repeated the same survey using a sample drawn from Amazon Mechanical Turk, with the restriction that respondents be from the United States. On Mechanical Turk, workers decide whether they would like to participate and are paid a small amount for a completed survey. Unlike random digit dialing, the sample on Mechanical Turk is entirely self-selected.  The results of that survey and a comparison to our earlier survey just appeared in PLoS One on December 18, 2012.

Just recently I wrote an extended blog post about the nature of survey demographics. To compare these two surveys, we weighted both to a nominal sample of 750 respondents according to the population demographics from the 2010 US Census. Reassuringly, the pattern of results was roughly comparable. In essence, we replicated our generalization to the national population, with many people endorsing mistaken beliefs about memory.

In writing the paper and re-weighting the samples, I discovered something interesting about who responds to these sorts of surveys. Although both could be weighted to a nationally representative sample, the raw demographics of the samples were vastly different. They were roughly comparable on most dimensions (e.g., income, education, region of the country), but their ages differed dramatically.

figure comparing age demographics from multiple survey methods

The yellow bars represent data from the 2010 US Census. Note that about half of the adult US population is under 44 and half is over 44. Now look at the blue bars from the SurveyUSA sample of about 1840 people. The first thing to notice is that the phone survey massively oversamples older people and massively undersamples younger ones. In order to generalize to the larger population, each response from a young subject is weighted to count many times that of an old subject. The pattern is almost exactly the opposite for our Mechanical Turk sample (of just under 1000 people). Mechanical Turk respondents were disproportionately young. The extent of age bias in each sample was roughly comparable and in opposite directions, and neither was anywhere close to the actual demographics of the US population.

For me, this figure was eye opening. I wasn't surprised that an online Mechanical Turk sample would be disproportionately younger, and I assumed that phone surveys would oversample the elderly, but I had no idea how extreme that bias would be. What that means is that any national survey conducted by phone is mostly contacting older people. Unless the sample is adequately large, the number of young respondents will be minuscule, meaning that the weighting for those respondents will be huge. If a small survey happened to get a few oddball younger respondents, it could dramatically alter the total estimate. 

As I discussed in this earlier post, pollsters almost never report their raw demographics, essentially hiding how much weighting their survey needed to make it nationally representative. But, that information is crucial, especially if the survey compares demographic groups. If you compared young and old subjects in our SurveyUSA study, you would be comparing a small sample to a huge one, and the generalization from the older subjects would be much safer than from the younger ones. Without knowing that, you might assume that each generalization was equally valid.

In our paper, we compared the two surveys by weighting each to match the population, making each into a representative sample with a nominal size of 750 people (e.g., weighting to match what would be expected for a sample of 750 people. See the paper for details: We basically dichotomized the age category given the sample sizes). Fortunately, despite these huge deviations from the actual population statistics, each "nationally representative" sample produced comparable results. In a sense, that is the ideal situation: Two samples with vastly different demographics produce comparable results when weighted. That means that the different sampling methods and weightings did not dramatically change the pattern of results. 

That finding also has practical implications for anyone interested in conducting surveys. One approach to obtaining a more representative sample would be to combine phone and Mechanical Turk samples, counting on Mechanical Turk for younger respondents and the phone survey for older ones.

The next time you read about a survey, ask yourself: Was the sample truly representative and if not, was the sample large enough to trust the conclusions about different demographic subgroups?

The hidden secrets of polling

Pollsters and survey researchers often report the results of "nationally representative" surveys, but in what sense are such surveys representative? The answer, it turns out, is more complicated than it might seem. And, the way surveys are reported obscures a potentially important secret.

The 2012 elections in the United States likely were the most heavily polled in history. Not surprisingly, polls varied in their accuracy, and different polls of the same race often produced discrepant results. Election polling is particularly tricky because there is no ground truth until after the election -- pollsters are trying to predict what people are going to do, an inherently noisy process since you have no way to know for certain if they will follow through on what they say they will do. The challenge in polling elections is to adjust for the likelihood that people will do what they say they will, and many polling discrepancies are due to differences in that "likely voter" model. (If you are a political junky like me, that was why some people thought it was necessary to "unskewer" of polls—basically, that meant adjusting the likely voter model.)

If your goal, instead, is just to generalize from your sample to the population as a whole, you just need to know the characteristics of the population, a much easier (but still challenging) problem. Suppose I want to know what percentage of the US Population owns a bicycle. If I had infinite resources and could work infinitely fast, I could ask every US citizen and tally their answers to get the percentage (much as a classic Census would). Far more efficient and cost effective, though, would be to sample from the population and estimate the population characteristics from the sample. The bigger the sample, the more likely the sample will be representative of the population as a whole, assuming the sample is random. But there's the rub. In practice, no sampling method provides a truly random sample of the population.

For a random sample, we can assume that any one individual is as likely as any other to be included in our survey. That means, with a large sample, the relative proportions of men and women, old people and young, rich and poor, will match those of the population. Roughly 1% of your sample will be in the top 1% of income earners in the USA and 99% won't be. But, if a sample is not truly random, some people will be sampled more than others. That leaves you two options:
  1. Assume that your sample is representative enough and generalize to the full population anyway. That approach actually forms the basis of most generalization from experimental research. People conduct a study by testing a group of college students (known as a sample of convenience) and assume that their sample is representative enough of a larger population. Whether or not that generalization is appropriate depends on the nature of the question. If you sample a group of college students and find that all of them breath air, generalizing to all humans would be justified. If you sample a group of college students and find that none of them are married, you wouldn't be justified in generalizing to all Americans and concluding that Americans don't marry. Few published journal articles explicitly address the limits of their generalizability and the assumptions made when generalizing, but they should (more on that soon).
  2. Adjust your estimate to account for known discrepancies between your sample and the full population. That's how polling firms solve the problem. They weight the responses they did get to account for oversampling of some groups or undersampling of others. If 10% of your sample falls in the top 1% of incomes in the USA, you would need to weight those responses less and weight the responses of the respondents reporting lower incomes more heavily. If you know that Democrats outnumber Republicans in a region, but your sample includes more Republicans than Democrats, you need to weight the sample to account for the discrepancy. That's where much of the fighting emerges in political polling (were the percentages of each party appropriate? Did the pollsters accurately weight for how likely each group was to vote?). 

National surveys are not truly random samples of the population. Most operate by selecting area codes (to make sure they sample the right geographical region) and dialing random numbers in that area code. Some call cell phones, but most call landlines from lists of registered voters or other calling registries. If everyone had a phone, answered it with equal probability, and responded to the survey with equal frequency, then a random-digit-dialed survey would be a random sample of the population. None of those conditions hold in reality. Many people no longer have land lines, especially younger people. Relatively few people respond to surveys (often, the response rates are well under 10%), and those who do respond might differ systematically from those who don't.

Given that polls are not random samples, pollsters weight their samples to match their beliefs about the characteristics of the population as a whole. For political polls, that means weighting to match the demographics of registered voters or likely voters. For surveys of other sorts (e.g., owning a bicycle), that means weighting the sample to match the demographics of the broader population of interest. With a large enough sample, weighting allow you to make your sample conform to the demographics of your target population. If your goal is to generalize to the population, so you must adjust your sample to match it. If you do that, your sample will be representative of those population demographics. That does not mean the poll will be accurate, though. Perhaps the old people who did respond were unusual or differed from other old people in systematic ways. If so, then your sampled old people might be a poor stand in for old people in general, and the inference you draw from your poll might be inaccurate.

The first secret of poll reporting: The size of the poll is a convenient fiction. When the media reports that a poll surveyed 2000 people, that is misleading. They almost certainly surveyed more than that and then weighted their poll to match what would be representative of the population with a sample of 2000. The reason that they would have to survey more than 2000 is that some groups are so underrepresented in polling that it would take more than 2000 people to get enough respondents in those groups to estimate how that group as a whole would respond. If you cut the demographic categories too finely, you won't have any respondents from some groups (for any group constituting less than 1/4000 of the target population, you would not expect to sample any respondents). The number of respondents reported is a "nominal" sample size, not an actual sample size. Pollsters decide in advance what nominal sample size the want and then polling until they obtain a large enough sample in each of their critical demographic groups to be able to weight the responses. Polling firms typically do not report how many responses they need from each demographic group, and they rarely report the total number of people sampled to achieve their nominal sample. And that hints at the second problem.

The second secret of poll reporting: Pollsters almost never report the raw sample demographics that went into the polling estimate. Instead, they report the results as if their sample were representative. They might report cross-tabs, in which they break down the percentages of each group responding in a particular way (e.g., what percentage of women own bicycles), but they don't report how many women were in their sample or how heavily they had to weight individual responses to make those estimates. In some cases, they might generalize to an entire demographic category from the responses of only a handful of people. Critically, the actual demographics of the sample almost never match the demographics of the target population, and in some cases, they can be dramatically different. That means the pollsters must use fairly extreme weights to achieve an representative sample.

In an upcoming post, I will provide an example of how wildly sample demographics can vary even when they produce comparable results. In a sense, that is the most reassuring case—when widely different samples are weighted to a common standard and lead to the same conclusion, we can be more confident that the result wasn't just due to the sample demographics. Still, whenever you see conclusions about a demographic group in a poll, you should as yourself (and the pollster) how many people likely formed the basis for that generalization. It might well be a smaller number than you think.