Thursday, February 9, 2023

What Drives Sex Toy Popularity? A Morphological Examination of Vaginally-Insertable Products Sold by the World’s Largest Sexual Wellness Company

What Drives Sex Toy Popularity? A Morphological Examination of Vaginally-Insertable Products Sold by the World’s Largest Sexual Wellness Company. Sarah E. Johns & Nerys Bushnell. The Journal of Sex Research, Feb 7 2023. https://doi.org/10.1080/00224499.2023.2175193

Abstract: There is limited research into the morphology of sex toys, and specifically into (the often phallic-shaped) vibrators and dildos and what they may represent in terms of user preferences for male genital morphology. This study provides insight into consumer preference around vaginally insertable sex toys, their features, and what contributes to their popularity. Using a data set compiling information from the world’s largest online sexual wellness retailer Lovehoney, we examined the dimensions, price, and morphological features of 265 sex toys designed for vaginal insertion to determine what contributes to item popularity. Using regression models, we found that realistic features did not predict item popularity, whereas price (p < .001) and circumference (p = .01) significantly predicted the overall popularity of a toy. It appears that consumers show a preference for insertable sex toys that are not direct replicas of the male penis, which suggests they are not seeking a realistic partner substitute. Further, we found that the length of the toy did not significantly predict popularity which is consistent with other work showing that women do not place considerable emphasis on large phallus size. Our results can contribute to future product design and marketing, as well as reveal preferences toward particular characteristics of the phallus (whether real or toy).

Discussion

Contrary to what we expected (and contrary to Döring et al., Citation2022), we found no preference for products with realistic morphological features, other than the presence of veiny texture, when not controlling for price. This may be a consequence of levels of stigma and taboo still associated with specifically insertive (thus penis substitute) sex toy use by women (Fahs & Swank, Citation2013; Minge & Zimmerman, Citation2009; Waskul & Anklan, Citation2020), and recent research has concluded that “perceived stigma” among users is higher when erotic technology being engaged with was more human-like (Dubé et al., Citation2022). Highly realistic products may make women (and potentially their partners (Ronen, Citation2021)) feel less comfortable given they are truly “penis substitutes” rather than being a fun, vaginally insertable toy; women have reported that they most often chose sex toys which were specifically intended to not resemble a penis (Fahs & Swank, Citation2013). There has been a move away from the marketing of such toys since the ensuing popularity of the (more abstract and less “obscene”) “Rabbit” during the 1990s (Devlin, Citation2018). A nonrealistic phallus might also be more acceptable and less threatening for men wanting to integrate a toy into their sex lives (Ronen, Citation2021), possibly related to the prevalent media narrative that only sad, lonely men would have sex with an “artificial partner” (Dubé et al., Citation2022), and also the concept of “dildo-envy” (Reich, Citation1992), whereby the dildo is viewed as superior to the “flawed organic penis” (Hamming, Citation2001, p. 331). However, the variable stigma between types of erotic technology and the relation of this to gender has yet to be empirically quantified (Dubé et al., Citation2022).

Product popularity was also significantly predicted by price. More expensive items were found to be less popular when accounting for a range of other morphological features. Realistic features of a product (a natural skin color) were related to its price, with this feature increasing cost. The sex toy industry is part of capitalist consumer culture (Döring, Citation2021), so it perhaps comes as little surprise that price is influential in consumer choice. It appears that customer choice is not based on morphological attributes alone and that item cost is a considerable factor. If realistic features on models predict a higher price, this may further deter customer purchase of anatomically realistic toys.

As described in a study by Gallup et al. (Citation2003), sex toys have been previously used in research into the evolution of human sexual anatomy, where they were employed as a proxy for human male genitals in an experimental condition to test whether the presence of a coronal ridge contributes to efficient removal of (a purported rival) semen from the vagina (Gallup et al., Citation2003). Interestingly, in our study, the presence of the penile glans or a coronal ridge was not a feature which significantly predicted sex toy popularity, suggesting that users are not concerned in matching their sex toy to certain aspects of the penis which could play a role in sperm competition. Our results may suggest that penile glans/coronal ridge does not appear to have an influential role in sexual satisfaction or preference as it was not a favorable morphological feature. Such a feature also increases product realism, which again, might be less desirable for women wanting an insertable fun toy rather than a realistic partner substitute.

Our results further highlight that women may not be simply seeking a large phallus size as could be assumed given the sociocultural influences around this being a desirable trait (Sharp & Oates, Citation2019). Certainly, men often feel anxiety around, and dissatisfaction with their penis, citing societal pressures around the idea that bigger is better (Francken et al., Citation2002). This has led some men to consider or use surgical processes to enhance penis size (Mondaini et al., Citation2002). However, we found that for toys at least, although circumference was influential in predicting product popularity, insertable toys of a larger girth in our sample were less popular, while length was non-significant. There appears to be an emphasis on offering slightly larger than average phallus products, and yet products with larger circumferences were not as popular as less girthy models. In our sample, the 5 most popular products had a mean circumference of 4.85 inches which is just above the average circumference for real penises. Our findings are very similar to results reported by Herbenick et al. (Citation2015) who noted that insertable sex toys replicated, on average, real penile dimensions. It may be the case that consumers prefer a slightly larger than average phallus circumference when purchasing an insertable sex toy online but there is a cutoff point where extremely large models are more niche than the average user desires. The findings of this study suggest that online sex shops could consider offering a greater selection of insertable sex toys in average, to slightly above average, sizes given that larger toy circumference predicted reduced item popularity. Research on female attitudes to penis dimensions support this: when asked whether penis length and girth was important, only 20.6% of women believed it to be, with the remainder considering it unimportant (Francken et al., Citation2002), while another study found that 85% of the women they asked were satisfied with the size of their partner’s penis (Lever et al., Citation2006). Generally, it appears that women do not place considerable emphasis on very large penis size, with women preferring penises to be only slightly larger than average (Prause et al., Citation2015).

We were also surprised that an additional vibrating functionality did not predict item popularity. Women who use vibrating toys are able to incorporate direct clitoral stimulation into their sexual activity to help them to achieve orgasm (Döring, Citation2021). Vibrator use is also positively linked with improving sexual function by increasing lubrication, arousal, orgasm and can help with pain reduction (Herbenick et al., Citation2009). Given vibrators (sensu latu) are the most commonly used sex toy, have a long history, and are frequently publicly endorsed given the known orgasm gap (Mahar et al., Citation2020) and the difficulty of many women to achieve orgasm through vaginal penetration alone (Lloyd, Citation2005), they are perhaps more socially acceptable than dildos solely designed for insertion. It is possible that purchasers of penetrative toys preferred items with straightforward insertable functionality rather than a combined insertable phallic-shaped vibrator, given that there are specific vibrating products designed only for clitoral stimulation but that can be used in conjunction, if desired, with insertable phallus-shaped models.

Overall, our results show that consumers prefer sex toys which, although suitable for vaginal insertion, are not a direct proxy of a penis. This supports previously reported feminist views of phallic-shaped sex toy use – women can simultaneously reclaim penetrative sex without having this suggested symbol of patriarchal power in their possession (e.g., Fahs & Swank, Citation2013). These results are also potentially supported by the emergence of the “personified” sex toy market (which incorporates sex dolls and sex robots) which may give users a more emotional and expansive masturbatory experience compared to a disembodied phallus, although most research in this area has focused on a narrow demographic of consumers (Hanson & Locatelli, Citation2022).

We also must acknowledge, in light of our framing and results, that sex toy manufacturers might not be truly interested in women’s anatomy or true preferences. Manufacturers could be using outdated and stereotyped aspects of the female body to inform the design and marketing of products. This may shape what is available to purchase, or indeed the purchases themselves, with women thinking they “ought” to like something (for example, a toy to stimulate the G Spot – which may not exist as a defined anatomical structure (Hines, Citation2001)). However, given our robust method of considering three different, consumer-led measures, we would argue that item popularity in this study is a measure of “true” enjoyment from use, albeit from items that are currently commercially available. That said, the range of insertable products (both realistic and not) available through the Lovehoney UK website is quite varied and unrestricted. We would, however, urge caution and consideration here for any similar research conducted outside of Europe, in more restricted markets, as location will likely impact the availability of some toys, and thus influence, or indeed mask, preference.

Limitations and Future Directions

There were a few limitations to this study. Firstly, our average penis size comparison measurement relied only on one source (Veale et al., Citation2015), and we recognize that this study itself might not be accurate given the many methodological difficulties in determining average penis size (e.g., no standardization across studies, possible inaccuracy in self-reports) so some caution is perhaps warranted when considering presented results comparing our sample to “reality.” Future studies in this area could be made more informative by collating data from multiple websites, so that the results could be applied globally; as the data were collected from a UK-based domain, we were likely only seeing the preferences of a British population (although, as mentioned above, the UK sex toy market is highly unrestricted). We also cannot be sure that all consumers contributing to the popularity rating were women (and indeed women with an attraction to penises), as men are also able to purchase and review dildos. However, women are the primary consumer focus and Ronen (Citation2021) reported that men are rarely users of sex toys. A scan of the first 3 pages of user comments below the 5 most popular items suggested that women were the primary reviewers. A textual analysis of such reviews on Lovehoney (similar to that carried out by Döring et al., Citation2022) could be an interesting follow-up study. The consumer demographic is also unknown so factors such as age, sexual orientation and background were not accounted for and may be influential to customer choice. Therefore, we have made some heteronormative assumptions about users that may not be entirely inclusive or accurate given lesbian women and men are also insertable sex toy users and consumers.

Wednesday, February 8, 2023

Large array of triangulating evidence from 12 studies & over 8,000 participants from the U.S. and over 66,000 participants world-wide strongly suggests that left-wing authoritarianism is much closer to a reality than a myth

Is the myth of left-wing authoritarianism itself a myth? Lucian Gideon Conway III, Alivia Zubrod, Linus Chan, James D. McFarland and Evert Van de Vliert. Front. Psychol., February 8 2023, Volume 13 - 2022. https://doi.org/10.3389/fpsyg.2022.1041391

Abstract: Is left-wing authoritarianism (LWA) closer to a myth or a reality? Twelve studies test the empirical existence and theoretical relevance of LWA. Study 1 reveals that both conservative and liberal Americans identify a large number of left-wing authoritarians in their lives. In Study 2, participants explicitly rate items from a recently-developed LWA measure as valid measurements of authoritarianism. Studies 3–11 show that persons who score high on this same LWA scale possess the traits associated with models of authoritarianism: LWA is positively related to threat sensitivity across multiple areas, including general ecological threats (Study 3), COVID disease threat (Study 4), Belief in a Dangerous World (Study 5), and Trump threat (Study 6). Further, high-LWA persons show more support for restrictive political correctness norms (Study 7), rate African-Americans and Jews more negatively (Studies 8–9), and show more cognitive rigidity (Studies 10 and 11). These effects hold when controlling for political ideology and when looking only within liberals, and further are similar in magnitude to comparable effects for right-wing authoritarianism. Study 12 uses the World Values Survey to provide cross-cultural evidence of Left-Wing Authoritarianism around the globe. Taken in total, this large array of triangulating evidence from 12 studies comprised of over 8,000 participants from the U.S. and over 66,000 participants world-wide strongly suggests that left-wing authoritarianism is much closer to a reality than a myth.

8. General discussion

Is left-wing authoritarianism a viable construct that predicts important real-world phenomena? Across 12 studies spanning over 8,000 participants in the U.S. and over 66,000 participants worldwide, our data consistently reveal the answer is yes. These data reveal that (1) both liberal and conservative American participants identify a large number of left-wing authoritarians in their everyday lives (Study 1), and (2) both liberal and conservative participants rate a common Left-Wing Authoritarianism scale as measuring authoritarianism (Study 2). Further, this same LWA scale (3) consistently predicts key phenomena that major authoritarianism theories suggest it should predict, including (3a) threat sensitivity (Studies 3–6), (3b) restrictive communication norms (Study 7), (3c) negative ratings of minority groups (Studies 8–10), and (3d) dogmatism (Studies 10 and 11). Further, we used multiple methods to help overcome the double-barreled measurement problem inherent in any authoritarianism measurement, including controlling directly for ideology (Studies 3–11) and performing analyses only on liberals (Studies 3–11). Finally, we (4) found evidence of left-wing authoritarianism in an expansive world-wide sample (Study 12). Each of these approaches has offsetting strengths and weaknesses, and yet they all point to the same conclusion: This wide array of triangulating evidence provides consistent support for the idea that left-wing authoritarianism is indeed a widespread everyday reality.

Below, we place this array of evidence into the existing literature on authoritarianism and ideology, discuss limitations of our work, and offer a brief set of concluding thoughts.

8.1. The authoritarianism debate

The present studies have multiple implications for the ongoing debate about the nature of authoritarianism. Nilsson and Jost (2020) have argued that prior evidence based on Conway et al.’s (2018) LWA scale was due to its overlap with liberal ideology, and thus it did not provide empirical evidence of liberal authoritarianism. The issue raised by this critique is important. What do more focused empirical tests – tests based in long-accepted scientific practice – reveal? Our multi-method evidence here suggests that, in fact, the scale is measuring something beyond mere liberalism. Almost all key effects across Studies 3–11 remain when controlling for political ideology. Further, in a similar fashion, almost all key effects remain within-liberals: Thus, when comparing liberal authoritarians to liberal non-authoritarians, high-LWA persons show conceptually-expected correlations. As a result, the scale differentiates one kind of liberal from another kind, and thus cannot be reduced to mere ideology.

This array of evidence overwhelmingly suggests that, contrary to critics’ claims, there is something beyond mere ideology captured by the LWA scale. What is that something beyond? Consistent with a long line of research on RWA, by far the most parsimonious answer to that question is that the something beyond is authoritarianism. And indeed, using standard content validity approaches also used in other authoritarianism work (e.g., Funke, 2005Dunwoody and Funke, 2016), Study 2 showed that participants evaluate the items in Conway’s LWA scale as measurements of authoritarianism. This strong empirical evidence is echoed in the judgments of researchers Fasce and Avendaño (2020, p. 3), who commented that the items on Conway et al.’s LWA scale “are not merely statements of liberal ideology; they univocally reflect an extremely authoritarian attitude, opposed to liberal commitments such as equality among citizens, freedom of expression, and tolerance toward political and cultural diversity.”

Taken together, this array of triangulating evidence points to the conclusion that – as is the case for the scientific consensus on the Altemeyer RWA scale on which it was based – Conway et al.’s LWA scale is a valid measurement of authoritarianism.

8.2. Limitations

Like all studies, the present study has limitations. First, although employing much larger and more diverse samples than most previous work on authoritarianism, Studies 1–11 (like much prior authoritarianism research) are nonetheless limited to the United States and should not be taken to generalize beyond that region.

Further, as other researchers have noted (Nilsson and Jost, 2020), the Conway et al. (2018) scale on which Studies 2–11 are based is not perfect. However, essentially all critiques of individual items on the scale hinge on the argument that these items do not measure anything beyond left-wing ideology.12 As such, all these smaller critiques are best addressed with triangulating empirical evidence that the whole collection of items – used in the way originally intended by the authors of the scale, as a total summative measure – is in fact capturing something beyond mere ideology. Evidence that the whole scale is valid suggests at a minimum that the collection of items as a whole is valid – and thus directly suggests there is no systemic problem with items interfering with the validity of the scale. It is just that kind of whole-scale validity evidence that has been supplied across multiple studies in the present package. This empirical approach mirrors the approach in other domains when critiques arise of the empirical validity of particular theoretical constructs (e.g., Banaji et al., 2004).

However, we acknowledge that Conway et al.’s (2018) LWA scale, like all scales, is not perfect and thus does of course have room for improvement (Conway, 2020). But saying a scale is imperfect is not the same as saying a scale is invalidAll measurements contain imperfections and all studies contain messiness, and yet that should not deter us from bigger-picture research conclusions (Cooper, 2016). Thus, we acknowledge the facts that (a) like virtually every scale, the LWA scale could be improved, and (b) as a scale designed to parallel the most widely-used RWA scale, it inherited some of that scale’s weaknesses. However, this lack of perfection should not be confused with the larger, big-picture issue of the degree that it can be construed as a valid measurement of left-wing authoritarianism. The overwhelming amount of evidence across multiple studies speaks clearly: It can be accurately viewed as a measurement of left-wing authoritarianism.

Adolescents today are just as politically polarized as adults; & they are much more likely to share their parents' political orientation than they were 4 decades ago

Learning to Dislike Your Opponents: Political Socialization in the Era of Polarization. Matthew Tyler, Shanto Iyengar. American Political Science Review, Volume 117, Issue 1, February 2023, online May 4 2022, pp. 347 - 354. https://doi.org/10.1017/S000305542200048X

Abstract: Early socialization research dating to the 1960s showed that children could have a partisan identity without expressing polarized evaluations of political leaders and institutions. We provide an update to the socialization literature by showing that adolescents today are just as polarized as adults. We compare our findings to a landmark 1980 socialization study and show that distrust in the opposing party has risen sharply among adolescents. We go on to show that the onset of polarization in childhood is predicted by parental influence; adolescents who share their parents’ identity and whose parents are more polarized are apt to voice polarized views.

CONCLUSION

We have shown that the onset of partisan polarization occurs early in the life cycle. Today, high levels of in-group favoritism and out-group distrust are in place well before early adulthood. In fact, the absence of age differences in our 2019 results suggests that the learning curve for polarization plateaus by the age of 11. This is very unlike the developmental pattern that held in the 1970s and 1980s, when early childhood was characterized by blanket positivity toward political leaders and partisanship gradually intruded into the political attitudes of adolescents before peaking in adulthood.

When we considered the antecedents of children’s trust in the parties, our findings confirm the earlier literature documenting the primacy of the family as an agent of socialization (Jennings and Niemi Reference Jennings and Niemi1968; Jennings, Stoker, and Bowers Reference Jennings, Stoker and Bowers2009; Tedin Reference Tedin1974). Polarized parents seem to transmit not only their partisanship but also their animus toward opponents. It is striking that the least-polarized youth respondents in 2019 are those who have not adopted their parental partisan loyalty.

In closing, our findings have important implications for the study of political socialization. Fifty years ago, political socialization was thought to play a stabilizing role important to the perpetuation of democratic norms and institutions. In particular, children’s adoption of uncritical attitudes toward political leaders helped to legitimize the entire democratic regime. Indeed, researchers cited this “functional” role of socialization in justifying the study of political attitudes in childhood (Kinder and Sears Reference Kinder, Sears, Lindzey and Aronson1985; van Deth, Abendschön, and Vollmar Reference van Deth, Abendschön and Vollmar2011).

In the current era, it seems questionable whether the early acquisition of out-party animus fosters democratic norms and civic attitudes. Extreme polarization is now associated with rampant misinformation (Peterson and Iyengar Reference Peterson and Iyengar2021) and, as indicated by the events that occurred in the aftermath of the 2020 election, with willingness to reject the outcome of free and fair electoral procedures. The question for future research is how to transmit party attachments, as occurred in the prepolarization era, without the accompanying distrust and disdain for political opponents.

Tuesday, February 7, 2023

Females had higher dating intention with men who shared COVID-19 vaccine similarity while men didn't care about female vaccination status

Lovesick: The Effects of Political Partisanship and COVID-19 Vaccine Perceptions on Online Romantic Partner Selection. Caleb R. Seymour. M.A. Thesis, University of Missouri-Columbia, Jul 2022. https://mospace.umsystem.edu/xmlui/bitstream/handle/10355/94028/SeymourCalebResearch.pdf

Abstract: Many studies have reported the positive relationship of perceived political similarity with dating intention in the world of online dating. However, there are currently no studies which consider this relationship alongside coronavirus disease (COVID-19) vaccine status and their combined influence on romantic consideration. In this study, we conduct a posttest-only design with a 2 (vaccinated) x 2 (political affiliation) x 2 (gender) online experiment, including variables such as vaccine perceptions, party identification, sensation seeking, and dating intention. Participants (N=97) were shown four avatar profiles of the opposite sex; each profile was displayed as vaccinated or unvaccinated and Democrat or Republican. Once exposed to these dating profile, subjects answered a survey to determine how individual dating intention differed in relation to the subject’s own political affiliation and “vaccination status.” The results indicate that males and females have higher dating intention with partners that have political similarity. However, females have higher dating intention with partners who share vaccine similarity while males have no relationship between vaccine similarity on dating intention. The implications of these findings may suggest that the formation of romantic relationships is currently influenced by personal health decisions compared to the decisions of potential online partners; this being a symptom of a much larger degree of affective polarization in the United States which continues to grow.


Many things we think we know about the placebo effect are actually unsubstantiated

Replication crisis and placebo studies: rebooting the bioethical debate. Charlotte Blease, Ben Colagiuri, Cosima Locher. Journal of Medical Ethics, Jan 6 2023. https://jme.bmj.com/content/early/2023/01/05/jme-2022-108672

Abstract: A growing body of cross-cultural survey research shows high percentages of clinicians report using placebos in clinical settings. One motivation for clinicians using placebos is to help patients by capitalising on the placebo effect’s reported health benefits. This is not surprising, given that placebo studies are burgeoning, with increasing calls by researchers to ethically harness placebo effects among patients. These calls propose placebos/placebo effects offer clinically significant benefits to patients. In this paper, we argue many findings in this highly cited and ‘hot’ field have not been independently replicated. Evaluating the ethicality of placebo use in clinical practice involves first understanding whether placebos are efficacious clinically. Therefore, it is crucial to consider placebo research in the context of the replication crisis and what can be learnt to advance evidence-based knowledge of placebos/placebo effects and their clinical relevance (or lack thereof). In doing so, our goal in this paper is to motivate both increased awareness of replication issues and to help pave the way for advances in scientific research in the field of placebo studies to better inform ethical evidence-based practice. We argue that, only by developing a rigorous evidence base can we better understand how, if at all, placebos/placebo effects can be harnessed ethically in clinical settings.


Sunday, February 5, 2023

Rolf Degen summarizing... The average effect sizes in a “null field” such as homeopathy are a good indicator of the extent to which the tunnel vision of the researchers involved alone can conjure up positive results

Homeopathy can offer empirical insights on treatment effects in a null field. Matthew K. Sigurdson, Kristin L. Sainani & John P.A. Ioannidis. Journal of Clinical Epidemiology, February 01, 2023. https://doi.org/10.1016/j.jclinepi.2023.01.010

Abstract

Objectives: A “null field” is a scientific field where there is nothing to discover and where observed associations are thus expected to simply reflect the magnitude of bias. We aimed to characterize a null field using a known example, homeopathy (a pseudoscientific medical approach based on using highly diluted substances), as a prototype.

Study design: We identified 50 randomized placebo-controlled trials of homeopathy interventions from highly-cited meta-analyses. The primary outcome variable was the observed effect size in the studies. Variables related to study quality or impact were also extracted.

Results: The mean effect size for homeopathy was 0.36 standard deviations (Hedges’ g; 95% CI: 0.21, 0.51) better than placebo, which corresponds to an odds ratio of 1.94 (95% CI: 1.69, 2.23) in favor of homeopathy. 80% of studies had positive effect sizes (favoring homeopathy). Effect size was significantly correlated with citation counts from journals in the Directory of Open Access Journals and CiteWatch. We identified common statistical errors in 25 studies.

Conclusion: A null field like homeopathy can exhibit large effect sizes, high rates of favorable results, and high citation impact in the published scientific literature. Null fields may represent a useful negative control for the scientific process.


While overall income inequality rose over the past 5 decades, the rise in overall consumption inequality was small; the declining quality of income data likely contributes to these differences for the bottom of the distribution

Consumption and Income Inequality in the United States Since the 1960s. Bruce D. Meyer and James X. Sullivan. Journal of Political Economy, Feb 2023. https://doi.org/10.1086/721702

Abstract: Recent research concludes that the rise in consumption inequality mirrors, or even exceeds, the rise in income inequality. We revisit this finding, constructing improved measures of consumption, focusing on its well-measured components that are reported at a high and stable rate relative to national accounts. While overall income inequality rose over the past 5 decades, the rise in overall consumption inequality was small. The declining quality of income data likely contributes to these differences for the bottom of the distribution. Asset price changes likely account for some of the differences in recent years for the top of the distribution.


Messages generated by AI are persuasive across a number of policy issues, including weapon bans, a carbon tax, and a paid parental-leave program; participants rated the author of AI messages as being more factual and logical, but less angry & unique

Bai, Hui, Jan G. Voelkel, Johannes C. Eichstaedt, and Robb Willer. 2023. “Artificial Intelligence Can Persuade Humans on Political Issues.” OSF Preprints. February 5. doi:10.31219/osf.io/stakv

Abstract: The emergence of transformer models that leverage deep learning and web-scale corpora has made it possible for artificial intelligence (AI) to tackle many higher-order cognitive tasks, with critical implications for industry, government, and labor markets in the US and globally. Here, we investigate whether the currently most powerful, openly-available AI model – GPT-3 – is capable of influencing the beliefs of humans, a social behavior recently seen as a unique purview of other humans. Across three preregistered experiments featuring diverse samples of Americans (total N=4,836), we find consistent evidence that messages generated by AI are persuasive across a number of policy issues, including an assault weapon ban, a carbon tax, and a paid parental-leave program. Further, AI-generated messages were as persuasive as messages crafted by lay humans. Compared to the human authors, participants rated the author of AI messages as being more factual and logical, but less angry, unique, and less likely to use story-telling. Our results show the current generation of large language models can persuade humans, even on polarized policy issues. This work raises important implications for regulating AI applications in political contexts, to counter its potential use in misinformation campaigns and other deceptive political activities.


Continuing education workshops do not produce sustained skill development—quite the opposite; any modest improvement in performance erodes over time without further coaching

The implications of the Dodo bird verdict for training in psychotherapy: prioritizing process observation. Henny A. Westra. Psychotherapy Research, Dec 16 2022. https://doi.org/10.1080/10503307.2022.2141588

Abstract: Wampold et al.’s 1997 meta-analysis found that the true differences between bona fide psychotherapies is zero, supporting the Dodo bird conjecture that “All have won and must have prizes”. Two and half decades later, the field continues to be slow to absorb this and similar uncomfortable discoveries. For example, entirely commensurate with Wampold’s conclusion is the meta-analytic finding that adherence to a given model of psychotherapy is unrelated to therapy outcomes (Webb et al., 2010). Despite the clear implication that theoretical models should not be the main lens through which psychotherapy is viewed if we are aiming to improve outcomes, therapists continue to identify themselves primarily by their theoretical orientation. And a major corollary of Wampold’s conclusions is that despite the evidence for non superiority of a given model, our focus in training continues to be model-driven. This article seeks to elaborate the training implications of Wampold et al.’s conclusion, with a rationale and appeal to incorporate process-centered training.

Consider these similarly uncomfortable findings regarding the state of training. We assume, rather than verify the efficacy of our training programs. Yet, there is no evidence that continuing education workshops for example, produce sustained skill development—quite the opposite. Large effects on self-report are found but any modest improvement in performance erodes over time without further coaching (Madson et al., 2019). Perhaps most concerning, psychotherapists do not appear to improve with experience and in fact, the evidence suggests that skills may decline slightly over time (Goldberg et al., 2016). Not surprisingly then, while the number of model-based treatments has proliferated, the rate of client improvement has not followed suit (Miller et al., 2013). Could stagnant training methods may be related to stagnant patient outcomes?

We need innovations in training that better align our training foci and methods with factors empirically supported as influencing client outcomes. Process researchers have long observed that trained process coders (typically for research purposes) make better therapists due to their enhanced attunement (e.g., Binder & Strupp, 1997). While such training is not yet available in training programs, it arguably should be based on emerging developments in the science of expertise (Ericsson & Pool, 2016) and the urgent need to bring outcome information forward in real time so that it can be used to make responsive adjustments to the process of therapy. In fact, such information could be considered “routine outcome monitoring in real time” (Westra & Di Bartolomeo, 2022).

To elaborate, Tracey et al. (2014) provocatively argued that acquiring expertise in psychotherapy may not even be possible. This is because the ability to predict outcomes is crucial to shaping effective performance. Yet there is a lack of feedback available to therapists regarding the outcomes of their interventions and such information, if it comes at all, comes too late to make a difference in the moment. Therapists are essentially like blind archers attempting to shoot at a target. The development of Routine Outcome Monitoring (ROM) measures capable of forecasting likely outcomes is a major advance in correcting this blindness and improving predictive capacity. However, in order to be effective for skill development, feedback needs to occur more immediately so that the relationship between the therapist action and the client response (or nonresponse) can be quickly ascertained and adjustments made in real time. Interestingly, while ROM has been helpful in improving failing cases, it has not been effective in enhancing clinical skills more generally (Miller et al., 2013).

Learning to preferentially attend to, extract and continuously integrate empirically supported process data may prove to be the elusive immediate feedback that has been lacking in psychotherapy training but that is crucial to developing expertise. Observable process data that has been validated through process science as differentiating good from poor patient outcomes, could be considered “little outcomes”; which in turn are related to session outcomes and ultimately treatment outcome (Greenberg, 1986). Moreover, thin-slicing research supports that it is possible to make judgements about important outcomes from even tiny slices of expressive behavior (Ambady & Rosenthal, 1992). If one considers real time process information as micro-outcomes, properly trained clinicians, just like expert-trained process coders, may no longer have to be blind. For example, a therapist trained to identify and monitor resistance and signals of alliance ruptures, can be continuously tracking these important phenomena and responsively adjusting to safeguard the alliance. Or a therapist who is sensitive to markers of low and high levels of experiencing (Pascual-Leone & Yeryomenko, 2017) and client ambivalence (Westra & Norouzian, 2018) can not only optimize the timing of their interventions but also continuously watch the client for feedback on the success of their ongoing efforts.

Being steeped in process research gives one a unique perspective on the promise of process observation to advance clinical training. Our lab recently took our first foray into studying practicing community therapists. As we coded the session videotapes, we became aware that we possessed a unique skill set that was absent in therapist’s test interviews. Therapists seemed to be guided solely by some model of how to bring about change but failed to simultaneously appreciate the ebb and flow of the relational context of the work. They seemed absorbed in their own moves (their model) but not aware that they were in a dance and must continually track and coordinate the process with their partner. It seemed that we had incidentally trained ourselves to detect and use these process signals. Our training was different and very unique; it was more akin to deliberate practice focused on discrimination training for detecting empirically supported processes.

In short, information capable of diagnosing the health of the process and critically, of forecasting eventual outcomes is arguably hiding in plain sight if one can acquire the requisite observational capacity to harvest it. And transforming an unpredictable environment into a predictable one makes expertise possible to acquire (Kahneman, 2011). Importantly, extracting such vital information relies on observational skill, rather than patient report, end of session measures, or longer-term outcome; thus, such real time data extraction is immediately accessible and can complement existing outcome monitoring (Westra & Di Bartolomeo, 2022). Moreover, process markers are often opaque; requiring systematic observational training for successful detection. Without proper discrimination and perceptual acuity training, this gilded information remains obscured. Thus, heeding Wampold et al.’s call to refocus our efforts must include innovations in training; innovations that harness outcome information. We need more process research to further uncover the immediately observable factors capable of differentiating poor and good outcomes, but existing process science gives us a good start. And since process-centered training is transtheoretical, it can exist alongside models of therapy—learning to see while doing (Binder & Strupp, 1997). Training in psychotherapy has primarily prioritized intervention (models) and now it may be time to emphasize observation.

Psychotherapeutic experience seems to be unrelated to patients’ change in pathology

Germer, S., Weyrich, V., Bräscher, A.-K., Mütze, K., & Witthöft, M. (2022). Does practice really make perfect? A longitudinal analysis of the relationship between therapist experience and therapy outcome: A replication of Goldberg, Rousmaniere, et al. (2016). Journal of Counseling Psychology, 69(5), 745–754. Jan 2023. https://doi.org/10.1037/cou0000608

Abstract: Experience is often regarded as a prerequisite of high performance. In the field of psychotherapy, research has yielded inconsistent results regarding the association between experience and therapy outcome. However, this research was mostly conducted cross-sectionally. A longitudinal study from the U.S. recently indicated that psychotherapists’ experience was not associated with therapy outcomes. The present study aimed at replicating Goldberg, Rousmaniere, et al. (2016) study in the German healthcare system. Using routine evaluation data of a large German university psychotherapy outpatient clinic, the effect of N = 241 therapists’ experience on the outcomes of their patients (N = 3,432) was assessed longitudinally using linear and logistic multilevel modeling. Experience was operationalized using the number of days since the first patient of a therapist as well as using the number of patients treated beforehand. Outcome criteria were defined as change in general psychopathology as well as response, remission, and early termination. Several covariates (number of sessions per case, licensure, and main diagnosis) were also examined. Across all operationalizations of experience (time since first patient and number of cases treated) and therapy outcome (change in psychopathology, response, remission, and early termination), results largely suggest no association between therapists’ experience and therapy outcome. Preliminary evidence suggests that therapists need fewer sessions to achieve the same outcomes when they gain more experience. Therapeutic experience seems to be unrelated to patients’ change in psychopathology. This lack of findings is of importance for improving postgraduate training and the quality of psychotherapy in general.


Using the opinionated language model affected the opinions expressed in participants' writing and shifted their opinions in the subsequent attitude survey

Co-Writing with Opinionated Language Models Affects Users' Views. Maurice Jakesch, Advait Bhat, Daniel Buschek, Lior Zalmanson, Mor Naaman. arXiv Feb 1 2023. https://arxiv.org/abs/2302.00560


Abstract: If large language models like GPT-3 preferably produce a particular point of view, they may influence people's opinions on an unknown scale. This study investigates whether a language-model-powered writing assistant that generates some opinions more often than others impacts what users write - and what they think. In an online experiment, we asked participants (N=1,506) to write a post discussing whether social media is good for society. Treatment group participants used a language-model-powered writing assistant configured to argue that social media is good or bad for society. Participants then completed a social media attitude survey, and independent judges (N=500) evaluated the opinions expressed in their writing. Using the opinionated language model affected the opinions expressed in participants' writing and shifted their opinions in the subsequent attitude survey. We discuss the wider implications of our results and argue that the opinions built into AI language technologies need to be monitored and engineered more carefully.


Saturday, February 4, 2023

Above a threshold level of wage, an increase in intelligence is no longer associated with higher earnings

The plateauing of cognitive ability among top earners. Marc Keuschnigg, Arnout van de Rijt, Thijs Bol. European Sociological Review, jcac076, January 28 2023. https://doi.org/10.1093/esr/jcac076

Abstract: Are the best-paying jobs with the highest prestige done by individuals of great intelligence? Past studies find job success to increase with cognitive ability, but do not examine how, conversely, ability varies with job success. Stratification theories suggest that social background and cumulative advantage dominate cognitive ability as determinants of high occupational success. This leads us to hypothesize that among the relatively successful, average ability is concave in income and prestige. We draw on Swedish register data containing measures of cognitive ability and labour-market success for 59,000 men who took a compulsory military conscription test. Strikingly, we find that the relationship between ability and wage is strong overall, yet above €60,000 per year ability plateaus at a modest level of +1 standard deviation. The top 1 per cent even score slightly worse on cognitive ability than those in the income strata right below them. We observe a similar but less pronounced plateauing of ability at high occupational prestige.

Discussion

The empirical results lend support to our argument that cognitive ability plateaus at high levels of occupational success. Precisely in the part of the wage distribution where cognitive ability can make the biggest difference, its right tail, cognitive ability ceases to play any role. Cognitive ability plateaus around €60,000 at under a standard deviation above the mean. In terms of occupational prestige, it plateaus at a similar level above a job prestige of 70: The differences in the prestige between accountants, doctors, lawyers, professors, judges, and members of parliament are unrelated to their cognitive abilities.

A limitation of our study is that we do not account for effort or non-cognitive capacities—motivation, social skills, creativity, mental stability, and physical ability (Borghans et al., 2016). Cognitive ability is more relevant for some occupations than for others, and academia, for which it is arguably most relevant, is neither the best-paid nor the most prestigious professional field. Our results thus raise the question to what degree top wages are indicative of other, unobserved dimensions of ability. However, omission of effort and non-cognitive ability from the analysis is only problematic for our conclusions about the relationship between ability and success if there are theoretical arguments to be made that their effects dominate luck in the production of top income and prestige, either because their distributions have many extreme values or if there are strongly increasing returns.

Our analysis, further, is limited to a single country. Sweden may be seen as a conservative testing ground. In countries where higher education is less inclusive, one would expect an overall weaker relationship between labour-market success and ability (Breen and Jonsson, 2007). Namely, less income redistribution and steep tuition barriers to elite colleges may impede the flow of gifted individuals from lower classes into top jobs. On the other hand, higher net wages and greater social status at the top may attract more talent, and greater differentiation in college prestige elsewhere may allow firms to select on cognitive skills among those with a college degree by using elite affiliations as a proxy. Future research on different countries may seek to evaluate to what extent our findings generalize.

Third, we limit our analyses to native-born men. This is an unavoidable restriction of the data (women and immigrants were not enrolled in the military), and it is important to learn whether our findings generalize to the full working population. We invite further research that includes women and citizens from different ethnic backgrounds, and we call for careful adjustments in measuring occupational success for different cohorts in light of marked increases in female labour-force participation over time as well as in the share of the immigrant workforce and the varying disadvantages they face along different career paths in many countries. Such research could also explore potential variation in meritocracy regimes across social groups, connecting debates on gender equality and integration to quantitative studies of the relationship between success and ability.

Finally, our analysis was descriptive in nature and did not assess the proposed theoretical mechanism. An additional mechanism that may drive the plateauing of the success–ability relation at high wages is that brighter individuals select into more poorly remunerated occupational groups, even if within these groups the brightest are rewarded the highest wages. If these worse-paying jobs are of higher prestige, this could explain the weaker patterns we observed for the relationship between wage and occupational prestige. While we could not effectively explore the operation of this possible mechanism, future studies may be able to disentangle competing mechanisms through longitudinal analysis of educational and labour market trajectories.

Recent years have seen much academic and public discussion of rising inequality (e.g. Mankiw, 2013Piketty, 2014Alvaredo et al., 2017). In debates about interventions against large wage discrepancies, a common defence of top earners is the superior merit inferred from their job-market success using human capital arguments (Murray, 2003Mankiw, 2013). However, along an important dimension of merit—cognitive ability—we find no evidence that those with top jobs that pay extraordinary wages are more deserving than those who earn only half those wages. The main takeaway of our analysis is thus the identification, both theoretically and empirically, of two regimes of stratification in the labour market. The bulk of citizens earn normal salaries that are clearly responsive to individual cognitive capabilities. Above a threshold level of wage, cognitive-ability levels are above average but play no role in differentiating wages. With relative incomes of top earners steadily growing in Western countries (Alvaredo et al., 2017), an increasing share of aggregate earnings may be allocated under the latter regime.

Listening to one’s most disliked music evokes a stress response that makes the whole body revolt

Merrill, Julia, Taren-Ida Ackermann, and Anna Czepiel. 2023. “The Negative Power of Music: Effects of Disliked Music on Psychophysiology.” PsyArXiv. February 2. doi:10.31234/osf.io/6escn

Abstract: While previous research has shown the positive effects of music listening in response to one’s favorite music, the negative effects of one’s most disliked music have not gained much attention. Contra to studies on musical chills, in the current study, participants listened to three self-selected disliked musical pieces which evoked highly unpleasant feelings. As a contrast, three musical pieces were individually selected for each participant based on neutral liking ratings they provided on other participants’ music. During music listening, real-time ratings of subjective (dis)pleasure and simultaneous recordings of peripheral measures were obtained. Results show that compared to neutral music, listening to disliked music evokes physiological reactions reflecting higher arousal (heart rate, skin conductance response, body temperature), disgust (levator labii muscle), anger (corrugator supercilii muscle), distress and grimacing (zygomaticus major muscle). The differences between conditions were most prominent during “very unpleasant” real-time ratings, showing peak responses for the disliked music. Hence, disliked music leads to a strong response of physiological arousal and facial expression, reflecting the listener’s attitude toward the music and the physiologically strenuous effect of listening to one’s disliked music.


Rolf Degen summarizing... Unlike a machine, in which dedicated components are entrusted with fixed functions, the brain operates more like a complex dynamic system in which changing coalitions of neurons can perform varying tasks depending on the context

Improving the study of brain-behavior relationships by revisiting basic assumptions. Christiana Westlin et al. Trends in Cognitive Sciences, February 2 2023. https://doi.org/10.1016/j.tics.2022.12.015

Highlights

The study of brain-behavior relationships has been guided by several foundational assumptions that are called into question by empirical evidence from human brain imaging and neuroscience research on non-human animals.

Neural ensembles distributed across the whole brain may give rise to mental events rather than localized neural populations. A variety of neural ensembles may contribute to one mental event rather than one-to-one mappings. Mental events may emerge as a complex ensemble of interdependent signals from the brain, body, and world rather than from neural ensembles that are context-independent.

A more robust science of brain-behavior relationships awaits if research efforts are grounded in alternative assumptions that are supported by empirical evidence and which provide new opportunities for discovery.


Abstract: Neuroimaging research has been at the forefront of concerns regarding the failure of experimental findings to replicate. In the study of brain-behavior relationships, past failures to find replicable and robust effects have been attributed to methodological shortcomings. Methodological rigor is important, but there are other overlooked possibilities: most published studies share three foundational assumptions, often implicitly, that may be faulty. In this paper, we consider the empirical evidence from human brain imaging and the study of non-human animals that calls each foundational assumption into question. We then consider the opportunities for a robust science of brain-behavior relationships that await if scientists ground their research efforts in revised assumptions supported by current empirical evidence.


Keywords: brain-behavior relationshipswhole-brain modelingdegeneracycomplexityvariation


Concluding remarks

Scientific communities tacitly agree on assumptions about what exists (called ontological commitments), what questions to ask, and what methods to use. All assumptions are firmly rooted in a philosophy of science that need not be acknowledged or discussed but is practiced nonetheless. In this article, we questioned the ontological commitments of a philosophy of science that undergirds much of modern neuroscience research and psychological science in particular. We demonstrated that three common commitments should be reconsidered, along with a corresponding course correction in methods (see Outstanding questions). Our suggestions require more than merely improved methodological rigor for traditional experimental design (Box 1). Such improvements are important, but may aid robustness and replicability only when the ontological assumptions behind those methods are valid. Accordingly, a productive way forward may be to fundamentally rethink what a mind is and how a brain works. We have suggested that mental events arise from a complex ensemble of signals across the entire brain, as well as the from the sensory surfaces of the body that inform on the states of the inner body and outside world, such that more than one signal ensemble maps to a single instance of a single psychological category (maybe even in the same context [51,56]). To this end, scientists might find inspiration by mining insights from adjacent fields, such as evolution, anatomy, development, and ecology (e.g., [123,124]), as well as cybernetics and systems theory (e.g., [125,126]). At stake is nothing less than a viable science of how a brain creates a mind through its constant interactions with its body, its physical environment, and with the other brains-in-bodies that occupy its social world.

Outstanding questions

Well-powered brain-wide analyses imply that meaningful signals exist in brain regions that are considered nonsignificant in studies with low within-subject power, but is all of the observed brain activity necessarily supporting a particular behavior? By thresholding out weak yet consistent effects, are we removing part of the complex ensemble of causation? What kinds of technical innovations or novel experimental methods would allow us to make progress in answering this question?

How might we incorporate theoretical frameworks, such as a predictive processing framework, to better understand the involvement of the whole-brain in producing a mental event? Such an approach hypothesizes the involvement of the whole-brain as a general computing system, without implying equipotentiality (i.e., that all areas of the brain are equally able to perform the same function).

Why are some reported effects (e.g., the Stroop effect) seemingly robust and replicable if psychological phenomena are necessarily degenerate? These effects should be explored to determine if they remain replicable outside of constrained laboratory contexts and to understand what makes them robust.

Given that measuring every signal in a complex system is unrealistic given the time and cost constraints of a standard neuroimaging experiment, how can we balance the measurement of meaningful signals in the brain, body, and world with the practical realities of experimental constraints?

Is the study of brain-behavior relationships actually in a replication crisis? And if so, is it merely a crisis of method? Traditional assumptions suggest that scientists should replicate sample summary statistics and tightly control variation in an effort to estimate a population summary statistic, but perhaps this goal should be reconsidered.

Friday, February 3, 2023

Within internet there exists the 90-9-1 principle (also called the 1% rule), which dictates that a vast majority of user-generated content in any specific community comes from the top 1% active users, with most people only listening in

Vuorio, Valtteri, and Zachary Horne. 2023. “A Lurking Bias: Representativeness of Users Across Social Media and Its Implications for Sampling Bias in Cognitive Science.” PsyArXiv. February 2. doi:10.31234/osf.io/n5d9j

Abstract: Within internet there exists the 90-9-1 principle (also called the 1% rule), which dictates that a vast majority of user-generated content in any specific community comes from the top 1% active users, with most people only listening in. When combined with other demographic biases among social media users, this casts doubt as to how well these users represent the wider world, which might be problematic considering how user-generated content is used in psychological research and in the wider media. We conduct three computational studies using pre-existing datasets from Reddit and Twitter; we examine the accuracy of the 1% rule and what effect this might have on how user-generated content is perceived by performing and comparing sentiment analyses between user groups. Our findings support the accuracy of the 1% rule, and we report a bias in sentiments between low- and high-frequency users. Limitations of our analyses will be discussed.


Contrary to this ideal, we found a negative association between media coverage of a paper and the paper’s likelihood of replication success = deciding a paper’s merit based on its media coverage is unwise

A discipline-wide investigation of the replicability of Psychology papers over the past two decades. Wu Youyou, Yang Yang, and Brian Uzzi. Proceedings of the National Academy of Sciences, January 30, 2023, 120 (6) e2208863120. https://doi.org/10.1073/pnas.2208863120


Significance: The number of manually replicated studies falls well below the abundance of important studies that the scientific community would like to see replicated. We created a text-based machine learning model to estimate the replication likelihood for more than 14,000 published articles in six subfields of Psychology since 2000. Additionally, we investigated how replicability varies with respect to different research methods, authors 'productivity, citation impact, and institutional prestige, and a paper’s citation growth and social media coverage. Our findings help establish large-scale empirical patterns on which to prioritize manual replications and advance replication research.


Abstract: Conjecture about the weak replicability in social sciences has made scholars eager to quantify the scale and scope of replication failure for a discipline. Yet small-scale manual replication methods alone are ill-suited to deal with this big data problem. Here, we conduct a discipline-wide replication census in science. Our sample (N = 14,126 papers) covers nearly all papers published in the six top-tier Psychology journals over the past 20 y. Using a validated machine learning model that estimates a paper’s likelihood of replication, we found evidence that both supports and refutes speculations drawn from a relatively small sample of manual replications. First, we find that a single overall replication rate of Psychology poorly captures the varying degree of replicability among subfields. Second, we find that replication rates are strongly correlated with research methods in all subfields. Experiments replicate at a significantly lower rate than do non-experimental studies. Third, we find that authors’ cumulative publication number and citation impact are positively related to the likelihood of replication, while other proxies of research quality and rigor, such as an author’s university prestige and a paper’s citations, are unrelated to replicability. Finally, contrary to the ideal that media attention should cover replicable research, we find that media attention is positively related to the likelihood of replication failure. Our assessments of the scale and scope of replicability are important next steps toward broadly resolving issues of replicability.

Discussion

This research uses a machine learning model that quantifies the text in a scientific manuscript to predict its replication likelihood. The model enables us to conduct the first replication census of nearly all of the papers published in Psychology’s top six subfield journals over a 20-y period. The analysis focused on estimating replicability for an entire discipline with an interest in how replication rates vary by subfield, experimental and non-experimental methods, the other characteristics of research papers. To remain grounded in the human expertise, we verified the results with available manual replication data whenever possible. Together, the results further provide insights that can advance replication theories and practices.
A central advantage of our approach is its scale and scope. Prior speculations about the extent of replication failure are based on relatively small, selective samples of manual replications (21). Analyzing more than 14,000 papers in multiple subfields, we showed that replication success rates differ widely by subfields. Hence, not one replication failure rate estimated from a single replication project is likely to characterize all branches of a diverse discipline like Psychology. Furthermore, our results showed that subfield rates of replication success are associated with research methods. We found that experimental work replicates at significantly lower rates than non-experimental methods for all subfields, and subfields with less experimental work replicate relatively better. This finding is worrisome, given that Psychology’s strong scientific reputation is built, in part, on its proficiency with experiments.
Analyzing replicability alongside other metrics of a paper, we found that while replicability is positively correlated with researchers’ experience and competence, other proxies of research quality, such as an author’s university prestige and the paper’s citations, showed no association with replicability in Psychology. The findings highlight the need for both academics and the public to be cautious when evaluating research and scholars using pre- and post-publication metrics as proxies for research quality.
We also correlated media attention with a paper’s replicability. The media plays a significant role in creating the public’s image of science and democratizing knowledge, but it is often incentivized to report on counterintuitive and eye-catching results. Ideally, the media would have a positive relationship (or a null relationship) with replication success rates in Psychology. Contrary to this ideal, however, we found a negative association between media coverage of a paper and the paper’s likelihood of replication success. Therefore, deciding a paper’s merit based on its media coverage is unwise. It would be valuable for the media to remind the audience that new and novel scientific results are only food for thought before future replication confirms their robustness.
We envision two possible applications of our approach. First, the machine learning model could be used to estimate replicability for studies that are difficult or impossible to manually replicate, such as longitudinal investigations and special or difficult-to-access populations. Second, predicted replication scores could begin to help prioritize manual replications of certain studies over others in the face of limited resources. Every year, individual scholars and organizations like Psychological Science Accelerator (67) and Collaborative Replication and Education Project (68) encounter the problem of choosing from an abundance of Psychology studies which ones to replicate. Isager and colleagues (69) proposed that to maximize gain in replication, the community should prioritize replicating studies that are valuable and uncertain in their outcomes. The value of studies could be readily approximated by citation impact or media attention, but the uncertainty part is yet to be adequately measured for a large literature base. We suggest that our machine learning model could provide a quantitative measure of replication uncertainty.
We note that our findings were limited in several ways. First, all papers we made predictions about came from top-tier journal publications. Future research could examine papers from lower-rank journals and how their replicability associate with pre- and post-publication metrics (70). Second, the estimates of replicability are only approximate. At the subfield-level, five out of six subfields in our analysis were represented by only one top journal. A single journal does not capture the scope of the entire subfield. Future research could expand the coverage to multiple journals for one subfield or cross-check the subfield pattern derived using other methods (e.g., prediction markets). Third, the training sample used to develop the model used nearly all the manual replication data available, yet still lacked direct manual replication for certain psychology subfields. While we conducted a series of transfer learning analyses to ensure the model’s applicability beyond the scope of the training sample, implementation of the model in the subfields of Clinical Psychology and Developmental Psychology, where actual manual replication studies are scarce should be done judiciously. For example, when estimating a paper’s replicability, we advise users to review a paper’s other indicators of replicability, like original study statistics, aggregated expert forecast, or prediction market. Nevertheless, our model can continue to be improved as more manual replication results become available.
Future research could go in several directions: 1) our replication scores could be combined with other methods like prediction markets (16) or non-text-based machine learning models (2728) to further refine estimates for Psychology studies; 2) the design of the study could be repeated to conduct replication censuses in other disciplines; and 3) the replication scores could be further correlated with other metrics of interest.
The replicability of science, which is particularly constrained in social science by variability, is ultimately a collective enterprise improved by an ensemble of methods. In his book The Logic of Scientific Discovery, Popper argued that “we do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them” (1). However, as true as Popper’s insight about repetition and repeatability is, it must be recognized that tests come with a cost of exploration. Machine learning methods paired with human acumen present an effective approach for developing a better understanding of replicability. The combination balances the costs of testing with the rewards of exploration in scientific discovery.