Friday, November 29, 2019

Hard Problems in Cryptocurrency: Cryptographic (expected to be solvable with purely mathematical techniques), consensus theory (improvements to proof of work and proof of stake), and economic

Hard Problems in Cryptocurrency: Five Years Later. Vitalik Buterin. No 22 2019. https://vitalik.ca/general/2019/11/22/progress.html

[Check original post for lots of links]

Special thanks to Justin Drake and Jinglan Wang for feedback

In 2014, I made a post and a presentation with a list of hard problems in math, computer science and economics that I thought were important for the cryptocurrency space (as I then called it) to be able to reach maturity. In the last five years, much has changed. But exactly how much progress on what we thought then was important has been achieved? Where have we succeeded, where have we failed, and where have we changed our minds about what is important? In this post, I'll go through the 16 problems from 2014 one by one, and see just where we are today on each one. At the end, I’ll include my new picks for hard problems of 2019.

The problems are broken down into three categories: (i) cryptographic, and hence expected to be solvable with purely mathematical techniques if they are to be solvable at all, (ii) consensus theory, largely improvements to proof of work and proof of stake, and (iii) economic, and hence having to do
with creating structures involving incentives given to different participants, and often involving the application layer more than the protocol layer. We see significant progress in all categories, though some more than others.

Cryptographic problems

1  Blockchain Scalability

One of the largest problems facing the cryptocurrency space today is the issue of scalability ... The main concern with [oversized blockchains] is trust: if there are only a few entities capable of running full nodes, then those entities can conspire and agree to give themselves a large number of additional bitcoins, and there would be no way for other users to see for themselves that a block is invalid without processing an entire block themselves.

Problem: create a blockchain design that maintains Bitcoin-like security guarantees, but where the maximum size of the most powerful node that needs to exist for the network to keep functioning is substantially sublinear in the number of transactions.

Status: Great theoretical progress, pending more real-world evaluation.

Scalability is one technical problem that we have had a huge amount of progress on theoretically. Five years ago, almost no one was thinking about sharding; now, sharding designs are commonplace. Aside from ethereum 2.0, we have OmniLedger, LazyLedger, Zilliqa and research papers seemingly coming out every month. In my own view, further progress at this point is incremental. Fundamentally, we already have a number of techniques that allow groups of validators to securely come to consensus on much more data than an individual validator can process, as well as techniques allow clients to indirectly verify the full validity and availability of blocks even under 51% attack conditions.

These are probably the most important technologies:

.  Random sampling, allowing a small randomly selected committee to statistically stand in for the full validator set: https://github.com/ethereum/wiki/wiki/Sharding-FAQ#how-can-we-solve-the-single-shard-takeover-attack-in-an-uncoordinated-majority-model

.  Fraud proofs, allowing individual nodes that learn of an error to broadcast its presence to everyone else: https://bitcoin.stackexchange.com/questions/49647/what-is-a-fraud-proof

.  Proofs of custody, allowing validators to probabilistically prove that they individually downloaded and verified some piece of data: https://ethresear.ch/t/1-bit-aggregation-friendly-custody-bonds/2236

.  Data availability proofs, allowing clients to detect when the bodies of blocks that they have headers for are unavailable: https://arxiv.org/abs/1809.09044. See also the newer coded Merkle trees proposal.

There are also other smaller developments like Cross-shard communication via receipts as well as "constant-factor" enhancements such as BLS signature aggregation.

That said, fully sharded blockchains have still not been seen in live operation (the partially sharded Zilliqa has recently started running). On the theoretical side, there are mainly disputes about details remaining, along with challenges having to do with stability of sharded networking, developer experience and mitigating risks of centralization; fundamental technical possibility no longer seems in doubt. But the challenges that do remain are challenges that cannot be solved by just thinking about them; only developing the system and seeing ethereum 2.0 or some similar chain running live will suffice.


2  Timestamping

Problem: create a distributed incentive-compatible system, whether it is an overlay on top of a blockchain or its own blockchain, which maintains the current time to high accuracy. All legitimate users have clocks in a normal distribution around some "real" time with standard deviation 20 seconds ... no two nodes are more than 20 seconds apart The solution is allowed to rely on an existing concept of "N nodes"; this would in practice be enforced with proof-of-stake or non-sybil tokens (see #9). The system should continuously provide a time which is within 120s (or less if possible) of the internal clock of >99% of honestly participating nodes. External systems may end up relying on this system; hence, it should remain secure against attackers controlling < 25% of nodes regardless of incentives.

Status: Some progress.

Ethereum has actually survived just fine with a 13-second block time and no particularly advanced timestamping technology; it uses a simple technique where a client does not accept a block whose stated timestamp is earlier than the client's local time. That said, this has not been tested under serious attacks. The recent network-adjusted timestamps proposal tries to improve on the status quo by allowing the client to determine the consensus on the time in the case where the client does not locally know the current time to high accuracy; this has not yet been tested. But in general, timestamping is not currently at the foreground of perceived research challenges; perhaps this will change once more proof of stake chains (including Ethereum 2.0 but also others) come online as real live systems and we see what the issues are.


3  Arbitrary Proof of Computation

Problem: create programs POC_PROVE(P,I) -> (O,Q) and POC_VERIFY(P,O,Q) -> { 0, 1 } such that POC_PROVE runs program P on input I and returns the program output O and a proof-of-computation Q and POC_VERIFY takes P, O and Q and outputs whether or not Q and O were legitimately produced by the POC_PROVE algorithm using P.

Status: Great theoretical and practical progress.

This is basically saying, build a SNARK (or STARK, or SHARK, or...). And we've done it! SNARKs are now increasingly well understood, and are even already being used in multiple blockchains today (including tornado.cash on Ethereum). And SNARKs are extremely useful, both as a privacy technology (see Zcash and tornado.cash) and as a scalability technology (see ZK Rollup, STARKDEX and STARKing erasure coded data roots).

There are still challenges with efficiency; making arithmetization-friendly hash functions (see here and here for bounties for breaking proposed candidates) is a big one, and efficiently proving random memory accesses is another. Furthermore, there's the unsolved question of whether the O(n * log(n)) blowup in prover time is a fundamental limitation or if there is some way to make a succinct proof with only linear overhead as in bulletproofs (which unfortunately take linear time to verify). There are also ever-present risks that the existing schemes have bugs. In general, the problems are in the details rather than the fundamentals.


4  Code Obfuscation

The holy grail is to create an obfuscator O, such that given any program P the obfuscator can produce a second program O(P) = Q such that P and Q return the same output if given the same input and, importantly, Q reveals no information whatsoever about the internals of P. One can hide inside of Q a password, a secret encryption key, or one can simply use Q to hide the proprietary workings of the algorithm itself.

Status: Slow progress.

In plain English, the problem is saying that we want to come up with a way to "encrypt" a program so that the encrypted program would still give the same outputs for the same inputs, but the "internals" of the program would be hidden. An example use case for obfuscation is a program containing a private key where the program only allows the private key to sign certain messages.

A solution to code obfuscation would be very useful to blockchain protocols. The use cases are subtle, because one must deal with the possibility that an on-chain obfuscated program will be copied and run in an environment different from the chain itself, but there are many possibilities. One that personally interests me is the ability to remove the centralized operator from collusion-resistance gadgets by replacing the operator with an obfuscated program that contains some proof of work, making it very expensive to run more than once with different inputs as part of an attempt to determine individual participants' actions.

Unfortunately this continues to be a hard problem. There is continuing ongoing work in attacking the problem, one side making constructions (eg. this) that try to reduce the number of assumptions on mathematical objects that we do not know practically exist (eg. general cryptographic multilinear maps) and another side trying to make practical implementations of the desired mathematical objects. However, all of these paths are still quite far from creating something viable and known to be secure. See https://eprint.iacr.org/2019/463.pdf for a more general overview to the problem.


5  Hash-Based Cryptography

Problem: create a signature algorithm relying on no security assumption but the random oracle property of hashes that maintains 160 bits of security against classical computers (ie. 80 vs. quantum due to Grover's algorithm) with optimal size and other properties.

Status: Some progress.

There have been two strands of progress on this since 2014. SPHINCS, a "stateless" (meaning, using it multiple times does not require remembering information like a nonce) signature scheme, was released soon after this "hard problems" list was published, and provides a purely hash-based signature scheme of size around 41 kB. Additionally, STARKs have been developed, and one can create signatures of similar size based on them. The fact that not just signatures, but also general-purpose zero knowledge proofs, are possible with just hashes was definitely something I did not expect five years ago; I am very happy that this is the case. That said, size continues to be an issue, and ongoing progress (eg. see the very recent DEEP FRI) is continuing to reduce the size of proofs, though it looks like further progress will be incremental.

The main not-yet-solved problem with hash-based cryptography is aggregate signatures, similar to what BLS aggregation makes possible. It's known that we can just make a STARK over many Lamport signatures, but this is inefficient; a more efficient scheme would be welcome. (In case you're wondering if hash-based public key encryption is possible, the answer is, no, you can't do anything with more than a quadratic attack cost)


Consensus theory problems

6  ASIC-Resistant Proof of Work

One approach at solving the problem is creating a proof-of-work algorithm based on a type of computation that is very difficult to specialize ... For a more in-depth discussion on ASIC-resistant hardware, see https://blog.ethereum.org/2014/06/19/mining/.

Status: Solved as far as we can.

About six months after the "hard problems" list was posted, Ethereum settled on its ASIC-resistant proof of work algorithm: Ethash. Ethash is known as a memory-hard algorithm. The theory is that random-access memory in regular computers is well-optimized already and hence difficult to improve on for specialized applications. Ethash aims to achieve ASIC resistance by making memory access the dominant part of running the PoW computation. Ethash was not the first memory-hard algorithm, but it did add one innovation: it uses pseudorandom lookups over a two-level DAG, allowing for two ways of evaluating the function. First, one could compute it quickly if one has the entire (~2 GB) DAG; this is the memory-hard "fast path". Second, one can compute it much more slowly (still fast enough to check a single provided solution quickly) if one only has the top level of the DAG; this is used for block verification.

Ethash has proven remarkably successful at ASIC resistance; after three years and billions of dollars of block rewards, ASICs do exist but are at best 2-5 times more power and cost-efficient than GPUs. ProgPoW has been proposed as an alternative, but there is a growing consensus that ASIC-resistant algorithms will inevitably have a limited lifespan, and that ASIC resistance has downsides because it makes 51% attacks cheaper (eg. see the 51% attack on Ethereum Classic).

I believe that PoW algorithms that provide a medium level of ASIC resistance can be created, but such resistance is limited-term and both ASIC and non-ASIC PoW have disadvantages; in the long term the better choice for blockchain consensus is proof of stake.


7  Useful Proof of Work

[M]aking the proof of work function something which is simultaneously useful; a common candidate is something like Folding@home, an existing program where users can download software onto their computers to simulate protein folding and provide researchers with a large supply of data to help them cure diseases.

Status: Probably not feasible, with one exception.

The challenge with useful proof of work is that a proof of work algorithm requires many properties:

.  Hard to compute
.  Easy to verify
.  Does not depend on large amounts of external data
.  Can be efficiently computed in small "bite-sized" chunks

Unfortunately, there are not many computations that are useful that preserve all of these properties, and most computations that do have all of those properties and are "useful" are only "useful" for far too short a time to build a cryptocurrency around them.

However, there is one possible exception: zero-knowledge-proof generation. Zero knowledge proofs of aspects of blockchain validity (eg. data availability roots for a simple example) are difficult to compute, and easy to verify. Furthermore, they are durably difficult to compute; if proofs of "highly structured" computation become too easy, one can simply switch to verifying a blockchain's entire state transition, which becomes extremely expensive due to the need to model the virtual machine and random memory accesses.

Zero-knowledge proofs of blockchain validity provide great value to users of the blockchain, as they can substitute the need to verify the chain directly; Coda is doing this already, albeit with a simplified blockchain design that is heavily optimized for provability. Such proofs can significantly assist in improving the blockchain's safety and scalability. That said, the total amount of computation that realistically needs to be done is still much less than the amount that's currently done by proof of work miners, so this would at best be an add-on for proof of stake blockchains, not a full-on consensus algorithm.


8  Proof of Stake

Another approach to solving the mining centralization problem is to abolish mining entirely, and move to some other mechanism for counting the weight of each node in the consensus. The most popular alternative under discussion to date is "proof of stake" - that is to say, instead of treating the consensus model as "one unit of CPU power, one vote" it becomes "one currency unit, one vote".

Status: Great theoretical progress, pending more real-world evaluation.

Near the end of 2014, it became clear to the proof of stake community that some form of "weak subjectivity" is unavoidable. To maintain economic security, nodes need to obtain a recent checkpoint extra-protocol when they sync for the first time, and again if they go offline for more than a few months. This was a difficult pill to swallow; many PoW advocates still cling to PoW precisely because in a PoW chain the "head" of the chain can be discovered with the only data coming from a trusted source being the blockchain client software itself. PoS advocates, however, were willing to swallow the pill, seeing the added trust requirements as not being large. From there the path to proof of stake through long-duration security deposits became clear.

Most interesting consensus algorithms today are fundamentally similar to PBFT, but replace the fixed set of validators with a dynamic list that anyone can join by sending tokens into a system-level smart contract with time-locked withdrawals (eg. a withdrawal might in some cases take up to 4 months to complete). In many cases (including ethereum 2.0), these algorithms achieve "economic finality" by penalizing validators that are caught performing actions that violate the protocol in certain ways (see here for a philosophical view on what proof of stake accomplishes).

As of today, we have (among many other algorithms):

Casper FFG: https://arxiv.org/abs/1710.09437
Tendermint: https://tendermint.com/docs/spec/consensus/consensus.html
HotStuff: https://arxiv.org/abs/1803.05069
Casper CBC: https://vitalik.ca/general/2018/12/05/cbc_casper.html

There continues to be ongoing refinement (eg. here and here) . Eth2 phase 0, the chain that will implement FFG, is currently under implementation and enormous progress has been made. Additionally, Tendermint has been running, in the form of the Cosmos chain for several months. Remaining arguments about proof of stake, in my view, have to do with optimizing the economic incentives, and further formalizing the strategy for responding to 51% attacks. Additionally, the Casper CBC spec could still use concrete efficiency improvements.


9  Proof of Storage

A third approach to the problem is to use a scarce computational resource other than computational power or currency. In this regard, the two main alternatives that have been proposed are storage and bandwidth. There is no way in principle to provide an after-the-fact cryptographic proof that bandwidth was given or used, so proof of bandwidth should most accurately be considered a subset of social proof, discussed in later problems, but proof of storage is something that certainly can be done computationally. An advantage of proof-of-storage is that it is completely ASIC-resistant; the kind of storage that we have in hard drives is already close to optimal.

Status: A lot of theoretical progress, though still a lot to go, as well as more real-world evaluation.

There are a number of blockchains planning to use proof of storage protocols, including Chia and Filecoin. That said, these algorithms have not been tested in the wild. My own main concern is centralization: will these algorithms actually be dominated by smaller users using spare storage capacity, or will they be dominated by large mining farms?


Economics

10  Stable-value cryptoassets

One of the main problems with Bitcoin is the issue of price volatility ... Problem: construct a cryptographic asset with a stable price.

Status: Some progress.

MakerDAO is now live, and has been holding stable for nearly two years. It has survived a 93% drop in the value of its underlying collateral asset (ETH), and there is now more than $100 million in DAI issued. It has become a mainstay of the Ethereum ecosystem, and many Ethereum projects have or are integrating with it. Other synthetic token projects, such as UMA, are rapidly gaining steam as well.

However, while the MakerDAO system has survived tough economic conditions in 2019, the conditions were by no means the toughest that could happen. In the past, Bitcoin has fallen by 75% over the course of two days; the same may happen to ether or any other collateral asset some day. Attacks on the underlying blockchain are an even larger untested risk, especially if compounded by price decreases at the same time. Another major challenge, and arguably the larger one, is that the stability of MakerDAO-like systems is dependent on some underlying oracle scheme. Different attempts at oracle systems do exist (see #16), but the jury is still out on how well they can hold up under large amounts of economic stress. So far, the collateral controlled by MakerDAO has been lower than the value of the MKR token; if this relationship reverses MKR holders may have a collective incentive to try to "loot" the MakerDAO system. There are ways to try to protect against such attacks, but they have not been tested in real life.


11  Decentralized Public Goods Incentivization

One of the challenges in economic systems in general is the problem of "public goods". For example, suppose that there is a scientific research project which will cost $1 million to complete, and it is known that if it is completed the resulting research will save one million people $5 each. In total, the social benefit is clear ... [but] from the point of view of each individual person contributing does not make sense ... So far, most problems to public goods have involved centralization

Additional Assumptions And Requirements: A fully trustworthy oracle exists for determining whether or not a certain public good task has been completed (in reality this is false, but this is the domain of another problem)

Status: Some progress.

The problem of funding public goods is generally understood to be split into two problems: the funding problem (where to get funding for public goods from) and the preference aggregation problem (how to determine what is a genuine public good, rather than some single individual's pet project, in the first place). This problem focuses specifically on the former, assuming the latter is solved (see the "decentralized contribution metrics" section below for work on that problem).

In general, there haven't been large new breakthroughs here. There's two major categories of solutions. First, we can try to elicit individual contributions, giving people social rewards for doing so. My own proposal for charity through marginal price discrimination is one example of this; another is the anti-malaria donation badges on Peepeth. Second, we can collect funds from applications that have network effects. Within blockchain land there are several options for doing this:

.  Issuing coins
.  Taking a portion of transaction fees at protocol level (eg. through EIP 1559)
.  Taking a portion of transaction fees from some layer-2 application (eg. Uniswap, or some scaling solution, or even state rent in an execution environment in ethereum 2.0)
.  Taking a portion of other kinds of fees (eg. ENS registration)

Outside of blockchain land, this is just the age-old question of how to collect taxes if you're a government, and charge fees if you're a business or other organization.


12  Reputation systems

Problem: design a formalized reputation system, including a score rep(A,B) -> V where V is the reputation of B from the point of view of A, a mechanism for determining the probability that one party can be trusted by another, and a mechanism for updating the reputation given a record of a particular open or finalized interaction.

Status: Slow progress.

There hasn't really been much work on reputation systems since 2014. Perhaps the best is the use of token curated registries to create curated lists of trustable entities/objects; the Kleros ERC20 TCR (yes, that's a token-curated registry of legitimate ERC20 tokens) is one example, and there is even an alternative interface to Uniswap (http://uniswap.ninja) that uses it as the backend to get the list of tokens and ticker symbols and logos from. Reputation systems of the subjective variety have not really been tried, perhaps because there is just not enough information about the "social graph" of people's connections to each other that has already been published to chain in some form. If such information starts to exist for other reasons, then subjective reputation systems may become more popular.


13  Proof of excellence

One interesting, and largely unexplored, solution to the problem of [token] distribution specifically (there are reasons why it cannot be so easily used for mining) is using tasks that are socially useful but require original human-driven creative effort and talent. For example, one can come up with a "proof of proof" currency that rewards players for coming up with mathematical proofs of certain theorems

Status: No progress, problem is largely forgotten.

The main alternative approach to token distribution that has instead become popular is airdrops; typically, tokens are distributed at launch either proportionately to existing holdings of some other token, or based on some other metric (eg. as in the Handshake airdrop). Verifying human creativity directly has not really been attempted, and with recent progress on AI the problem of creating a task that only humans can do but computers can verify may well be too difficult.


15 [sic]. Anti-Sybil systems

A problem that is somewhat related to the issue of a reputation system is the challenge of creating a "unique identity system" - a system for generating tokens that prove that an identity is not part of a Sybil attack ... However, we would like to have a system that has nicer and more egalitarian features than "one-dollar-one-vote"; arguably, one-person-one-vote would be ideal.

Status: Some progress.

There have been quite a few attempts at solving the unique-human problem. Attempts that come to mind include (incomplete list!):

.  HumanityDAO: https://www.humanitydao.org/
.  Pseudonym parties: https://bford.info/pub/net/sybil.pdf
.  POAP ("proof of attendance protocol"): https://www.poap.xyz/
.  BrightID: https://www.brightid.org/

With the growing interest in techniques like quadratic voting and quadratic funding, the need for some kind of human-based anti-sybil system continues to grow. Hopefully, ongoing development of these techniques and new ones can come to meet it.


14 [sic]. Decentralized contribution metrics

Incentivizing the production of public goods is, unfortunately, not the only problem that centralization solves. The other problem is determining, first, which public goods are worth producing in the first place and, second, determining to what extent a particular effort actually accomplished the production of the public good. This challenge deals with the latter issue.

Status: Some progress, some change in focus.

More recent work on determining value of public-good contributions does not try to separate determining tasks and determining quality of completion; the reason is that in practice the two are difficult to separate. Work done by specific teams tends to be non-fungible and subjective enough that the most reasonable approach is to look at relevance of task and quality of performance as a single package, and use the same technique to evaluate both.

Fortunately, there has been great progress on this, particularly with the discovery of quadratic funding. Quadratic funding is a mechanism where individuals can make donations to projects, and then based on the number of people who donated and how much they donated, a formula is used to calculate how much they would have donated if they were perfectly coordinated with each other (ie. took each other's interests into account and did not fall prey to the tragedy of the commons). The difference between amount would-have-donated and amount actually donated for any given project is given to that project as a subsidy from some central pool (see #11 for where the central pool funding could come from). Note that this mechanism focuses on satisfying the values of some community, not on satisfying some given goal regardless of whether or not anyone cares about it. Because of the complexity of values problem, this approach is likely to be much more robust to unknown unknowns.

Quadratic funding has even been tried in real life with considerable success in the recent gitcoin quadratic funding round. There has also been some incremental progress on improving quadratic funding and similar mechanisms; particularly, pairwise-bounded quadratic funding to mitigate collusion. There has also been work on specification and implementation of bribe-resistant voting technology, preventing users from proving to third parties who they voted for; this prevents many kinds of collusion and bribe attacks.


16  Decentralized success metrics

Problem: come up with and implement a decentralized method for measuring numerical real-world variables ... the system should be able to measure anything that humans can currently reach a rough consensus on (eg. price of an asset, temperature, global CO2 concentration)

Status: Some progress.

This is now generally just called "the oracle problem". The largest known instance of a decentralized oracle running is Augur, which has processed outcomes for millions of dollars of bets. Token curated registries such as the Kleros TCR for tokens are another example. However, these systems still have not seen a real-world test of the forking mechanism (search for "subjectivocracy" here) either due to a highly controversial question or due to an attempted 51% attack. There is also research on the oracle problem happening outside of the blockchain space in the form of the "peer prediction" literature; see here for a very recent advancement in the space.

Another looming challenge is that people want to rely on these systems to guide transfers of quantities of assets larger than the economic value of the system's native token. In these conditions, token holders in theory have the incentive to collude to give wrong answers to steal the funds. In such a case, the system would fork and the original system token would likely become valueless, but the original system token holders would still get away with the returns from whatever asset transfer they misdirected. Stablecoins (see #10) are a particularly egregious case of this. One approach to solving this would be a system that assumes that altruistically honest data providers do exist, and creating a mechanism to identify them, and only allowing them to churn slowly so that if malicious ones start getting voted in the users of systems that rely on the oracle can first complete an orderly exit. In any case, more development of oracle tech is very much an important problem.


New problems

If I were to write the hard problems list again in 2019, some would be a continuation of the above problems, but there would be significant changes in emphasis, as well as significant new problems. Here are a few picks:

.  Cryptographic obfuscation: same as #4 above

.  Ongoing work on post-quantum cryptography: both hash-based as well as based on post-quantum-secure "structured" mathematical objects, eg. elliptic curve isogenies, lattices...

.  Anti-collusion infrastructure: ongoing work and refinement of https://ethresear.ch/t/minimal-anti-collusion-infrastructure/5413, including adding privacy against the operator, adding multi-party computation in a maximally practical way, etc.

.  Oracles: same as #16 above, but removing the emphasis on "success metrics" and focusing on the general "get real-world data" problem

.  Unique-human identities (or, more realistically, semi-unique-human identities): same as what was written as #15 above, but with an emphasis on a less "absolute" solution: it should be much harder to get two identities than one, but making it impossible to get multiple identities is both impossible and potentially harmful even if we do succeed

.  Homomorphic encryption and multi-party computation: ongoing improvements are still required for practicality

.  Decentralized governance mechanisms: DAOs are cool, but current DAOs are still very primitive; we can do better

Fully formalizing responses to PoS 51% attacks: ongoing work and refinement of https://ethresear.ch/t/responding-to-51-attacks-in-casper-ffg/6363

.  More sources of public goods funding: the ideal is to charge for congestible resources inside of systems that have network effects (eg. transaction fees), but doing so in decentralized systems requires public legitimacy; hence this is a social problem along with the technical one of finding possible sources

.  Reputation systems: same as #12 above

In general, base-layer problems are slowly but surely decreasing, but application-layer problems are only just getting started.

Neuroticism (negative), extraversion, agreeableness, and to a lesser extent conscientiousness predicted wellbeing; the hypothesis that self-enhancement is beneficial for wellbeing is doubtful

An integrated model of social psychological and personality psychological perspectives on personality and wellbeing. Ulrich Schimmack, Hyunji Kim. Journal of Research in Personality, Volume 84, February 2020, 103888. https://doi.org/10.1016/j.jrp.2019.103888

Highlights
•    Largest sample size for multi-method studies of self-enhancement.
•    No support for benefits of positive illusions on wellbeing.
•    Multi-method evidence that personality influences well-being.

Abstract: This article uses multi-rater data from 458 triads (students, mother, father, total N = 1374) to examine the relationship of personality ratings with wellbeing ratings, using a multi-method approach to separate accurate perceptions (shared across raters) from biased perceptions of the self (rater-specific variance). The social-psychological perspective predicts effects of halo bias in self-ratings on wellbeing, whereas the personality-psychological perspective predicts effects of personality traits on wellbeing. Results are more consistent with the personality perspective in that neuroticism (negative), extraversion, agreeableness, and to a lesser extent conscientiousness predicted wellbeing, whereas positive illusions about the self were only weakly and not significantly related to wellbeing. These results cast doubt on the hypothesis that self-enhancement is beneficial for wellbeing.

4. Discussion

The main contribution of this article was to examine wellbeing from an integrated personality and social psychological perspective. While personality psychologists focused on the contribution of actual traits, social psychologists focused on biases in self-perceptions of traits. Multi-method measurement models were used to separate valid trait variance from illusory perceptions of personality in self-ratings and ratings of other family members. The results show that actual personality traits are more important for wellbeing than positive biases in self-perceptions. In fact, the most important finding was that positive illusions about the self were unrelated to wellbeing impressions that are shared across informants. This finding challenges Taylor and Brown (1988) influential and highly controversial claim that positive illusions not only foster higher wellbeing, but are a sign of optimal and normal functioning. Subsequently, we discuss the implications of our findings for the future of wellbeing science and for individuals’ pursuit of wellbeing.

4.1. Positive illusions and public wellbeing

The social psychological perspective on wellbeing is grounded in the basic assumption that human information processing is riddled with errors. Taylor and Brown (1988) quote Fiske and Taylor (1984) book about social cognitions to support this assumption. “Instead of a naïve scientist entering the environment in search of the truth, we find the rather unflattering picture of a charlatan trying to make the data come out in a manner most advantageous to his or her already-held theories” (p. 88). Thirty years later, it has become apparent that human information processing is more accurate than Fiske and Taylor (1984) assumed (Funder, 1995Jussim, 1991McCrae and Costa, 1991Schimmack and Oishi, 2005). Thus, Taylor and Brown (1988) model of wellbeing is based on outdated evidence and needs to be revised.
The vast majority of studies have relied on self-ratings of wellbeing to measure benefits of wellbeing. This is problematic because self-ratings of wellbeing can be inflated by the very same processes that inflate self-ratings of personality (Humberg et al., 2019). There have been only a handful of studies with valid illusion measures and informant ratings of wellbeing and these studies have found similar weak results (Dufner et al., 2019).
The lack of evidence for benefits of positive illusions is not for a lack of trying. Taylor, Lerner, Sherman, Sage, and McDowell (2003) claimed that effects of positive illusions are not limited to self-ratings. “We conducted a study with multiple measures of self-enhancement along with multiple measures and judges of mental health, comprehensively assessing their relationship. The results indicated that self-enhancement is positively associated with multiple indicators of mental health” (p. 165). Contrary to this claim, Table 5 shows correlations of various self-enhancement measures with peer-rated mental health ranging from r = −0.13 to 0.09. None of these correlations were significant, in part due to the low statistical power of the study (N = 55). Thus, even Taylor and colleagues never provided positive evidence that positive illusions increase wellbeing in ways that can be measured with a method other than self-reports. The social cognitive model of wellbeing also faces other problems. One problem is causality. Even if there were a small correlation between positive illusions about the self and wellbeing, it is not clear that it is causal. It is equally plausible that happiness distorts self-perceptions. Thirty years of research have failed to address this problem (cf. Humberg et al., 2019). Another problem is that third variables produce a spurious correlation between illusions about the self and wellbeing. For example, relationship researchers have shown that illusions about a partner predict relationship satisfaction (see Weidmann, Ledermann, & Grob, 2016, for a review), and Kim et al. (2012) showed that individuals with positive illusions about the self also tend to have positive illusions about others. Thus, it is possible that positive illusions about others, not the self, are beneficial for social relationships and wellbeing. Future research needs to include measures of positive illusions about the self and others to examine this question. Given these problems, we question broad conclusions about the benefits of positive illusions for wellbeing (Dufner et al., 2019Humberg et al., 2019).

4.2. Positive illusions and private wellbeing

The present study replicated the finding that positive illusions predict unique variance in self-ratings of wellbeing. That is, individuals who claim to be more extraverted and more agreeable than others perceive them also claim to be happier than others perceive them to be (Dufner et al., 2019Humberg et al., 2019Taylor et al., 2003). As noted in the introduction, there are two possible explanation for this finding. One explanation is that positive illusions enhance wellbeing in a way that is not observable to others. The challenge for this model is to explain how positive illusions foster private wellbeing and to provide empirical evidence for this model. To explain why informants are unable to see the happiness of individuals with positive illusions, we have to assume that the illusion-based happiness is not visible to others. This requires a careful examination of the variance in self-ratings of wellbeing that is not shared with informants (Schneider & Schimmack, 2010).
The private-wellbeing illusion model also faces an interesting contradiction in assumptions about the validity of personality and wellbeing judgments. To allow for effects of positive illusions on private wellbeing, the model assumes that people have illusions about their personality, while their self-ratings of wellbeing are highly accurate and trustworthy. In contrast, social psychologists have argued that wellbeing judgments are highly sensitive to context effects and provide little valid information about individuals’ wellbeing (Schwarz & Strack, 1999). In contrast, personality psychologists have pointed to self-informant agreement in wellbeing judgments as evidence for the validity of self-ratings of wellbeing. If informant ratings validate self-ratings, then we would expect predictors of wellbeing also to be related to self-ratings of wellbeing and to informant ratings of wellbeing. Our main contribution is to show that this is not the case for positive illusions, or at least, that the effect size is small. No single study can resolve deep philosophical questions, but our study suggests that hundreds of studies that relied on self-ratings of wellbeing to demonstrate the benefits of positive illusions may have produced illusory evidence of these benefits.

4.3. Positive illusions as halo bias

Evidence for halo biases in personality ratings is nearly 100 years old (Thorndike, 1920). Ironically, some of the strongest evidence for the pervasiveness of halo biases stems from social psychology (Nisbett & Wilson, 1977). Given the evidence that halo biases in ratings are pervasive, halo bias provides a simple and parsimonious explanation for the finding that positive illusions are only related to the unique variance in self-ratings and not to informant ratings of wellbeing. One explanation for halo bias is that many trait concepts have a denotative and a connotative (evaluative) meaning (Osgood, Suci, & Tannenbaum, 1957). While denotative meaning and valid information produce agreement between raters, ratings are also biased by the connotative meaning of words and liking of a target. For example, lazy has a denotative meaning of not putting a lot of effort into tasks and a negative connotation. Ratings of laziness will be enhanced by dislike and attenuated by liking of an individual independent of the objective effort targets exert (Leising, Erbs, & Fritz, 2010). It seems plausible that halo bias also influences ratings of desirable attributes like happiness and having a good life. Thus, halo bias offers a plausible explanation for our results that is also consistent with heuristic and bias models in social psychology.

4.4. Personality and wellbeing

The present study provided new evidence on the relationship between personality and wellbeing from a multi-rater perspective. Results confirmed that neuroticism is the strongest predictor of wellbeing and that the influence on wellbeing is mediated by hedonic balance. This finding is consistent with the hypothesis that neuroticism is a broad disposition to experience more unpleasant mood states (Costa and McCrae, 1980Schimmack, Radhakrishnan, Oishi et al., 2002Watson and Tellegen, 1985). As experiencing unpleasant mood is undesirable it lowers wellbeing independent of actual life-circumstances. Twin studies suggest that individual differences in neuroticism are partially heritable and that the genetic variance in neuroticism accounts for a considerable portion of the shared variance between neuroticism and wellbeing (Nes et al., 2013).
In comparison, the other personality traits explain relatively small amounts of variance in wellbeing. While, the effects of extraversion and agreeableness were also mediated by hedonic balance, the results for conscientiousness suggested a unique influence on life evaluations. Future research needs to go beyond demonstrating effects of the Big Five and wellbeing and start to investigate the causal processes that link personality to wellbeing. McCrae and Costa (1991) proposed that agreeableness is beneficial for more harmonious social relationships, while conscientiousness is beneficial for work, but there have been few attempts to test these predictions. One way to test potential mediators are integrated top-down bottom-up models with domain satisfaction as mediators (Brief et al., 1993Schimmack, Diener, Oishi, 2002). It is important to use multi-method measurement models to separate top-down effects from halo bias (Schneider & Schimmack, 2010). It is also important to examine the relationship of personality and wellbeing with a more detailed assessment of personality traits. While the Big Five have the advantage of covering a broad range of personality traits with a few, largely orthogonal dimensions, the disadvantage is that they cannot represent all of the variation in personality. Some studies showed that the depression facet of neuroticism and the cheerfulness facet of extraversion explain additional variance in wellbeing (Allik et al., 2018Schimmack et al., 2004). More research with narrow personality traits is needed to specify the precise personality traits that are related to wellbeing.

Pseudo-profound bullshit titles makes the art grow profounder

Bullshit makes the art grow profounder. Martin Harry Turpin et al. Judgment and Decision Making, Vol. 14, No. 6, November 2019, pp. 658-670. http://journal.sjdm.org/19/190712/jdm190712.html

Abstract: Across four studies participants (N = 818) rated the profoundness of abstract art images accompanied with varying categories of titles, including: pseudo-profound bullshit titles (e.g., The Deaf Echo), mundane titles (e.g., Canvas 8), and no titles. Randomly generated pseudo-profound bullshit titles increased the perceived profoundness of computer-generated abstract art, compared to when no titles were present (Study 1). Mundane titles did not enhance the perception of profoundness, indicating that pseudo-profound bullshit titles specifically (as opposed to titles in general) enhance the perceived profoundness of abstract art (Study 2). Furthermore, these effects generalize to artist-created abstract art (Study 3). Finally, we report a large correlation between profoundness ratings for pseudo-profound bullshit and “International Art English” statements (Study 4), a mode and style of communication commonly employed by artists to discuss their work. This correlation suggests that these two independently developed communicative modes share underlying cognitive mechanisms in their interpretations. We discuss the potential for these results to be integrated into a larger, new theoretical framework of bullshit as a low-cost strategy for gaining advantages in prestige awarding domains.

Keywords: pseudo-profound bullshit, impression management, abstract art, meaning, social navigation

The complex relation between receptivity to pseudo-profound bullshit and political ideology. Nilsson, Artur; ERLANDSSON, Arvid and Västfjäll, Daniel (2018) In Personality and Social Psychology Bulletin, Jan 2019. https://www.bipartisanalliance.com/2019/01/bullshit-receptivity-robustly.html

Check also the first author's MA Thesis... Bullshit Makes the Art Grow Profounder: Evidence for False Meaning Transfer Across Domains. Martin Harry Turpin. MA Thesis, Waterloo Univ., Ontario. https://www.bipartisanalliance.com/2018/10/pairing-abstract-art-pieces-with.html

And Bullshit-sensitivity predicts prosocial behavior. Arvid Erlandsson et al. PLOS, https://www.bipartisanalliance.com/2018/08/bullshit-receptivity-perceived.html

Non-believers: Reflection increases belief in God through self-questioning

Reflection increases belief in God through self-questioning among non-believers. Onurcan Yilmaz, Ozan Isler. Judgment and Decision Making, Vol. 14, No. 6, November 2019, pp. 649-657. http://journal.sjdm.org/19/190605/jdm190605.html

The dual-process model of the mind predicts that religious belief will be stronger for intuitive decisions, whereas reflective thinking will lead to religious disbelief (i.e., the intuitive religious belief hypothesis). While early research found intuition to promote and reflection to weaken belief in God, more recent attempts found no evidence for the intuitive religious belief hypothesis. Many of the previous studies are underpowered to detect small effects, and it is not clear whether the cognitive process manipulations used in these failed attempts worked as intended. We investigated the influence of intuitive and reflective thought on belief in God in two large-scale preregistered experiments (N = 1,602), using well-established cognitive manipulations (i.e., time-pressure with incentives for compliance) and alternative elicitation methods (between and within-subject designs). Against our initial hypothesis based on the literature, the experiments provide first suggestive then confirmatory evidence for the reflective religious belief hypothesis. Exploratory examination of the data suggests that reflection increases doubts about beliefs held regarding God’s existence. Reflective doubt exists primarily among non-believers, resulting in an overall increase in belief in God when deciding reflectively.

Keywords: reflection, intuition, analytic cognitive style, belief, belief in God or gods


4  Discussion

In both experiments, we found that reflection increases belief in God and that the effect is stronger among non-believers. Exploratory analysis suggested that the overall increase in religious belief is likely due to the religious self-questioning (i.e., reflective doubt) of non-believers who tended to revise their responses on the scale towards the middle point (i.e., “not sure”). The results also showed that those who make greater use of their reflective capacities (as measured by CRT-2) are less likely to endorse belief in God or gods. These results provide evidence against the hypothesis that intuition fosters and that reflection dampens religious belief (Gervais & Norenzayan, 2012; Shenhav et al., 2012; Yilmaz et al., 2016) but it converges with the longstanding correlational results demonstrating that tendency for reflective thinking is negatively associated with religious belief (e.g., Bahçekapili & Yilmaz, 2017; Gervais et al., 2018; Pennycook et al., 2016; Stagnaro et al., 2018; Stagnaro, Ross, Pennycook & Rand, 2019).
Why does reflection increase belief in God in the current research? Our exploratory analysis strongly suggests that reflection, rather than directly increasing belief in God, increases doubt about one’s initial and intuitively held belief regarding God’s existence. It is likely that reflection increased religious belief in our overall sample because religious self-questioning is stronger among non-believers than among believers. On the other hand, we show that endorsement of agnosticism, deism, and polytheism is associated with both increase and decrease in belief in God, which may drive reflective doubt. Future research should try to experimentally distinguish this reflective religious doubt hypothesis implicated by our exploratory analysis from the reflective religious belief hypothesis. Nevertheless, we expect the effect of reflection on religious belief to be small because the belief in God question, as regularly used in the literature, will tend to probe stable opinions. Having answered the same question numerous times over the course of one’s life, participants are likely to know, as a defining characteristic of their personal identity, whether and to what extent they believe in God.
We also hypothesized but found no strong evidence that Pascal’s Wager may motivate a religious belief. Accordingly, reflected evaluation of the possibility of God’s existence could highlight the potentially infinite benefits of belief and costs of disbelief, hence questioning religious disbelief through a rational utility calculus. Although plausible, the tendency in our sample to agree with Pascal’s Wager did not clearly explain the reflected change in religious belief. However, our test was limited by the fact that religious believers (i.e., those with already high levels of belief) agreed with the Wager more than non-believers as well as by the fact that there were fewer atheists and agnostics in our sample.
An alternative explanation of the positive effect of reflection on religious belief may be that reflection makes people less extreme in their beliefs in general (i.e., religious and non-religious) but that openness to such self-criticism may be stronger among non-believers since they also tend to be reflective thinkers (Pennycook et al., 2016). Comparing religious and secular belief change among non-believers can therefore provide an explanation for our main finding. Likewise, Pascal’s Wager can be tested using improved methods, for example, by studying the effect of Pascal’s argument as an experimental manipulation. Finally, the two-stage procedure used in Experiment 2 was more insightful to studying religious belief change than the standard between-subject design of Experiment 1. The two-stage technique can be used in future studies of cooperation and morality in order to dissociate dual cognitive processes.
We also suggest that these experimental manipulations might have more influence on less stable beliefs or on those who are less confident about the existence of God. A similar distinction has been made in the field of political psychology (Talhelm, 2018; Talhelm et al., 2015; Yilmaz & Saribay, 2016, 2017). Activating reflective thinking did not have an impact on political opinions when they were measured by standard scale items based on identity labels (e.g., liberal or conservative), but it led to a significant change in less stable contextualized opinions (e.g., forming opinions about a newspaper article; Yilmaz & Saribay, 2017). A similar distinction can be made in the field of cognitive science of religion. For example, while belief in God, reflecting relatively stable opinions, may be more resistant to cognitive process manipulations, the relative reliance on natural vs. supernatural explanations for an uncertain event (e.g., the disappearance of airplanes in the Bermuda Triangle) may be more open to the influence of intuitive and reflective thinking. This possibility should be examined in future research.
A surprising contrast emerges from our data: the positive causal effect of reflection on belief in God vs. the negative correlation between individual tendency for reflected thinking and religious belief. While it is not clear why experimental and correlational tests lead to different conclusions, one may conjecture that the two approaches capture separate psychological mechanisms occurring across distinct time-frames. In particular, correlational measures may reflect self-selection of intuitively inclined people to religious belief (a long-term process of identity formation), while promoting reflection may isolate the possibly short-term effects of questioning one’s own and already established beliefs. While correlational findings are prevalent in the literature, there is a need for more experimental research on this topic. In particular, the generalizability of our results across cultures (e.g., using multi-lab experiments) is an open question.
In sum, recent failures to support the intuitive religious belief hypothesis suggested that the early evidence supporting the hypothesis is not easily reproducible. Using stronger manipulations and two large-scale experiments, we found that the effect of reflection and intuition on belief in God is in fact the opposite of intuitive belief hypothesis. Our results suggest that reflection on God’s existence may promote religious self-questioning, especially among non-believers.

Wronging past rights: The sunk cost bias distorts moral judgment

Wronging past rights: The sunk cost bias distorts moral judgment. Ethan A. Meyers et al. Judgment and Decision Making, Vol. 14, No. 6, November 2019, pp. 721-727. http://journal.sjdm.org/19/190909b/jdm190909b.html

When people have invested resources into an endeavor, they typically persist in it, even when it becomes obvious that it will fail. Here we show this bias extends to people’s moral decision-making. Across two preregistered experiments (N = 1592) we show that people are more willing to proceed with a futile, immoral action when costs have been sunk (Experiment 1A and 1B). Moreover, we show that sunk costs distort people’s perception of morality by increasing how acceptable they find actions that have received past investment (Experiment 2). We find these results in contexts where continuing would lead to no obvious benefit and only further harm. We also find initial evidence that the bias has a larger impact on judgment in immoral compared to non-moral contexts. Our findings illustrate a novel way that the past can affect moral judgment. Implications for rational moral judgment and models of moral cognition are discussed.

Keywords: sunk costs, morality, decision-making, judgment, open data, open materials, preregistered

4  General Discussion

We found that the sunk cost bias extends to moral judgments. When costs were sunk, participants were more willing to proceed with a futile, immoral action compared to when costs were not sunk. For example, they were more willing to sacrifice monkeys to develop a medical cure when some monkeys had already been sacrificed than when none had been. Moreover, people judged these actions as more acceptable when costs were sunk. Importantly, these effects occurred even though the benefit of the proposed immoral action was eliminated.
Our findings illustrate a novel way that the past can impact moral judgment. Moral research conducted to-date has focused extensively on future consequences (e.g., Baez et al., 2017; Miller & Cushman, 2013). Although this makes normative sense as only the future should be relevant to decisions, it is well known that choice is affected by irrelevant factors like past investment (Kahneman, 2011; Kahneman, Slovic & Tversky, 1982; Szaszi, Palinkas, Palfi, Szollosi & Aczel, 2018; Tversky & Kahneman, 1974). As such, our findings show that as is true with other (non-moral) judgments, people’s moral judgments are affected by factors that rational agents “should” ignore when making them.
Further, our findings show that a major decision bias (i.e., the sunk cost effect) extends to moral judgment. This finding is broadly consistent with research showing that moral judgments are affected by such biases. This earlier work shows that when making moral judgments, people are sensitive to how options are framed (e.g., Shenhav & Greene, 2010) and prefer acts of omission over commission (e.g., Bostyn & Roets, 2016). For example, people make different moral judgments when the decision is presented in a gain frame than when it is presented in a loss frame, even though these two decisions are logically identical (Kern & Chugh, 2009). Likewise, people judge lying to the police about who is at fault in a car accident (a harmful commission), to be more immoral than not informing the police precisely who is at fault (a harmful omission) (Spranca, Minsk & Baron, 1991). However, unlike most of these previous demonstrations, our findings directly compare the presence of decision-making biases across moral and non-moral contexts (also see Cushman & Young, 2011).
In our first experiment, we also found that the sunk cost bias may be stronger in moral decision-making than in other situations. This is surprising. In non-moral cases proceeding with a futile course of action is wasteful. But in our moral version of the scenarios, proceeding is wasteful, harmful to others, and morally wrong. Yet, there was a greater discrepancy between willingness to act in response to sunk costs in the immoral condition. Increasing the reasons to not proceed with the action amplified the sunk cost bias. One potential explanation for this is that people are unwilling to admit their prior investments were in vain (Brockner, 1992). People succumb to the sunk cost bias in part because they feel a need to justify their past decisions as correct (Ku, 2008; also see Staw, 1976). Likewise, moral judgments seem to generate a much greater need to provide reasons to justify past decisions (Haidt, 2012). Thus, those making decisions in an immoral context might have additional pressures to justify their previous choice that stem from the nature of moral judgment itself.
Another explanation is that the initial investment was of a larger magnitude in the immoral compared to the non-moral condition. In both cases, participants incurred an economic cost, but only in one did participants incur an additional moral cost. People are more likely to succumb to the sunk cost bias when initial investments are large (Arkes & Ayton, 1999; Arkes & Blumer, 1985; Sweis et al., 2018). Perhaps sunk costs exerted a greater effect in the immoral condition because the past investments were greater (i.e., of two kinds: economic and moral, rather than just one: economic). However, as we do not know if the economic resources (e.g., pine trees and lab monkeys) were of comparable value, the discrepancy between moral conditions may entirely stem from the lab monkeys being valued higher and thus larger in investment magnitude. Thus, we are hesitant to draw any strong conclusion from this finding. The difference in sunk cost magnitude could stem from differences in financial costs between the immoral and non-moral contexts.
Our finding that moral violations led to increased willingness to act is reminiscent of the “what the hell” effect, in which people who violate their diet then give up on it and continue to overindulge (Cochran & Tesser, 1996; Polivy, Herman & Deo, 2010). We see this as similar to persisting in an immoral course of action after costs have been sunk. After engaging in a morally equivocal act, people may feel disinhibited and willing to continue the act even when its immorality becomes clear. Likewise, people may persist in an attempt to maintain the status quo (Kahneman, Knetsch & Thaler, 1991; Samuelson & Zeckhauser, 1988). These accounts, though, may not explain why sunk costs changed people’s moral perceptions. One possibility is that this resulted from cognitive dissonance between people’s actions and their moral code (Aronson, 1969; Festinger, 1957; Harmon-Jones & Mills, 1999). For example, sacrificing monkeys to develop a cure may cause dissonance between not wanting to harm but having done so. To resolve this, people might change their moral perceptions, molding their moral code to fit their behavior.
We close by considering a broader implication of this work. The extension of decision biases to moral judgment has been previously construed as supporting domain-general accounts of morality that suggest moral judgment operates similarly to ordinary judgment (Osman & Wiegmann, 2017; Greene, 2015). This is because if morality is not unique, one could reasonably expect that a factor that affects ordinary judgment would likewise affect moral judgment. Thus, if information irrelevant to the decision at hand (e.g., past investments) influences whether we continue to bulldoze land to build a highway, so too should it influence the same bulldoze decision that requires confiscating the land. This is not conclusive however, and our findings could be interpreted to support domain-specific accounts instead (e.g., Mikhail, 2011). For instance, the sunk cost bias was demonstrably larger in moral judgments. Nevertheless, an interpretation of our results as evidence for a domain-general account of morality must explain how the varying effect of past investment on judgment is a difference in degree but not kind.

Thursday, November 28, 2019

Switzerland: Those exposed to civil conflict/mass killing during childhood are 35 pct more prone to violent crime; effect is mostly confined to co-nationals, consistent with inter-group hostility persisting over time

Couttenier, Mathieu, Veronica Petrencu, Dominic Rohner, and Mathias Thoenig. 2019. "The Violent Legacy of Conflict: Evidence on Asylum Seekers, Crime, and Public Policy in Switzerland." American Economic Review, 109 (12): 4378-4425. DOI: 10.1257/aer.20170263

Abstract: We study empirically how past exposure to conflict in origin countries makes migrants more violence-prone in their host country, focusing on asylum seekers in Switzerland. We exploit a novel and unique dataset on all crimes reported in Switzerland by the nationalities of perpetrators and of victims over 2009–2016. Our baseline result is that cohorts exposed to civil conflict/mass killing during childhood are 35 percent more prone to violent crime than the average cohort. This effect is particularly strong for early childhood exposure and is mostly confined to co-nationals, consistent with inter-group hostility persisting over time. We exploit cross-region heterogeneity in public policies within Switzerland to document which integration policies are best able to mitigate the detrimental effect of past conflict exposure on violent criminality. We find that offering labor market access to asylum seekers eliminates two-thirds of the effect.

Could they be lying?: Vegetarian women reported that they are more prosocially motivated to follow their diet & adhere to their diet more strictly (i.e., are less likely to cheat & eat meat)

Gender Differences in Vegetarian Identity: How Men and Women Construe Meatless Dieting. Daniel L.Rosenfeld. Food Quality and Preference, November 28 2019, 103859. https://doi.org/10.1016/j.foodqual.2019.103859

Highlights
• This research evaluated psychological differences between vegetarian men and women.
• Women are more prosocially motivated to follow a vegetarian diet than men are.
• Women adhere to their vegetarian diet more strictly than men do.

Abstract: Meat is deeply associated with masculine identity. As such, it is unsurprising that women are more likely than men are to become vegetarian. Given the gendered nature of vegetarianism, might men and women who become vegetarian express distinct identities around their diets? Through two highly powered preregistered studies (Ns = 890 and 1,775) of self-identified vegetarians, combining both frequentist and Bayesian approaches, I found that men and women differ along two dimensions of vegetarian identity: (1) dietary motivation and (2) dietary adherence. Compared to vegetarian men, vegetarian women reported that they are more prosocially motivated to follow their diet and adhere to their diet more strictly (i.e., are less likely to cheat and eat meat). By considering differences in how men and women construe vegetarian dieting, investigators can generate deeper insights into the gendered nature of eating behavior.

Keywords: vegetarianismfood choicedietinggenderidentity


About lies and prosociality in women, nonreligion is socially risky, atheism is more socially risky than other forms of nonreligion, & women and members of other marginalized groups avoid the most socially risky forms of nonreligion: From Existential to Social Understandings of Risk: Examining Gender Differences in Nonreligion. Penny Edgell, Jacqui Frost, Evan Stewart. Social Currents, Dec 2018. https://www.bipartisanalliance.com/2018/12/nonreligion-is-socially-risky-atheism.html

Check also Taste and health concerns trump anticipated stigma as barriers to vegetarianism. Daniel L.Rosenfeld, A. JanetTomiyama. Appetite, Volume 144, January 1 2020, 104469. https://www.bipartisanalliance.com/2019/09/vegetarian-diets-may-be-perceived-as.html

And Relationships between Vegetarian Dietary Habits and Daily Well-Being. John B. Nezlek, Catherine A. Forestell & David B. Newman. Ecology of Food and Nutrition, https://www.bipartisanalliance.com/2018/10/vegetarians-reported-lower-self-esteem.html

And Psychology of Men & Masculinity: Eating meat makes you sexy / Conformity to dietary gender norms and attractiveness. Timeo, S., & Suitner, C. (2018). Eating meat makes you sexy: Conformity to dietary gender norms and attractiveness. Psychology of Men & Masculinity, 19(3), 418-429. https://www.bipartisanalliance.com/2018/06/psychology-of-men-masculinity-eating.html


Great interest exists in identifying methods to predict neuropsychiatric disease states and treatment outcomes from high-dimensional data, including neuroimaging and genomics data; best practices are discussed

Establishment of Best Practices for Evidence for Prediction: A Review. Russell A. Poldrack, Grace Huckins, Gael Varoquaux. JAMA Psychiatry, November 27, 2019. doi:https://doi.org/10.1001/jamapsychiatry.2019.3671

Abstract
Importance  Great interest exists in identifying methods to predict neuropsychiatric disease states and treatment outcomes from high-dimensional data, including neuroimaging and genomics data. The goal of this review is to highlight several potential problems that can arise in studies that aim to establish prediction.

Observations  A number of neuroimaging studies have claimed to establish prediction while establishing only correlation, which is an inappropriate use of the statistical meaning of prediction. Statistical associations do not necessarily imply the ability to make predictions in a generalized manner; establishing evidence for prediction thus requires testing of the model on data separate from those used to estimate the model’s parameters. This article discusses various measures of predictive performance and the limitations of some commonly used measures, with a focus on the importance of using multiple measures when assessing performance. For classification, the area under the receiver operating characteristic curve is an appropriate measure; for regression analysis, correlation should be avoided, and median absolute error is preferred.

Conclusions and Relevance  To ensure accurate estimates of predictive validity, the recommended best practices for predictive modeling include the following: (1) in-sample model fit indices should not be reported as evidence for predictive accuracy, (2) the cross-validation procedure should encompass all operations applied to the data, (3) prediction analyses should not be performed with samples smaller than several hundred observations, (4) multiple measures of prediction accuracy should be examined and reported, (5) the coefficient of determination should be computed using the sums of squares formulation and not the correlation coefficient, and (6) k-fold cross-validation rather than leave-one-out cross-validation should be used.

---
Excerpts (full paper, references, etc., at the DOI above):

Introduction

The development of biomarkers for disease is attracting increasing interest in many domains of biomedicine. Interest is particularly high in neuropsychiatry owing to the current lack of biologically validated diagnostic or therapeutic measures.1 An essential aspect of biomarker development is demonstration that a putative marker is predictive of relevant behavioral outcomes,2 disease prognosis,3 or therapeutic outcomes.4 As the size and complexity of data sets have increased (as in neuroimaging and genomics studies), it has become increasingly common that predictive analyses have been performed using methods from the field of machine learning, with techniques that are purpose-built for generating accurate predictions on new data sets.

Despite the potential utility of prediction-based research, its successful application in neuropsychiatry—and medicine more generally—remains challenging. In this article, we review a number of challenges in establishing evidence for prediction, with the goal of providing simple recommendations to avoid common errors. Although most of these challenges are well known within the machine learning and statistics communities, awareness is less widespread among research practitioners.

We begin by outlining the meaning of the concept of prediction from the standpoint of machine learning. We highlight the fact that predictive accuracy cannot be established by using the same data both to fit and test the model, which our literature review found to be a common error in published claims of prediction. We then turn to the question of how accuracy should be quantified for categorical and continuous outcome measures. We outline the ways in which naive use of particular predictive accuracy measures and cross-validation methods can lead to biased estimates of predictive accuracy. We conclude with a set of best practices to establish valid claims of successful prediction.

Code to reproduce all simulations and figures is available at https://github.com/poldrack/PredictionCV.

Association vs Prediction

A claim of prediction is ultimately judged by its ability to generalize data to new situations; the term implies that it is possible to successfully predict outcomes in data sets other than the one used to generate the claim. When a statistical model is applied to data, the goodness of fit of that model to those data will in part reflect the underlying data-generating mechanism, which should generalize to new data sets sampled from the same population, but it will also include a contribution from noise (ie, unexplained variation or randomness) that is specific to the particular sample.5 For this reason, a model will usually fit better to the sample used to estimate it than it will to a new sample, a phenomenon known in machine learning as overfitting and in statistics as shrinkage.

Because of overfitting, it is not possible to draw useful estimates of predictive accuracy simply from a model’s goodness of fit to a data set; such estimates will necessarily be inflated, and their degree of optimism will depend on many factors, including the complexity of the statistical model and the size of the data set. The fit of a model to a specific data set can be improved by increasing the number of parameters in the model; any data set can be fit with 0 error if the model has as many parameters as data points. However, as the model becomes more complex than the process that generates the data, the fit of the model starts to reflect the specific noise values in the data set. A sign of overfitting is that the model fits well to the specific data set used to estimate the model but fits poorly to new data sets sampled from the same population. Figure 1 presents a simulated example, in which increasing model complexity results in decreased error for the data used to fit the model, but the fit to new data becomes increasingly poor as the model grows more complex than the true data-generating process.

Because we do not generally have a separate test data set to assess generalization performance, the standard approach in machine learning to address overfitting is to assess model fit via cross-validation, a process that uses subsets of the data to iteratively train and test the predictive performance of the model. The simplest form of cross-validation is known as leave-one-out, in which the model is successively fit on every data point but 1 and is then tested on that left-out point. A more general cross-validation approach is known as k-fold cross-validation, in which the data are split into k different subsets, or folds. The model is successively trained on every subset but 1 and is then tested on the held-out subset. Cross-validation can also help discover the model that will provide the best predictive performance on a new sample (Figure 1).

One might ask how poorly inflated the in-sample association is as an estimate of out-of-sample prediction; if the inflation is small, or only occurs with complex models, then perhaps it can be ignored for practical purposes. Figure 2 shows an example of how the optimism of in-sample fits depends on the complexity of the statistical model; in this case, we use a simple linear model but vary the number of irrelevant independent variables in the model. As the number of variables increases, the fit of the model to the sample increases owing to overfitting. However, even for a single predictor in the model, the fit of the model is inflated compared with new data or cross-validation. The optimism of in-sample fits is also a function of sample size (Figure 2). This example demonstrates the utility of using cross-validation to estimate predictive accuracy on a new sample.


Statistical Significance vs Useful Prediction

A second reason that significant statistical association does not imply practically useful prediction is exemplified by the psychiatric genetic literature. Large genome-wide association studies have now identified significant associations between genetic variants and mental illness diagnoses. For example, Ripke et al6 compared more than 21 000 patients with schizophrenia with more than 38 000 patients without schizophrenia and found 22 genetic variants significant at a genome-wide level (P = 5 × 10−8), the strongest of which (rs9268895) had a combined P value of 9.14 × 10−14. However, this strongest association would be useless on its own as a predictor of schizophrenia. The combined odds ratio for this risk variant was 1.167; assuming a population prevalence of schizophrenia of 1 in 196 individuals as the baseline risk,7 possessing the risk allele for this strongest variant would raise an individual’s risk to 1 in 167. Such an effect is far from clinically actionable. In fact, the increased availability of large samples has made clear the point that Meehl8 raised more than 50 years ago, which stated that in the context of null hypothesis testing, as samples become larger, even trivial associations become statistically significant.

A more general challenge exists regarding the prediction of uncommon outcomes, such as a diagnosis of schizophrenia. Consider the case in which a researcher has developed a test for schizophrenia that has 99% sensitivity (ie, a 99% likelihood that the test will return a positive result for someone with the disease) and 99% specificity (ie, a 99% likelihood that the test will return a negative result for someone without the disease). These are performance levels that any test developer would be thrilled to obtain; in comparison, mammography has a sensitivity of 87.8% and a specificity of 90.5% for the detection of breast cancer.9 If this test for schizophrenia were used to screen 1 million people, it would detect 99% of those with schizophrenia (5049 individuals) but would also incorrectly detect 9949 individuals without schizophrenia; thus, even with exceedingly high sensitivity and specificity, the predictive value of a positive test result remains well below 50%. As we can straightforwardly deduce from the Bayes theorem, false alarm rates will usually be high when testing for events with low baseline rates of occurrence.


Misinterpretation of Association as Prediction

A significant statistical association is insufficient to establish a claim of prediction. However, in our experience, it is common for investigators in the functional neuroimaging literature to use the term prediction when describing a significant in-sample statistical association. To quantify the prevalence of this practice, we identified 100 published studies between December 24, 2017, and October 30, 2018, in PubMed by using the search terms fMRI prediction and fMRI predict. For each study, we identified whether the purported prediction was based on a statistical association, such as a significant correlation or regression effect, or whether the researchers used a statistical procedure specifically designed to measure prediction, such as cross-validation or out-of-sample validation. We only included studies that purported to predict an individual-level outcome based on fMRI data and excluded other uses of the term prediction, such as studies examining reward prediction error. A detailed description of these studies is presented in the eTable in the Supplement.

Of the 100 studies assessed, 45 reported an in-sample statistical association as the sole support for the claims of prediction, suggesting that the conflation of statistical association and predictive accuracy is common.10 The remaining studies used a mixture of cross-validation strategies, as shown in Figure 3.


Factors That Can Bias Assessment of Prediction

Although performing some type of assessment of an out-of-sample prediction is essential, it is also clear that cross-validation still leaves room for errors when establishing predictive validity. We now turn to issues that can affect the estimation of predictive accuracy even when using appropriate predictive modeling methods.

- Small Samples

The use of cross-validation with small samples can lead to highly variable estimates of predictive accuracy. Varoquaux11 noted that a general decrease in the level of reported prediction accuracy can be observed as sample sizes increase. Given the flexibility of analysis methods12 and publication bias for positive results, such that only the top tail of accuracy measures is reported, the high variability of estimates with small samples can lead to a body of literature with inflated estimates of predictive accuracy.

Our literature review found a high prevalence of small samples, with more than half of the samples comprising fewer than 50 people and 15% of the studies with samples comprising fewer than 20 people (Figure 3). Most studies that use small samples are likely to exhibit highly variable estimates. This finding suggests that many of the claims of predictive accuracy in the neuroimaging literature may be exaggerated and/or not valid.

- Leakage of Test Data

To give a valid measure of predictive accuracy, cross-validation needs to build on a clean isolation of the test data during the fitting of models to the training data. If information leaks from the testing set into the model-fitting procedure, then estimates of predictive accuracy will be inflated, sometimes wildly. For example, any variable selection that is applied to the data before application of cross-validation will bias the results if the selection involves knowledge of the variable being predicted. Of the 57 studies in our review that used cross-validation procedures, 10 may have applied dimensionality reduction methods that involved the outcome measure (eg, thresholding based on correlation) to the entire data set. This lack of clarity raises concerns regarding the level of methodological reporting in these studies.13

In addition, any search across analytic methods, such as selecting the best model or the model parameters, must be performed using nested cross-validation, in which a second cross-validation loop is used within the training data to determine the optimal method or parameters. The best practice is to include all processing operations within the cross-validation loop to prevent any potential for leakage. This practice is increasingly possible using cross-validation pipeline tools, such as those available within the scikit-learn software package (scikit-learn Developers).14

- Model Selection Outside of Cross-validation

Selecting a predictive method based on the data creates an opportunity for bias that could involve the potential use of a number of different classifiers, hyperparameters for those classifiers, or various preprocessing methods. As in standard data analysis, there is a potential garden of forking paths,15 such that data-driven modeling decisions can bias the resulting outcomes even if there is no explicit search for methods providing the best results. The outcomes are substantially more biased if an explicit search for the best methods is performed without a held-out validation set.

As reported in studies by Skocik et al16 using simulations and Varoquaux11 using fMRI data, it is possible to obtain substantial apparent predictive accuracy from data without any true association if a researcher capitalizes on random fluctuations in classifier performance and searches across a large parameter space. A true held-out validation sample is a good solution to this problem. A more general solution to the problem of analytic flexibility is the preregistration of analysis plans before any analysis, as is increasingly common in other areas of science.17

- Nonindependence Between Training and Testing Sets

Like any statistical technique, the use of cross-validation to estimate predictive accuracy involves assumptions, the failure of which can undermine the validity of the results. An important assumption of cross-validation is that observations in the training and testing sets are independent. While this assumption is often valid, it can break down when there are systematic relationships between observations. For example, the Human Connectome Project data set includes data from families, and it is reasonable to expect that family members will be closer to each other in brain structure and function than will individuals who are not biologically related.

Similarly, data collected as a time series will often exhibit autocorrelation, such that observations closer in time are more similar. In these cases, there are special cross-validation strategies that must be used to address this structure. For example, in the presence of family structure, such as the sample used in the Human Connectome Project, a researcher might cross-validate across families (ie, leave-k-families-out) rather than individuals to address the nonindependence potentially induced by family structure.18

- Quantification of Predictive Accuracy

Two main categories of problems occur in predictive modeling. The first, classification accuracy, involves the prediction of discrete class membership, such as the presence or absence of a disease diagnosis; the second, regression accuracy, involves the prediction of a continuous outcome variable, such as a test score or disease severity measure. In our literature review, we found that 37 studies performed classification while 64 performed regression to determine predictive accuracy. These strategies generally involve different methods for quantification of accuracy, but in each case, potential problems can arise through the naive use of common methods.

- Quantifying Classification Accuracy

In a classification problem, we aim to quantify our ability to accurately predict class membership, such as the presence of a disease or a cognitive state. When the number of members in each class is equal, then average accuracy (ie, the proportion of correct classifications, as used in the examples in Figure 2) is a reasonable measure of predictive accuracy. However, if any imbalance exists between the frequencies of the different classes, then average accuracy is a misleading measure. Consider the example of a predictive model for schizophrenia, which has a prevalence of 0.5% in the population; the classifier can achieve average accuracy of 99.5% across all cases by predicting that no one has the disease, simply owing to the low frequency of the disease.

A standard method to address the class imbalance problem is to use the receiver operating characteristic curve from signal detection theory.19 A receiver operating characteristic curve can be constructed given any continuous measure of evidence, as provided by most classification models. A threshold is then applied to this measure of evidence, systematically ranging from low (in which most cases will be assigned to the positive class, and the number of false positives will be high) to high (in which most cases will be assigned to the negative class, and the number of false positives will be low). The area under the curve can then be used as an integrated measure of classification accuracy. A perfect prediction leads to an area under the curve of 1.0, while a fully random prediction leads to an area under the curve of 0.5. Importantly, the area under the curve value of 0.5 expected by chance is not biased by imbalanced frequencies of positive and negative cases in the way that simple measures of accuracy would be. It is also useful to separately present the sensitivity (ie, the proportion of positive cases correctly identified as positive) and specificity (ie, the proportion of negative cases correctly identified as negative) of the classifier, to allow assessment of the relative balance of false positives and false negatives.

- Quantifying Regression Accuracy

It is increasingly common to apply predictive modeling in cases in which the outcome variable is continuous rather than discrete—that is, in regression rather than classification problems. For example, a number of studies in cognitive neuroscience have attempted to predict phenotypic measures, such as age,20 personality,21 or behavioral outcomes.22 For continuous predictions, accuracy can be quantified either by the relation between the predicted and actual values, relative to perfect prediction, or by a measure of the absolute difference between predicted and actual values (ie, the error). A relative measure is useful because its value can easily be related to the success of the prediction. For this purpose, a useful measure is the fraction of explained variance, often called the coefficient of determination or R2. If a model makes perfect predictions, its associated R2 value will be 1.0, whereas a model making random predictions should have an R2 value of approximately 0. If a model is particularly poor, to the point that its predictions are less accurate than they would be if the model simply returned the mean value for the data set, the R2 value can be negative, despite the fact that it is called R2. The disadvantage of this measure is that it does not support comparisons of the quality of predictions across different data sets because the variance of the outcome variable may differ between one data set and another. For this purpose, absolute error measurements, such as the mean absolute error, which has the benefit of quantifying error in the units of the original measure (such as IQ points), are useful.

It is common in the literature to use the correlation between predicted and actual values as a measure of predictive performance; of the 64 studies in our literature review that performed prediction analyses on continuous outcomes, 30 reported such correlations as a measure of predictive performance. This reporting is problematic for several reasons. First, correlation is not sensitive to scaling of the data; thus, a high correlation can exist even when predicted values are discrepant from actual values. Second, correlation can sometimes be biased, particularly in the case of leave-one-out cross-validation. As demonstrated in Figure 4, the correlation between predicted and actual values can be strongly negative when no predictive information is present in the model. A further problem arises when the variance explained (R2) is incorrectly computed by squaring the correlation coefficient. Although this computation is appropriate when the model is obtained using the same data, it is not appropriate for out-of-sample testing23; instead, the amount of variance explained should be computed using the sum-of-squares formulation (as implemented in software packages such as scikit-learn).

As discussed previously in this section, leave-one-out cross-validation is problematic because it allows for the possibility of negative R2 values. For classification settings, the effect is the same; in a perfectly balanced data set, leave-one-out cross-validation creates a testing set comprising a single observation that is in the minority class of the training set. A simple prediction rule, such as majority vote, would thus lead to predictions that would be incorrect.24 Rather, the preferred method of performing cross-validation is to leave out 10% to 20% of the data, using k-fold or shuffle-split techniques that repeatedly split the data randomly. Larger testing sets enable a good computation of measurements, such as the coefficient of determination or area under the receiver operating characteristic curve.


Best Practices for Predictive Modeling

We have several suggestions for researchers engaged in predictive modeling to ensure accurate estimates of predictive validity:

.    In-sample model fit indices should not be reported as evidence for predictive accuracy because they can greatly overstate evidence for prediction and take on positive values even in the absence of true generalizable predictive ability.

.    The cross-validation procedure should encompass all operations applied to the data. In particular, predictive analyses should not be performed on data after variable selection if the variable selection was informed to any degree by the data themselves (ie, post hoc cross-validation). Otherwise, estimated predictive accuracy will be inflated owing to circularity.25

.    Prediction analyses should not be performed with samples smaller than several hundred observations, based on the finding that predictive accuracy estimates with small samples are inflated and highly variable.26

.    Multiple measures of prediction accuracy should be examined and reported. For regression analyses, measures of variance, such as R2, should be accompanied by measures of unsigned error, such as mean squared error or mean absolute error. For classification analyses, accuracy should be reported separately for each class, and a measure of accuracy that is insensitive to relative class frequencies, such as area under the receiver operating characteristic curve, should be reported.

.    The coefficient of determination should be computed by using the sums-of-squares formulation rather than by squaring the correlation coefficient.

.    k-fold cross-validation, with k in the range of 5 to 10,27 should be used rather than leave-one-out cross-validation because the testing set in leave-one-out cross-validation is not representative of the whole data and is often anticorrelated with the training set.


Author Contributions: Dr Poldrack and Ms Huckins had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Poldrack, Varoquaux.

Acquisition, analysis, or interpretation of data: Poldrack, Huckins.

Drafting of the manuscript: Poldrack.

Critical revision of the manuscript for important intellectual content: Huckins, Varoquaux.

Statistical analysis: All authors.

Administrative, technical, or material support: Poldrack.