Most* small probabilities aren’t pascalian

2023-12-152023-11-03 ~ Gregory Lewis ~ Leave a comment

Summary: We routinely act to prevent, mitigate, or insure against risks with P = ‘one-in-a-million’. Risks similarly or more probable than this should not prompt concerns about ‘pascal’s mugging’ etc.

Motivation

Reckless appeals to astronomical stakes often prompt worries about pascal’s mugging or similar. Sure, a 10^-20 chance of 10^40 has the same expected value as 10^20 with P = 1, but treating them as equivalent when making decisions is counter-intuitive. Thus one can (perhaps should) be wary of lines which amount to “The scale of the longterm future is so vast we can basically ignore the probability – so long as it is greater than 10^-lots – to see x-risk reduction is the greatest priority.”

Most folks who work on (e.g.) AI safety do not think the risks they are trying to reduce are extremely (nor astronomically) remote. Pascalian worries are unlikely to apply to attempts to reduce a baseline risk of 1/10 or 1/100. They also are unlikely to apply if the risk is a few orders of magnitude less (or a few orders of magnitude less tractable to reduce) than some suppose.

Despite this, I sometimes hear remarks along the lines of “I only think this risk is 1/1000 (or 1/10 000, or even ‘a bit less than 1%’) so me working on this is me falling for Pascal’s wager.” This is mistaken: an orders-of-magnitude lower risk (or likelihood of success) makes, all else equal, something orders of magnitude less promising, but it does not mean it can be dismissed out-of-hand.

Exactly where the boundary should be drawn for pascalian probabilities is up for grabs (10^-10 seems reasonably pascalian, 10^-2 definitely not). I suggest a very conservative threshold at ‘1 in a million’: human activity in general (and our own in particular) is routinely addressed to reduce, mitigate, or insure against risks between 1/1000 and 1/1 000 000, and we typically consider these activities ‘reasonable prudence’ rather than ‘getting mugged by mere possibility’.

Illustrations

Among many other things:

Aviation and other ‘safety critical’ activities

One thing which can go wrong when flying an airliner is an engine stops working. Besides all the engineering and maintenance to make engines reliable, airlines take many measures to mitigate this risk:

Airliners have more than one engine, and are designed and operated so that they are able to fly and land at a nearby airport ‘on the other engine’ should one fail at any point in the flight.
Pilots practice in initial and refresher simulator training how to respond to emergencies like an engine failure (apparently engine failure just after take-off is the riskiest)
Pilots also make a plan before each flight what to do ‘just in case’ an engine fails whilst they are taking off.

This risk is very remote: the rate of (jet) engine failure is something like 1 per 400 000 flight hours. So for a typical flight, maybe the risk is something like 10^-4 to 10^-5. The risk of an engine failure resulting in a fatal crash is even more remote: the most recent examples I could find happened in the 90s. Given the millions of airline flights a year, ‘1 in a million flights’ is comfortable upper bound.

Similarly, the individual risk-reduction measures mentioned above are unlikely to be averting that many micro(/nano?) crashes. A pilot who (somehow) manages to skive off their recurrency training or skip the pre-flight briefing may still muddle through if the risk they failed to prepare for realises. I suspect most consider the diligent practice by pilots for events they are are unlikely to ever see in their career admirable rather than getting suckered by Pascal’s mugging.

Aviation is the poster child of safety engineering, but it is not unique. Civil engineering disasters (think building collapses) share similar properties to aviation ones: the absolute rate is very low (plausibly of the order of ‘1 per million structure years’ or lower); this low risk did not happen by magic, but rather through concerted efforts across design, manufacture, maintenance, and operation; by design, a failure in one element should not lead to disaster (cf. the famous ‘swiss cheese model‘); it also means the marginal effect of each contribution to avoiding disaster is very small. A sloppy inspection or shoddy design does not guarantee disaster, yet thorough engineers and inspectors are lauded rather than ridiculed. The same points can be made can be made across most safety and many security fields.

Voting (and similar collective actions)

There’s a well worn discussion about the rationality of voting, given the remote likelihood a marginal vote is decisive. A line popular in EA-land is considering voting as rational charity: although the expected value to you if your vote is decisive to get the better party in government might only be cents, if one is voting for party one believes best improves overall welfare across the electorate, this expected value (roughly) gets multiplied by population, and so climbs into the thousands of dollars.

Yet this would not be rational if in fact individuals should dismiss motivating reasons based on 1/1 000 000 probabilities as ‘Pascal’s mugging’. Some initial work making this case suggested the key probability of being the decisive vote was k/ n(voters in election), with k between 1 and 10 depending how close the election was expected to be. So voting in larger elections (>10M voters) could not be justified by this rationale.

The same style of problem applies to other collective activity: it is similarly unlikely you will make the decisive signature for a petition to be heeded, the decisive attendance for a protest to gain traction, nor the decisive vegan which removes a granule of animal product production. It is doubtful they are irrational in virtue of Pascal’s mugging.

Asteroid defence

Perhaps the OG x-risk reducers are those who work on planetary defence: trying to look for asteroids which could hit Earth; and planning how, if one was headed our way, how a collision could be avoided. These efforts have been ongoing for decades, and have included steadily improvements in observation and detection, and recently included tests of particular collision avoidance methods.

The track record of asteroid impacts (especially the most devastating) indicate this risk is minute, and still lower when conditioned on diligent efforts to track near-earth objects and finding none of the big ones are on a collision course. Once again, the rate of a ‘planet killer’ collision is somewhere around 10^-6 to 10^-7 per century. Also once again, I think most are glad this risk is being addressed rather than ignored.

Conclusion

Three minor points, then one major one.

One is that Pascal’s mugging looks wrong at 1 in a million does not make the worry generally misguided; even if ‘standard rules apply’ at 10^-6, maybe something different is called for at (say) 10^-60. I only argue pascalian worries are inapposite at or above the 10^-6 level.

Two is you can play with interval or aggregation to multiply up or down the risk. Even if ‘not looking both ways before you cross the road’ once incurs only a minute risk of serious injury (as a conservative BOTEC: ~2000 injuries/year in the UK, ~70M population, cross a road 1/day on average, RR = 100 for not looking both ways ~~ 8/ million per event?) following this as a policy across one’s lifetime increases the risk a few orders of magnitude – and everyone following this policy would significantly increase the number of pedestrians struck and killed by cars each year.

This highlights a common challenge for naive (and some not so naive) efforts to ‘discount low probabilities’: we can slice up some composite risk reduction measures such that individually every one should be rejected (“the risk of getting hit by a car but-for looking both ways as you cross the road is minuscule enough to the subject of Pascal’s mugging”), yet we endorse the efforts as a whole given their aggregate impact. Perhaps the main upshot was 1/1 000 000 risk as the ‘don’t worry about Pascal’s mugging’ was too conservative – maybe more ‘at least 1/1 000 000 on at least one reasonable way to slice it‘

Three is we might dispute how reasonable the above illustrations are. Maybe we err risk-intolerant or over-insure ourselves; maybe the cost savings of slightly laxer airline safety would be worth an aviation disaster or two a year; maybe the optimal level of political engagement should be lower than the status quo; maybe asteroid impacts are so remote its practitioners should spend their laudable endeavour elsewhere. Yet even if they are unreasonable, they are unreasonable because the numbers are not adding up (/multiplying together) per orthodox expected utility theory, and not because they should have been ruled out in principle by a ‘pascal’s mugging’ style objection.

Finally, returning to x-risk. The examples above were also chosen to illustrate a different ‘vibe’ that could apply to x-risk besides ‘impeding disaster and heroic drama’. Safety engineering is non-heroic by design: a saviour snatching affairs from the jaws of disaster indicates an intolerable single point of failure. Rather, success is a team effort which is resilient to an individual’s mistake, and their excellence only slightly notches down the risk even further. Yet this work remains both laudable and worthwhile: a career spent investigating not-so-near misses to tease out human factors to make them even more distant has much to celebrate, even if not much of a highlight reel.

‘Existential safety’ could be something similar. Risks of AI, nukes, pandemics etc. should be at least as remote of those of a building collapsing, a plane crashing, or a nuclear power plant melting down. Hopefully these risks are similarly remote, and hopefully one’s contribution amounts to a slight incremental reduction. Only the vainglorious would wish otherwise.

Not all hopes are expectations, and facts don’t care about how well we vibe with them. Most people working on x-risk (including myself) think the risks they work on are much more likely than an airliner crashing. Yet although the scale of the future may be inadequate stakes for pascalian gambles, its enormity is sufficient to justify most non-pascalian values. Asteroid impacts, although extremely remote, still warrant some attention. If it transpired all other risks were similarly unlikely, I’d still work on mine.

Whether x-risk reduction is the best thing one can do if the risks are more ‘bridge collapse’ than ‘Russian roulette’ turns on questions like how should one price the value of the future, and how it stacks up versus other contributions to other causes. If you like multiplying things at approximate face value as much as I do, the answer plausibly still remains ‘yes’. But if ‘no’, pascal’s mugging should not be the reason why.

Terminate deliberation based on resilience, not certainty

2023-12-082023-11-03 ~ Gregory Lewis ~ Leave a comment

BLUF:

“We should ponder until it no longer feels right to ponder, and then to choose one of the acts it feels most right to choose… If pondering comes at a cost, we should ponder only if it seems we will be able to separate better options from worse options quickly enough to warrant the pondering” – Trammell

Introduction

Many choices are both highly uncertain, highly consequential, and difficult to develop a highly-confident impression of which option is best. In EA-land, the archetypal example is career choice, but similar dilemmas are common in corporate decision-making (e.g. grant-making, organisational strategy) and life generally (e.g. “Should I marry Alice?” “Should I have children with Bob?”).

Thinking carefully about these choices is wise, and my impression is people tend to err in the direction of too little contemplation – stumbling over important thresholds that set the stage for the rest of their lives. One of the (perhaps the main) key messages of effective altruism is our altruistic efforts err in a similar direction. Pace more cynical explanations, I think (e.g.) the typical charitable donor (or typical person) is “using reason to try and do good”. Yet they use reason too little and decide too soon whilst the ‘returns to (further) reason’ remain high, and so typically fall far short of what they could have accomplished.

One can still have too much of a good thing: ‘reasonableness’ ‘prudence’ or even ‘caution’ can be wise, but ‘indecisiveness’ less so. I suspect many can recall occasions of “analysis paralysis”, or even when they suspect prolonged fretting worsened the quality of the decision they finally made. I think folks in EA-land tend to err in this opposite direction, and I find myself giving similar sorts of counsel on these themes to those who (perhaps unwisely) seek my advice. So I write.

Certainty, resilience, and value of (accessible) information

It is widely appreciated that we can be more certain (or confident) of some things than others: I can be essentially certain I have two eyes, whilst guessing with little confidence whether it will rain tomorrow. We can often (and should even oftener) use numbers and probability to express different levels of certainty – e.g. P(I have two eyes) ~ 1 (- 10^-(lots)); P(rain tomorrow) ~ 0.5.

Less widely appreciated is the idea that our beliefs can vary not only in their certainty but how much we expect this certainty to change. This ‘second-order certainty’ sometimes goes under the heading ‘credal (¬)fragility’ or credal resilience. These can come apart, especially in cases of uncertainty.[1] I might think the chances of ‘The next coin I flip will land heads’ and ‘Rain tomorrow’ are 50/50, but I’d be much more surprised if my confidence in the former changed to 90% than the latter: coins tend fair, whilst real weather forecasts I consult often surprise my own guesswork, and prove much more reliable. For coins, I have resilient uncertainty; for the weather, non-resilient uncertainty.

Credal resilience is – roughly – a forecast of the range or spread of our future credences. So when applied to ourselves, it is partly a measure of what resources we will access to improve our guesswork. My great-great-great-grandfather, sans access to good meteorology, probably had much more resilient uncertainty about tomorrow’s weather. However the ‘resource’ need not be more information, sometimes it is simply more ‘thinking time’: a ‘snap judgement’ I make on a complex matter may also be one I expect to shift markedly if I take time to think it through more carefully, even if I only spend that time trying to better weigh information I already possess.

Thus credal resilience is one part of value of information (or value of contemplation): all else equal, information (or contemplation) applied to a less resilient belief is more valuable than a more resilient one, as there’s a higher expected ‘yield’ in terms of credences changing (hopefully/expectedly for the better). Of course, all else is seldom equal; accuracy on some beliefs can be much more important than others: if I was about to ‘call the toss’ for a high profile sporting event, eking out even a miniscule ‘edge’ might be worth much more than the much easier to obtain, and much more informative, weather forecast for tomorrow.

Decision making under (resilient?) uncertainty

Most ‘real life’ decision-making under uncertainty (so most real life decision-making in general) has the hidden bonus option of ‘postpone the decision to think about your options more’. When is taking the bonus option actually a bonus, rather than a penalty?

Simply put: “Go with your best guess when you’re confident enough” is the wrong answer, and “Go with your best guess when you’re resilient enough” is the right one.

The first approach often works, and its motivation is understandable. For low-stakes decisions (“should I get this or that for dinner?”), I might be happy only being 60% confident I am taking the better option. For many high-stakes decisions (e.g. “what direction should I take my career for the next several years?”) I’d want to be at least 90% sure I’m making the right call.

But you can’t always get what you want, and the world offers no warranty you can make all (or any) of your important decisions with satisfying confidence. Thus this approach runs aground in cases of resilient uncertainty: you want to be 90% sure, but you are 60% sure, and try as you might you stay 60% sure. Yet you keep trying in the increasingly forlorn hope the next attempt will excavate the crucial nugget of information which makes the decision clear.

Deciding once you are resilient enough avoids getting stuck in this way. In this approach, the ‘decision to decide’ is not based on achieving some aspirational level of certainty, but comparing the marginal benefits of better decision accuracy versus earlier decision execution. If my credence for “I should – all things considered[2] – go to medical school” is 0.6, but I thought it would change by 0.2 on average if I spent a gap year to think it over (e.g. a distribution of Beta(3,2)), maybe waiting a year makes sense: there’s maybe a 30% chance I would think medical school wasn’t the right choice after all, and averting that risk may be worth more than the 70% chance further reflection gets the same result and I delay the decision for a year.

Suppose I take that year, and now my credence is 0.58, but I think this would only change by +/- 0.05 if I took another year out (e.g. Beta(55, 40)): my deliberation made me more uncertain, but this uncertainty is much more resilient. Taking another gap year to deliberate looks less reasonable: although I think there’s a 42% chance this decision is the wrong call, there’s only a 6% chance I would change my mind with another year of agonising. My best guess is resilient enough; I should take the plunge.[3]

Track records are handy in practically assessing credal resilience

This is all easier said than done. One challenge is applying hard numbers to a felt sense of uncertainty is hard. Going further and (like the toy example) applying hard numbers to forecast the distribution of what this felt sense of uncertainty would be after some further hypothesised information/contemplation to assess credal resilience looks nigh-impossible. “I’m not sure, then I thought about it for a bit, now I’m still not sure. What now?”

However, this sort of ‘auto-epistemic-forecasting’, like geopolitical forecasting, is not as hard as it first seems. In the latter, base-rates, reference classes, and trend extrapolation are often enough to at least get one in the right ballpark. In a similar way, tracking one’s credal volatility over previous deliberation can give you a good idea about the resilience of one’s uncertainty, and the value of further deliberation.

Imagine a graph which has ‘Confidence I should do X’ on the Y-axis (sorry), and some metric of amount of deliberation (e.g. hours, number of people you’ve called for advice, pages in your option-assessment google doc) on the X-axis. What might this graph look like?

The easy case is where one can see confidence (red line) trend in a direction with volatility (blue dashed envelope) steadily decreasing. This suggests one’s belief is already high confidence and high-resilience, and – extrapolating forward – spending a lot more time deliberating is unlikely to change this picture much.

Per previous remarks, the same applies even if deliberation resolves to a lesser degree of confidence. This suggests resilience and maybe decision time – as yet further deliberation can be expected to result in yet more minor perturbations around your (uncertain) best guess.

Alternatively, if the blue lines haven’t converged much (but are still converging), this suggests non-resilient uncertainty, and further deliberation can reasonably be expected to give further resolution, and this might be worth the further cognitive investment.

The most difficult cases are ‘stable (or worsening) vacillation’. After perhaps some initial progress, one finds one’s confidence keeps hopping up and down in a highly volatile way, and this volatility does not appear to be settling despite more and more time deliberating: one morning you’re 70% confident, that evening 40%, next week 55%, and so on.

Given the post’s recommendation (in graphical terms) is ‘look at how quickly the blue lines come together’, probably the best option here is to decide when it is clear they are no longer converging. The returns to deliberation seem to have capped out, and although your best guess now is unlikely to be the same as your best guess at time t+n, it is also unlikely to be worse in expectation either.

Naturally, few of us steadily track our credences re. some proposition over time as we deliberate further upon it (perhaps a practice worth attempting). Yet memory might be good enough for us to see what kind of picture applies. If I recall “I thought I should go to medical school, and after each ‘thing’ I did to inform this decision (e.g. do some research, shadow a doctor, work experience in a hospital) this impression steadily resolved”, perhaps it is the first or second graph. If I recall instead “I was certain I should become a doctor, but then I thought I should not because of something I read on the internet, but then I talked to my friend who still thought I should, and now I don’t know”, perhaps it is more the third graph.

This approach has other benefits. It can act as a check on the amounts of deliberation we are amassing, both in general and with respect to particular deliberative activities. In a general sense, we can benchmark how long quasi-idealized versions ourselves would mull over a decision (at least bounded to some loose order of magnitude – e.g. “more than a hour, but less than a month”). Exceeding these bounds in either direction can be tripwire to consider we are being too reckless or diffident. It can also be applied to help gauge the value of particular types of information or deliberation: if my view on taking a job changed markedly on talking to an employee, maybe it would be worth me talking to a couple more; if my decision has stayed at mediocre % as my deliberation google doc increased from 10 pages to 30, maybe not much is likely to change if I add another 20 pages.

Irresilient consolation

Although the practical challenges of deploying resilience to decision-making are not easy, I don’t think it is the main problem. In my experience, folks spending too long deliberating on a decision with resilient uncertainty are not happily investigating their dilemma, looking forward to figuring it out, yet blissfully unaware that their efforts are unlikely to pay off and they would be better off just going with their best guess. Rather, their dilemma causes them unhappiness and anxiety, they find further deliberation aversive and stressful, and they often have some insight this agonised deliberation is at least sub-optimal, if not counter-productive, if not futile. You at-least-kind-of know you should just decide, but resort to festinating (and festering) rumination to avoid confronting the uncertain dangers of decision.

One of the commoner tragedies of the human condition is that insight is not sufficient for cure. I may know that airliners are extremely safe means of travel, that I plausibly took greater risks driving to the airport than on the flight itself, yet nonetheless check the wings haven’t fallen off after every grumble of turbulence. Yet although ‘knowing is less than half the battle’, it remains a useful ally – in the same way cognitive-behavioural therapy is more than, but also partly, ‘getting people to argue themselves out of their depression and anxiety’.

The stuff above is my attempt to provide one such argument. There are others I’d often be minded to offer too. In ascending order of generality and defensibility.

“It’s not that big a deal”: Folks often overestimate the importance and reversibility of their decisions: in the same way the anticipated impact of an adverse life event on wellbeing tends to be greater than its observed impact, the costs of making the wrong decision may be inflated. Of my medical school cohort, several became disenchanted with medicine at various points, often after committing multiple years to the vocation; others simply failed the course and had to leave. None of this is ideal, but it was seldom (~never) a disaster that soured their life forevermore.

Similarly, big decisions can often be (largely) unwound. Typical ‘exits’ from medical school are (if very early) leaving to re-apply to university; (if after a couple of years) converting their study into a quasi-improvised bachelor’s in biological science; (if later) graduating as ‘doctor’ but working in another career afterwards. This is again not ideal: facially, spending multiple years preparing for a career one will not go on to practice is unlikely to be optimal – but the expected costliness is not astronomical.

Although I often say this, I don’t always. I think typical ‘EA dilemmas’ like “Should I take this job or that one?” “What should I major in?” are lower-stakes than the medical school examples above,[4] but that doesn’t mean all are. The world also offers no warranty that you’ll never face a decision where picking the wrong option is indeed hugely costly and very difficult to reverse. Doctrinal assertions otherwise are untrustworthy ‘therapy’ and unwise advice.

“Suck it and see”: trying an option can have much higher value of information than further deliberation. Experience is not only the teacher of fools: sometimes one has essentially exhausted all means to inform a decision in advance, leaving committing to one option and seeing how it goes the only means left. Another way of looking at it: it might be more time efficient to trial doing an option for X months rather than deliberating on whether to take it for X + n months.[5]

Obviously this consideration doesn’t apply for highly irreversible choices. But for choices across most of the irreversibility continuum, it can weigh in favour (and my impression is the “You also get information (and maybe the best available remaining information) from trying something directly, not just further deliberation” consideration is often neglected).

Another caveat is when the option being trialled is some sort of co-operative endeavour: others might hope to rely on your commitment, so may not be thrilled at (e.g.) a 40% risk you’ll back out after a couple of months. However, if you’re on the same team, this seems apt for transparent negotiation. Maybe they’re happy to accept this risk, or maybe there’s a mutually beneficial deal between you which can be struck: e.g. you commit to this option for at least Y months so even if you realise it is not for you earlier you will stick ‘stick it out’ a while to give them time to plan and mitigate for you backing out.[6]

“Inadequate amounts ventured, suboptimal amounts gained (in expectation)”: optimal strategy is reconciled to some exposure to the risks of mistake. As life is a risky prospect, you can seldom guarantee for yourself the best outcome: morons can get lucky, and few plans, no matter how wise, are proof from getting wrecked by sufficiently capricious fortune. You also can’t promise yourself to always adopt the optimal ex ante strategy either, but this is a much more realistic and prudentially worthwhile ambition.

One element of an ‘ideal strategy’ is an ideal level of risk tolerance. Typically, some available risk reductions are costly, so there comes a point where accepting the remaining risk can be better than further attempts to eliminate it – cf. ‘Umeshisms’ like “If you never miss a flight, you are spending too much time at airports”.

Although the ideal level of risk tolerance varies (cf. “If you’re never in a road traffic accident, you’re driving too cautiously”) it is ~always some, and for altruistic benefit the optimal level of (pure) risk aversion is typically ~zero. Thus optimal policy will calculate many substantial risks as nonetheless worth taking.

This can suck (worse, can be expected to suck in advance), especially in longtermist-land where the impact usually has some conjuncts which are “very uncertain, but plausibly very low probability”: maybe the best thing for you to do is devote yourself to some low probability hedge or insurance, where the likeliest ex post outcome was “this didn’t work” or (perhaps worse from the point of view of your own reflection) “this all was just tilting at windmills”.

Perhaps one can reconcile oneself to the point of view of the universe: it matters what rather than who, so best complementing the pre-existing portfolio is the better metric to ‘keep score’; or perhaps one could remember that one really should hope the risks one devotes oneself to are illusory (and ideally, all of them are); or perhaps one should get over it: “angst over inadequate self-actualization” is neither an important cause area nor an important decision criterion. But ultimately, you should do the right thing, even if it sucks.

High levels of certainty imply relatively high levels of credal resilience: the former places lower bounds on the latter. To motivate: suppose I was 99% sure of rain on Friday. Standard doctrine is my current confidence should also be the expected value of my future confidence. So in the same way 99% confidence is inconsistent with ‘I think there’s a 10% chance I will believe in on saturday P(rain friday) = 0% (i.e. actually 10% chance it doesn’t rain after all), it is also inconsistent with ‘I think there’s a 20% chance in an hour I will be 60% confident it will rain on Friday. Even if the rest of my probability mass was at 100%, this nets out to 92% overall (0.8*1 + 0.2*0.6).
Including, for example, all the (many) other options I have besides going to medical school.
One family of caveats (mentioned here for want of a better location) are around externalities. Maybe my own resilient confidence is not always sufficient to justify choices I make that could adversely affect others. Likewise a policy of action in these cases may go wrong when decision-makers are poorly calibrated and do not consult with one another enough (cf. unilateralists curse). I do not think this changes the main points, which apply to something like the resilience of one’s ‘reasonable, all things considered’ credence (including peer disagreement, option value, and sundry other considerations). Insofar as folks tend to err in the direction of being headstrong because they neglect their propensity to be overconfident, or fail to attend to their impact on others, it is worth stressing these as general reminders.
In terms of multiplier stacking (q.v.), if not absolute face value (but the former matters more here).
A related piece of advice I often give is to run ‘trial’ and ‘evaluation’ in series rather than in parallel. So, rather than commit to try X, but spend your time during this trial simultaneously agonising (with the ‘benefit’ of up-to-the-minute signals from how things are going so far) about whether you made the right choice, commit instead to try X for n months, and postpone evaluation (and the decision to keep going, stop, or whatever else) to time set aside afterwards.
I guess the optimal policy of ‘flakiness’ versus ‘stickiness’ is very context dependent. I can think of some people who kept ‘job hopping’ rapidly who I suspect would have been better served if they stuck with each option a while longer (leave alone the likely ‘transaction costs’ on their colleagues). However, surveying my anecdata I think I see more often folks sticking around too long in suboptimal (or just bad) activity. One indirect indicator is I hear too frequently misguided reasons for sticking around like “I don’t want to make my boss sad by leaving”, or “I don’t want to make life difficult for my colleagues filling in for me and finding a replacement”: employment contracts are not marriage vows, and your obligations here are (rightly) fully discharged by giving fair notice and helping transition/handover.

Rational predictions often update predictably*

2023-12-012023-11-03 ~ Gregory Lewis ~ Leave a comment

BLUF: One common supposition is a rational forecast, because it ‘prices in’ anticipated evidence, should follow a (symmetric) random walk. Thus one should not expect a predictable trend in rational forecasts (e.g. 10% day 1, 9% day 2, 8% day 3, etc.), nor commonly see this pattern when reviewing good forecasting platforms. Yet one does, and this is because the supposition is wrong: ‘pricing in’ and reflective equilibrium constraints only entail the present credence is the expected value of a future credence. Skew in the anticipated distribution can give rise to the commonly observed “steady pattern of updates in one direction”. Such skew is very common: it is typical in ‘will event happen by date’ questions, and one’s current belief often implies skew in the expected distribution of future credences. Thus predictable directions in updating are unreliable indicators of irrationality.

Introduction

Forecasting is common (although it should be commoner) and forecasts (whether our own or others) change with further reflection or new evidence. There are standard metrics to assess how good someone’s forecasting is, like accuracy and calibration. Another putative metric is something like ‘crowd anticipation’: if I predict P(X) = 0.8 when the consensus is P(X) = 0.6, but over time this consensus moves to P(X) = 0.8, regardless of how the question resolves, I might take this to be evidence I was ‘ahead of the curve’ in assessing the right probability which should have been believed given the evidence available.

This leads to scepticism about the rationality of predictors which show a pattern of ‘steadily moving in one direction’ for a given question: e.g. P(X) = 0.6, then 0.63, 0.69, 0.72 … and then the question resolves affirmatively. Surely a more rational predictor, observing this pattern, would try and make forecasts an observer couldn’t reliably guess to be higher or lower in the future, so forecast values follow something like a (symmetrical) random walk. Yet forecast aggregates (and individual forecasters) commonly show these directional patterns if tracking a given question.

Scott Alexander noted curiosity about this behaviour; Eliezer Yudkowsky has confidently asserted it is an indicator of sub-par Bayesian updating.[1] Yet the forecasters regularly (and predictably) notching questions up or down as time passes are being rational. Alexander’s curiosity was satisfied by various comments on his piece, but I write here as this understanding may be tacit knowledge to regular forecasters, yet valuable to explain to a wider audience.

Reflective equilibrium, expected value, and skew

One of the typical arguments against steadily directional patterns is they suggest a violation of reflective equilibrium. In the same way if I say P(X) = 0.8, I should not expect to believe P(X) = 1 (i.e. it happened) more than 80% of the time, I shouldn’t expect to believe P(X) [later] > P(X) [now] more likely than not. If I did, surely I should start ‘pricing that in’ and updating my current forecast upwards. Market analogies are commonly appealed to in making this point: if we know the stock price of a company is more likely than not to go up, why haven’t we bid up the price already?

This motivation is mistaken. Reflective equilibrium only demands one’s current forecast is the expected value of one’s future credence. So although the mean of P(X) [later] should equal P(X) [now], there are no other constraints on the distribution. If it is skewed, then you can have predictable update direction without irrationality: e.g. from P(X) = 0.8, I may think in a week I will – most of the time – have notched this forecast slightly upwards, but less of the time notching it further downwards, and this averages out to E[P(X) [next week]] = 0.8.

One common scenario for this sort of skew to emerge are ‘constant hazard’ forecast questions of the form “Will this event [which could happen ‘at any time’, e.g. a head of state dying, a conflict breaking out] occur before this date?”. As time passes, the window of opportunity for the question resolving positively closes: ‘tomorrow’s prediction’ is very likely to be slightly lower than todays, but with a small chance of being much higher.[2] Thus the common observation of forecasters tracking a question to be notching their forecasts down periodically ‘for the passage of time’. Forecasters trying to ‘price this in early’ by going to 0 or 1 will find they are overconfident rather than more accurate (more later).

Current forecasts generally imply skew (and predictable direction) for future forecasts

Although these sorts of questions are common on forecasting platforms, the predictable directionality they result in might be dismissed as a quirk of question structure: they have a set deadline, and thus the passage of time itself is providing information – you wouldn’t get this mechanistic effect if the question was “what date will X happen?” rather than “will X happen by [date]?”.

But in fact skew is much more general: ‘directionally predictable patterns’ are expected behaviour when a) future forecasts are expected to have better resolution than current forecasts (e.g. because you will have more information, or will have thought about it more), and b) your current prediction is not one of equipoise (e.g. not 50/50 for a binary event which happens or not). In a soundbyte: your current credence governs the bias in its expected future updates.

Suppose I was trying to predict who would win the French presidential election in January 2022 (cf.), and (like the Metaculus aggregate at the time) I thought Macron is a 60% favourite to be reelected. There’s no pure ‘passage of time’ effects on election results, but I would expect my future predictions to be more accurate than my current one: opinion polls a week before the election are better correlated to results than those 5 months before, economic conditions would be clearer, whether any crises or scandals emerge etc. Even without more data, if I plan to spend a lot more time forecasting this question, I might expect my forecasts to improve with more careful contemplation.

As I have Macron as the favourite, this is (hopefully) based on the balance of considerations I have encountered so far tending to favour his re-election versus not. All else equal,[3] this implies a prediction that the next consideration I encounter is more likely to favour Macron than not too: I think I’m more likely to be in a world where Macron is re-elected in the future, and if so I expect the pre-election signs to generally point to this future too.

If the next consideration I come across indeed favours Macron like I expected, I should update a bit more upwards on his reelection chances (say 60% to 63%). This is not a failure of reflective equilibrium: although I should have priced in the expectation for favourable evidence, I only priced in this was what I was likely to see, so confirmation I actually saw it gives further confidence in my mainline Macron-reelected model.[4] If this consideration went the other way, my greater surprise would have pushed me more in the opposite direction (e.g. 60% to 54%, cf.). If this pattern continues, I should be progressively more confident that Macron will be re-elected, and progressively more confident future evidence will provide further support for the same. So, taking an epistemic diary of my earlier predictions, these would often show a pattern of ‘steady updates in one direction’.

The same point can be made graphically. Let the Y axis be credence, and the X axis to be ‘degree of omniscience’. If you’re at 60% given your current state of knowledge, you would expect to end up at 1 more often than 0 if you were to become omniscient. As there’s a 60:40 directional bias for the ‘random walk’ of a single step from here to omniscience, there’s also a similar expected bias for a smaller step from ‘here to a little more omniscient’, or for the net result of multitude of smaller sequential steps: you are expecting to ultimately resolve to 1 or 0, but to 1 more often than 0, so in aggregate the path is biased upwards from the present estimate. Again reflective equilibrium is maintained by the skew of the distribution, which gets steadily more extreme the closer you are to 0 or 1 already. The typical path when gaining understanding is more resilient confidence, and so major jolts in the opposite direction are more and more surprising.

You are the grey dot, with 60% credence in something, you know something about that something, but you could know more, and are not omniscient (main panel). Your ‘metaforecast’ of “What probability would you assign if you suddenly became omniscient?” should be a probability mass function with 60% at P = 1 and 40% at P=0 (red, upper right panel).

Now consider “Where would you expect your credence to end up if you knew some more (but not everything) about this something? Your credence could walk all over the place across ‘epistemic time’, but as at the limit you are at the pmf for omniscience, these paths are more likely to be tending towards P=1 than P=0, thus higher than your current estimate. Thus the intermediate pdf forms some skewed distribution with a mean of 0.6, but a mode and median somewhat higher (blue, upper middle panel).

What about forecasting quantities?

For quantitative instead of binary predictions (e.g. ‘how much X?’ or ‘what year X?’), skew is not guaranteed by an initial estimate, but remains common nonetheless. Many quantities of interest are not distributed symmetrically (e.g. exponential for time-to-event, lognormal for conjunctive multipliers, power-law for more network-y generators), kind-of owed to the fact there are hard minimums (e.g. “X happened two years ago” is a very remote resolution to most “when will X happen?” questions; a stock price can’t really go negative).[5]

This skew again introduces asymmetry in update direction: ignoring edge cases like multi-modality, I expect to find myself closer to the modal scenario, so anticipate most of the time the average for my new distribution to track towards the mode of my previous one, but with a smaller chance of new evidence throwing me out along the long tail. E.g. my forecast for a market cap of an early stage start-up is probably log-normalish, and the average likely to go down than up, as I anticipate it to steadily fall in the modal case evidence mounts it is following the modal trajectory (i.e. to the modal outcome of zero), but more rarely it becomes successful and my expected value shoots upwards much higher.

Conclusion

‘Slowly(?) updating in one direction’ can be due to underconfidence or sluggishness. Yet, as demonstrated above, this appearance can also emerge from appropriate and well calibrated forecasts and updating. Assessing calibration can rule in or out predictor underconfidence: 80% events should occur roughly 80% of the time, regardless of whether 80% was my initial forecast or my heavily updated one. Sluggish updating can also be suggested by a trend towards greater underconfidence with prediction number: i.e. in aggregate, my 2nd to nth predictions on a question get steadily more underconfident whilst my initial ones are well-calibrated (even if less accurate). Both require a significant track record (across multiple questions) to determine. But I’d guess on priors less able forecasters tend to err in the opposite directions: being overconfident, and over-updating on recent information.

In contrast, rational predictions which do follow a symmetric random walk over time are constrained to particular circumstances: where you’re basically at equipoise, where you’re forecasting a quantity which is basically normally distributed, etc. These are fairly uncommon on prediction platforms, thus monotonic traces and skewed forecast distributions are the commoner observations.

Aside: ‘anticipating consensus’ poorly distinguishes overconfidence from edge

As ‘steady updates in one direction’ is a very unreliable indicator of irrationality, so too ‘observing the crowd steadily move towards your position’ is a very unreliable indicator that you are a superior forecaster. Without proper scoring, inferior (overconfident) predictors look similar to superior (accurate) ones.

Suppose one is predicting 10 questions where the accurate and well-calibrated consensus is at 90% for all of them. You’re not more accurate, but you are overconfident, so you extremise to 99% for all of them. Consistent with the consensus estimate, 9/10 happen but 1 doesn’t. In terms of resolution, you see 9 times out of 10 you were on the ‘better side of maybe’ versus the crowd; if there was intermediate evidence being tracked (so 9 of them steadily moved to 1, whilst 1 bucked the trend and went to 0), you’d also see 9 times out of 10 you were anticipating the consensus, which moved in your direction.

Qualitatively, you look pretty impressive, especially if only a few items have resolved, and all in your favour (“This is the third time you thought I was being too confident about something, only for us to find my belief was closer to the truth than yours. How many more times does this have to happen before you recognise you’d be less wrong if you agreed with me?”) Yet in fact you are just a worse forecaster by the lights of any proper scoring rule:[6] the pennies you won going out on a limb nine times does not compensate you for the one time you got steamrollered.

From a (public) facebook post (so I take the etiquette to be ‘fair game to quote, but not to make a huge deal out of’):To be a slightly better Bayesian is to spend your entire life watching others slowly update in excruciatingly predictable directions that you jumped ahead of 6 years earlier so that your remaining life could be a random epistemic walk like a sane person with self-respect.I wonder if a Metaculus forecast of “what this forecast will look like in 3 more years” would be saner. Is Metaculus reflective, does it know what it’s doing wrong?
The relevant financial analogy here would be options, which typically exhibit ‘time decay‘
Obviously, you might have a richer model where although the totality of evidence favours Macron’s re-election, some families of considerations are adverse. Maybe I think he is going to win despite domestic economic headwinds (which I take to harm his chances). So P(next consideration favours Macron) can be >0.5, yet P(next consideration favours Macron|it’s about domestic economics) <0.5.
So, strictly speaking, you should still be updating off evidence you were expecting to see, so long as you weren’t certain it would turn out like you anticipated.
This is not a foolproof diagnostic. Although it is impossible for a woman to have negative height, and ‘possible’ for her to be 10 ft tall, human height distributions by sex are basically normal distributions, not skewed ones.
E.g. 0.196 versus 0.180 for your brier score versus the consensus in this toy example (lower is better). The distance between you will vary depending on the scoring rule, but propriety ensures all will rank you worse

Complex cluelessness as credal fragility

2023-12-012023-11-03 ~ Gregory Lewis ~ Leave a comment

Introduction

In sketch, the challenge of consequentialist cluelessness is the consequences of our actions ramify far into the future (and thus – at least at first glance – far beyond our epistemic access). Although we know little about them, we know enough to believe it unlikely these unknown consequences will neatly ‘cancel out’ to neutrality – indeed, they are likely to prove more significant than those we can assess. How, then, can we judge some actions to be better than others?

For example (which we shall return to), even if we can confidently predict the short-run impact of donations to the Against Malaria Foundation are better than donations to Make-a-Wish, the final consequences depend on a host of recondite matters (e.g. Does reducing child mortality increase or decrease population size? What effect does a larger population have on (among others) economic growth, scientific output, social stability? What effect do these have on the margin in terms of how the future of humankind goes?) In aggregate, these effects are extremely unlikely to neatly cancel out, and their balance will likely be much larger than the short run effects.

Hillary Greaves, in presenting this issue (Greaves 2016), notes that there is an orthodox subjective Bayesian ‘answer’ to it: in essence, one should offer (precise) estimates for all the potential long-term ramifications, ‘run the numbers’ to give an estimate of the total value, and pick the option that is best in expectation. Calling this the ‘sceptical response’, she writes (p.12):

This sceptical response may in the end be the correct one. But since it at least appears that something deeper is going on in cases like the one discussed in section 5 [‘complex cluelessness’], it is worth exploring alternatives to the sceptical response.

Andreas Mogensen, in subsequent work (Mogensen 2019), goes further, suggesting this is a ‘Naive Response’. Our profound uncertainty across all the considerations which bear upon the choice-worthiness of AMF cannot be reasonably summarized into a precise value or probability distribution. Both Greaves and Mogensen explore imprecise credences as an alternative approach to this uncertainty.

I agree with both Greaves and Mogensen there’s a deeper issue here, and an orthodox reply along the lines of, “So be it, I assign numbers to all these recondite issues”, is too glib. Yet I do not think imprecise credences are the best approach to tackle this problem. Also, I do not think the challenge of cluelessness relies on imprecise credences: even if we knew for sure our credences should be precise, the sense of a deeper problem still remains.

I propose a different approach, taking inspiration from Amanda Askell and Phil Trammell, where we use credal fragility, rather than imprecision, to address decision making with profound uncertainty. We can frame issues of ‘simple cluelessness’ (e.g. innocuous variations on our actions which scramble huge numbers of future conceptions) as considerations where we are resiliently uncertain of their effect, and so reasonably discount them as issues to investigate further to improve our best guess (which is commonly, but not necessarily, equipoise). By contrast, complex cluelessness are just those cases where the credences we assign to the considerations which determine the long-run value of our actions are fragile, and we have reasonable prospects to make our best guess better.

Such considerations seem crucial to investigate further: even though it may be even harder to (for example) work out the impact of population size on technological progress than it is the effect size of AMF’s efforts on child mortality, the much greater impact of the former than the latter on the total impact of AMF donations makes this topic the better target of investigation on the margin to improve our estimate of the value of donating to AMF.[1]

My exploration of this approach suggests it has some attractive dividends: it preserves features of orthodox theory most find desirable, avoids the costs of imprecise credences, and – I think – articulates well the core problem of cluelessness which Greaves, Mogensen, myself, and others perceive. Many of the considerations regarding the influence we can have on the deep future seem extremely hard, but not totally intractable, to investigate. Offering naive guestimates for these, whilst lavishing effort to investigate easier but less consequential issues, is a grave mistake. The EA community has likely erred in this direction.

On (and mostly contra) imprecise credences

Rather than a single probability function which will give precise credences, Greaves and Mogensen suggest the approach of using a set of probability functions (a representor). Instead of a single credence for some proposition p, we instead get a set of credences, arising from each probability function within the representor.

Although Greaves suggests imprecise credences as an ‘alternative line’ to orthadox subjective Bayesianism, Mogensen offers a stronger recommendation of the imprecise approach over the ‘just take the expected value’ approach, which he deems a naive response (p. 6):

I call this ‘the Naïve Response’ because it is natural to object that it fails to take seriously the depth of our uncertainty. Not only do we not have evidence of a kind that allows us to know the total consequences of our actions, we seem often to lack evidence of a kind that warrants assigning precise probabilities to relevant states. Consider, for example, the various sources of uncertainty about the indirect effects of saving lives by distributing anti-malarial bed-nets noted by Greaves (2016). We have reason to expect that saving lives in this way will have various indirect effects related to population size. We have some reason to think that the effect will be to increase the future population, but also some reason to think that it will be to decrease the net population (Roodman 2014; Shelton 2014). It is not clear how to weigh up these reasons. It is even harder to compare the relative strength of the reasons for believing that increasing the population is desirable on balance against those that support believing that population decrease is desirable at the margin. That the distribution of bed-nets is funded by private donors as opposed to the local public health institutions may also have indirect political consequences that are hard to assess via the tools favoured by the evidence-based policy movement (Clough 2015). To suppose that our uncertainty about the indirect effects of distributing anti-malarial bed-nets can be summarized in terms of a perfectly precise probability distribution over the relevant states seems to radically understate the depth of our uncertainty.

I fear I am too naive to be moved by Mogensen’s appeal. By my lights, although any of the topics he mentions are formidably complicated, contemplating them, weighing up the considerations that bear upon them, and offering a (sharp) ‘best guess’ alongside the rider that I am deeply uncertain (whether cashed out in terms of resilience or ‘standard error’ – cf., and more later) does not seem to pretend false precision nor do any injustice to how uncertain I should be.

Trading intuitions offers little: it is commonplace in philosophy for some to be indifferent to a motivation others find compelling. I’m not up to the task of giving a good account of the de/merits of im/precise approaches (see eg. here). Yet I can say more on why imprecise credences poorly articulate the phenomenology of this shared sense of uncertainty – and more importantly, seem to fare poorly as means to aid decision-making:[2] the application of imprecise approaches gives results (generally of excessive scepticism) which seem inappropriate.

Incomparability

Unlike ‘straight ticket expected value maximization’, decision rules on representors tend to permit incomparability: if elements within one’s representor disagree on which option is better, it typically offers no overall answer. Mogensen’s account of one such decision rule (the ‘maximality rule’) notes this directly:

When there is no consensual ranking of a and a`, the agent’s preference with respect to these options is indeterminate: she neither prefers a to a`, nor a` to a, nor does she regard them as equally good.

I think advocates of imprecise credences consider this a feature rather than a bug,[3] but I do not think I am alone in having the opposite impression. The articulation would be something along the lines of decision rules on representors tending ‘radically anti-majoritarian’: a single member of one’s credal committee is enough to ‘veto’ a comparison of a versus a`. This doesn’t seem the right standard in cases of consequentialist cluelessness where one wants to make judgements like ‘on balance better in expectation’.

This largely bottoms out in the phenomenology of uncertainty:[4] if, in fact, one’s uncertainty is represented by a set of credence functions, and, in fact, one has no steers on the relative plausibility of the elements of this set compared to one another, then responding with indeterminacy when there is no consensus across all elements seems a rational response (i.e. even if the majority of my representor favours action a over alternative action a`, without steers on how to weigh a-favouring elements over others, there seems little room to surpass a ‘worst-case performance’ criterion like the maximality rule).

Yet I aver that in most-to-all cases (including those of consequentialist cluelessness) we do, in fact, have steers about the relative plausibility of different elements. I may think P(rain tomorrow) could be 0.05, and it could be 0.7 (and other things besides), yet I also have impressions (albeit imprecise) on one being more reasonable than the other. The urge of the orthodox approach is we do better trying to knit these imprecise impressions into a distribution to weigh the spectrum of our uncertainty – even though it is an imperfect representation,[5] rather than deploying representors which often unravel anything downstream of them as indeterminate.

To further motivate, this (by my lights, costly) incomparability may prove pervasive and recalcitrant in cases of consequentialist cluelessness, for reasons related to the challenge of belief inertia.

A classical example of belief inertia goes like this: suppose a coin of unknown bias. It seems rationality permissible for one’s representor on the probability of said coin landing heads to be (0,1).[6] Suppose one starts flipping this coin. No matter the number of coin flips (and how many land heads), the representor on the posterior seems stuck to (0,1): for any element in this posterior representor, for any given sequence of observations, we can find an element in the prior representor which would update to it.[7]

This problem is worse for propositions where we anticipate receiving very limited further evidence. Suppose (e.g.) we take ourselves to be deeply uncertain on the proposition “AMF increases population growth”. Following the spirit of imprecise approach, we offer a representor with at least one element either side of 0.5. Suppose one study, then several more, then a systematic review emerges which all find that AMF lives saved do translate into increased population growth. There’s no guarantee all elements of our representor, on this data, will rise above 0.5 – it looks permissible for us to have included in our representor a credence function which would not do this (studies and systematic reviews are hardly infallible). If this proposition lies upstream of enough of the expected impact, such a representor entails we will never arrive at an answer as to whether donations to AMF are better than nothing.

For many of the other propositions subject to cluelessness (e.g. “A larger population increases existential risk”) we can only hope to acquire much weaker evidence than sketched above. Credence functions that can remain resiliently ‘one side or the other’ of 0.5[8] in the face of this evidence again seems at least permissible (if not reasonable) to include in our representor.[9] Yet doing so makes for pervasive and persistent incomparability: including a few mildly stubborn credence functions in some judiciously chosen representors can entail effective altruism from the longtermist perspective is a fool’s errand. Yet this seems false – or, at least, if it is true, it is not true for this reason.[10]

Through a representor darkly

A related challenge is we have very murky access to what our representor either is or should be. A given state of (imprecise) uncertainty could be plausibly described by very large numbers of candidate representors. As decision rules on representors tend to be exquisitely sensitive to which elements they contain, it may be commonplace where (e.g.) action a is recommended over a` given representor R, but we can counter-propose R`, no less reasonable by our dim lights of introspective access, which hold a and a` to be incomparable.[11]

All the replies here look costly to me. One could ‘go meta’ and apply the decision rules to the superset of all credence functions that are a member of at least one admissible representor[12] (or perhaps devise some approach to aggregate across a family of representors), but this seems likely to amplify the problems of incomparability and murky access that apply to the composition of a single representor.[13] Perhaps theory will be able to offer tools to supplement internal plausibility to assist us in picking the ‘right’ representor (although this seems particularly far off for ‘natural language’ propositions cluelessness tends to concern). Perhaps we can work backwards from our intuitions about when actions should be incomparable or not to inform what our representor should look like, although reliance on working backwards like this raises the question as to what value – at least prudentially – imprecise credences have as a response to uncertainty.

Another alternative for the imprecise approach is to go on the offensive: orthodoxy faces a similar problem. Any given sharp representation I offer to represent my uncertainty is also susceptible to counter-proposals which will seem similarly appropriate, and in some of these cases the overall judgement will prove sensitive to which representation I use. Yet although a similar problem, it is much less in degree: even heavy-tailed distributions are much less sensitive to ‘outliers’ than representors, and orthodox approaches have more resources available to aggregate and judge between a family of precise representations.

Minimally clueless forecasting

A natural test for approaches to uncertainty is to judge them by their results. For consequentialist cluelessness, this is impossible: there is no known ground truth of long-run consequences to judge against. Yet we can assess performance in nearby domains, and I believe this assessment can be adduced in favour of orthodox approaches versus imprecise ones.

Consider a geopolitical forecasting question, such as: “Before 1 January 2019, will any other EU member state [besides the UK] schedule a referendum on leaving the EU or the eurozone?” This question opened on the Good Judgement Open on 16 Dec 2017. From more than a year out, there would seem plenty to recommend imprecision: there would be 27 countries, each with their complex political context, several of whom with elections during this period, and a year is a long time in politics. Given all of this, would an imprecise approach not urge us it is unwise to give a (sharp) estimate of the likelihood of this event occurring within a year?

Yet (despite remaining all-but-ignorant of these particulars) I still offered an estimate of 10% on Dec 17. As this didn’t happen, my Brier score was a (pretty good) 0.02 – although still fractionally worse than the median forecaster for this question (0.019).[14] If we instrumented this with a bet (e.g. “Is it better for me to bet donation money on this not occuring at a given odds?”), I would fare better than my “imprecise/incomparible” counterpart. Depending on how wide their representor was, they could not say taking a bet at evens or 5:1 (etc.) was better than not doing so. By tortured poker analogy, my counterpart’s approach leads them to play much too tightly, leaving value on the table an orthodox approach can reliably harvest.[15]

Some pre-emptive replies:

First, showing a case where I (or the median) forecaster got it right means little: my counterpart may leave ‘easy money’ on the table when the precisification was right, yet not get cleaned out when the precisification was wrong. Yet across my forecasts (on topics including legislation in particular countries, election results, whether people remain in office, and property prices – _all _of which I know very little about), I do somewhat better than the median forecaster, and substantially better than chance (Brier ~ 0.23). Crucially, the median forecaster also almost always does better than chance too (~ 0.32 for those who answered the same questions as I) – which seems the analogous consideration for cluelessness given our interest is in objective rather than relative accuracy.[16] That the imprecise/incomparible approach won’t recommend taking ‘easy money’ seems to be a general trend rather than a particular case.

Second, these uncertainties are ‘easier’ than the examples of consequentialist cluelessness. Yet I believe the analogy is fairly robust, particularly with respect to ‘snap judgements’. There are obvious things one can investigate to get a better handle on the referendum question above: one could look at the various political parties standing in each country, see which wanted to hold such a referendum, and look at their current political standing (and whether this was likely to change over the year). My imprecise/incomparible counterpart, on consulting these, could winnow their representor so (by the maximality rule) they may be recommended to take rather than refrain even-odds or tighter bets.

Yet I threw out an estimate without doing any of these (and I suspect the median forecaster was not much more diligent). Without this, there seems much to recommend including in my representor at least one element with P>0.5 (e.g. “I don’t know much about the Front Nationale in France, but they seem a party who may want to leave the EU, and one which has had political success, and I don’t know when the next presidential election is – and what about all the other countries I know even less about?”). As best as I can tell, these snap judgements (especially from superforecasters,[17] but also from less trained or untrained individuals) still comfortably beat chance.

Third, these geopolitical forecasting questions generally have accessible base rates or comparison classes (the reason I threw out 10% the day after the referendum question opened was mostly this). Not so for consequentialist cluelessness – all of the questions about trends in long-term consequences are yet to resolve,[18] and so we have no comparison class to rely on for (say) whether greater wealth helps the future go better or otherwise. Maybe orthodox approaches are fine when we can moor ourselves to track records, but they are inadequate when we cannot and have to resort to extrapolation fuelled by analogy and speculation – such as cases of consequentialist cluelessness.

Yet in practice forecasting often takes one far afield from accessible base rates. Suppose one is trying to work out (in early 2019) whether Mulvaney will last the year as White House Chief of staff. One can assess turnover of white house staff, even turnover of staff in the Trump administration – but how to balance these to Mulvaney’s case in particular (cf. ‘reference class tennis’)? Further suppose one gets new information (e.g. media rumours of him being on ‘shaky ground’): this should change one’s forecast, but by how much? (from 10% to 11%, or to 67%?) There seems to be a substantially similar ‘extrapolation step’ here.

In sum: when made concrete, the challenges of geopolitical forecasting are similar to those in consequentialist cluelessness. In both:

The information we have is only connected to what we care about through a chain of very uncertain inferences (e.g.[19] “This report suggests that Trump is unhappy with Mulvaney, but what degree of unhappiness should we infer given this is filtered to us from anonymous sources routed through a media report? And at what rate of exchange should this nebulous ‘unhappiness’ be cashed out into probability of being dismissed?”).
There are multiple such chains (roughly corresponding to considerations which bear upon the conclusion), which messily interact with one another (e.g. “This was allegedly prompted by the impeachment hearing, so perhaps we should think Trump is especially likely to react given this speaks to Mulvaney’s competence at an issue which is key to Trump – but maybe impeachment itself will distract or deter Trump from staff changes?”)
There could be further considerations we are missing which are more important than those identified.
Nonetheless, we aggregate all of these considerations to give an all-things considered crisp expectation.

Empirically, with forecasting, people are not clueless. When they respond to pervasive uncertainty with precision, their crisp estimates are better than chance, and when they update (from one crisp value to another) based on further information (however equivocal and uncertain it may be) their accuracy tends to improve.[20] Cases of consequentialist cluelessness may differ in degree, but not (as best as I can tell) in kind. In the same way our track record of better-than-chance performance warrants us to believe our guesses on hard geopolitical forecasts, it also warrants us to believe a similar cognitive process will give ‘better than nothing’ guesses on which actions tend to be better than others, as the challenges are similar between both.[21]

Credal resilience and ‘value of contemplation’

Suppose I, on seeing the evidence AMF is one of the most effective charities (as best as I can tell) for saving lives, resolve to make a donation to it. Andreas catches me before I make the fateful click and illustrates all the downstream stakes I, as a longtermist consequentialist, should also contemplate – as these could make my act of donation much better or much worse than I first supposed.

A reply along the lines of, “I have no idea what to think about these things, so let’s assume they add up to nothing” is unwise. It is unwise as (accord Mogensen and others) I cannot simply ‘assume away’ considerations on which the choiceworthiness of my action is sensitive to. I have to at least think about them.

Suppose instead I think about it for a few minutes and reply, “On contemplating these issues (and the matter more generally), my best guess is the downstream consequences in aggregate are in perfect equipoise – although I admit I am highly uncertain on this – thus the best guess for the expected value of my donation remains the same.”

This is also unwise, but for a different reason. The error is not (pace cluelessness) on giving a crisp estimate for a matter I should still be profoundly uncertain about: if I hopped into a time machine and spent a thousand years mulling these downstream consequences over, and made the same estimate (and was no less uncertain about it); or for some reason I could only think about it for those few minutes, I should go with my all-things-considered best guess. Yet the error is that in these circumstances, it seems I should be thinking about these downsteam consequences for much longer than a few minutes.

This rephrases ‘value of information’ (perhaps better ‘contemplation’).[22] Why this seems acutely relevant here is precisely the motivation for consequentialist cluelessness: when the overall choiceworthiness of our action proves very sensitive to a considerations we are very uncertain about, the expected marginal benefit of reducing our uncertainty here will often be a better use of our efforts than acting on our best guess.

Yet in making wise decisions about how to allocate time to contemplate things further, we should factor in ‘tractability’. Some propositions, although high-stakes, might be those in which we are resiliently uncertain, and so effort trying to improve our guesswork is poorly spent. Others, where our credences are fragile (‘non-resiliently uncertain’) are more worthwhile targets for investigation.[23]

I take the distinction between ‘simple’ and ‘complex’ cluelessness to be mainly this. Although vast consequences could hang in the balance with trivial acts (e.g. whether I click ‘donate now’ now or slightly later could be the archetypal ‘butterfly wing’ which generates a hurricane, or how trivial variations in behaviour might change the identity of a future child – and then many more as this different child and their descendants ‘scramble’ the sensitive identity-determining factors of other conceptions), we rule out further contemplation of ‘simple cluelessness’ because it is intractable. We know enough to predict that the ex post ‘hurricane minimizing trivial movements’ won’t be approximated by simple rules (“By and large, try and push the air molecules upwards”), but instead exquisitely particularized in a manner we could never hope to find out ex ante. Likewise, for a given set of ‘nearby conceptions’ to the actual one, we know enough to reasonably believe we will never have steers on which elements are better, or how – far causally upstream – we can send ripples in just the right way to tilt the odds in favour of better results.[24]

Yet the world is not only causal chaos. The approximate present can approximately determine the future: physics allows us to confidently predict a plane will fly without modelling to arbitrary precision all the air molecules of a given flight. Similarly, we might discover moral ‘levers on the future’ which, when pulled in the right direction, systematically tend to make the world better rather than worse in the long run.

‘Complex cluelessness’ draws our attention to such levers, where we know little about which direction they are best pulled, but (crucially) it seems we can improve our guesswork. Whether (say) economic growth is good in the long-run is a formidably difficult question to tackle.[25] Yet although it is harder than (say) how many child deaths a given antimalarial net distributions can be expected to avert, it is not as hopelessly intractable as the best way to move my body to minimize the expected number of hurricanes. So even though the expected information per unit effort on the long-run impacts of economic growth will be lower than evaluating a charity like AMF, the much greater importance of his consideration (both in terms of its importance and its wide applicability) makes this the better investment of our attention.[26]

Conclusion

Our insight into the present is poor, and deteriorates inexorably as we try to scry further and further into the future. Yet longtermist consequentialism urges us to evaluate our actions based on how we forecast this future to differ. Although we know little, we know enough that our actions are apt to have important long term ramifications, but we know very little about what those will precisely be, or how they will precisely matter. What ought we to do?

Yet although we know little, we are not totally clueless. I am confident the immediate consequences of donations to AMF are better than those to make a wish. I also believe the consequences simpliciter are better for AMF donations than Make-a-wish donations, although this depends very little on the RCT-backed evidence base, and much more on a weighted aggregate of poorly-educated guesswork on ‘longtermist’ considerations (e.g. that communities with fewer child deaths fare better in naive terms, and the most notable dividends of this with longterm relevance – such as economic growth in poorer countries – systematically tend to push the longterm future in a better direction) which I am much less confident in.

This guesswork, although uncertain and fragile, nonetheless warrants my belief that AMF-donations are better than Make-a-wish ones. The ‘standard of proof’ for consequentialist decision making is not ‘beyond a reasonable doubt’, but a balance of probabilities, and no matter how lightly I weigh my judgement aggregating across all these recondite matters, it is sufficient to tip the scales if I have nothing else to go on. Withholding judgement will do worse if, to any degree, my tentative guesswork tends towards the truth. If I had a gun to head to allocate money to one or the other right now, I shouldn’t flip a coin.

Without the gun, another option I have is to try and improve my guesswork. To best improve my guesswork, I should allocate my thinking time to uncertainties which have the highest yield – loosely, a multiple of their ‘scale’ (how big an influence they play on the total value) and ‘tractability’ (how much I can improve my guess per unit effort).

Some uncertainties, typically those of ‘simple’ cluelessness, score zero by the latter criterion: I can see from the start I will not find ways to intervene on causal chaos to make it (in expectation) better, and so I leave them as high variance (but expectation neutral) terms which I am indifferent towards. If I were clairvoyant on these matters, I expect I would act very differently, but I know I can’t get any closer than I already am.

Yet others, those of complex cluelessness, do not score zero on ‘tractability’. My credence in “economic growth in poorer countries is good for the longterm future” is fragile: if I spent an hour (or a week, or a decade) mulling it over, I would expect my central estimate to change, although my remaining uncertainty to be only a little reduced. Given this consideration has much greater impact on what I ultimately care about, time spent on this looks better than time further improving the estimate of immediate impacts like ‘number of children saved’. It would be unwise to continue delving into the latter at the expense of the former. Would that we do otherwise.

Acknowledgements

Thanks to Andreas Mogensen, Carl Shulman, and Phil Trammel for helpful comments and corrections. Whatever merits this piece has are mainly owed to their insight. It’s failures are owed to me (and many to lacking time and ability to address their criticism better).

[1] Obviously, the impact of population size on technological progress also applies to many other questions besides the utility of AMF donations.

[2] (Owed to/inspired by Mogensen) One issue under the hood here is whether precision is descriptively accurate or pragmatically valuable. Even if (in fact) we cannot describe our own uncertainty as (e.g.) a distribution to arbitrary precision, we fare better if we act as-if this were the case (the opposite is also possible). My principal interest is the pragmatic one: that agents like ourselves make better decisions by attempting to EV-maximization with precisification than they would with imprecise approaches.

[3] One appeal I see is the intuition that for at least some deeply mysterious propositions, we should be rationally within our rights to say, “No, really, I don’t have any idea”, rather than – per the orthodox approach – taking oneself to be rationally obliged to offer a distribution or summary measure to arbitrary precision.︎

[4] This point, like many of the other (good) ones, I owe to conversation with Phil Trammell.

[5] And orthodox approaches can provide further resources to grapple with this problem: we can perform sensitivity analyses with respect to factors implied by our putative distribution. In cases where we get a similar answer ‘almost all ways you (reasonably) slice it’, we can be more confident in making our decision in the teeth of our uncertainty (and vice versa). Similarly, this quantitative exercise in setting distributions and central measures can be useful for approaching reflective equilibrium on the various propositions which bear upon a given conclusion.︎

[6] One might quibble we might have reasons to rule out extremely strong biases (e.g. P(heads) = 10^-20), but nothing important turns on this point.

[7] Again the ‘engine’ of this problem is the indeterminate weights on the elements in the set. Orthodoxy has also to concede that, given a prior ranging between (0,1) for bias, given any given sequence of heads and tails, any degree of bias (towards heads or tails) is still possible. Yet, as the prior expressed the relative plausibility of different elements, it can say which values(/intervals) have gotten more or less likely. Shorn of this, it is also inert.

[8] I don’t think the picture substantially changes if the proposition changes from “Does a larger population increase existential risk” to “What is the effect size of a larger population on existential risk?”

[9] Winnowing our representor based on its resilience to the amount of evidence we can expect to receive looks ad hoc. ︎

[10] A further formidable challenge is how to address ‘nested’ imprecision: what to do if we believe we should be imprecise about P(A) and P(B|A). One family of cases that spring to mind is moral uncertainty: our balance of credence across candidate moral theories, and (conditional on a given theory) which option is the best will seem to be things that ‘fit the bill’ for an imprecise approach. Naive approaches seem doomed to incomparability, as the total range of choice-worthiness will inevitably expand the longer the conditional chain – and, without density, we seem stuck in the commonplace where these ranges overlap.

[11] Earlier remarks on the challenge of mildly resilient representor elements provide further motivation. Consider Morgensen’s account of how imprecision and maximality will not lead to a preference to AMF over Make-A-Wish given available evidence. After elaborating the profound depth of our uncertainty around the long-run impacts of AMF, he writes (p. 16-17):

[I]t was intended to render plausible the view that the evidence is sufficiently ambiguous that the probability values assigned by the functions in the representor of a rational agent to the various hypotheses that impact on the long-run impact of her donations ought to be sufficiently spread out that some probability function in her representor assigns greater expected moral value to donating to the Make-A-Wish Foundation.

[…]

Proving definitively that your credences must be so spread out as to include a probability function of a certain kind is therefore not in general within our powers. Accordingly, my argument rests in large part on an appeal to intuition. But the intuition to which I am appealing strikes me as sufficiently forceful and sufficiently widely shared that we should consider the burden of proof to fall on those who deny it.

Granting the relevant antecedents, I share this intuition. Yet how this intuitive representor should update in response to further evidence is mysterious to me. I can imagine some that would no longer remain ‘sufficiently spread out’ if someone took a month or so to consider all the recondite issues Morgensen raises, and concludes, all things considered, that AMF has better overall long-run impact than Make-A-Wish; I can imagine others that would remain ‘sufficiently spread out’ even in the face of person-centuries of work on each issue raised. This (implicit) range spans multitudes of candidate representors.

[12] Morgensen notes (fn.4) the possibility a representor could be a fuzzy set (whereby membership is degree-valued), which could be a useful resource here. One potential worry is this invites an orthodox-style approach: one could weigh each element by its degree of membership, and aggregate across them.

[13] Although recursive criticisms are somewhat of a philosophical ‘cheap shot’, I note that “What representor should I use for p?” seems (for many ‘p’s) the sort of question which the motivations of imprecise credences (e.g. the depth of our uncertainty, (very) imperfect epistemic access) recommend an imprecise answer.

[14] I think what cost me was I didn’t update this forecast as time passed (and so the event became increasingly more unlikely). In market terms, I set a good price on Dec 17, but I kept trading at this price for the rest of the year.

[15] Granted, the imprecise/incomparable approach is not recommending refraining from the bet, but saying it has no answer as to whether taking the bet or not doing so is the better option. Yet I urge one should take these bets, and so this approach fails to give the right answer.

[16] We see the same result with studies on wholly untrained cohorts (e.g. undergraduates). E.g. Mellers et al. (2017).

[17] In conversation, one superforecaster I spoke to suggested they take around an hour to give an initial forecast on a question: far too little time to address the recondite matters that bear upon a typical forecasting question.

[18] Cf. The apocryphal remark about it being ‘too early to tell’ what the impact of the French Revolution was.

[19] For space I only illustrate for my example – Morgensen (2019) ably explains the parallel issues in the ‘AMF vs. Make-a-wish’ case.

[20] Although this is a weak consideration, I note approaches ‘in the spirit’ of an imprecise/incomparible approach, when applied to geopolitical forecasting, are associated with worse performance: Rounding estimates (e.g. from a percentage to n bands across the same interval) degrades accuracy – especially for the most able forecasters; those who commit to making (precise) forecasts and ‘keeping score’ improve more than those who do less of this. Cf. Tetlock’s fireside chat:

Some people would say to us, “These are unique events. There’s no way you’re going to be able to put probabilities on such things. You need distributions. It’s not going to work.” If you adopt that attitude, it doesn’t really matter how high your fluid intelligence is. You’re not going to be able to get better at forecasting, because you’re not going to take it seriously. You’re not going to try. You have to be willing to give it a shot and say, “You know what? I think I’m going to put some mental effort into converting my vague hunches into probability judgments. I’m going to keep track of my scores, and I’m going to see whether I gradually get better at it.” The people who persisted tended to become superforecasters. ︎

[21] Skipping ahead, Trammell (2019) also sketches a similar account to ‘value of contemplation’ which I go on to graffiti more poorly below (p 6-7):

[W]e should ponder until it no longer feels right to ponder, and then to choose one of the the acts it feels most right to choose. Lest that advice seem as vacuous as “date the person who maximizes utility”, here is a more concrete implication. If pondering comes at a cost, we should ponder only if it seems that we will be able to separate better options from worse options quickly enough to warrant the pondering—and this may take some time. Otherwise, we should choose immediately. When we do, we will be choosing literally at random; but if we choose after a period of pondering that has not yet clearly separated better from worse, we will also be choosing literally at random.

The standard Bayesian model suggests that if we at least take a second to write down immediate, baseless “expected utility” numbers for soap and pasta, these will pick the better option at least slightly more often than random. The cluelessness model sketched above predicts (a falsifiable prediction!) that there is some period—sometimes thousandths of a second, but perhaps sometimes thousands of years—during which these guesses will perform no better than random.

I take the evidence from forecasting to give evidence that orthodoxy can meet the challenge posed in the second paragraph. As it seems people (especially the most able) can make ‘rapid fire’ guesses on very recondite matters that are better than chance, we can argue by analogy that the period before guesswork manages to tend better than chance, even for the hard questions of complex cluelessness, tends to be more towards the ‘thousands of a second’ than ‘thousands of years’.

[22] In conversation Phil Trammell persuades me it isn’t strictly information in the classic sense of VoI which is missing. Although I am clueless with respect to poker, professional players aren’t, and yet VoI is commonplace in their decision making (e.g. factoring in whether they want to ‘see more cards’ in terms of how aggressively to bet). Contrariwise I may be clueless whilst having all the relevant facts, if I don’t know how to ‘put them together’.

[23] Naturally, our initial impressions here might be mistaken, but can be informed by the success of our attempts to investigate (cf. multi-armed bandit problems).

[24] Perhaps one diagnostic for ‘simple cluelessness’ could be this sensitivity to ‘causal jitter’.

[25] An aside on generality and decomposition – relegated to a footnote as the natural responses seem sufficient to me: We might face a trade-off between broader and narrower questions on the same topic: is economic growth in country X a good thing versus economic growth generally, for example. In deciding to focus on one or the other we should weigh up their relative tractability (on its face, perhaps the former question is easier) applicability to the decision (facially, if the intervention is in country X, the more particular question is more important to answer – even if economic growth was generally good or bad, country X could be an exception), and generalisability (the latter question offers us a useful steer for other decisions we might have to make).

[26] If it turned out it was hopelessly intractable after all, that, despite appearances, there are not trends or levers – or perhaps we find there are many such levers connected to one another in an apparently endless ladder of plausibly sign-inverting crucial considerations – then I think we can make the same reply (roughly indifference) that we make to matters of simple cluelessness.

Take care with notation for uncertain quantities

2023-11-242023-11-03 ~ Gregory Lewis ~ Leave a comment

BLUF: 3 000 and 3⋅103 are equivalent; so too a probability of 10% and 9-to-1 odds; and so too 0.002 and 1/500. Despite their objective equivalence, they can look very different to our mind’s eye. Different notation implies particular scales or distributions are canonical, and so can frame our sense of uncertainty about them. As the wrong notation is can bewitch us, it should be chosen with care.

Introduction

I presume it is common knowledge that our minds are woeful with numbers in general and any particular use of them in particular.

I highlight one example of clinical risk communication, where I was drilled in medical school to present risks in terms of absolute risks or numbers needed to treat (see). Communication of relative risks was strongly discouraged, as:

Risk factor X increases your risk of condition Y by 50%

Can be misinterpreted as:

With risk factor X, your risk of condition Y is 50% (+ my baseline risk)

This is mostly base-rate neglect: if my baseline risk was 1 in a million, increasing this to 1.5 in a million (i.e. an absolute risk increase of 1 in 2 million, and a number needed to treat of 2 million) probably should not worry me.

But it is also partly an issue of notation: if risk factor X increased my risk of Y by 100% or more, people would be less likely to make the mistake above. Likewise if one uses different phrases than ‘increases risk by Z%’ (e.g. “multiples your risk by 1.5”).

I think this general issue applies elsewhere

Implicit scaling and innocent overconfidence

A common exercise to demonstrate overconfidence is to ask participants to write down 90% credible intervals for a number of ‘fermi-style’ estimation questions (e.g. “What is the area of the Atlantic Ocean?”, “How many books are owned by the British Library?”). Almost everyone finds the true value lies outside their 90% interval much more often than 10% of the time.

I do not think poor calibration here is solely attributable to colloquial-sense overconfidence (i.e. an epistemic vice of being more sure of yourself than you should be). I think a lot it is a more innocent ineptitude at articulating their uncertainty numerically.

I think most people will think about an absolute value for (e.g.) how many books the British Library owns (“A million?”). Although they recognise they have basically no idea, life in general has taught them variance (subjective or objective) around absolute values will be roughly symmetrical, and not larger than the value itself (“they can’t have negative books, and if I think I could be overestimating by 500k, I should think I could be underestimating by about the same”). So they fail to translate (~appropriate) subjective uncertainty into corresponding bounds: 1 000 000 ± (some number less than 1 million)

I think they would fare better if they put their initial guess in scientific notation: 1⋅106. This emphasises the leading digit and exponent, and naturally prompts one to think in orders of magnitude (e.g. “could it be a thousand? A billion? 100 million?”), and that one could be out by a multiple or order of magnitude rather than a particular value (compare to the previous approach – it is much less natural to think you should be unsure about how many digits one should write down). 1⋅106±2 would include the true value of 25 million.

Odds are odds are better than probabilities for small chances

Although there are some reasons (especially for Bayesians) to prefer odds to probability in general, reasons parallel to the above recommend them for weighing up small chances – especially sub-percentile chances. I suspect two pitfalls:

Translating appropriate levels of uncertainty is easier: more natural to contemplating adding zeros to the denominator of 1/x than adding zeros after the decimal for 0.x.
Percentages are not that useful notation for sub-percentile figures (e.g. 0.001% might make one wonder whether the ‘%’ is a typo). Using percentages for probabilities ‘by reflex’ risks anchoring one to the [1%, 99%] range (“well, it’s not impossible, so it can’t be 0%, but it is very unlikely…”)

Also similar to the above, although people especially suck with rare events, I think they may suck less if they develop the habit of at least cross-checking with odds. As a modest claim, I think people can at least discriminate 1 in 1000 versus 1 in 500000 better than 0.1% and 0.002%. There is expression in the ratio scale of odds that the absolute scale of probability compresses out.

The ‘have you tried fractions?’ solution to inter-theoretic value comparison

Consider this example from an Open Philanthropy report [h/t Ben Pace]:

The “animal-inclusive” vs. “human-centric” divide could be interpreted as being about a form of “normative uncertainty”: uncertainty between two different views of morality. It’s not entirely clear how to create a single “common metric” for adjudicating between two views. Consider:

Comparison method A: say that “a human life improved” is the main metric valued by the human-centric worldview, and that “a chicken life improved” is worth >1% of these (animal-inclusive view) or 0 of these (human-centric view). In this case, a >10% probability on the animal-inclusive view would lead chickens to be valued >0.1% as much as humans, which would likely imply a great deal of resources devoted to animal welfare relative to near-term human-focused causes.

Comparison method B: say that “a chicken life improved” is the main metric valued by the animal-inclusive worldview, and that “a human life improved” is worth <100 of these (animal-inclusive view) or an astronomical number of these (human-centric view). In this case, a >10% probability on the human-inclusive view would be effectively similar to a 100% probability on the human-centric view.

These methods have essentially opposite practical implications.

There is a typical two-envelopes style problem here (q.v. Tomasik): if you take the ‘human value’ or ‘chicken value’ envelope first, method A and method B (respectively) recommend switching despite saying roughly the same thing. But something else is also going on. Although method A and method B have similar verbal descriptions, their mathematical sketches imply very different framings of the problem.

Consider the trade ratio of chicken value: human value. Method A uses percentages, which implies we we are working in absolute values, and something like the real number line should be the x axis for our distribution. Using this as the x-axis also suggests our uncertainty should be distributed non-crazily across it: the probability density is not going up or down by multiples of zillions as we traverse it, and 0.12 is more salient than 10−7. It could look something like this:

One possible implicit distribution of method A

By contrast, method B uses a ratio – furthermore a potentially astronomical ratio. It suggests the x-axis has to be on something like a 10−x scale (roughly unbounded below, although it does rule out zero or 10−∞ – it is an astronomical ratio, not an infinite one). This axis, alongside the intuition the distribution should not look crazy, suggests our uncertainty should be smeared over orders of magnitude. 10−6 is salient, whilst 1.2⋅10−1 is not. It could look something like:

One possible implicit distribution for method B

Yet these intuitive responses do look very crazy if plotted against one another on a shared axis. On method A’s implied scale, method B naturally leads us to concentrate a lot of probability density into tiny intervals near zero – e.g. much higher density at the pixel next to the origin (<10−3) than ranges like [0.2, 0.3]. On method B’s implied scale, method A places infinite density infinitely far along the axis, virtually no probability mass on most of the range, and a lot of mass towards the top (e.g much more between 2⋅10−1 and 3⋅10−3 than 10−3 and 10−9):

Which one is better? I think most will find method A’s framing leads them to more intuitive results on reflection. Contra method B, we think there is some chance chickens have zero value (so humans count ‘infinitely more’ as a ratio), and we think there’s essentially no chance of them being between (say) a billionth and a trillionth of a human, where they would ‘only’ count astronomically more. In this case, method B’s framing of an unbounded ratio both misleads our intuitive assignment, and its facially similar description to method A conceals the stark differences between them.

Conclusions

It would be good to have principles to guide us, beyond collecting particular examples. Perhaps:

One can get clues for which notation to use from one’s impression of the distribution (or generator). If the quantity results from multiplying (rather than adding) a lot of things together, you’re probably working on a log scale.
If one is not sure, one can compare and contrast by mentally converting into equivalent notation. One can then cross-check these against ones initial intuition to see which is a better fit.

Perhaps all this is belabouring the obvious, as all of this arises as pedestrian corollaries from other ideas in and around EA-land. We often talk about heavy-tailed distributions, managing uncertainty, the risk of inadvertently misleading with information, and so on. Yet despite being acquainted with all this since medical school, developing these particular habits of mind only struck me in the late-middle-age of my youth. By the principle of mediocrity – despite being exceptionally mediocre – I suspect I am not alone.

Challenges in evaluating forecaster performance

2023-11-172023-11-03 ~ Gregory Lewis ~ Leave a comment

Briefly: There are some good reasons to assess how good you or others are at forecasting (alongside some less-good ones). This is much harder than it sounds, even with a fairly long track record on something like Good Judgement Open to review: all the natural candidates are susceptible to distortion (or ‘gaming’). Making comparative judgements more prominent also have other costs. Perhaps all of this should have been obvious at first glance. But it wasn’t to me, hence this post.

Motivation

I’m a fan of cultivating better forecasting performance in general and in the EA community in particular. Perhaps one can break down the benefits this way.

Training: Predicting how the future will go seems handy for intervening upon it to make it go better. Given forecasting seems to be a skill that can improve with practice, practising it could be a worthwhile activity.

Community accuracy: History augurs poorly for those who claim to know the future. Although even unpracticed forecasters typically beat chance, they tend inaccurate and overconfident. My understanding (tendentiously argued elsewhere) is taking aggregates of these forecasts – ditto all other beliefs we have (ibid.) – allows us to fare better than we would each out on our own. Forecasting platforms are one useful way to coordinate in such an exercise, and so participation supplies a common epistemic good.[1] Although this good is undersupplied throughout the intellectual terrain, it may be particularly valuable for more ‘in house’ topics of the EA community, given less in the way of ‘outside interest’.

Self-knowledge/’calibration’: Knowing one’s ability as a forecast be useful piece of self-knowledge. It can inform how heavily we should weigh our own judgement in those rare cases where our opinion comprises a non-trivial proportion of the opinions we are modestly aggregating (ibid. ad nauseam). Sometimes others ask us for forecasts, often under the guise of advice (I have been doing quite a lot of this with ongoing COVID-19 pandemic): our accuracy (absolute or relative) would be useful to provide alongside our forecast, so our advice can be weighed appropriately by its recipient.

Epistemic peer evaluation: It has been known for some to offer their opinion despite their counsel not being invited. In such cases, public disagreement can result. We may be more accurate in adjudicating these disagreements by weighing the epistemic virtue of the opposing ‘camps’ instead of the balance of argument as it appears to us (ibid. – peccavi).

Alas, direct measures of epistemic accuracy can be elusive: people are apt to better remember (and report) their successes over their failures, and track records from things like prop betting or publicly registered predictions tend low-resolution. Other available proxy measures for performance – subject matter expertise, social status, a writing style suffused with fulminant candenzas of melismatic and mellifluous (yet apropos and adroit) limerence of language [sic] – are inaccurate. Forecasting platforms allow people to make a public track record, and paying greater attention to these track records likely improves whatever rubbish approach is the status quo for judging others’ judgement.

Challenges

The latter two objectives require some means of comparing forecasters to one another.[2] This evaluation is tricky for a few reasons:

1. Metrics which allow good inter-individual comparison can interfere with the first two objectives, alongside other costs.

2. Probably in principle (and certainly in practice) natural metrics for this introduce various distortions.

3. (In consequence, said metrics are extremely gameable and vulnerable to Goodhart’s law).

Forecasting and the art of slaking one’s fragile and rapacious ego

Suppose every EA started predicting on a platform like Metaculus. Also suppose there was a credible means to rank all of them by their performance (more later). Finally, suppose this ‘Metaculus rank’ became an important metric used in mutual evaluation.

Although it goes without saying effective altruists almost perfectly act to further the common good, all-but-unalloyed with any notion of self-regard, insofar as this collective virtue is not adamantine, perverse incentives arise. Such as:

Fear of failure has a mixed reputation as an aid to learning. Prevalent worry about ‘tanking ones rank’ could slow learning and improvement, and result in poorer individual and collective performance.
People can be reluctant to compete when they believe they are guaranteed to lose. Whoever finds themselves in the bottom 10% may find excusing themselves from forecasting more appealing than continuing to broadcast their inferior judgement (even your humble author [sic – ad nau– nvm] may not have written this post if he was miles below-par on Good Judgement Open). This is bad for these forecasters (getting better in absolute terms still matters), and for the forecasting community (relatively poorer forecasters still provide valuable information).
Competing over relative rank is zero-sum. To win in zero-sum competition, it is not enough that you succeed – all others must fail. Good reasoning techniques and new evidence are better jealously guarded as ‘trade secrets’ rather than publicly communicated. Yet the latter helps one another to get better, and for the ‘wisdom of the crowd’ to be wiser.

Places like GJO and Metaculus are aware of these problems, and so do not reward relative accuracy alone, either through separate metrics (badges for giving your rationale, ‘upvotes’ on comments, etc.) or making their ‘ranking’ measures composite metrics of accuracy and other things like activity (more later).[3]

These composite metrics are often better. Alice, who starts off a poor forecaster but through diligent practice becomes a good (but not great) and regular contributor to a prediction platform has typically done something more valuable and praiseworthy than Bob, who was naturally brilliant but only stuck around long enough to demonstrate a track record to substantiate his boasting. Yet, as above, sometimes we really do (and really should) care about performance alone, and would value Bob’s judgement over Alice’s.

Hacking scoring metrics for minimal fun and illusory profit

Even if we ignore the above, constructing a good metric of relative accuracy is much easier said than done. Even if we want to (as Tetlock recommends) ‘keep score’ of our performance, essentially all means of keeping score either introduce distortions, are easy to Goodhart, or are uninterpretable. To illustrate, I’ll use all the metrics available for participating in Good Judgement Open as examples (I’m not on Metaculus, but I believe similar things apply).

Incomplete evaluation and strategic overconfidence: Some measures are only reported for single questions or a small set of questions (‘forecast challenges’ in the GJO). This can inadvertently reward overconfidence. ‘Best performers’ for a single question are typically overconfident (and typically inaccurate) forecasters who maxed out their score by betting 0/100% the day a question opened and got lucky.

Sets of questions (‘challenges’) do a bit better (good forecasters tend to find themselves frequently near the top of the leader board), but their small number still allows a lot of volatility. My percentile across question sets on GJO varies from top 0.1% to significantly below average. The former was on a set where I was on the ‘right’ side of the crowd for all dozen of the questions in the challenge. Yet for many of these I was at something like 20% whilst the crowd was at 40% – even presuming I had edge rather than overconfidence, I got lucky that none of these happened. Contrariwise, being (rightly) less highly confident than the crowd will pay out in the long run, but the modal result in a small question-set is getting punished. The latter was a set where I ‘beat the crowd’ on most of the low probabilities, but tanked on an intermediate probability one – Brier scoring and non-normalized adding of absolute difference means this question explained most of the variance in performance across the set.[4]

If one kept score by ones ‘best rankings’, ones number of ‘top X finishes’, or similar, this measure would reward overconfidence, as although this costs you in the long run, it amplifies good fortune.

Activity loading: The leaderboard for challenges isn’t ranked by Brier score (more later), but accuracy, essentially your Brier – crowd Brier. GJO evaluates each day, and ‘carries forward’ forecasts made before (i.e. if you say 52% on Monday, you are counted as forecasting 52% on Tuesday, Wednesday, and every day until the question closes unless you change it). Thus – if you are beating the crowd – your ‘accuracy score’ is also partly an activity score, as answering all the questions, having active forecasts as soon questions open (etc) all improve ones score (presuming one is typically beating the crowd) without being measures of good judgement per se.

Metaculus ranks all users by a point score which (like this forum’s karma system) rewards a history of activity rather than ‘present performance’: even if Alice was more accurate than all current Metaculus users, if she joined today it would take her a long time to rise to the top them.

Raw performance scores are meaningless without a question set: Happily, GJO uses a fairly pure ‘performance metric’ front and centre: Brier score across all of your forecasts. Although there are ways to ‘grind’ activity into accuracy (updating very frequently, having a hair trigger to update on news a couple of days before others get around to it, etc.) it loads much more heavily on performance. It also (at least in the long run) punishes overconfidence.

The problem is that this measure has little meaning on its own. A Brier score of (e.g.) 0.3 may be good or bad depending on how hard it was to forecast your questions – and one can get arbitrarily close to 0 by ‘forecasting the obvious’ (i.e. putting 100% on ‘No’ the day before a question closes and the event has not happened yet). One can fairly safely say your performance isn’t great if you’re underperforming a coin flip, but more than that is hard to say.

Comparative performance is messy with disjoint evaluation sets: To help with this, GJO provides a ‘par score’ as a benchmark, composed of median performance across other forecasts on the same questions and time periods as one’s own. This gives a fairly reliable single bit of information: a Brier score of 0.3 is good if it is lower than this median, as it suggests one has outperformed the typical forecasts on these questions (and vice versa).

However, it is hard to say much more than that. If everyone answered the same questions across the same time periods, one could get a sense of (e.g.) ‘how much better than the median’, or which of Alice and Bob (who are both ‘better than average’) is better. But when (as is typically the case on forecasting questions) people forecast disjoint sets of questions, this goes up in the air again. ‘How much you beat the median by’ can depend as much on ‘picking the right questions’ as ‘forecasting well’:

If the median forecast is already close to the forecast you would make, making a forecast can harm your ‘average outperformance’ even if you are right and the median is wrong (at the extreme, making a forecast identical to the median inevitably drags one closer to par). This can bite a lot for low likelihood questions: if the market is at 2% but you think it should be 8%, even if you’re right you are a) typically likely to lose in the short run, and b) even in the long run the yield, when aggregated, can make you look more similar to the average.
On GJO the more political questions tend to attract less able forecasters, and the niche/technical ones more able ones. For questions where the comments are often “I know X is a fighter, they will win this!” or “Definitely Y, as [conspiracy theory]”, the bar to ‘beat the crowd’ is much lower than for questions where the comments are more “I produced this mathematical model for volatility to give a base-rate for exceeding the cut-off value which does well back-tested on the last 2 years – of course, back-testing on training data risks overfitting (schoolboy error, I know), but this also corresponds with the probability inferred from the price of this exotic financial instrument”.

This is partly a good thing, as it incentives forecasters to prioritise questions where the current consensus forecast is less accurate. But it remains a bad one for the purposes discussed here: even if Alice in some platonic sense is a superior forecaster, Bob may still appear better if he happens to (or strategizes to) benefit from these trends more than her.[5]

Admiring the problem: an Ode to Goodhart’s law

One might have hoped as well describing this problem, I would also have some solutions to propose. Sadly this is beyond me.

The problem seems fundamentally tied to Goodhart’s law. (q.v.) People may have a mix of objectives for forecasting, and this mix may differ between people. Even a metric closely addressed to one objective probably will not line up perfectly, and the incentives to ‘game’ would amplify the divergence. With respect to other important objectives, one expects a greater mismatch: a metric that rewards accuracy can discourage participation (see above); a metric that rewards participation can encourage people to ‘do a lot’ without necessarily ‘adding much’ (or trying to get much better). Composite metrics can help with this problem, but provide another one in turn: difficulty isolating and evaluating individual aspects of the composite.

Perhaps best is a mix of practical wisdom and balance – taking the metrics as useful indicators but not as targets for monomaniacal focus. Some may be better at this than others.

[1]: Aside: one skill in forecasting which I think is neglected is formulating good questions. Typically our convictions are vague gestalts rather than particular crisp propositions. Finding useful ‘proxy propositions’ which usefully inform these broader convictions is an under-appreciated art.

[2]: Relative performance is a useful benchmark for peer weighting as well as self-knowledge. If Alice tends worse than the average person at forecasting, she would be wise to be upfront about this lest her ‘all things considered’ judgements inadvertently lead others (who would otherwise aggregate better discounting her view rather than giving it presumptive equal weight) astray.

[3]: I imagine it is also why (e.g.) the GJP doesn’t spell out exactly how it selects the best GJO users for superforecaster selection.

[4]: One can argue the toss about whether there are easy improvements. One could make a scoring rule more sensitive to accuracy on rare events (Brier is infamously insensitive), or do some intra-question normalisation of accuracy. The downside would be this is intensely gameable, encouraging a ‘pick the change up in front of the steamroller’ strategy – overconfidently predicting rare events definitely won’t happen will typically net one a lot of points, with the occasional massive bust.

[5]: One superforecaster noted taking a similar approach (but see).

Use resilience, instead of imprecision, to communicate uncertainty

2023-11-102023-11-03 ~ Gregory Lewis ~ Leave a comment

BLUF: Suppose you want to estimate some important X (e.g. risk of great power conflict this century, total compute in 2050). If your best guess for X is 0.37, but you’re very uncertain, you still shouldn’t replace it with an imprecise approximation (e.g. “roughly 0.4”, “fairly unlikely”), as this removes information. It is better to offer your precise estimate, alongside some estimate of its resilience, either subjectively (“0.37, but if I thought about it for an hour I’d expect to go up or down by a factor of 2”), or objectively (“0.37, but I think the standard error for my guess to be ~0.1”).

‘False precision’

Imprecision often has a laudable motivation – to avoid misleading your audience into relying on your figures more than they should. If 1 in 7 of my patients recover with a new treatment, I shouldn’t just report this proportion, without elaboration, to 5 significant figures (14.285%).

I think a similar rationale is often applied to subjective estimates (forecasting most salient in my mind). If I say something like “I think there’s a 12% chance of the UN declaring a famine in South Sudan this year”, this could imply my guess is accurate to the nearest percent. If I made this guess off the top of my head, I do not want to suggest such a strong warranty – and others might accuse me of immodest overconfidence (“Sure, Nostradamus – 12% exactly“). Rounding off to a number (“10%”), or just a verbal statement (“pretty unlikely”) seems both more reasonable and defensible, as this makes it clearer I’m guessing.

In praise of uncertain precision

One downside of this is natural language has a limited repertoire to communicate degrees of uncertainty. Sometimes ’round numbers’ are not meant as approximations: I might mean “10%” to be exactly 10% rather than roughly 10%. Verbal riders (e.g. roughly X, around X, X or so, etc.) are ambiguous: does roughly 1000 mean one is uncertain about the last three digits, or the first, or how many digits in total? Qualitative statements are similar: people vary widely in their interpretation of words like ‘unlikely’, ‘almost certain’, and so on.

The greatest downside, though, is precision: you lose half the information if you round percents to per-tenths. If, as is often the case in EA-land, one is constructing some estimate ‘multiplying through’ various subjective judgements, there could also be significant ‘error carried forward’ (cf. premature rounding). If I’m assessing the value of famine prevention efforts in South Sudan, rounding status quo risk to 10% versus 12% infects downstream work with a 1/6th directional error.

There are two natural replies one can make. Both are mistaken.

High precision is exactly worthless

First, one can deny the more precise estimate is any more accurate than the less precise one. Although maybe superforecasters could expect ’rounding to the nearest 10%’ would harm their accuracy, others thinking the same are just kidding themselves, so nothing is lost. One may also have some of Tetlock’s remarks in mind about ’rounding off’ mediocre forecasters doesn’t harm their scores, as opposed to the best.

I don’t think this is right. Combining the two relevant papers (1, 2), you see that everyone, even mediocre forecasters, have significantly worse Brier scores if you round them into seven bins. Non-superforecasters do not see a significant loss if rounded to the nearest 0.1. Superforecasters do see a significant loss at 0.1, but not if you rounded more tightly to 0.05.

Type 2 error (i.e. rounding in fact leads to worse accuracy, but we do not detect it statistically), rather than the returns to precision falling to zero, seems a much better explanation. In principle:

If a measure has signal (and in aggregate everyone was predicting better than chance) shaving bits off it should reduce it; it also definitely shouldn’t increase it, setting the upper bound of whether this helps or harms reliably to zero.
Trailing bits of estimates can be informative even if discrimination between them is unreliable. It’s highly unlikely superforecasters can reliably discriminate (e.g.) p=0.42 versus 0.43, yet their unreliable discrimination can still tend towards the truth (X-1% forecasts happen less frequently than X% forecasts, even if one lacks the data to demonstrate this for any particular X). Superforecaster callibration curves, although good, are imperfect, yet I aver the transformation to perfect calibration will be order preserving rather than ‘stair-stepping’.
Rounding (i.e. undersampling) would only help if we really did have small-n discrete values for our estimates across the number line, and we knew variation under this “estimative nyquist limit” was uninformative jitter.
Yet that we have small-n discrete values (often equidistant on the probability axis, and shared between people), and increased forecasting skill leads n to increase is implausible. That we just have some estimation error (the degree of which is lower for better forecasters) has much better face validity. Yet if there’s no scale threshold which variation below is uninformative, taking the central estimate (rather than adding some arbitrary displacement to get it to the nearest ’round number’) should fare better.
Even on the n-bin model, intermediate values can be naturally parsed as estimation anti-aliasing when one remains unsure which bin to go with (e.g. “Maybe it’s 10%, but maybe it’s 15% – I’m not sure, but maybe more likely 10% than 15%, so I’ll say 12%”). Aliasing them again should do worse.

In practice:

The effect sizes for ‘costs to rounding’ increase both with degree of rounding (you tank Brier more with 3 bins than 7), and with underlying performance (i.e. you tank superforecaster scores more with 7 bins than untrained forecasters). This lines up well with T2 error: I predict even untrained forecasters are numerically worse with 0.1 (or 0.05) rounding, but as their accuracy wasn’t great to start with, this small decrement won’t pass hypothesis testing (but rounding superforecasters to the same granularity generate a larger and so detectable penalty).
Superforecasters themselves are prone to offering intermediate values. If they really only have 0.05 bins (e.g. events they say are 12% likely are really 10% likely, events they say are 13% likely are really 15% likely), this habit worsens their performance. Further, this habit would be one of the few things they do worse than typical forecasters: a typical forecaster jittering over 20 bins when they only have 10 levels is out by a factor of two; a ‘superforecaster’, jittering over percentages when they only have twenty levels, is out by a factor of five.

The rounding/granularity assessments are best seen as approximate tests of accuracy. The error processes which would result in rounding being no worse (or an improvement) labour under very adverse priors, and ‘not shown to be statistically worse’ should not convince us of them.

Precision is essentially (although not precisely) pointless

Second, one may assert the accuracy benefit of precision may be greater than zero, but less than any non-trivial value. For typical forecasters, the cost of rounding into seven bins is a barely perceptible percent or so of Brier score. If (e.g.) whether famine prevention efforts are a good candidate intervention proves sensitive to whether we use a subjective estimate of 12% or round it to 10%, this ‘bottom line’ seems too volatile to take seriously. So rounding is practically non-inferior with respect to accuracy, and so the benefits noted before tilt the balance of considerations in favour.

Yet this reply conflates issues around value of information (q.v.). If I’m a program officer weighing up whether to recommend famine prevention efforts in South Sudan, and I find my evaluation is very sensitive to this ‘off the top of my head’ guess on how likely famine is on the status quo, said guess looks like an important thing to develop if I want to improve my overall estimate.

Suppose this cannot be done – say for some reason I need to make a decision right now, or, despite careful further study, I remain just as uncertain as I was before. In these cases I should decide on the basis of my unrounded-estimate: my decision is better in expectation (if only fractionally) if I base them on (in expectation) fractionally more accurate estimates.

Thus I take precision, even when uncertain – or very uncertain – to be generally beneficial. It would be good if there was some ‘best of both worlds’ way to concisely communicate uncertainty without sacrificing precision. I have a suggestion.

Resilience

One underlying challenge is natural language poorly distinguishes between aleatoric and epistemic uncertainty. I am uncertain (aleatoric sense) whether a coin will land heads, but I’m fairly sure the likelihood will be close to 50% (coins tend approximately fair). I am also uncertain whether local time is before noon in [place I’ve never heard of before], but this uncertainty is essentially inside my own head. I might initially guess 50% (modulo steers like ‘sounds more a like a place in this region on the planet’), but expect this guess to shift to ~0 or ~1 after several seconds of internet enquiry.

This distinction can get murky (e.g. isn’t all the uncertainty about whether there will be a famine ‘inside our heads’?), but the moral that we want to communicate our degree of epistemic uncertainty remains. Some folks already do this by giving a qualitative ‘epistemic status’. We can do the same thing, somewhat more quantitatively, by guessing how resilient our guesses are.

There are a couple of ways I try to do this:

Give a standard error or credible interval: “I think the area of the Mediterranean sea is 300k square kilometers, but I expect to be off by an order of magnitude”; “I think Alice is 165 cm tall (95 CI: 130-190)”. I think it works best when we expect to get access to the ‘true value’ – or where there’s a clear core of non-epistemic uncertainty even a perfect (human) cognizer would have to grapple with.

Give an expected error/CI relative to some better estimator – either a counterpart of yours (“I think there’s a 12% chance of a famine in South Sudan this year, but if I spent another 5 hours on this I’d expect to move by 6%”); or a hypothetical one (“12%, but my 95% CI for what a superforecaster median would be is [0%-45%]”). This works better when one does not expect to get access to the ‘true value’ (“What was the ‘right’ ex ante probability Trump wins the 2016 election?”)

With either, one preserves precision, and communicates a better sense of uncertainty (i.e. how uncertain, rather than that one is uncertain), at a modest cost of verbiage. Another minor benefit is many of these can be tracked for calibration purposes: the first method is all-but a calibration exercise; for the latter, one can review how well you predicted what your more thoughtful self thinks.

Conclusion

All that said, sometimes precision has little value: some very rough sense of uncertainty around a rough estimate is good enough, and careful elaboration is a waste of time. “I’m going into town, I think I’ll be back in around 13 minutes, but with an hour to think more about it I’d expect my guess would change by 3 minutes on average”, seems overkill versus “Going to town, back in quarter of an hour-ish”, as typically the marginal benefit to my friend believing “13 [10-16]” versus (say) “15 [10-25]” is minimal.

Yet not always; some numbers are much more important than others, and worth traversing a very long way along a concision/precision efficient frontier. “How many COVID-19 deaths will be averted if we adopt costly policy X versus less costly variant X`?” is the sort of question where one basically wants as much precision as possible (e.g. you’d probably want to be a lot verbose on spread – or just give the distribution with subjective error bars – rather than a standard error for the mean, etc.)

In these important cases, one is hamstrung if one only has ‘quick and dirty’ ways to communicate uncertainty in one’s arsenal: our powers of judgement are feeble enough without saddling them with lossy and ambiguous communication too. Important cases are also the ones EA-land is often interested in.

Reality is often underpowered

2023-04-06 ~ Gregory Lewis ~ 1 Comment

Introduction

When I worked as a doctor, we had a lecture by a paediatric haematologist, on a condition called Acute Lymphoblastic Leukaemia. I remember being impressed that very large proportions of patients were being offered trials randomising them between different treatment regimens, currently in clinical equipoise, to establish which had the edge. At the time, one of the areas of interest was, given the disease tended to have a good prognosis, whether one could reduce treatment intensity to reduce the long term side-effects of the treatment whilst not adversely affecting survival.

On a later rotation I worked in adult medicine, and one of the patients admitted to my team had an extremely rare cancer,¹with a (recognised) incidence of a handful of cases worldwide per year. It happened the world authority on this condition worked as a professor of medicine in London, and she came down to see them. She explained to me that treatment for this disease was almost entirely based on first principles, informed by a smattering of case reports. The disease unfortunately had a bleak prognosis, although she was uncertain whether this was because it was an aggressive cancer to which current medical science has no answer, or whether there was an effective treatment out there if only it could be found.

I aver that many problems EA concerns itself with are closer to the second story than the first. That in many cases, sufficient data is not only absent in practice but impossible to obtain in principle. Reality is often underpowered for us to wring the answers from it we desire.

Big units of analysis, small samples

The main driver of this problem for ‘EA topics’ is that the outcomes of interest have units of analysis for which the whole population (leave alone any sample from it) is small-n: e.g. outcomes at the level of a whole company, or a whole state, or whole populations. For these big unit of analysis/small sample problems, RCTs face formidable in principle challenges:

Even if by magic you could get (e.g.) all countries on earth to agree to randomly allocate themselves to policy X or Y, this is merely a sample size of ~200. If you’re looking at companies relevant to cage-free campaigns, or administrative regions within a given state, this can easily fall another order of magnitude.
These units of analysis tend highly heterogeneous, almost certainly in ways that affect the outcome of interest. Although the key ‘selling point’ of the RCT is it implicitly controls for all confounders (even ones you don’t know about), this statistical control is a (convex) function of sample size, and isn’t hugely impressive at ~ 100 per arm: it is well within the realms of possibility for the randomisation happen to give arms with unbalanced allocation of any given confounding factor.
‘Roughly’ (in expectation) balanced intervention arms are unlikely to be good enough in cases where the intervention is expected to have much less effect on the outcome than other factors (e.g. wealth, education, size, whatever), thus an effect size that favours one arm or the other can be alternatively attributed to one of these.
Supplementing this raw randomisation by explicitly controlling for confounders you suspect (cf. block randomisation, propensity matching, etc.) has limited value when don’t know all the factors which plausibly ‘swamp’ the likely intervention effect (i.e. you don’t have a good predictive model for the outcome but-for the intervention tested). In any case, they tend to trade-off against the already scarce resource of sample size.

These ‘small sample’ problems aren’t peculiar to RCTs, but endemic to all other empirical approaches. The wealth of econometric and quasi-experimental methods (e.g. IVs, regression discontinuity analysis), still run up against hard data limits, as well those owed to in whatever respect they fall short of the ‘ideal’ RCT set-up (e.g. imperfect instrumentation, omitted variable bias, nagging concerns about reverse causation). Qualitative work (case studies, etc.) have the same problems even if other ones (e.g. selection) loom larger.

Value of information and the margin of common-sense

None of this means such work has zero value – big enough effect sizes can still be reliably detected, and even underpowered studies still give us information. But we may learn very little on the margin of common sense. Suppose we are interested in ‘what makes social movements succeed or fail?’ and we retrospectively assess a (somehow) representative sample of social movements. It seems plausible the results of this investigation is the big (and plausibly generalisable) hits may prove commonsensical (e.g. “Social movements are more likely to grow if members talk to other people about the social movement”), whilst the ‘new lessons’ remain equivocal and uncertain.

We should expect to see this if we believe the distribution of relevant effect sizes is heavy-tailed, with most of the variance in (say) social movement success owed to a small number of factors, with the rest comprised of a large multitude of smaller effects. In such case, modest increases in information (e.g. from small sample data) may bring even more modest increases in either explaining the outcome or identifying what contributes to it:

Toy example, where we propose a roughly pareto distribution of effect size among contributory factors. The largest factors (which nonetheless explain a minority of the variance) may prove to be obvious to the naked eye (blue). Adding in the accessible data may only slightly lower detection threshold, with modest impacts on identifying further factors (green) and overall accuracy. The great bulk of the variance remains in virtue of a large ensemble of small factors which cannot be identified (red). Note that detection threshold tends to have diminishing returns with sample size.

The scientific revolution for doing good?

The foregoing should not be read as general scepticism to using data. The triumphs of evidence-based medicine, although not unalloyed, have been substantial, and there remain considerable gains that remain on the table (e.g. leveraging routine clinical practice). The ‘randomista’ trend in international development is generally one to celebrate, especially (as I understand) it increasingly aims to isolate factors that have credible external validity. The people who run cluster-randomised, stepped-wedge, and other study designs with big units of analysis are not ignorant of their limitations, and can deploy these judiciously.

But it should temper our enthusiasm about how many insights we can glean by getting some data and doing something sciency to it.² The early successes of EA in global health owes a lot to this being one of the easier areas to get crisp, intersubjective and legible answers from a wealth of available data. For many to most other issues, data-driven demonstration of ‘what really works’ will never be possible.

We see that people do better than chance (or better than others) in terms of prediction and strategic judgement. Yet, at least judging by the superforecasters (this writeup by AI impacts is an excellent overview), how they do is much more indirectly data-driven: one may have to weigh between several facially-relevant ‘base rates’, adjusting these rates by factors where the coefficient may be estimated by role in loosely analogous cases, and so forth.³ Although this process may be informed by statistical and numerical literacy (e.g. decomposition, ‘fermi-ization’), it seems to me the main action going on ‘under the hood’ is developing a large (and implicit, and mostly illegible) set of gestalts and impressions to determine how to ‘weigh’ relevant data that is nonetheless fairly remote to the question at issue.⁴

Three final EA takeaways:

Most who (e.g.) write up a case study or a small-sample analysis tend to be well aware of the limitations of their work. Nonetheless I think it is worth paying more attention to how these bear on overall value of information before one embarks on these pieces of work. Small nuggets of information may not be worth the time to excavate even when the audience are ideal reasoners. As they aren’t, one risks them (or yourself) over-weighing their value when considering problems which should demand tricky aggregation of a multitude of data sources.
There can be good reasons why expert communities in some areas haven’t tried to use data explicitly to answer problems in their field. In these cases, the ‘calling card’ of EA-style analysis of doing this anyway can be less of a disruptive breakthrough and more a stigma of intellectual naivete.
In areas where ‘being driven by the data’ isn’t a huge advantage, it can be hard to identify an ‘edge’ that the EA community has. There are other candidates: investigating topics neglected by existing work, better aligned incentives, etc. We should be sceptical of stories which boil down a generalized ‘EA exceptionalism’.

Footnotes

Its name escapes me, although arguably including it would risk deductive disclosure. To play it safe I’ve obfuscated some details.
And statistics and study design generally prove hard enough that experts often go wrong. Given the EA community’s general lack of cultural competence in these areas, I think their (generally amateur) efforts at the same have tended to fare worse.
I take as supportive evidence a common feature among superforecasters is they read a lot – not just in areas closely relevant to their forecasts, but more broadly across history, politics, etc.
Something analogous happens in other areas of ‘expert judgement’, whereby experts may not be able to explain why they made a given determination. We know that this implicit expert judgement can be outperformed by simple ‘reasoned rules’. My suspicion, however, is it still performs better than chance (or inexpert judgement) when such rules are not available.

Is old art obsolete?

2019-02-032019-02-03 ~ Gregory Lewis ~ 2 Comments

Summary

The present is much better than the past in almost all respects, and producing great works of art should not be one of exceptions: in principle modernity has a much larger ‘talent pool’ than the past, and in circumstances which much better allow this talent to flow into great accomplishments; when we look at elite performance in areas with ‘harder’ indicators, they show a trend of improvement from the past to present; and aggregate consumption of older versus new art provides a ‘wisdom of crowds’ consideration.

If we want to experience the ‘best’ artistic work, we should turn our attention away from the ‘classics’, and towards the modern body of work which all-but-inevitably surpasses them. Old art is obsolete.

Such a view runs counter to received wisdom, whereby the ‘canon’ of great art is dominated by classic works wrought by old masters. There’s a natural debunking argument that this view derives from snobbish signalling than genuine merit. I nonetheless explore other hypotheses that make old art non-obsolescent after all.

In concert, they may be enough to secure a place for old art. Yet the do not defray the key claim that the modern artistic community is profoundly stronger than any time before, and posterity should find our art to have been greater than any time before. For this, and many other things besides, there has never been a better time to be a Whig historian.

Continue reading “Is old art obsolete?” →

Notes on Deep Work

2018-10-10 ~ Gregory Lewis ~ 1 Comment

Introduction

Cal Newport has written a book on ‘Deep Work’. I’m usually archly sceptical of ‘productivity gurus’ or ‘productivity guides’, but this book was recommended and I found it was very good – so much so I made notes on the way through.

I put them here in case others find them valuable: I’ve done a bit of rearranging of the line of argument, as well as some commentary on the areas I take to be less well-supported. Nonetheless, the notes follow the general structure in the book of section 1: Why deep work is great; section 2: How to do it. Those after the armamentarium of suggestions Newport makes on how to do it can skip down to the third section. Avoiding anything in italics spares you most of my editorialising. Continue reading “Notes on Deep Work” →