decompose.html

<html><head><title>niplav</title>
<link href="./favicon.png" rel="shortcut icon" type="image/png"/>
<link href="main.css" rel="stylesheet" type="text/css"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<!DOCTYPE HTML>

<style type="text/css">
code.has-jax {font: inherit; font-size: 100%; background: inherit; border: inherit;}
</style>
<script async="" src="./mathjax/latest.js?config=TeX-MML-AM_CHTML" type="text/javascript">
</script>
<script type="text/x-mathjax-config">
	MathJax.Hub.Config({
	extensions: ["tex2jax.js"],
	jax: ["input/TeX", "output/HTML-CSS"],
	tex2jax: {
		inlineMath: [ ['$','$'], ["\\(","\\)"] ],
		displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
		processEscapes: true,
		skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
	},
	"HTML-CSS": { availableFonts: ["TeX"] }
	});
</script>
<script>
document.addEventListener('DOMContentLoaded', function () {
	// Change the title to the h1 header
	var title = document.querySelector('h1')
	if(title) {
		var title_elem = document.querySelector('title')
		title_elem.textContent=title.textContent + " – niplav"
	}
});
</script>
</head><body><h2 id="home"><a href="./index.html">home</a></h2>
<p><em>author: niplav, created: 2024-04-06, modified: 2024-04-06, language: english, status: in progress, importance: 5, confidence: likely</em></p>
<blockquote>
<p><strong>Question decomposition is unreasonably effective, despite good
counter-arguments.</strong></p>
</blockquote><div class="toc"><div class="toc-title">Contents</div><ul><li><a href="#Classification_and_Improvements">Classification and Improvements</a><ul></ul></li><li><a href="#Using_LLMs">Using LLMs</a><ul><li><a href="#Small_Experiment">Small Experiment</a><ul><li><a href="#Description">Description</a><ul></ul></li><li><a href="#Conclusion">Conclusion</a><ul></ul></li></ul></li></ul></li><li><a href="#Discussions">Discussions</a><ul></ul></li></ul></div>
<h1 id="On_The_Unreasonable_Effectiveness_Of_Question_Decomposition"><a class="hanchor" href="#On_The_Unreasonable_Effectiveness_Of_Question_Decomposition">On The Unreasonable Effectiveness Of Question Decomposition</a></h1>
<p>If we say "<code>$X$</code> will happen if and only if <code>$Y_1$</code> and <code>$Y_2$</code> and
<code>$Y_3$</code>... <em>all</em> happen, so we estimate <code>$P(Y_1)$</code> and <code>$P(Y_2|Y_1)$</code>
and <code>$P(Y_3|Y_1, Y_2)$</code> &amp;c, and then multiply them together to estimate
<code>$P(X)=P(Y_1)·P(Y_2|Y_1)·P(Y_3|Y_2,Y_1·)·$</code>…", do we usually get a
probability that is close to <code>$P(X)$</code>? Does this <em>improve</em> forecasts where
one tries to estimate <code>$P(X)$</code> directly?</p>
<p>This type of question decomposition (which one could
call <strong>multiplicative decomposition</strong>) appears to be a
relatively common method for forecasting, see <a href="https://forum.effectivealtruism.org/posts/ARkbWch5RMsj6xP5p/transformative-agi-by-2043-is-less-than-1-likely">Allyn-Feuer &amp; Sanders
2023</a>,
<a href="http://fivethirtyeight.com/features/donald-trumps-six-stages-of-doom/">Silver
2016</a>,
<a href="https://www.jefftk.com/p/breaking-down-cryonics-probabilities">Kaufman
2011</a>,
<a href="https://arxiv.org/abs/2206.13353">Carlsmith 2022</a> and <a href="https://www.overcomingbias.com/p/break-cryonics-downhtml">Hanson
2011</a>,
but there have been conceptual arguments against this technique, see
<a href="https://arbital.com/p/multiple_stage_fallacy/">Yudkowsky 2017</a>, <a href="https://www.lesswrong.com/posts/kmZkCmz6AiJntjWDG/multiple-stages-of-fallacy-justifications-and-non">AronT
2023</a>
and <a href="https://gwern.net/forking-path">Gwern 2019</a>, which all argue that
it reliably underestimates the probability of events.</p>
<p>What is the empirical evidence for decomposition being a technique that
<em>improves</em> forecasts?</p>
<p><a href="https://www.sciencedirect.com/science/article/abs/pii/S0169207006000501">Lawrence et al.
2006</a>
summarize the state of research on the question:</p>
<blockquote>
<p>Decomposition methods are designed to improve accuracy by splitting
the judgmental task into a series of smaller and cognitively less
demanding tasks, and then combining the resulting judgements. <a href="https://www.researchgate.net/publication/267198099_The_Forecasting_Dictionary">Armstrong
(2001)</a>
distinguishes between decomposition, where the breakdown of
the task is multiplicative (e.g. sales forecast=market size
forecast×market share forecast), and segmentation, where it is
additive (e.g. sales forecast=Northern region forecast+Western
region forecast+Central region forecast), but we will use the
term for both approaches here. <strong>Surprisingly, there has been
relatively little research over the last 25 years into the value
of decomposition and the conditions under which it is likely to
improve accuracy. In only a few cases has the accuracy of forecasts
resulting from decomposition been tested against those of control
groups making forecasts holistically.</strong> One exception is <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/for.3980090403">Edmundson
(1990)</a>
who found that for a time series extrapolation task, obtaining separate
estimates of the trend, seasonal and random components and then combining
these to obtain forecasts led to greater accuracy than could be obtained
from holistic forecasts. Similarly, <a href="https://www.sciencedirect.com/science/article/abs/pii/S0169207004001049">Webby, O’Connor and Edmundson
(2005)</a>
showed that, when a time series was disturbed in some periods by several
simultaneous special events, accuracy was greater when forecasters were
required to make separate estimates for the effect of each event, rather
than estimating the combined effects holistically. <a href="https://core.ac.uk/download/pdf/76362507.pdf">Armstrong and Collopy
(1993)</a> also constructed
more accurate forecasts by structuring the selection and weighting
of statistical forecasts around the judge’s knowledge of separate
factors that influence the trends in time series (causal forces).
Many other proposals for decomposition methods have been based on an
act of faith that breaking down judgmental tasks is bound to improve
accuracy or upon the fact that decomposition yields an audit trail
and hence a defensible rationale for the forecasts (<a href="https://d1wqtxts1xzle7.cloudfront.net/49278189/0169-2070_2891_2990004-f20161001-25533-1nihj7p-libre.pdf?1475369537=&amp;response-content-disposition=inline%3B+filename%3DUsing_belief_networks_to_forecast_oil_pr.pdf&amp;Expires=1693237828&amp;Signature=JtQssSZv0KaUbWLf3fPA70ho1ECj9zYkBC~EnNVIrFfIgcQ5dDVeK5stSWj1tR7OQrcur7PG~y8wHNuAorqrPAjqHwEq3T88klt23BzmzXwMWUNR~ZPKimTrcDTGgrj0WcC~~gM51fzvvCJrK2hO7oPsmc-mQsgvBL5VIywRLw6-GpQjBbpILXJk90c3-JTXwWeUwhwt1zv3h6U-WAyQn-Y88tZg~R7AUFJBRAdbwV8A67o7mHcCZNbKLdluGYDgG9uC516BWr4lckSd7VcoqzfywkjpxZWTjBEFLvmJoWuSRwNvqak3SzHBO5Hv86zZ4oJtWXbxwTdsVw61JGmt6Q__&amp;Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA">Abramson &amp; Finizza,
1991</a>;
<a href="https://www.researchgate.net/profile/George-Wright-11/publication/227446292_Interaction_of_Judgemental_and_Statistical_Forecasting_Methods_Issues_Analysis/links/0c9605375ff870d3c9000000/Interaction-of-Judgemental-and-Statistical-Forecasting-Methods-Issues-Analysis.pdf">Bunn &amp; Wright,
1991</a>;
<a href="https://www.sciencedirect.com/science/article/abs/pii/0169207092900277">Flores, Olson, &amp; Wolfe,
1992</a>;
<a href="https://link.springer.com/book/9789401579544">Saaty &amp;
Vargas, 1991</a>; <a href="https://d1wqtxts1xzle7.cloudfront.net/56667412/0040-1625_2894_2900050-720180527-12155-ptp70l-libre.pdf?1527455229=&amp;response-content-disposition=inline%3B+filename%3DDecomposition_in_the_assessment_of_judgm.pdf&amp;Expires=1693237988&amp;Signature=Np4MDK~nFPb3xPknH2QaBnyOnnYT8FPgpsx7PTKkZEhmPVRQ5RTKSKzOQ7j9KDstvWfF~X7pIQdd~OJxn4OntioCsEPCPxRzLtOUscn3~UuGBnWYNsZ4JO8iBaREvH2N~DL0um~6moufhk69-lNkSjV~x2MLC5KMDBGJUwbxSwZmTp0sx3vANfZGpq~~f5ojnSkSfVJ1NYvWr82KK5UUxtU08HtGsSqOKlBB8NA7~IxsTcJnUKONHm5lczVeWq8KBEMGaNLI0GBr1y4e2bPA~Y8aKcCqnDbsOriQ0f7rNclqsY-cEEarUmd8UXRFJZc6vtPjgdF5Xv0CgrPEGIh0xA__&amp;Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA">Salo &amp; Bunn, 1995</a>;
<a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/for.3980090407">Wolfe &amp; Flores,
1990</a>).
Yet, as <a href="https://d1wqtxts1xzle7.cloudfront.net/46103856/0169-2070_2893_2990001-420160531-2191-1mlbayr-libre.pdf?1464718211=&amp;response-content-disposition=inline%3B+filename%3DImproving_judgmental_time_series_forecas.pdf&amp;Expires=1693238053&amp;Signature=PDCGfnyMHtluH1q9RsZffGSGZU02oBJZvEFChvofGx0nzBDrpCnlErCwx5OFUv0rXIRsULnJL~LA57rWsRXBEXcAbUtaObpC6rJTmAqe1RJLkDE59eD7787zBpqxYCkBHx5-uOou2gPpBCrxpMzc9JS3zDt4HXSs3eiXMzhzw0jPHkPyYGPwIFK5Xae1JVOkmZccnBe-9QwZhwyIcLEqoEWIoAr34d2EW19zendk~9NA182Kaf4MgKXaUCzMxSwcyMIWfoJ5K~VdfWr5Cf1LOToCb638Nn354gpcOtTX~gwCfaK0lwWcqc9Ew-3AJ7w7EBtStETgW3rSbOvuSV9WrA__&amp;Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA">Goodwin and Wright
(1993)</a>
point out, decomposition is not guaranteed to improve accuracy and may
actually reduce it when the decomposed judgements are psychologically more
complex or less familiar than holistic judgements, or where the increased
number of judgements required by the decomposition induces fatigue.</p>
</blockquote>
<p>(Emphasis mine).</p>
<p>The types of decomposition described here seem quite different from
the ones used in the sources above: Decomposed time series are quite
dissimilar to multiplied probabilities for binary predictions, and in
combination with the conceptual counter-arguments the evidence appears
quite weak.</p>
<p>It appears as if a team of a few (let's say 4) dedicated forecasters could
run a small experiment to determine whether multiplicative decomposition
for binary forecasts a good method, by randomly spending 20 minutes either
making explicitely decomposed forecasts or control forecasts (although
the exact method for control needs to be elaborated on). Working in
parallel, making 70 forecasts should take $70 \text{ forecasts} \cdot \frac{1 \text{hr}}{3 \text{ forecasts}} \cdot \frac{1}{4}
\approx 5.8\text{hr}$ less than 6 hours, although it'd be useful to search for
more recent literature on the question.</p>
<ul>
<li>Would decomposition work better if one were operating with log-odds instead of probabilities?</li>
</ul>
<h2 id="Classification_and_Improvements"><a class="hanchor" href="#Classification_and_Improvements">Classification and Improvements</a></h2>
<p>The description of such decomposition in <a href="#Forecasting_Techniques">this
section</a> is, of course, lacking: A
<em>better</em> way of decomposition would be, for a specific outcome,
to find a set of preconditions for <code>$X$</code> that are <a href="https://en.wikipedia.org/wiki/Mutually-exclusive">mutually
exclusive</a>
and <a href="https://en.wikipedia.org/wiki/Collectively_exhaustive">collectively
exhaustive</a>, find
a chain that precedes them (or another MECE decomposition), and iterate
until a whole (possibly interweaving) tree of options has been found.</p>
<p>Thus one can define three types of question decomposition:</p>
<ol>
<li><strong>Multiplicative Decomposition</strong>: Given an event <code>$X$</code>, find conditions <code>$Y_1, \dots Y_n$</code> so that <code>$X$</code> if any only if all of <code>$Y_1, \dots, Y_n$</code> happen. Estimate <code>$P(Y_1)$</code> and <code>$P(Y_2|Y_1)$</code> and <code>$P(Y_3|Y_1, Y_2)$</code> &amp;c, and then mult
iply them together to estimate <code>$P(X)=P(Y_1)·P(Y_2|Y_1)·P(Y_3|Y_2,Y_1·) \dots P(Y_n | Y_{n-1}, \dots, Y_2, Y_1)$</code>.</li>
<li><strong>Additive Decomposition</strong> or <strong><a href="https://en.wikipedia.org/wiki/Mutually-exclusive">ME</a><a href="https://en.wikipedia.org/wiki/Collectively_Exhaustive_Events">CE</a> Decomposition</strong>: Given an event <code>$X$</code>, find a set of scenarios <code>$Y_1, \dots Y_n$</code>
such that <code>$X$</code> happens if any <code>$Y$</code> happens, and only then, and no two <code>$Y_k, Y_l$</code> have <code>$P(Y_k \cap Y_l)&gt;0$</code>. Estimate <code>$P(Y_1), P(Y_2), \dots P(Y_n)$</code> and then estimate <code>$P(X)=\sum_{i=1}^n P(Y_i)$</code>.</li>
<li><strong>Recursive Decomposition</strong>: For each scenario <code>$X'$</code>, decide to pursue one of the following strategies:
    1. Estimate <code>$P(X')$</code> directly
    2. Multiplicative decomposition of <code>$P(X')$</code>
            1. Find a multiplicative decomposition <code>$Y_1', \dots Y_n'$</code> for <code>$X'$</code>
            2. Estimate <code>$P(Y_1'), \dots P(Y_n' | Y_1', \dots Y_{n-1}')$</code> each via recursive decomposition
            3. Determine <code>$P(X')=P(Y_1')·P(Y_2'|Y_1')·P(Y_3'|Y_2', Y_1') \dots P(Y_n' | Y_{n-1}', \dots, Y_2', Y_1')$</code>.
    3. Additive decomposition of <code>$P(X')$</code>
            1. Find a multiplicative decomposition <code>$Y_1', \dots Y_n'$</code> for <code>$X'$</code>
            2. Estimate <code>$P(Y_1'), \dots P(Y_n')$</code> each via recursive decomposition
            3. Determine <code>$P(X')=P(Y_1')+P(Y_2')+ \dots P(Y_n')$</code>.</li>
</ol>
<p>A keen reader will notice that recursive decomposition is similar to
<a href="https://en.wikipedia.org/wiki/Bayesian_Network">Bayes nets</a>. True,
though it doesn't deal as well with <a href="https://en.wikipedia.org/wiki/Conditional_Probability">conditional
probabilities</a>.</p>
<h2 id="Using_LLMs"><a class="hanchor" href="#Using_LLMs">Using LLMs</a></h2>
<p>This is a scenario where large language models are quite useful, and
we have a testable hypothesis: Does question decomposition (or MECE
decomposition) improve language model forecasts by any amount?</p>
<p>Frontier LLMs are <a href="https://dynomight.net/predictions/">at</a>
<a href="https://www.lesswrong.com/posts/c3cQgBN3v2Cxpe2kc/getting-gpt-3-to-predict-metaculus-questions">best</a>
<a href="https://arxiv.org/pdf/2206.15474v1">mediocre</a> at
forecasting real-world events, but similar to how <a href="https://arxiv.org/pdf/2305.14975v2">asking
for calibration</a>
improves performance, so perhaps
<a href="https://blog.research.google/2022/05/language-models-perform-reasoning-via.html">chain-of-thought</a>-like
question decomposition improves (or reduces) their performance (and
therefore gives us reason to believe that similar practices will (or
won't) work with human forecasters).</p>
<p>Direct:</p>
<pre><code>    Provide your best probabilistic estimate for the following
    question. Give ONLY the probability, no other words or
    explanation. For example:
    0.1
    Give the most likely guess, as short as possible; not a complete
    sentence, just the guess!
    The question is: ${QUESTION}.
    ${RESOLUTION_CRITERIA}.
</code></pre>
<p>Multiplicative decomposition:</p>
<pre><code>    Provide your best probabilistic estimate a question.

    Your output should be structured in three parts.

    First, determine a list of factors X₁, …, X_n that are necessary
    and sufficient for the question to be answered "Yes". You can choose
    any number of factors.

    Second, for each factor X_i, estimate and output the conditional
    probability P(X_i|X₁, X₂, …, X_{i-1}), the probability that X_i
    will happen, given all the previous factors *have* happened. Then, arrive
    at the probability for Q by multiplying the conditional probabilities
    P(X_i):

    P(Q)=P(X₁)*P(X₂|X₁)…P(X_n|X₁, X₂, …, X_{n-1}).

    Third and finally, In the last line, report P(Q), WITHOUT ANY ADDITIONAL
    TEXT. Just write the probability, and nothing else.

    Example (Question: "Will my wife get bread from the bakery today?"):

    Necessary factors:
    1. My wife remembers to get bread from the bakery.
    2. The car isn't broken.
    3. The bakery is open.
    4. The bakery still has bread.

    1. P(My wife remembers to get bread from the bakery)=0.75
    2. P(The car isn't broken|My wife remembers to get bread from the bakery)=0.99
    3. P(The bakery is open|The car isn't broken, My wife remembers to get bread from the bakery)=0.7
    4. P(The bakery still has bread|The bakery is open, The car isn't broken, My wife remembers to get bread from the bakery)=0.9
    Multiplying out the probabilities: 0.75*0.99*0.7*0.9=0.467775
    0.467775
    (End of output)
    The question is: ${QUESTION}.
    ${RESOLUTION_CRITERIA}
</code></pre>
<h3 id="Small_Experiment"><a class="hanchor" href="#Small_Experiment">Small Experiment</a></h3>
<p>Using the metaculus data in <a href="./iqisa.html">iqisa</a>, I test whether
gpt-4-0613 outperforms when multiplicative decomposition, compared to
direct estimation.</p>
<p><strong>I find that, on a sample of 100 questions from after the end
of training, <em>multiplicative decomposition</em> outperforms <em>direct
estimation</em>.</strong></p>
<p>Forecasts produced by direct estimation have a logscore of -0.439,
while forecasts produced by multiplicative decomposition have a logscore
of -0.206.</p>
<p>This goes against <a href="https://fatebook.io/q/multiplicative-decomposition-of-a--cluns2mz10003l408adz3aw37">my
prediction</a>
that direct estimation would outperform multiplicative
decomposition<sub><a href="https://fatebook.io/q/multiplicative-decomposition-of-a--cluns2mz10003l408adz3aw37">75%</a></sub>.</p>
<h4 id="Description"><a class="hanchor" href="#Description">Description</a></h4>
<p>Code <a href="https://github.com/niplav/decomposer">here</a>.</p>
<p>I load the forecasting data in iqisa, and filter to select questions
which resolved after the release of GPT-4 in March 2023:</p>
<pre><code>    questions=metaculus.load_questions(data_dir='../iqisa/data')
    end_training_data=datetime.fromisoformat('2023-04-30')
    not_in_training_data=questions.loc[(questions['q_status']=='resolved') &amp; (questions['resolve_time']&gt;=end_training_data)]
</code></pre>
<p>I then loop over <code>not_in_training_data</code>, replace <code>\${QUESTION}</code> in
the prompts with the questions and <code>\${RESOLUTION_CRITERIA}</code> with the
resolution criteria. I send the resulting prompt over the OpenAI API,
and save the result in a separate file. The resulting files can be found
<a href="https://github.com/niplav/decomposer/tree/main/completions">here</a>,
file format is <code>\${QUESTION_ID}_\${METHOD}_completion$</code>.</p>
<p>I then <a href="https://github.com/niplav/decomposer/blob/main/analyser.py">load the last line from each
file</a>, and
calculate the <a href="https://en.wikipedia.org/wiki/Log_scoring_rule">log-score</a>
of the individual forecasts.</p>
<h4 id="Conclusion"><a class="hanchor" href="#Conclusion">Conclusion</a></h4>
<p>I have changed my mind as a result of this experiment: I previously
believed that multiplicative decomposition would result in worse
forecasts, name ones that are too close to 0%, with respect to good
calibration.</p>
<p>My best guess now is that multiplicative decomposition <em>has that
effect</em>, but that it also encourages thinking more deeply and clearly
about the question, and that this outweighs the effect from pushing
the probabilities towards zero. This should also hold for truth-seeking
humans.</p>
<p>It also suggests another experiment: Instructing the language
model to "think about the question step by step", without
any specific instructions how to do so, would outperform multiplicative
decomposition<sub><a href="https://fatebook.io/q/instructing-to-think-about-the-question--clunyjnwf0001l708pz9k1xv7">55%</a></sub>
if my understanding is correct.</p>
<h2 id="Discussions"><a class="hanchor" href="#Discussions">Discussions</a></h2>
<ul>
<li><a href="https://www.lesswrong.com/posts/YjZ8sJmkGJQhNcjHj/the-evidence-for-question-decomposition-is-weak">LessWrong</a></li>
<li><a href="https://forum.effectivealtruism.org/posts/beRtXkMpCT39y8bPj/there-is-little-evidence-on-question-decomposition">Effective Altruism Forum</a></li>
</ul>
</body></html>