-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathposter.tex
313 lines (240 loc) · 12.2 KB
/
poster.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
% Gemini theme
% https://github.com/anishathalye/gemini
\documentclass[final]{beamer}
% ====================
% Packages
% ====================
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage[orientation=portrait,size=a0,scale=1.0]{beamerposter}
\usetheme{gemini}
\usecolortheme{gemini}
\usepackage{graphicx}
\usepackage{svg}
\usepackage{booktabs}
\usepackage{tikz}
\usepackage{pgfplots}
\usepackage{pifont}
\usepackage{tcolorbox}
\usepackage{xcolor}
\usepackage{forest}
\usepackage{wrapfig}
\usepackage{multicol}
\usepackage{hyperref}
\usepackage{natbib}
\renewcommand{\bibpreamble}{\setlength{\columnsep}{30pt}\begin{multicols}{3}}
\renewcommand{\bibpostamble}{\end{multicols}}
\usetikzlibrary{calc,fit,shapes.arrows,positioning}
\usetikzlibrary{graphs}
\usetikzlibrary{graphdrawing}
\usetikzlibrary{trees,matrix}
\usegdlibrary{trees}
% ====================
% Lengths
% ====================
% If you have N columns, choose \sepwidth and \colwidth such that
% (N+1)*\sepwidth + N*\colwidth = \paperwidth
\newlength{\sepwidth}
\newlength{\colwidth}
\setlength{\sepwidth}{0.0333\paperwidth}
\setlength{\colwidth}{0.4\paperwidth}
\newcommand{\separatorcolumn}{\begin{column}{\sepwidth}\end{column}}
\pgfplotsset{compat=1.16}
\colorlet{darkyellow}{yellow!60!black}
\colorlet{lightblue}{blue!65}
\colorlet{darkred}{red!60!black}
\definecolor{mycolor}{rgb}{0.122, 0.435, 0.698}
\newtcbox{\mybox}{on line,
colframe=mycolor,colback=mycolor!10!white,arc=4pt,boxsep=0pt,left=6pt,right=6pt,boxrule=0pt,top=6pt,bottom=6pt}
% \renewcommand{\emph}[1]{\mybox{\emph{#1}}}
\newcommand{\realemph}[1]{\mybox{\textit{#1}}}
% ====================
% Title
% ====================
\newcommand\textbox[1]{%
\parbox{.333\textwidth}{#1}%
}
\title{
\noindent
\makebox[0pt][l]{}%
\makebox[\textwidth][c]{AbcRanger}%
\makebox[0pt][r]{\raisebox{1.5ex}{\footnotesize{Code and binaries available at \url{http://github.com/diyabc/abcranger}\hspace{2cm}}}}\\
\LARGE{A fast and scalable random forest library for ABC model choice and parameter estimation}
}
% \title{\RaggedLeft \footnotesize{Code and binaries available at \url{http://github.com/diyabc/abcranger}}\\
% \Centering \Huge{AbcRanger}\\
% \LARGE{A fast and scalable random forest library for ABC model choice and parameter estimation}}
\author{F.-D. Collin \inst{2} A. Estoup \inst{1} J.-M. Marin \inst{2} L. Raynal \inst{2} }
\institute[shortinst]{\inst{1} CBGP, INRA, CIRAD, IRD, Montpellier SupAgro, Univ. Montpellier \samelineand \inst{2} Université de Montpellier, CNRS, IMAG UMR 5149}
% \author{Alyssa P. Hacker \inst{1} \and Ben Bitdiddle \inst{2} \and Lem E. Tweakit \inst{2}}
% \institute[shortinst]{\inst{1} Some Institute \samelineand \inst{2} Another Institute}
% ====================
% Body
% ====================
\begin{document}
\begin{frame}[t]
\begin{columns}[t]
\separatorcolumn
\begin{column}{\colwidth}
\begin{block}{First building block : ABC simulations}
\begin{figure}
\centering
\includesvg[width=35cm]{abc-explication}
\caption{ABC details}
\end{figure}
Given an observed data, the basic idea of ABC, \emph{Approximate Bayesian Computations} \cite{marin2012approximate}, is to approximate the likelihood of a parametrized model with selected simulations, by comparing the observed data and simulated ones via computed \emph{summary statistics}. The table of summary statistics for simulated data is called \realemph{the reference table}.
\end{block}
\begin{block}{ABC posterior methodologies}
\begin{description}
\item[\color{darkred}\sffamily \textbf{Model choice:}] Simulate data for several models and \emph{choose the best model to fit our data}
\item[\color{darkred}\sffamily \textbf{Parameter estimation:}] Simulate data for one model and \emph{infer one or several parameters for this model given the observed data}
\end{description}
A sensible workflow is to first choose a model and then infer its parameters.
\begin{figure}
\centering
\includesvg[width=35cm]{methodologies}
\caption{ABC workflow with AbcRanger}
\end{figure}
\end{block}
\begin{alertblock}{Challenges of ABC}
In the context of population genetics recent advances
\begin{description}
\item[\color{darkyellow}\sffamily \textbf{Number of simulated data :}] could be > 100 000
\item[\color{darkyellow}\sffamily \textbf{Number of summary statistics :}] could range from several hundred to tens of thousands (scenario with several populations and combinatory "explosion") : how to select the \emph{\color{darkred} meaningful} ones?
\end{description}
Classical Methods for ABC ($k$-nn and local methods) doesn't cope very well with this situation.
\end{alertblock}
\begin{block}{Our solution}
\cite{pudlo2015reliable} and \cite{raynal2016abc} proposed a novel approach, relying on \emph{Random Forests} to provide both model choice and parameter estimation methodologies
\end{block}
\end{column}
\separatorcolumn
\begin{column}{\colwidth}
\begin{block}{Second building block : Random Forests}
\begin{block}{CART}
Random Forests are based on a CART, \emph{Classification and Regression Trees}, algorithm \cite{breiman:etal:1984}.
\begin{wrapfigure}{l}{0.7\textwidth}
\begin{center}
\begin{tikzpicture}[scale=1.0, every node/.style={scale=1, minimum height=0.8cm}, thick]
\useasboundingbox (-1,-1) rectangle (11.5,11);
%\draw[very thin, gray] (0,0) grid[step=1] (10,10);
\coordinate (c1) at (0,0);
\coordinate (c2) at (0,10);
\coordinate (c3) at (10,10);
\coordinate (c4) at (10,0);
\coordinate (X1) at (5,0);
\coordinate (X2) at (0,5);
\coordinate (cut1) at ($ (c1) + (0,4) $);
\coordinate (cut1_bis) at ($ (c4) + (0,4) $);
\coordinate (cut2) at ($ (c1) + (6,0) $);
\coordinate (cut2_bis) at ($ (cut1) + (6,0) $);
\coordinate (cut3) at ($ (cut1) + (3,0) $);
\coordinate (cut3_bis) at ($ (c2) + (3,0) $);
\coordinate (cut4) at ($ (cut3) + (0,4) $);
\coordinate (cut4_bis) at ($ (cut1_bis) + (0,4) $);
\coordinate (ret1) at (3,2);
\coordinate (ret2) at (8,2);
\coordinate (ret3) at (1.5,7);
\coordinate (ret4) at (6.5,6);
\coordinate (ret5) at (6.5,9);
\coordinate (s1) at (0,4);
\coordinate (s2) at (6,0);
\coordinate (s3) at (3,0);
\coordinate (s4) at (0,8);
\draw (c1) -- (c2) -- (c3) -- (c4) -- (c1);
\draw [color=lightblue,line width=1mm] (cut1) -- (cut1_bis);
\draw [color=darkyellow,line width=1mm] (cut2) -- (cut2_bis);
\draw [color=green!80!black,line width=1mm] (cut3) -- (cut3_bis);
\draw [color=red,line width=1mm] (cut4) -- (cut4_bis);
\draw (ret1) node {$\hat{y}_1$};
\draw (ret2) node {$\hat{y}_2$};
\draw (ret3) node {$\hat{y}_3$};
\draw (ret4) node {$\hat{y}_4$};
\draw (ret5) node {$\hat{y}_5$};
\draw [color=lightblue] (s1) node[left] {$s_1$};
\draw [color=darkyellow] (s2) node[below] {$s_2$};
\draw [color=green!80!black] (s3) node[below] {$s_3$};
\draw [color=red] (s4) node[left] {$s_4$};
\draw (s1) -- ++(-0.15,0);
\draw (s2) -- ++(0,-0.15);
\draw (s3) -- ++(0,-0.15);
\draw (s4) -- ++(-0.15,0);
\draw (X1) node[below, minimum height = 2.5cm] {$X_1$};
\draw (X2) node[left, minimum width = 2.5cm] {$X_2$};
\end{tikzpicture}%
\begin{tikzpicture}[scale=1.5,
edge from parent/.style={draw, thick, edge from parent path=
{(\tikzparentnode.south)--+(0,-1.5pt)-| (\tikzchildnode)}},
noeud/.style ={minimum size=1.2em, inner sep=4pt},
level 1/.style={sibling distance=3cm},
level 2/.style={sibling distance=2cm}]
\tikzset{level distance=65pt}
\node [noeud,color=lightblue] {$X_2 \leq s_1$}
child {node [noeud,color=darkyellow] {$X_1 \leq s_2$}
child {node [noeud] {$\hat{y}_1$} }
child {node [noeud] {$\hat{y}_2$} }
}
child {node [noeud,color=green!80!black] {$X_1 \leq s_3$}
child {node [noeud] {$\hat{y}_3$} }
child {node [noeud,color=red] {$X_2 \leq s_4$}
child {node [noeud] {$\hat{y}_4$} }
child {node [noeud] {$\hat{y}_5$} }
}
};
\end{tikzpicture}
\end{center}
\caption{An example of CART and the associated partition of the two dimensional predictor space. Each splitting condition takes the form $X_j \leq s$ and the prediction at a leaf is denoted $\hat{y}_\ell$.}
\label{chap2:fig:CART-tree}
\end{wrapfigure}
A CART is a \emph{machine learning algorithm} whose principle is to partition the predictor space into disjoint subspaces, in an iterative manner, and each one is assigned a prediction value which will be used for test data falling in this subspace.\\
Once the partitioning is done, we have a binary tree structure which could predict outcomes from an input data, either classes or continuous values.
\end{block}
\begin{block}{Random Forests}
\begin{figure}[ht]
\centering
\includesvg[width=35cm]{RF}
\caption{Random Forest}
\end{figure}
\begin{multicols}{2}
\textbf{Random Forests \cite{breiman:2001} are a three pronged extension of CART:}
\begin{itemize}
\item[\textbf{Ensemble method}] Training a \emph{set} of CART (not just one), and getting the majority vote (resp. mean) for classification (resp. regression)
\item[\textbf{Bootstrapping}] Training data is random sampled (with replacement) for \emph{each} tree
\item[\textbf{Feature bagging}] At each node of a growing tree, find the best split on a random subset of the features
\end{itemize}
\columnbreak
\centering
\textbf{Advantages in an ABC setting : }
\justify
\begin{itemize}
\item robust to noise
\item (almost) free variable importance
\item free (out-of-bag) cross-validation procedure
\item easy parallelization
\item good scaling properties (samples and features)
\item classifier and regressor (both are used)
\end{itemize}
\end{multicols}
\end{block}
\end{block}
\begin{alertblock}{Computational challenges with ABC/Random Forests}
With ~100 000 lines and more than 10 000 summary statistics, each tree could reach over 1 gigabyte of memory size. Typically we need 500 or 1000 trees for good prediction performance, so, even with state of the art RF packages like \cite{wright2015ranger}, memory constraints are preventing completion of the training.
\end{alertblock}
\begin{block}{A new implementation of Random Forest for ABC}
Since ABC procedures only use trained Random Forests on a known set of observations, we have altered the random forest training computation by using only a subset of in-memory trees at a time and accumulating the required outcomes (predictions and statistics). Memory footprint is vastly improved and there is no performance cost.
\begin{figure}[ht]
\centering
\includesvg[width=35cm]{RF-optim}
\caption{Window of growing trees}
\end{figure}
\justifying Ongoing project \realemph{LeafLitter} intends to pursue that line even further: for a growing tree, only encountered leaves are stored. Thus, the memory footprint of the trees becomes negligible, and their growing could finally be parallelized at full scale.
\end{block}
\end{column}
\separatorcolumn
\end{columns}
\begin{block}{\justifying References}
\nocite{*}
\footnotesize{\bibliographystyle{unsrt}\bibliography{poster}}
\end{block}
\end{frame}
\end{document}