forked from theodi/presentations
-
Notifications
You must be signed in to change notification settings - Fork 0
/
2013-09-ota-git-some-data.html
executable file
·467 lines (411 loc) · 15 KB
/
2013-09-ota-git-some-data.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
---
layout: reveal
title: Git Yo'self Some Data
author: ODI Tech Team
description: How we can (and can't) use open source tooling for open data
author: James Smith
twitter: floppy
---
{% include odi_logo.html %}
{% include standard_title.html %}
<section>
<h2>WTF is the<br/>Open Data Institute?</h2>
<ul class="centred-list">
<li>non-profit, non-partisan</li>
<li>founded 2012 by Tim Berners-Lee and Nigel Shadbolt</li>
<li>"helping others be successful with open data"</li>
<li>economic, social and environmental value</li>
</ul>
<aside class="notes"></aside>
</section>
<section>
<h2>WTF is<br/>open data?</h2>
<aside class="notes"></aside>
</section>
<section>
<aside class="notes">the simplest definition of open data is this one...</aside>
<p>
<blockquote>Open data is information that is available for anyone to use, for any purpose,<br/>at no cost.</blockquote>
— <a href="http://opendefinition.org/">http://theodi.org/guide/what-open-data</a>
</p>
</section>
<section>
<aside class="notes"></aside>
<ul class="centred-list">
<li><strong>open data</strong><br/>must have have a <em>licence</em> to say it is open</li>
<li><strong>the license</strong><br/>may impose some constraints:<br/><em>attribution</em> and/or <em>share-alike</em></li>
</ul>
</section>
<section>
<aside class="notes"></aside>
<p>
<blockquote>A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.</blockquote>
— <a href="http://opendefinition.org/">http://opendefinition.org/</a>
</p>
</section>
<section>
<h2>So What?</h2>
<aside class="notes">why should you care? 1) because open data frees you up to build cool things without having to pay for the data or collect it yourself; 2) because your clients/customers will probably start to care.</aside>
</section>
<section>
<p><img src='2013-09-bristol-brug-open-dev-open-data/OSMinecraft.png'></p>
<small><a href='http://www.ordnancesurvey.co.uk/innovate/developers/minecraft-map-britain.html'>http://www.ordnancesurvey.co.uk/innovate/developers/minecraft-map-britain.html</a></small>
<aside class="notes">so thinking about some examples of interesting data sets and how these have been used...</aside>
</section>
<section>
<aside class="notes"></aside>
<p><img src='2013-09-bristol-brug-open-dev-open-data/prescribinganalytics.png'></p>
<small><a href='http://prescribinganalytics.com/'>http://prescribinganalytics.com/</a></small>
</section>
<section>
<aside class="notes"></aside>
<p><img src='2013-09-bristol-brug-open-dev-open-data/smtm.png'></p>
<small><a href='http://smtm.labs.theodi.org/'>http://smtm.labs.theodi.org/</a></small>
</section>
<section>
<aside class="notes"></aside>
<p><img src='2013-09-bristol-brug-open-dev-open-data/telefonicafootfall.png'></p>
<small><a href='http://dynamicinsights.telefonica.com/488/smart-steps'>http://dynamicinsights.telefonica.com/488/smart-steps</a></small>
</section>
<section>
<aside class="notes"></aside>
<h2>Data you can get<br/>!= Open Data</h2>
<ul>
<li>Twitter Firehose</li>
<li>Google Maps</li>
<li>... and most others</li>
</ul>
</section>
<section>
<h2>Good Open Data</h2>
<ul class="centred-list">
<li><strong>can be linked to</strong><br/>so that it can be easily shared and talked about</li>
<li><strong>is available in a standard, structured format</strong><br/>so that it can be easily processed</li>
<li><strong>has guaranteed availability and consistency over time</strong><br/>so that others can rely on it</li>
<li><strong>is traceable, through any processing</strong><br/>so others can work out whether to trust it</li>
</ul>
<aside class="notes"></aside>
</section>
<section>
<h2>Open Data Certificates</h2>
<aside class="notes"></aside>
</section>
<section>
<p><img src="2013-09-bristol-brug-open-dev-open-data/certificates.png"></p>
<small><a href="https://certificates.theodi.org/">https://certificates.theodi.org/</a></small>
<aside class="notes"></aside>
</section>
<section id='shared-resources'>
<h2>Open Data enables...</h2>
<ul>
<li>cooperation</li>
<li>collaboration</li>
<li>building shared resources</li>
<li>public goods</li>
</ul>
<h3>...or at least the <em>idea</em> does</h3>
</section>
<section id="open-data-management">
<h2>Data Collaboration</h2>
<aside class="notes">
Consider the world of Open Data, where we have a load of data, but very little collaboration.
Most data is dropped into central datastores, and that's it.
If I use your dataset and find an error, the only way to get it fixed is to tell you about it,
and hope you can be bothered to sort it out.
</aside>
<img src='open-data-flow/collaboration.jpg'><br/>
<small>Image from <a href='http://blog.mindjet.com/2013/05/collaboration-mistakes-and-how-to-avoid-them/'>MindJet</a></small>
</section>
<section id="open-source">
<aside class="notes">
collboration; reuse; serendipity!
</aside>
<h1>I ♥ Open Source!</h1>
</section>
<section id="git-and-github">
<aside class="notes">
Git and GitHub. How github has revolutionised the process of contributing to OSS projects.
If you have some code, I can fork it, make my own changes, then hit a single button to merge those
changes back upstream. This makes contribution incredibly simple, so that the admin overhead of doing
this becomes almost zero. This allows projects to draw on a wider pool of contributors than otherwise
would have been available.
</aside>
<img src='open-data-flow/github.png' height='500'><br/>
<small><a href='http://octodex.github.com/notocat/'>Not Octocat</a> by <a href='https://github.com/cameronmcefee'>Cameron McEfee</a></small>
</section>
<section id="github-flow">
<aside class="notes">
Github use a collaboration process they call 'Github Flow' (not to be confused with 'Git flow', which is
more complex).
Wouldn't it be great if we could do this with open data?
</aside>
<h3>GitHub Flow</h3>
<img src='open-data-flow/github_flow.png' height='400'><br/>
<small>from <a href='http://zachholman.com/talk/how-github-uses-github-to-build-github/'>How Github uses GitHub to build GitHub</a> by <a href='http://github.com/holman'>Zach Holman</a></small>
</section>
<section id="sourceforge">
<aside class="notes">
Unfortunately, compared to the open source world, this is a pre-sourceforge level of
collaboration. We can do better.
</aside>
<img src='git-some-data/sourceforge.png' height='500'><br/>
<small><a href='http://sourceforge.net'>SourceForge.net</a> (in 2000)</small>
</section>
<section id="github-for-x">
<aside class="notes">
This often gets referred to as 'git for data', though in my view git is unimportant. It's all about flow.
Github's revolution was not that they used git - it's that they built powerful, simple workflow tools on
top of it.
This is not a new idea; people have been talking about it for years. Only problem is there are a few
problems when it comes to handling data in git, and other systems designed for source code.
I think we can get a long way with existing tools, however. If we can bend git to our will,
and use it to work with simple data in useful ways, then we can get this revolution started.
</aside>
<h2>GitHub ALL the things!</h2>
<pre><code>
> %w{teachers accountants governments dogs cats hamsters DATA}.each do |x|
> puts "GitHub for #{x}!"
> end
GitHub for teachers!
GitHub for accountants!
GitHub for governments!
GitHub for dogs!
GitHub for cats!
GitHub for hamsters!
GitHub for DATA!
</code></pre>
</section>
<section id="test-data">
<aside class="notes">
</aside>
<img src='git-some-data/test_data.png'/>
</section>
<section id="test-data-edited">
<aside class="notes">
</aside>
<img src='git-some-data/test_data_edited.png'/>
</section>
<section id="git-is-line-oriented">
<aside class="notes">
So we know what we're up against, let's look at some problems with git when working with things like CSVs.
First, it's line-oriented, built for source code. This is OK when adding a row, or changing a few cells,
but add a column and suddenly you have a change on every line.
Let's look at a bit of test data. This is a small CSV file, and I've made some changes. First thing that you can
see is that I've obviously added a column. However, it's hard to see if anything else has changed, because the diff is
utterly useless.
</aside>
<img src='open-data-flow/naive_diff.png' height='500px'/>
<h1>er...</h1>
</section>
<section id="git-cli">
<aside class="notes">
There are some things we can improve here though. First thing to realise is that we don't really care what git does
internally; how it stores our changes, and so on. As long as we can see what's going on, git can do what it wants inside.
That means that this is a tooling problem, so we can tinker with the tooling around the edges to try to fix our problems.
Let's start with the git tool that *everyone* has; the command line.
</aside>
<h2>Git CLI</h2>
<ul>
<li><code>git diff --word-diff</code></li>
<li>~/.config/git/attributes
<pre><code>*.csv diff=csv</code></pre>
</li>
<li>~/.gitconfig
<pre><code>[color]
ui = true
[alias]
diffcsv = diff --word-diff
[diff "csv"]
wordRegex = ...?</code></pre>
</li>
</ul>
</section>
<section id="word-diff">
<img src='open-data-flow/word_regex_dot.png'/>
<p>
<code>wordRegex=.</code>
</p>
</section>
<section id="csv-diff">
<img src='open-data-flow/csv_diff.png'/>
<p>
<code>wordRegex=[^,\n]+[,\n]|[,]</code>
</p>
</section>
<section id="csv-my-git">
<h2>csv-my-git</h2>
<p>
Automatically configure your local git installation for CSV
</p>
<pre><code>curl -L http://theodi.github.io/csv-my-git/install.sh | bash
git diffcsv test.csv</code></pre>
<p>
<a href='https://github.com/theodi/csv-my-git'>https://github.com/theodi/csv-my-git</a>
</p>
</section>
<section id="gitlab">
<aside class="notes">
This is all very well, but what makes github flow really usable is... github. How can we get CSV diffs
into Github? Unfortunately, their core display code isn't open source, but we have the next best thing: Gitlab.
</aside>
<h2>Gitlab</h2>
<p>
Open Source GitHub-alike
<br/>
<img src='open-data-flow/gitlab_logo.png'/>
</p><p>
<a href='http://gitlab.org/'>http://gitlab.org/</a>
</p>
</section>
<section id="gitlab-views">
<h2>File & diff views</h2>
</section>
<section id="coophx">
<aside class="notes">
So, all we need to do is change the views for files and diffs to add CSV support. Files are pretty easy, but CSV is harder,
mainly because just working out the diffs is non-trivial. Coopyhx to the rescue.
</aside>
<img src='open-data-flow/coopyhx.png' height='500px'/>
<br/>
<small><a href='http://paulfitz.github.io/coopyhx/'>http://paulfitz.github.io/coopyhx/</a></small>
</section>
<section id="gitlab-diffs">
<aside class="notes">
With the coopyhx javascript library doing all the hard work, adding diff rendering is actually really easy.
</aside>
<img src='open-data-flow/gitlab_diff.png' height='500px'/>
</section>
<section id="github-csvs">
<aside class="notes">
We published this on our blog a few weeks ago, and a few days later, github announced CSV support in their web interface!
</aside>
<h2>GitHub</h2>
<img src='open-data-flow/github_csv.png'>
</section>
<section id="github-csv-filtering">
<img src='open-data-flow/github_csv_filter.png'>
</section>
<section id="github-csv-diffs">
<img src='open-data-flow/github_csv_diff.png'>
</section>
<section id="winning">
<h1>Winning!</h1>
</section>
<section id="standards">
<h1>Standards</h1>
<h3>(de facto or otherwise)</h3>
</section>
<section id="dataprotocols">
<img src='git-some-data/dataprotocols.png'>
<br/>
<a href='http://dataprotocols.org'>http://dataprotocols.org</a>
</section>
<section id="git-viewer">
<img src='open-data-flow/git_viewer.png'>
</section>
<section id="data-kitten">
<h2>Data Kitten</h2>
<img src='git-some-data/data_kitten.gif'>
<br/>
<small><a href='https://github.com/theodi/data_kitten'>https://github.com/theodi/data_kitten</a></small>
</section>
<section id="data-ecosystem">
<h2>Data Ecosystem</h2>
<ul>
<li>Dependency Tracking</li>
<li>Validation & Testing</li>
<li>Quality Metrics</li>
<li>Visualisation</li>
<li>Conversion & Decoration</li>
</ul>
</section>
<section id="crowdsourcing">
<h2>Crowdsourcing!</h2>
<img src='git-some-data/github_forms.png' style='height: 400px'/>
<br/>
<a href='https://github.com/benbalter/github-forms'>https://github.com/benbalter/github-forms</a>
</section>
<section id="datasets">
<img src='git-some-data/datasets.png'>
<br/>
<a href='https://github.com/datasets'>https://github.com/datasets</a>
</section>
<section id="chicago">
<img src='git-some-data/chicago.png'>
<br/>
<a href='https://github.com/Chicago'>https://github.com/Chicago</a>
</section>
<section id="san-francisco">
<img src='git-some-data/san_francisco.png'>
<br/>
<a href='http://sfmoci.github.io/openlaw/'>http://sfmoci.github.io/openlaw/</a>
</section>
<section id="zomg">
<h1>ZOMG GIT FIXES EVERYTHING</h1>
</section>
<section id="limitations">
<h1>Limitations</h1>
<p>
Adding a large file (50m lines)
</p>
<h2>13m 40s</h2>
<p>
Changing a single line
</p>
<h2>8m 30s</h2>
<small>figures by <a href='http://maxogden.github.io/slides/okcon'>Max Ogden</a></small>
</section>
<section id="dat">
<h1>Dat</h1>
<pre><code>
# make a new dat store
dat init
# put a JSON object into dat
echo '{"hello": "world"}' | dat
# stream the most recent of all rows
dat cat
# pipe dat into itself (increments revisions)
dat cat | dat
# start a dat server
dat serve
# delete the dat folder (removes all data + history)
rm -rf .dat
</code></pre>
<a href='https://github.com/maxogden/dat'>https://github.com/maxogden/dat</a>
</section>
<section id="rawbase">
<h2>R&Wbase</h2>
<img src='git-some-data/rawbase.png' style='height: 350px'/><br/>
<a href='http://rawbase.github.io/'>http://rawbase.github.io/</a>
</section>
<section id="issues">
<h2>Where Next?</h2>
<ul>
<li>Server-side diff calculation</li>
<li>Merging</li>
<li>Conflict resolution</li>
<li>CSV dialect support</li>
<li>More tools!</li>
</ul>
</section>
<section id="contribute">
<h2>Contribute!</h2>
<ul>
<li>ODI blog post:
<ul>
<li><a href='http://theodi.org/blog/adapting-git-simple-data'>http://theodi.org/blog/adapting-git-simple-data</a></li>
</ul>
</li>
<li>Gitlab fork:
<ul>
<li><a href='http://github.com/theodi/gitlabhq'>http://github.com/theodi/gitlabhq</a></li>
</ul>
</li>
<li>Git CLI configurator:
<ul>
<li><a href='http://github.com/theodi/csv-my-git'>http://github.com/theodi/csv-my-git</a></li>
</ul>
</li>
</ul>
</section>
{% include odi_tech_team.html %}