data.html

<html><head><title>niplav</title>
<link href="./favicon.png" rel="shortcut icon" type="image/png"/>
<link href="main.css" rel="stylesheet" type="text/css"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<!DOCTYPE HTML>

<style type="text/css">
code.has-jax {font: inherit; font-size: 100%; background: inherit; border: inherit;}
</style>
<script async="" src="./mathjax/latest.js?config=TeX-MML-AM_CHTML" type="text/javascript">
</script>
<script type="text/x-mathjax-config">
	MathJax.Hub.Config({
	extensions: ["tex2jax.js"],
	jax: ["input/TeX", "output/HTML-CSS"],
	tex2jax: {
		inlineMath: [ ['$','$'], ["\\(","\\)"] ],
		displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
		processEscapes: true,
		skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
	},
	"HTML-CSS": { availableFonts: ["TeX"] }
	});
</script>
<script>
document.addEventListener('DOMContentLoaded', function () {
	// Change the title to the h1 header
	var title = document.querySelector('h1')
	if(title) {
		var title_elem = document.querySelector('title')
		title_elem.textContent=title.textContent + " – niplav"
	}
});
</script>
</head><body><h2 id="home"><a href="./index.html">home</a></h2>
<p><em>author: niplav, created: 2022-07-07, modified: 2024-07-08, language: english, status: maintenance, importance: 2, confidence: log</em></p>
<blockquote>
<p><strong>Notes for myself on the data I track, how to transform it into a
usable shape, data quality and other random assortments.</strong></p>
</blockquote><div class="toc"><div class="toc-title">Contents</div><ul><li><a href="#Anki">Anki</a><ul></ul></li><li><a href="#Meditation">Meditation</a><ul></ul></li><li><a href="#Daygame">Daygame</a><ul></ul></li><li><a href="#Fitbit_Biometrics">Fitbit Biometrics</a><ul></ul></li><li><a href="#Others">Others</a><ul><li><a href="#Masturbation">Masturbation</a><ul></ul></li><li><a href="#Mood">Mood</a><ul></ul></li><li><a href="#Substances">Substances</a><ul></ul></li><li><a href="#Weight">Weight</a><ul></ul></li><li><a href="#Daily_Performance_Metrics">Daily Performance Metrics</a><ul></ul></li><li><a href="#Bag_Spreading">Bag Spreading</a><ul></ul></li><li><a href="#Phone_Data">Phone Data</a><ul></ul></li><li><a href="#Forecasting_Performance">Forecasting Performance</a><ul></ul></li></ul></li></ul></div>
<h1 id="Types__Methods_of_Data_Collection_I_Use"><a class="hanchor" href="#Types__Methods_of_Data_Collection_I_Use">Types &amp; Methods of Data Collection I Use</a></h1>
<p>I've always collected some data about myself and the world around me,
but not using/analyzing it because of a chronic "I'll get around to it
<em>eventually</em>" syndrome. Which is a shame, because that means I've been
putting in a reasonably large amount of effort and have nothing to show
for it, a 1-legged stool:</p>
<blockquote>
<p>The QS cycle is straightforward and flexible:  </p>
<ol>
<li>Have an idea<br/></li>
<li>Gather data<br/></li>
<li>Test the data<br/></li>
<li>Make a change; GOTO 1</li>
</ol>
<p>Any of these steps can overlap: you may be collecting sleep data long
before you have the idea (in the expectation that you will have an idea),
or you may be making the change as part of the data in an experimental
design, or you may inadvertently engage in a “natural experiment”
before wondering what the effects were (perhaps the baby wakes you up
on random nights and lets you infer the costs of poor sleep).  </p>
<p>The point is not publishable scientific rigor. If you are the sort of
person who wants to run such rigorous self-experiments, fantastic! The
point is making your life better, for which scientific certainty is not
necessary: imagine you are choosing between equally priced sleep pills
and equal safety; the first sleep pill will make you go to sleep faster
by 1 minute and has been validated in countless scientific trials, and
while the second sleep pill has in the past week has ended the sweaty
nightmares that have plagued you every few days since childhood but alas
has only a few small trials in its favor—which would you choose? I
would choose the second pill! […]</p>
<p>One failure mode which is particularly dangerous for QSers is
to overdo the data collection and collect masses of data they
never use. Famous computer entrepreneur &amp; mathematician <a href="https://en.wikipedia.org/wiki/Stephen_Wolfram">Stephen
Wolfram</a>
exemplified this for me in March 2012 with his
lengthy blog post ⁠<a href="https://writings.stephenwolfram.com/2012/03/the-personal-analytics-of-my-life/">“The Personal Analytics of My
Life”</a>
in which he did some impressive graphing and exploration
of data from 1989 to 2012: a third of a million (!) emails,
full keyboard logging, calendar, phone call logs (with missed
calls include), a pedometer, revision history of his tome <a href="https://www.amazon.com/New-Kind-Science-Stephen-Wolfram/dp/1579550088/?tag=gwernnet-20">A New Kind of
Science⁠</a>,
file types accessed per date, parsing scanned documents for dates,
a treadmill, and perhaps more he didn’t mention. […]</p>
<p>One thinks of <a href="https://deming.org/index.cfm?content=653">a saying</a> of
<a href="https://en.wikipedia.org/wiki/W._Edwards_Deming">W. Edwards Deming</a>:
“Experience by itself teaches nothing.” Indeed. A QS experiment is a
4-legged beast: if any leg is far too short or far too long, it can’t
carry our burdens.</p>
</blockquote>
<p><em>—<a href="https://gwern.net">Gwern</a>, <a href="https://www.gwern.net/Zeo">“Zeo sleep self-experiments”</a>, 2018</em></p>
<p>At least I now know that I'm falling into this trap, "Selbsterkenntnis
ist der erste Schritt zur Besserung". And the second step is to bring
all of your data in a usable format.</p>
<h2 id="Anki"><a class="hanchor" href="#Anki">Anki</a></h2>
<p>I use spaced repetition, and plan to take it as a proxy for cognitive
performance in QS experiments.</p>
<p>The data can be found in the helpfully named <code>collection.anki2</code>, which
is actually an <a href="https://en.wikipedia.org/wiki/SQLITE">sqlite</a> database
in disguise.</p>
<p>The <a href="https://docs.ankiweb.net/stats.html#manual-analysis">Anki manual</a>
helpfully informs that the most important table is <code>revlog</code>, one can
then export the data to CSV with the following command:</p>
<pre><code>echo -e '.headers on \n select * from revlog;' |
sqlite3 anki_2022-07-04T08:43:00.db |
tr '|' ',' &gt;anki_2022-07-04T08:43:00.csv
</code></pre>
<p>The header is to be interpreted as follows:</p>
<blockquote>
<p>The most important table for statistics is the revlog table, which
stores an entry for each review that you conduct. The columns are
as follows:</p>
<p>id  </p>
<p>The time at which the review was conducted, as the number of milliseconds
that had passed since midnight UTC on January 1, 1970. (This is sometimes
known as Unix epoch time, especially when in straight seconds instead
of milliseconds.)</p>
<p>cid  </p>
<p>The ID of the card that was reviewed. You can look up this value in
the id field of the cards table to get more information about the card,
although note that the card could have changed between when the revlog
entry was recorded and when you are looking it up. It is also the
millisecond timestamp of the card’s creation time.</p>
<p>usn  </p>
<p>This column is used to keep track of the sync state of reviews and
provides no useful information for analysis.</p>
<p>ease  </p>
<p>Which button you pressed at the end of the review (1 for Again, 4 for Easy).</p>
<p>ivl  </p>
<p>The new interval that the card was pushed to after the review. Positive
values are in days; negative values are in seconds (for learning cards).</p>
<p>lastIvl  </p>
<p>The interval the card had before the review. Cards introduced for the
first time have a last interval equal to the Again delay.</p>
<p>factor  </p>
<p>The new ease factor of the card in permille (parts per thousand). If
the ease factor is 2500, the card’s interval will be multiplied by
2.5 the next time you press Good.</p>
<p>time  </p>
<p>The amount of time (in milliseconds) you spent on the question and answer
sides of the card before selecting an ease button.</p>
<p>type  </p>
<p>This is 0 for learning cards, 1 for review cards, 2 for relearn cards,
and 3 for "cram" cards (cards being studied in a filtered deck when they
are not due).</p>
</blockquote>
<p><em>— Anki developers, <a href="https://docs.ankiweb.net/stats.html#manual-analysis">“Manual Analysis”</a> in <a href="https://docs.ankiweb.net/stats.html#manual-analysis">“Graphs and Statistics”</a>, year unknown</em></p>
<p>The CSV of the data can be found <a href="./data/anki_reviews.csv">here</a>.</p>
<h2 id="Meditation"><a class="hanchor" href="#Meditation">Meditation</a></h2>
<p>Similarly, one can export meditation data from Medativo (if one has
coughed up 5€ for the premium version, which I decided was worth it
for the data, after having locked myself in :-|):</p>
<pre><code>echo -e '.headers on \n select * from History;' |
sqlite3 meditation_2022-07-02T20:00:00.db |
tr '|' ',' &gt;meditations.csv
</code></pre>
<p>The names for the columns are exceedingly obvious and need no further
explanation.</p>
<p>I didn't rate my sessions in the beginning (and manually inserted
data from meditation retreats with unrated sessions), leading to
a very optimistic default of 4.0 mindfulness and "concentration"
(better called absorption, I claim). So we remove those, using
the <a href="https://plan9.io/sys/doc/sam/sam.html">sam</a> language in
<a href="https://github.com/martanne/vis">vis</a>:</p>
<pre><code>,/^1,/;/^860,/
x/4\.0,4\.0,/c/,/
,/^1210,/;/^1308,/
x/4\.0,4\.0,/c/,/
,/^1594,/;/^1615,/
x/4\.0,4\.0,/c/,/
</code></pre>
<p>The CSV of the meditation data can be found <a href="./data/meditations.csv">here</a>.</p>
<p><code>mindfulness_ranking</code> and <code>concentration_ranking</code> are both subjective
impressions directly after meditation, where "mindfulness" describes the
degree of sensory clarity, and "concentration" (better called "absorption"
or "rest") describes my ability to rest on a specific sensory object.</p>
<h2 id="Daygame"><a class="hanchor" href="#Daygame">Daygame</a></h2>
<!--TODO: Clean up as per https://claude.ai/chat/f2735ad5-dfd6-4d0c-aaaa-6e1a43f96498-->
<p>Sanitizing the sessions file by converting the datetime to
<a href="https://en.wikipedia.org/wiki/ISO-8601">ISO-8601</a> using <a href="./doc/cs/structural_regular_expressions_pike_1990.pdf" title="Structural Regular Expressions">structural
regular
expressions</a>,
and some other minor fixes:</p>
<pre><code>,x/([0-9]+)\/([0-9]+)\/([0-9]+) /c/\3-\1-\2T/
,x/; /c/;/
,x/(T[0-9]+:[0-9]+),/c/\1:00,/
,x/-([0-9])-/c/-0\1-/
</code></pre>
<p>Formatting the approaches file:</p>
<pre><code>,x/ ,/c/,/
</code></pre>
<p>Find incorrectly written locations:</p>
<pre><code>$ awk -F, '{ print($2) }' &lt;daygame_approaches.csv | sort | uniq
</code></pre>
<p>and manually correct them (this is useful for the other fields as well,
just to check consistency).</p>
<p>Anonymizing locations and the names of the women:</p>
<pre><code>$ awk -F, 'BEGIN {
    FS = OFS = ","
    while (getline &lt; "admn/daygame/locations") {
        loc[$2] = $1
    }
    close("locations")
}
{
    if (name[$8] == "" &amp;&amp; $8 != "Name") {
        name[$8] = int(100000 * rand())
        gsub(/\./, "", name[$8])
    }
    if ($2 != "Location") {
        original_location = $2
        gsub(/"/, "", original_location)
        if (loc[original_location] == "") {
            print "Warning: Location '" $2 "' not found in locations file" &gt; "/dev/stderr"
            loc[original_location] = 100000 * rand()
            printf "%d,\"%s\"\n", id, location &gt;&gt; "admn/daygame/locations"
        }
        $2 = loc[original_location]
    }
    if ($8 != "Name") {
        $8 = name[$8]
    }
    print $0
}' &lt;daygame_approaches.csv &gt;daygame_approaches_anon.csv
$ mv daygame_approaches.csv daygame_approaches_deanon.csv
$ mv daygame_approaches_anon.csv daygame_approaches.csv
</code></pre>
<p>The approaches file can be found <a href="./data/daygame_approaches.csv">here</a>,
the sessions file can be found <a href="./data/daygame_sessions.csv">here</a>.</p>
<p>Approaches file datapoints (in CSV):</p>
<ul>
<li>Approach index number</li>
<li>Datetime</li>
<li>Location</li>
<li>Blowout</li>
<li>Contact info ∈{number,instagram,facebook,skype,snapchat etc.,other}</li>
<li>Idate length (minutes)</li>
<li>Idate cost (euro)</li>
<li>Flake before 1st date (boolean)</li>
<li>Date before first sex [1..10] cost (euro)</li>
<li>Date before first sex [1..10] length (minutes)</li>
<li>Sex number of times (approximately)</li>
<li>Attractiveness (∈[1..10])</li>
</ul>
<p>Sessions file:</p>
<ul>
<li>Datetime start</li>
<li>Datetime end</li>
<li>Approaches index number range</li>
<li>Number of approaches</li>
</ul>
<h2 id="Fitbit_Biometrics"><a class="hanchor" href="#Fitbit_Biometrics">Fitbit Biometrics</a></h2>
<p>I use the <a href="https://en.wikipedia.org/wiki/List_of_Fitbit_products#Fitbit_Inspire_3">Fitbit Inspire
3</a>
because Fitbit is one of the few (the only?) companies whose products
allow for data exporting, mostly to track my sleep, but maybe I'll also
get mileage out of the heart rate, step, glucose and temperature tracking.</p>
<p>Sleep data <a href="./data/sleep.json">here</a>.</p>
<h2 id="Others"><a class="hanchor" href="#Others">Others</a></h2>
<!--TODO: light.csv, islight.csv, ispomodoro.csv, pomodoros.csv-->
<p>Other metrics I track don't deserve as much elaboration.</p>
<h3 id="Masturbation"><a class="hanchor" href="#Masturbation">Masturbation</a></h3>
<p>I track when I masturbate &amp; how good it feels &amp; the type of
pornography in <a href="./data/masturbations.csv">this file</a> via <a href="./data/mstrbt">this
script</a>. Data quality is pretty high.</p>
<ul>
<li><code>t</code> stands for text</li>
<li><code>a</code> stands for audio</li>
<li><code>i</code> stands for image</li>
<li><code>v</code> stands for video</li>
</ul>
<h3 id="Mood"><a class="hanchor" href="#Mood">Mood</a></h3>
<p>I track my mood via the excellent <a href="https://play.google.com/store/apps/details?id=info.moodpatterns.moodpatterns&amp;hl=en&amp;gl=US">Mood
Patterns</a>
which performs <a href="https://en.wikipedia.org/wiki/Experience_sampling_method">experience
sampling</a> allows
swift CSV export of the data. They even turned changed the <em>annoying</em>
"hitting a block of wood with a hammer" notification sound to the OS
default. No post-processing needed, the data <em>is just there</em>. An app by
programmers, for programmers.</p>
<p>But there is still <em>some</em> data cleanup to do:</p>
<pre><code>sort -n mood.csv &gt;&gt;~/proj/site/data/mood.csv
</code></pre>
<p>Finally I rename the mood columns simply to "happy", "content", "relaxed",
and "horny".</p>
<p>CSV <a href="./data/mood.csv">here</a>, the data quality is mediocre (long stretches
of not responding to questions, giving more conservative (closer to
50) answers over time, starting to use activities around July 2022
(and not using it for what it was intended for: when the activity is
"Nothing" that means I carried on with my day as normal afterwards, if
the activity is "Mindfulness" it means I spend a couple of seconds in
a more mindful state)). Also, I use the "interested — uninterested"
metric to track horniness (higher means hornier).</p>
<h3 id="Substances"><a class="hanchor" href="#Substances">Substances</a></h3>
<p>I track which substances I take
(<a href="./nootropics.html">nootropics</a>/melatonin/drugs) in <a href="./data/substances.csv">this
file</a> via <a href="./data/cnsm">this script</a>. Data quality
is good but fairly few entries. At the moment I am mostly using it to
perform self-blinded RCTs.</p>
<h3 id="Weight"><a class="hanchor" href="#Weight">Weight</a></h3>
<p>Tracking weight, mostly for exercise purposes. Data
<a href="./data/weights.csv">here</a>, collected with <a href="./data/weight">this script</a>.</p>
<h3 id="Daily_Performance_Metrics"><a class="hanchor" href="#Daily_Performance_Metrics">Daily Performance Metrics</a></h3>
<p>Productivity, creativity and the subjective length of the
day. Collected with <a href="./data/mental">this script</a> into <a href="./data/mental.csv">this
file</a>. Started collecting subjective length of the
day on 2023-08-21.</p>
<h3 id="Bag_Spreading"><a class="hanchor" href="#Bag_Spreading">Bag Spreading</a></h3>
<p>Data on bag spreading on public transport, in <a href="./data/bag_spreading.csv">this
file</a>. Data quality is horrible: probably prone
to multiple biases from my side, from different locations, no tracking
of location or datetime…maybe I should just delete this one.</p>
<h3 id="Phone_Data"><a class="hanchor" href="#Phone_Data">Phone Data</a></h3>
<p>Via <a href="https://play.google.com/store/apps/details?id=com.kelvin.sensorapp&amp;hl=en_US">Sensor
Logger</a>,
I use my phone as an easy way to collect large amounts of data. I don't want to
make the files public, as they contain information that could de-pseudonymise me.</p>
<h3 id="Forecasting_Performance"><a class="hanchor" href="#Forecasting_Performance">Forecasting Performance</a></h3>
<p>In principle it should be possible for me to track my forecasting
performance on Manifold, Fatebook/PredictionBook and Metaculus,
given that all of them have APIs over which data can be exported and
analyzed. In practice I haven't done so yet, but it might be a good
(albeit slow-to-evaluate) proxy for cognitive performance.</p>
</body></html>