`ein tool query` - an experimental query engine for git repositories #767

Byron · 2023-03-06T16:46:48Z

Byron
Mar 6, 2023
Maintainer

The Problem

Extracting intelligence from git repositories is slow even for medium sized projects, especially once one wants information about the file changes of a commit. It's fair to say though that obtaining information about commits themselves is typically fast - here we are using the linux kernel to jump right to a worst case scenario to obtain commit-level intelligence.

How many full years of 8h workdays does it take to write the Linux kernel? (373,6)

linux (2f5065a)
❯ time ein tool hours
 16:29:26 traverse commit graph done 1.1M commits in 7.96s (142.9k commits/s)
 16:29:26        estimate-hours Extracted and organized data from 1137234 commits in 12.364375ms (91976664 commits/s)
total hours: 1091040.00
total 8h days: 136380.00
total commits = 1137234
total authors: 30889
total unique authors: 23526 (23.84% duplication)
ein tool hours  9.56s user 0.56s system 122% cpu 8.280 total

8 seconds to extract a bunch of commit-related information isn't too bad at all, but one can imagine this to get annoying if there were more invocations like this.

It gets dicy when we ask for listing diff information as well.

linux (2f5065a)
❯ time ein tool hours --file-stats --line-stats
 16:34:23 traverse commit graph done 1.1M commits in 12.48s (91.1k commits/s)
 16:50:24         extract stats done 1.0M commits in 974.00s (1.1k commits/s)
 16:50:24          find changes done 6.2M modified files in 974.00s (6.4k modified files/s)
 16:50:24          find changes done 113.9M diff lines in 974.00s (116.9k diff lines/s)
 16:50:25        estimate-hours Extracted and organized data from 1137234 commits in 272.93175ms (4166734 commits/s)
total hours: 1091040.00
total 8h days: 136380.00
total commits = 1137234
total authors: 30889
total files added/removed/modified/remaining: 127139/66660/2226986/60479
total lines added/removed/remaining: 70941024/42963427/27977597
total unique authors: 23526 (23.84% duplication)
stats omitted for 84307 merge commits
ein tool hours --file-stats --line-stats  9102.31s user 33.11s system 937% cpu 16:14.27 total

Nearly 17 minutes on 8.5 cores! Now we are knee-deep in the realm of too-slow-to-use, and repeat invocations with different query parameters are definitely nothing one would casually do.

How would it be if the information we are interested in, per-commit file changes along with statistics about changed lines, would be readily available in some sort of cache? What if that cache could be incrementally updated?

Meet `ein tool query` - incremental `git` to `sqlite`

Enter ein tool query, an experiment to see what can be done when interesting information is readily available in a sqlite database. It's quite easy to use and the first time it runs, it will obtain information for each commit in the repository at a similar cost like demonstrated in the ein tool hours example, or roughly 17 minutes on the linux kernel. But afterwards, all updates are incremental and complete nearly instantly.

linux (2f5065a)
❯ time ein tool query
 17:07:48 db cache up to date
Choose a command for the query engine
ein tool query  0.20s user 0.02s system 98% cpu 0.231 total

Query Performance

How fast is the actual query? ~250ms for figuring out that there is nothing to update isn't all too bad, leaving about 750ms on the table for use in an actual query. What about learning about the origin of a file?.

linux (2f5065a)
❯ time ein tool query trace-file include/dt-bindings/pinctrl/pinctrl-starfive-jh7100.h
 17:18:49 db cache up to date
 17:18:50 run sql query done 2.0 round in 0.29s (6.0 round/s)
+         | 2022-10-04 | ba99b756da17 ➡ include/dt-bindings/pinctrl/pinctrl-starfive.h ➡ include/dt-bindings/pinctrl/pinctrl-starfive-jh7100.h
++++++++++| 2021-12-16 | 3021114b3d17 + include/dt-bindings/pinctrl/pinctrl-starfive.h
ein tool query trace-file   0.46s user 0.07s system 99% cpu 0.537 total

~500ms for an incremental update and a single query, showing the (not very interesting) history of a renamed file (one of nearly 35 thousand by the way).

There are also a couple of copies (just 1500 that could be tracked as copies from the set of changed files), which can be used as example where multiple SQL queries are performed to get one answer:

linux (2f5065a)
❯ time ein tool query trace-file arch/s390/kernel/vdso32/note.S
 17:26:45 db cache up to date
 17:26:46 run sql query done 4.0 round in 0.60s (6.0 round/s)
          | 2021-07-08 | 779df2248739 ⏸ arch/s390/kernel/vdso64/note.S ➡ arch/s390/kernel/vdso32/note.S
----------| 2019-12-01 | 2115fbf7210b - arch/s390/kernel/vdso32/note.S
+         | 2017-11-02 | b24413180f56 Δ arch/s390/kernel/vdso32/note.S
+         | 2017-12-05 | 9fa1db4c7511 Δ arch/s390/kernel/vdso64/note.S
          | 2008-12-25 | b020632e40c3 ⏸ arch/x86/vdso/vdso-note.S ➡ arch/s390/kernel/vdso64/note.S
          | 2007-10-11 | 7648b1330c33 ➡ arch/x86_64/vdso/vdso-note.S ➡ arch/x86/vdso/vdso-note.S
++++++++++| 2007-07-21 | 2aae950b21e4 + arch/x86_64/vdso/vdso-note.S
1 file(s) were found in history that are not reachable from HEAD
ein tool query trace-file arch/s390/kernel/vdso32/note.S  0.73s user 0.12s system 99% cpu 0.848 total

Conclusion (or "Is this all?!")

☺️Indeed, there isn't very much this prototype can do yet, but it may serve as an example how gitoxide enables new kinds of tools that previously were prohibitively expensive are hard to implement.

With the query engine that ein provides, one can easily implement answers to more interesting questions - your imagination (and possibly SQL skills) are the limit!

Q & A

How big are these `sqlite` databases, and where are they stored?

They are stored at .git/ein.query and weigh 281MB for the linux kernel, or ~250 bytes per commit.

Why is diffing so slow?

Generally, accessing the object database is slow which can only handle about 120k objects/s per core, or much less depending on the amount of deltas that need to be resolved to obtain the object. However, the main cost for doing any of that is the venerable zlib, which always dominates any interaction with the object database.

Can `git` object databases get faster?

With zlib at its core, it's unlikely we will be seeing considerable improvements. With gitoxide, it could be possible though to re-create a pack using a faster compressor like lzma or zstd to study the performance improvements. My guess is that typical algorithms will easily be 4 times faster per core to the point where the actual algorithm, like the diffing itself, will dominate the workload. These special packs could just be ignored by git due to an unknown pack version, but gitoxide could prefer using them over the zlib compressed packs to unleash their potential.

Would that solve a real problem though? Probably not, as it would still be too slow to run diff-based queries on the entire history at least compared to sql queries that finish in a second or less.

Credits

ein tool query was spawned off work done for the Git-Heat-Map-db-gen program, and inspired to have a tool like it in the ein tool suite. Lastly, it helped to find a fatal flaw in the tree diff implementation that made rename tracking (and some diffs) incorrect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ein tool query` - an experimental query engine for git repositories #767

{{title}}

Replies: 0 comments

Select a reply

ein tool query - an experimental query engine for git repositories #767

Byron Mar 6, 2023 Maintainer

The Problem

How many full years of 8h workdays does it take to write the Linux kernel? (373,6)

Meet ein tool query - incremental git to sqlite

Query Performance

Conclusion (or "Is this all?!")

Q & A

How big are these sqlite databases, and where are they stored?

Why is diffing so slow?

Can git object databases get faster?

Credits

Replies: 0 comments

`ein tool query` - an experimental query engine for git repositories #767

Byron
Mar 6, 2023
Maintainer

Meet `ein tool query` - incremental `git` to `sqlite`

How big are these `sqlite` databases, and where are they stored?

Can `git` object databases get faster?