ein tool query
- an experimental query engine for git repositories
#767
Byron
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The Problem
Extracting intelligence from git repositories is slow even for medium sized projects, especially once one wants information about the file changes of a commit. It's fair to say though that obtaining information about commits themselves is typically fast - here we are using the
linux
kernel to jump right to a worst case scenario to obtain commit-level intelligence.How many full years of 8h workdays does it take to write the Linux kernel? (373,6)
8 seconds to extract a bunch of commit-related information isn't too bad at all, but one can imagine this to get annoying if there were more invocations like this.
It gets dicy when we ask for listing diff information as well.
Nearly 17 minutes on 8.5 cores! Now we are knee-deep in the realm of too-slow-to-use, and repeat invocations with different query parameters are definitely nothing one would casually do.
How would it be if the information we are interested in, per-commit file changes along with statistics about changed lines, would be readily available in some sort of cache? What if that cache could be incrementally updated?
Meet
ein tool query
- incrementalgit
tosqlite
Enter
ein tool query
, an experiment to see what can be done when interesting information is readily available in asqlite
database. It's quite easy to use and the first time it runs, it will obtain information for each commit in the repository at a similar cost like demonstrated in theein tool hours
example, or roughly 17 minutes on the linux kernel. But afterwards, all updates are incremental and complete nearly instantly.Query Performance
How fast is the actual query? ~250ms for figuring out that there is nothing to update isn't all too bad, leaving about 750ms on the table for use in an actual query. What about learning about the origin of a file?.
~500ms for an incremental update and a single query, showing the (not very interesting) history of a renamed file (one of nearly 35 thousand by the way).
There are also a couple of copies (just 1500 that could be tracked as copies from the set of changed files), which can be used as example where multiple SQL queries are performed to get one answer:
Conclusion (or "Is this all?!")
gitoxide
enables new kinds of tools that previously were prohibitively expensive are hard to implement.With the query engine that
ein
provides, one can easily implement answers to more interesting questions - your imagination (and possibly SQL skills) are the limit!Q & A
How big are these
sqlite
databases, and where are they stored?They are stored at
.git/ein.query
and weigh 281MB for the linux kernel, or ~250 bytes per commit.Why is diffing so slow?
Generally, accessing the object database is slow which can only handle about 120k objects/s per core, or much less depending on the amount of deltas that need to be resolved to obtain the object. However, the main cost for doing any of that is the venerable
zlib
, which always dominates any interaction with the object database.Can
git
object databases get faster?With
zlib
at its core, it's unlikely we will be seeing considerable improvements. Withgitoxide
, it could be possible though to re-create a pack using a faster compressor likelzma
orzstd
to study the performance improvements. My guess is that typical algorithms will easily be 4 times faster per core to the point where the actual algorithm, like the diffing itself, will dominate the workload. These special packs could just be ignored by git due to an unknown pack version, butgitoxide
could prefer using them over thezlib
compressed packs to unleash their potential.Would that solve a real problem though? Probably not, as it would still be too slow to run diff-based queries on the entire history at least compared to sql queries that finish in a second or less.
Credits
ein tool query
was spawned off work done for theGit-Heat-Map
-db-gen
program, and inspired to have a tool like it in theein
tool suite. Lastly, it helped to find a fatal flaw in the tree diff implementation that made rename tracking (and some diffs) incorrect.Beta Was this translation helpful? Give feedback.
All reactions