-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rebase to v2.47.0 #5196
Rebase to v2.47.0 #5196
Conversation
In anticipation of a few planned applications, introduce the most basic form of a path-walk API. It currently assumes that there are no UNINTERESTING objects, and does not include any complicated filters. It calls a function pointer on groups of tree and blob objects as grouped by path. This only includes objects the first time they are discovered, so an object that appears at multiple paths will not be included in two batches. There are many future adaptations that could be made, but they are left for future updates when consumers are ready to take advantage of those features. Signed-off-by: Derrick Stolee <[email protected]>
Add some tests based on the current behavior, doing interesting checks for different sets of branches, ranges, and the --boundary option. This sets a baseline for the behavior and we can extend it as new options are introduced. Signed-off-by: Derrick Stolee <[email protected]>
We add the ability to filter the object types in the path-walk API so the callback function is called fewer times. This adds the ability to ask for the commits in a list, as well. Future changes will add the ability to visit annotated tags. Signed-off-by: Derrick Stolee <[email protected]>
In anticipation of using the path-walk API to analyze tags or include them in a pack-file, add the ability to walk the tags that were included in the revision walk. Signed-off-by: Derrick Stolee <[email protected]>
The sparse tree walk algorithm was created in d5d2e93 (revision: implement sparse algorithm, 2019-01-16) and involves using the mark_trees_uninteresting_sparse() method. This method takes a repository and an oidset of tree IDs, some of which have the UNINTERESTING flag and some of which do not. Create a method that has an equivalent set of preconditions but uses a "dense" walk (recursively visits all reachable trees, as long as they have not previously been marked UNINTERESTING). This is an important difference from mark_tree_uninteresting(), which short-circuits if the given tree has the UNINTERESTING flag. A use of this method will be added in a later change, with a condition set whether the sparse or dense approach should be used. Signed-off-by: Derrick Stolee <[email protected]>
This option causes the path-walk API to act like the sparse tree-walk algorithm implemented by mark_trees_uninteresting_sparse() in list-objects.c. Starting from the commits marked as UNINTERESTING, their root trees and all objects reachable from those trees are UNINTERSTING, at least as we walk path-by-path. When we reach a path where all objects associated with that path are marked UNINTERESTING, then do no continue walking the children of that path. We need to be careful to pass the UNINTERESTING flag in a deep way on the UNINTERESTING objects before we start the path-walk, or else the depth-first search for the path-walk API may accidentally report some objects as interesting. Signed-off-by: Derrick Stolee <[email protected]>
This will be helpful in a future change. Signed-off-by: Derrick Stolee <[email protected]>
In order to more easily compute delta bases among objects that appear at the exact same path, add a --path-walk option to 'git pack-objects'. This option will use the path-walk API instead of the object walk given by the revision machinery. Since objects will be provided in batches representing a common path, those objects can be tested for delta bases immediately instead of waiting for a sort of the full object list by name-hash. This has multiple benefits, including avoiding collisions by name-hash. The objects marked as UNINTERESTING are included in these batches, so we are guaranteeing some locality to find good delta bases. After the individual passes are done on a per-path basis, the default name-hash is used to find other opportunistic delta bases that did not match exactly by the full path name. RFC TODO: It is important to note that this option is inherently incompatible with using a bitmap index. This walk probably also does not work with other advanced features, such as delta islands. Getting ahead of myself, this option compares well with --full-name-hash when the packfile is large enough, but also performs at least as well as the default in all cases that I've seen. RFC TODO: this should probably be recording the batch locations to another list so they could be processed in a second phase using threads. RFC TODO: list some examples of how this outperforms previous pack-objects strategies. (This is coming in later commits that include performance test changes.) Signed-off-by: Derrick Stolee <[email protected]>
There are many tests that validate whether 'git pack-objects' works as expected. Instead of duplicating these tests, add a new test environment variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default when specified. This was useful in testing the implementation of the --path-walk implementation, especially in conjunction with test such as: - t0411-clone-from-partial.sh : One test fetches from a repo that does not have the boundary objects. This causes the path-based walk to fail. Disable the variable for this test. - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo without a boundary object. - t5310-pack-bitmaps.sh : One test compares the case when packing with bitmaps to the case when packing without them. Since we disable the test variable when writing bitmaps, this causes a difference in the object list (the --path-walk option adds an extra object). Specify --no-path-walk in both processes for the comparison. Another test checks for a specific delta base, but when computing dynamically without using bitmaps, the base object it too small to be considered in the delta calculations so no base is used. - t5316-pack-delta-depth.sh : This script cares about certain delta choices and their chain lengths. The --path-walk option changes how these chains are selected, and thus changes the results of this test. - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of the --sparse option and how it combines with --path-walk. - t5332-multi-pack-reuse.sh : This test verifies that the preferred pack is used for delta reuse when possible. The --path-walk option is not currently aware of the preferred pack at all, so finds a different delta base. - t7406-submodule-update.sh : When using the variable, the --depth option collides with the --path-walk feature, resulting in a warning message. Disable the variable so this warning does not appear. I want to call out one specific test change that is only temporary: - t5530-upload-pack-error.sh : One test cares specifically about an "unable to read" error message. Since the current implementation performs delta calculations within the path-walk API callback, a different "unable to get size" error message appears. When this is changed in a future refactoring, this test change can be reverted. Signed-off-by: Derrick Stolee <[email protected]>
Since 'git pack-objects' supports a --path-walk option, allow passing it through in 'git repack'. This presents interesting testing opportunities for comparing the different repacking strategies against each other. Add the --path-walk option to the performance tests in p5313. For the microsoft/fluentui repo [1] checked out at a specific commit [2], the results are very interesting: Test this tree ------------------------------------------------------------------ 5313.2: thin pack 0.40(0.47+0.04) 5313.3: thin pack size 1.2M 5313.4: thin pack with --full-name-hash 0.09(0.10+0.04) 5313.5: thin pack size with --full-name-hash 22.8K 5313.6: thin pack with --path-walk 0.08(0.06+0.02) 5313.7: thin pack size with --path-walk 20.8K 5313.8: big pack 2.16(8.43+0.23) 5313.9: big pack size 17.7M 5313.10: big pack with --full-name-hash 1.42(3.06+0.21) 5313.11: big pack size with --full-name-hash 18.0M 5313.12: big pack with --path-walk 2.21(8.39+0.24) 5313.13: big pack size with --path-walk 17.8M 5313.14: repack 98.05(662.37+2.64) 5313.15: repack size 449.1K 5313.16: repack with --full-name-hash 33.95(129.44+2.63) 5313.17: repack size with --full-name-hash 182.9K 5313.18: repack with --path-walk 106.21(121.58+0.82) 5313.19: repack size with --path-walk 159.6K [1] https://github.com/microsoft/fluentui [2] e70848ebac1cd720875bccaa3026f4a9ed700e08 This repo suffers from having a lot of paths that collide in the name hash, so examining them in groups by path leads to better deltas. Also, in this case, the single-threaded implementation is competitive with the full repack. This is saving time diffing files that have significant differences from each other. A similar, but private, repo has even more extremes in the thin packs: Test this tree -------------------------------------------------------------- 5313.2: thin pack 2.39(2.91+0.10) 5313.3: thin pack size 4.5M 5313.4: thin pack with --full-name-hash 0.29(0.47+0.12) 5313.5: thin pack size with --full-name-hash 15.5K 5313.6: thin pack with --path-walk 0.35(0.31+0.04) 5313.7: thin pack size with --path-walk 14.2K Notice, however, that while the --full-name-hash version is working quite well in these cases for the thin pack, it does poorly for some other standard cases, such as this test on the Linux kernel repository: Test this tree -------------------------------------------------------------- 5313.2: thin pack 0.01(0.00+0.00) 5313.3: thin pack size 310 5313.4: thin pack with --full-name-hash 0.00(0.00+0.00) 5313.5: thin pack size with --full-name-hash 1.4K 5313.6: thin pack with --path-walk 0.00(0.00+0.00) 5313.7: thin pack size with --path-walk 310 Here, the --full-name-hash option does much worse than the default name hash, but the path-walk option does exactly as well. Signed-off-by: Derrick Stolee <[email protected]>
Users may want to enable the --path-walk option for 'git pack-objects' by default, especially underneath commands like 'git push' or 'git repack'. This should be limited to client repositories, since the --path-walk option disables bitmap walks, so would be bad to include in Git servers when serving fetches and clones. There is potential that it may be helpful to consider when repacking the repository, to take advantage of improved deltas across historical versions of the same files. Much like how "pack.useSparse" was introduced and included in "feature.experimental" before being enabled by default, use the repository settings infrastructure to make the new "pack.usePathWalk" config enabled by "feature.experimental" and "feature.manyFiles". Signed-off-by: Derrick Stolee <[email protected]>
Repositories registered with Scalar are expected to be client-only repositories that are rather large. This means that they are more likely to be good candidates for using the --path-walk option when running 'git pack-objects', especially under the hood of 'git push'. Enable this config in Scalar repositories. Signed-off-by: Derrick Stolee <[email protected]>
Previously, the --path-walk option to 'git pack-objects' would compute deltas inline with the path-walk logic. This would make the progress indicator look like it is taking a long time to enumerate objects, and then very quickly computed deltas. Instead of computing deltas on each region of objects organized by tree, store a list of regions corresponding to these groups. These can later be pulled from the list for delta compression before doing the "global" delta search. This presents a new progress indicator that can be used in tests to verify that this stage is happening. The current implementation is not integrated with threads, but could be done in a future update. Since we do not attempt to sort objects by size until after exploring all trees, we can remove the previous change to t5530 due to a different error message appearing first. Signed-off-by: Derrick Stolee <[email protected]>
Adapting the implementation of ll_find_deltas(), create a threaded version of the --path-walk compression step in 'git pack-objects'. This involves adding a 'regions' member to the thread_params struct, allowing each thread to own a section of paths. We can simplify the way jobs are split because there is no value in extending the batch based on name-hash the way sections of the object entry array are attempted to be grouped. We re-use the 'list_size' and 'remaining' items for the purpose of borrowing work in progress from other "victim" threads when a thread has finished its batch of work more quickly. Using the Git repository as a test repo, the p5313 performance test shows that the resulting size of the repo is the same, but the threaded implementation gives gains of varying degrees depending on the number of objects being packed. (This was tested on a 16-core machine.) Test HEAD~1 HEAD ------------------------------------------------------------- 5313.6: thin pack with --path-walk 0.01 0.01 +0.0% 5313.7: thin pack size with --path-walk 475 475 +0.0% 5313.12: big pack with --path-walk 1.99 1.87 -6.0% 5313.13: big pack size with --path-walk 14.4M 14.3M -0.4% 5313.18: repack with --path-walk 98.14 41.46 -57.8% 5313.19: repack size with --path-walk 197.2M 197.3M +0.0% Signed-off-by: Derrick Stolee <[email protected]>
When adding tree objects, we are very careful to avoid adding the same tree object more than once. There was one small gap in that logic, though: when adding a root tree object. Two refs can easily share the same root tree object, and we should still not add it more than once. Signed-off-by: Johannes Schindelin <[email protected]>
In anticipation of implementing 'git backfill', populate the necessary files with the boilerplate of a new builtin. RFC TODO: When preparing this for a full implementation, make sure it is based on the newest standards introduced by [1]. [1] https://lore.kernel.org/git/[email protected]/T/#m606036ea2e75a6d6819d6b5c90e729643b0ff7f7 [PATCH 1/3] builtin: add a repository parameter for builtin functions Signed-off-by: Derrick Stolee <[email protected]>
The default behavior of 'git backfill' is to fetch all missing blobs that are reachable from HEAD. Document and test this behavior. The implementation is a very simple use of the path-walk API, initializing the revision walk at HEAD to start the path-walk from all commits reachable from HEAD. Ignore the object arrays that correspond to tree entries, assuming that they are all present already. Signed-off-by: Derrick Stolee <[email protected]>
Users may want to specify a minimum batch size for their needs. This is only a minimum: the path-walk API provides a list of OIDs that correspond to the same path, and thus it is optimal to allow delta compression across those objects in a single server request. We could consider limiting the request to have a maximum batch size in the future. Signed-off-by: Derrick Stolee <[email protected]>
One way to significantly reduce the cost of a Git clone and later fetches is to use a blobless partial clone and combine that with a sparse-checkout that reduces the paths that need to be populated in the working directory. Not only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. However, history investigations can be expensie as computing blob diffs will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. Note that this is distinctly different from the '--filter=sparse:<oid>' option, as this assumes that the partial clone has all reachable trees and we are using client-side logic to avoid downloading blobs outside of the sparse-checkout cone. This avoids the server-side cost of walking trees while also achieving a similar goal. It also downloads in batches based on similar path names, presenting a resumable download if things are interrupted. This augments the path-walk API to have a possibly-NULL 'pl' member that may point to a 'struct pattern_list'. This could be more general than the sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently the only consumer. Be sure to test this in both cone mode and not cone mode. Cone mode has the benefit that the path-walk can skip certain paths once they would expand beyond the sparse-checkout. Signed-off-by: Derrick Stolee <[email protected]>
Start work on a new 'git survey' command to scan the repository for monorepo performance and scaling problems. The goal is to measure the various known "dimensions of scale" and serve as a foundation for adding additional measurements as we learn more about Git monorepo scaling problems. The initial goal is to complement the scanning and analysis performed by the GO-based 'git-sizer' (https://github.com/github/git-sizer) tool. It is hoped that by creating a builtin command, we may be able to take advantage of internal Git data structures and code that is not accessible from GO to gain further insight into potential scaling problems. Co-authored-by: Derrick Stolee <[email protected]> Signed-off-by: Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>
The previous change introduced the '--[no-]sparse' option for the 'git backfill' command, but did not assume it as enabled by default. However, this is likely the behavior that users will most often want to happen. Without this default, users with a small sparse-checkout may be confused when 'git backfill' downloads every version of every object in the full history. However, this is left as a separate change so this decision can be reviewed independently of the value of the '--[no-]sparse' option. Add a test of adding the '--sparse' option to a repo without sparse-checkout to make it clear that supplying it without a sparse-checkout is an error. Signed-off-by: Derrick Stolee <[email protected]>
By default we will scan all references in "refs/heads/", "refs/tags/" and "refs/remotes/". Add command line opts let the use ask for all refs or a subset of them and to include a detached HEAD. Signed-off-by: Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>
This is a highly useful command, and we want it to get some testing "in the wild". However, the patches have not yet been reviewed on the Git mailing list, and are therefore subject to change. By marking the command as experimental, users will be warned to pay attention to those changes. Signed-off-by: Johannes Schindelin <[email protected]>
When 'git survey' provides information to the user, this will be presented in one of two formats: plaintext and JSON. The JSON implementation will be delayed until the functionality is complete for the plaintext format. The most important parts of the plaintext format are headers specifying the different sections of the report and tables providing concreted data. Create a custom table data structure that allows specifying a list of strings for the row values. When printing the table, check each column for the maximum width so we can create a table of the correct size from the start. The table structure is designed to be flexible to the different kinds of output that will be implemented in future changes. Signed-off-by: Derrick Stolee <[email protected]>
At the moment, nothing is obvious about the reason for the use of the path-walk API, but this will become more prevelant in future iterations. For now, use the path-walk API to sum up the counts of each kind of object. For example, this is the reachable object summary output for my local repo: REACHABLE OBJECT SUMMARY ======================== Object Type | Count ------------+------- Tags | 1343 Commits | 179344 Trees | 314350 Blobs | 184030 Signed-off-by: Derrick Stolee <[email protected]>
Now that we have explored objects by count, we can expand that a bit more to summarize the data for the on-disk and inflated size of those objects. This information is helpful for diagnosing both why disk space (and perhaps clone or fetch times) is growing but also why certain operations are slow because the inflated size of the abstract objects that must be processed is so large. Signed-off-by: Derrick Stolee <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
In future changes, we will make use of these methods. The intention is to keep track of the top contributors according to some metric. We don't want to store all of the entries and do a sort at the end, so track a constant-size table and remove rows that get pushed out depending on the chosen sorting algorithm. Co-authored-by: Jeff Hostetler <[email protected]> Signed-off-by; Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>
Since we are already walking our reachable objects using the path-walk API, let's now collect lists of the paths that contribute most to different metrics. Specifically, we care about * Number of versions. * Total size on disk. * Total inflated size (no delta or zlib compression). This information can be critical to discovering which parts of the repository are causing the most growth, especially on-disk size. Different packing strategies might help compress data more efficiently, but the toal inflated size is a representation of the raw size of all snapshots of those paths. Even when stored efficiently on disk, that size represents how much information must be processed to complete a command such as 'git blame'. Since the on-disk size is likely to be fragile, stop testing the exact output of 'git survey' and check that the correct set of headers is output. Signed-off-by: Derrick Stolee <[email protected]>
The 'git survey' builtin provides several detail tables, such as "top files by on-disk size". The size of these tables defaults to 100, currently. Allow the user to specify this number via a new --top=<N> option or the new survey.top config key. Signed-off-by: Derrick Stolee <[email protected]>
Git documentation refers to $HOME and $XDG_CONFIG_HOME often, but does not specify how or where these values come from on Windows where neither is set by default. The new documentation reflects the behavior of setup_windows_environment() in compat/mingw.c. Signed-off-by: Alejandro Barreto <[email protected]>
This is the recommended way on GitHub to describe policies revolving around security issues and about supported versions. Signed-off-by: Johannes Schindelin <[email protected]>
These are Git for Windows' Git GUI and gitk patches. We will have to decide at some point what to do about them, but that's a little lower priority (as Git GUI seems to be unmaintained for the time being, and the gitk maintainer keeps a very low profile on the Git mailing list, too). Signed-off-by: Johannes Schindelin <[email protected]>
…dvice clean: suggest using `core.longPaths` if paths are too long to remove
Signed-off-by: Johannes Schindelin <[email protected]>
This was pull request git-for-windows#1645 from ZCube/master Support windows container. Signed-off-by: Johannes Schindelin <[email protected]>
…ws#4527) With this patch, Git for Windows works as intended on mounted APFS volumes (where renaming read-only files would fail). Signed-off-by: Johannes Schindelin <[email protected]>
Specify symlink type in .gitattributes
Signed-off-by: Johannes Schindelin <[email protected]>
This patch introduces support to set special NTFS attributes that are interpreted by the Windows Subsystem for Linux as file mode bits, UID and GID. Signed-off-by: Johannes Schindelin <[email protected]>
Handle Ctrl+C in Git Bash nicely Signed-off-by: Johannes Schindelin <[email protected]>
Switch to batched fsync by default
A fix for calling `vim` in Windows Terminal caused a regression and was reverted. We partially un-revert this, to get the fix again. Signed-off-by: Johannes Schindelin <[email protected]>
This topic branch re-adds the deprecated --stdin/-z options to `git reset`. Those patches were overridden by a different set of options in the upstream Git project before we could propose `--stdin`. We offered this in MinGit to applications that wanted a safer way to pass lots of pathspecs to Git, and these applications will need to be adjusted. Instead of `--stdin`, `--pathspec-from-file=-` should be used, and instead of `-z`, `--pathspec-file-nul`. Signed-off-by: Johannes Schindelin <[email protected]>
Originally introduced as `core.useBuiltinFSMonitor` in Git for Windows and developed, improved and stabilized there, the built-in FSMonitor only made it into upstream Git (after unnecessarily long hemming and hawing and throwing overly perfectionist style review sticks into the spokes) as `core.fsmonitor = true`. In Git for Windows, with this topic branch, we re-introduce the now-obsolete config setting, with warnings suggesting to existing users how to switch to the new config setting, with the intention to ultimately drop the patch at some stage. Signed-off-by: Johannes Schindelin <[email protected]>
…updates Start monitoring updates of Git for Windows' component in the open
Add a README.md for GitHub goodness. Signed-off-by: Johannes Schindelin <[email protected]>
Ooops. Must not risk a segmentation fault in a partial clone missing trees... Signed-off-by: Johannes Schindelin <[email protected]>
I briefly considered backporting gitgitgadget#1812 because the link is broken: But I decided against backporting given that there is no easy |
/git-artifacts The The |
/release The |
d53e464
into
git-for-windows:main
Range-diff relative to
errno
is set correctly when socket operations failgetcwd()
git add
issue with NTFS junctionsgit.exe
to be used instead of the "Git wrapper"strbuf_realpath()
.git/branches/
in the templatescontrib/subtree
test
targetparse_interpreter()
contrib/subtree
tests in CI buildswindows.appendAtomically
NO_PERL
again-pedantic
option.vcxproj
filesruns_on_pool
prove
cacheGETTEXT_POISON
jobapt-get
callsLinux32
jobgit-<command>
for built-inswindows.appendAtomically
in more caseslocaltime_r()
is declared even in i686 builds--full-name-hash
as experimentalgit_terminal_prompt
with more terminalscore.longPaths
if paths are too long to removesymlink
attributeiconv
iconv
is unavailable, usetest-helper --iconv
builtin pwd -W
when availableNothing stands out, except maybe that the numbers reflect that I moved a couple of patches into the
ready-for-upstream
thicket.