@@ -27,83 +27,119 @@ This edition covers what happened during the months of December 2023 and January
2727
2828### Support
2929
30- * [ Git Rename Detection Bug ] ( https://public-inbox.org/git/LO6P265MB6736043BE8FB607DB671D21EFAAAA@LO6P265MB6736.GBRP265.PROD.OUTLOOK.COM/ )
30+ * [ Git Rename Detection Surprises ] ( https://public-inbox.org/git/LO6P265MB6736043BE8FB607DB671D21EFAAAA@LO6P265MB6736.GBRP265.PROD.OUTLOOK.COM/ )
3131
32- Jeremy Pridmore reported a bug to the Git mailing list. He used
32+ Jeremy Pridmore reported an issue to the Git mailing list. He used
3333[ ` git bugreport ` ] ( https://git-scm.com/docs/git-bugreport ) , so his
3434 message looks like a filled out form with questions and answers.
3535
3636 He was trying to cherry-pick changes from one repo A to another B,
3737 while both A and B came from the same original TFS server but with
3838 different set of changes. He was disappointed though because some
39- files that had been moved in repo A weren't matched by the rename
40- detection mechanism to the original files in repo B, and he wondered
41- if the reason for this was the new 'ort' merge strategy described in
42- a [ blog post by Elijah Newren] ( https://blog.palantir.com/optimizing-gits-merge-machinery-1-127ceb0ef2a1 ) .
39+ files that had been moved in repo A were matched up by the rename
40+ detection mechanism to files other than what he expected in repo B,
41+ and he wondered if the reason for this was the new 'ort' merge
42+ strategy described in a [ blog post by Elijah
43+ Newren] ( https://blog.palantir.com/optimizing-gits-merge-machinery-1-127ceb0ef2a1 ) .
44+
45+ While not obvious at first, Jeremy's primary problem specifically
46+ centered around cases where there were multiple files with 100%
47+ identical content. Perhaps an example would help. There could have
48+ originally been an ` orig/foo.txt ` file, while one side of history
49+ does not have that file anymore but instead has both a
50+ ` dir2/foo.txt ` and a ` dir3/foo.txt ` ; further, both of the new
51+ foo.txt files are identical to the original ` orig/foo.txt ` . So, Git
52+ has to figure out which foo.txt file the ` orig/foo.txt ` was renamed
53+ to, whether to ` dir2/foo.txt ` or ` dir3/foo.txt ` .
4354
4455 Elijah replied to Jeremy explaining extensively how rename detection
45- works in Git. He said that the new 'ort' merge strategy, which he
46- implemented, and which replaced the old 'recursive' strategy, uses
47- the same rename detection rules as that old strategy. He suggested
48- adding the ` -s recursive ` option to the cherry-pick command to check
49- if it worked differently using the old 'recursive' strategy.
50-
51- Elijah mentioned especially that "exact renames" are detected first
52- when performing rename detection, and if files have different names
53- they are matched randomly as renames.
54-
55- Jeremy replied to Elijah saying that he observed a similar
56- behavior. He gave examples of some issues he was seeing, and he
57- suggested to match files using a "difference value" between the paths
58- and filenames of the different files. He also said he wrote a script
59- to help him resolve conflicts.
60-
61- Elijah replied to Jeremy with further explanations about the fact
62- that renames are just a help for developers as they are not
63- recorded but computed from scratch in response to user commands. He
64- also asked for clarification about some points, and suggested that
65- some files Jeremy had issues with had been added in both repos A
66- and B, which created conflicts but were not rename issues.
67- Similarly, when a file has been removed in both repo A and B, there is
68- no rename issue. The file should just be deleted.
69-
70- About the idea of matching files using a "difference value" between
71- the paths and filenames of the different files, Elijah replied that
72- he had tried similar ideas, but found that in practice it could take
73- significant time and would not provide much benefit.
74-
75- Elijah also discussed the case of having a "base" version with a
76- directory named "library-x-1.7/", while a "stable" version has many
77- changes in that directory and a "development" branch has removed
78- that directory but has added both a "library-x-1.8/" and a
79- "library-x-1.9/" directory with many changes compared to
80- "library-x-1.7/". This case would be somewhat similar to Jeremy's
81- case, and Elijah suggested a hack to workaround rename detection in
82- such cases.
56+ works in Git. Elijah pointed out that Jeremy's problem, as
57+ described, did not involve directory rename detection (despite
58+ looking kind of like a directory rename detection problem). Also,
59+ since Jeremy pointed out that the contents of the "mis-detected"
60+ renames had identical contents to what they were paired with, that
61+ meant that only exact renames were involved. Because of these two
62+ factors, Elijah said that the new 'ort' merge strategy, which he
63+ implemented, and which replaced the old 'recursive' strategy, should
64+ use the same rename detection rules as that old strategy for
65+ Jeremy's problem. Elijah suggested adding the ` -s recursive ` option
66+ to the cherry-pick command to verify this and check if it worked
67+ differently using the old 'recursive' strategy.
68+
69+ Elijah also pointed out that for exact renames in a setup like this,
70+ other than Git giving a preference to files with the same basename,
71+ if there are multiple choices with identical content then it will
72+ just pick one essentially at random.
73+
74+ Jeremy replied to Elijah saying that this sounded like what he was
75+ observing. He gave some more examples, showing that when there are
76+ multiple 100% matches, Git didn't always match up the files that he
77+ wanted but matched files differently. Jeremy suggested that
78+ filename similarity (beyond just basename matching) be added as a
79+ secondary criteria to content similarity for rename detection, since
80+ it would help in his case.
81+
82+ Elijah replied that he had tried a few filename similarity ideas,
83+ and added a "same basename" criteria for inexact renames in the
84+ ` ort ` merge strategy along these lines. However, he said other
85+ filename similarity measurements he tried didn't work out so well.
86+ He mentioned that they risk being repository-specific (in a way
87+ where they help with merges in some repositories but actually hurt
88+ in others). He also mentioned a rather counter-intuitive result
89+ that filename comparisons could rival the cost of content
90+ comparisons, which means such measurements could adversely affect
91+ performance and possibly even throw a monkey wrench in multiple of
92+ the existing performance optimizations in the current merge
93+ algorithm.
94+
95+ The thread also involved additional explanations about various facts
96+ involving rename detection. This included details about how renames
97+ are just a help for developers as they are not recorded, but are
98+ instead computed from scratch in response to user commands. It also
99+ included details about what things like "added by both" means
100+ (namely that both sides added the same filename but with different
101+ contents), why you never see "deleted by both" as a conflict status
102+ (there is no conflict; the file can just be deleted), and other
103+ minor points.
104+
105+ Elijah also brought up a slightly more common case that mirrors the
106+ problems Jeremy saw, where users could be surprised by the per-file
107+ content similarity matching that Git does. This more general case
108+ arises from having multiple copies of a versioned library. For
109+ example, you may have a "base" version with a directory named
110+ "library-x-1.7/", and a "stable" version has many changes in that
111+ directory, while a "development" branch has removed that directory
112+ but has added both a "library-x-1.8/" and a "library-x-1.9/"
113+ directory which both have changes compared to "library-x-1.7/". In
114+ such a case, if you are trying to cherry-pick a commit involving
115+ several files modified under "library-x-1.7/", where do the changes
116+ get applied? Some users might expect the changes in that commit to
117+ get applied to "library-x-1.8/", while others might expect them to
118+ get applied to "library-x-1.9/". In practice, though, it would not
119+ be uncommon for Git to apply the changes from some of the files in
120+ the commit to "library-x-1.8/" and changes from other files in the
121+ commit to "library-x-1.9/". Elijah explained why this happens and
122+ suggested a hack for users dealing with this particular kind of case
123+ to workaround rename detection.
83124
84125 Philip Oakley then chimed into the discussion to suggest using
85126 "BLOBSAME" for exact renames in the same way as "TREESAME" is used
86- in ` git log ` for history simplification.
87-
88- Elijah replied to Philip that he thinks that 'exact rename' already
89- works. He then discussed the possible simplifications in the rename
90- detection algorithm that can be made when 'exact rename' happens for
91- a file or a directory.
92-
93- Junio C Hamano, the Git maintainer, then chimed into the discussion
94- saying that "TREESAME" is a property of commits, not trees. So he
95- suggested using different words than "BLOBSAME" and "TREESAME" in
96- the context of rename detection.
97-
98- Philip and Elijah discussed terminology again, agreeing that a good
99- one could help people coming from an "old centralised VCS" make the
100- mind shift to understand Git's model. They didn't find something
101- better than 'exact rename' to help in this case though.
102-
103- As Elijah used the "spanhash representation" words, Philip asked for
104- more information about this way of computing file content
105- similarity. As for rename detection, Elijah explained it
106- comprehensively and supported with a number of arguments his claim
127+ in ` git log ` for history simplification. Elijah replied to Philip
128+ that he thinks that 'exact rename' already works. Junio C Hamano,
129+ the Git maintainer, then chimed into the discussion saying that
130+ "TREESAME" is a property of commits, not trees. So he suggested
131+ using different words than "BLOBSAME" and "TREESAME" in the context
132+ of rename detection.
133+
134+ Philip and Elijah discussed terminology at more length, agreeing
135+ that good terminology can sometimes help people coming from an "old
136+ centralised VCS" make the mind shift to understand Git's model, but
137+ didn't find anything that would help in this case though.
138+
139+ Finally, Philip requested more information about how Git computes
140+ file content similarity (for inexact rename detection), referencing
141+ Elijah's mention of "spanhash representation". Elijah explained the
142+ internal data structure in detail, and supported his earlier claim
107143 that "comparison of filenames can rival cost of file content
108144 similarity".
109145
0 commit comments