@@ -78,6 +78,253 @@ This edition covers what happened during the month of July 2019.
7878### Support
7979-->
8080
81+ ## An Introduction to git-filter-repo
82+
83+ There is a new tool available for surgery on git repositories:
84+ [ git-filter-repo] ( https://github.com/newren/git-filter-repo ) . It
85+ claims to have [ many new unique
86+ features] ( https://github.com/newren/git-filter-repo#design-rationale-behind-filter-repo-why-create-a-new-tool ) ,
87+ [ good
88+ performance] ( https://public-inbox.org/git/CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com/ ) ,
89+ and an ability to scale -- from making simple history rewrites
90+ trivial, to facilitating the creation of entirely new tools which
91+ leverage existing capabilities to handle more complex cases.
92+
93+ You can read more about [ common usecases and base capabilities of
94+ filter-repo] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L17-L55 ) ,
95+ but in this article, I'd like to focus on two things: providing a simple
96+ example to give a very brief flavor for git-filter-repo usage, and answer a
97+ few likely questions about its purpose and rationale (including a short
98+ comparison to other tools). I will provide several links along the way for
99+ curious folks to learn more.
100+
101+ ### A simple example
102+
103+ Let's start with a simple example that has come up a lot for me:
104+ extracting a piece of an existing repository and preparing it to be
105+ merged into some larger monorepository. So, we want to:
106+
107+ * extract the history of a single directory, src/. This means that only
108+ paths under src/ remain in the repo, and any commits that only touched
109+ paths outside this directory will be removed.
110+ * rename all files to have a new leading directory, my-module/ (e.g. so that
111+ src/foo.c becomes my-module/src/foo.c)
112+ * rename any tags in the extracted repository to have a 'my-module-'
113+ prefix (to avoid any conflicts when we later merge this repo into
114+ something else)
115+
116+ Doing this with filter-repo is as simple as the following command:
117+ ``` shell
118+ git filter-repo --path src/ --to-subdirectory-filter my-module --tag-rename ' ' :' my-module-'
119+ ```
120+ (the single quotes are unnecessary, but make it clearer to a human that we
121+ are replacing the empty string as a prefix with ` my-module- ` )
122+
123+ By contrast, filter-branch comes with a pile of caveats even once you
124+ figure out the necessary (os-dependent) invocation(s):
125+
126+ ``` shell
127+ git filter-branch --index-filter ' git ls-files | grep -v ^src/ | xargs git rm -q --cached; git ls-files -s | sed "s-$(printf \\t)-&my-module/-" | git update-index --index-info; git ls-files | grep -v ^my-module/ | xargs git rm -q --cached' --tag-name-filter ' echo "my-module-$(cat)"' --prune-empty -- --all
128+ git clone file://$( pwd) newcopy
129+ cd newcopy
130+ git for-each-ref --format=" delete %(refname)" refs/tags/ | grep -v refs/tags/my-module- | git update-ref --stdin
131+ git gc --prune=now
132+ ```
133+
134+ BFG is not capable of this type of rewrite, and this type of rewrite is
135+ difficult to perform safely using fast-export and fast-import directly.
136+
137+ You can find a lot more examples in [ filter-repo's
138+ manpage] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L434 ) .
139+ (If you are curious about the "pile of caveats" mentioned above or the
140+ reasons for the extra steps for filter-branch, you can [ read more
141+ details about this
142+ example] ( https://github.com/newren/git-filter-repo#example-usage-comparing-to-filter-branch ) ).
143+
144+ ### Why a new tool instead of contributing to other tools?
145+
146+ There are two well known tools in the repository rewriting space:
147+
148+ * [ git-filter-branch] ( https://git-scm.com/docs/git-filter-branch )
149+ * [ BFG Repo Cleaner] ( https://rtyley.github.io/bfg-repo-cleaner/ )
150+
151+ and two lesser-known tools:
152+
153+ * [ reposurgeon] ( http://www.catb.org/~esr/reposurgeon/reposurgeon.html )
154+ * [ git-fast-export] ( https://git-scm.com/docs/git-fast-export ) and
155+ [ git-fast-import] ( https://git-scm.com/docs/git-fast-import )
156+
157+ (While fast-export and fast-import themselves are well known, they are
158+ usually thought of as export-to-another-VCS or import-from-another-VCS
159+ tools, though they also work for git->git transitions.)
160+
161+ I will briefly discuss each.
162+
163+ #### filter-branch and BFG
164+
165+ It's natural to ask why, if these well-known tools lacked features I
166+ wanted, they could not have been extended instead of creating a new tool.
167+ In short, they were philosophically the wrong starting point for extension
168+ and they also had the wrong architecture or design to support such an
169+ effort.
170+
171+ From the philosophical angle:
172+
173+ * BFG: easy to use flags for some common cases, but not extensible
174+ * filter-branch: relatively versatile capability via user-specified
175+ shell commands, but rapidly becomes very difficult to use beyond
176+ trivial cases especially as usability defaults increasingly
177+ conflict and cause problems.
178+
179+ I wanted something that made the easy cases simple like BFG, but which
180+ would scale up to more difficult cases and have versatility beyond that
181+ which filter-branch provides.
182+
183+ From the technical architecture/design angle:
184+
185+ * BFG: works on packfiles and packed-refs, directly rewriting tree and
186+ blob objects; Roberto proved you can get a lot done with this design
187+ with his work on the BFG (as many people who have used his tool can
188+ attest), but this design does not permit things like differentiating
189+ paths in different directories with the same basename nor could it be
190+ used to allow renaming of paths (except within the same directory).
191+ Further, this design even sadly runs into a
192+ [ lot] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L32-L39 )
193+ [ of] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L29-L31 )
194+ [ roadblocks] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L23-L26 )
195+ [ and] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L64-L66 )
196+ [ limitations] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L27-L28 )
197+ even within its intended usecase of removing big or sensitive content.
198+
199+ * filter-branch: performance really shouldn't matter for a one shot
200+ usage tool, but filter-branch can turn a few hour rewrite
201+ (allowing an overnight downtime) into an intractable three month
202+ wait. Further, its design architecture leaks through every level
203+ of the interface, making it nearly impossible to change anything
204+ about the slow design without having backward compatibility
205+ issues. These issues are well known, but what is less well known
206+ is that even ignoring performance, [ the usability choices in
207+ filter-branch rapidly become increasingly conflicting and
208+ problematic] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/filter-lamely#L9-L61 )
209+ for users with larger repos and more involved rewrites,
210+ difficulties that again cannot be ameliorated without breaking
211+ backward compatibility.
212+
213+ #### reposurgeon
214+
215+ Some brief impressions about reposurgeon:
216+
217+ * Appears to be
218+ [ almost] ( http://www.catb.org/~esr/reposurgeon/features.html )
219+ [ exclusively] ( http://www.catb.org/~esr/reposurgeon/dvcs-migration-guide.html )
220+ focused on transitioning between different version control systems
221+ (cvs, svn, hg, bzr, git, etc.), and in particular handling the myriad
222+ edge and corner cases that arise in transitioning from CVS or SVN to a
223+ DVCS.
224+ * Provides very thorough reference-style documentation; if you read all
225+ reposurgeon documentation, you will likely feel as though you can take
226+ an existing example and modify it in many ways.
227+ * [ Absolutely no full-fledged
228+ examples] ( https://public-inbox.org/git/CAA01Csq0eX2L5cKpjjySs+4e0Sm+vp=10C_SAkE6CLpCHBWZ8g@mail.gmail.com/ )
229+ [ or user-guide style
230+ documentation] ( https://public-inbox.org/git/CAA01Csp+RpCXO4noewBOMY6qJiBy=Gshv3rUh83ZY2RJ5Th3Ww@mail.gmail.com/ )
231+ are provided for getting started.
232+ * Appears to not have any facilities for quick (in terms of time spent by
233+ the user) conversions similar to filter-branch, BFG, or filter-repo.
234+ Users who want such capabilities are likely to be frustrated by
235+ reposurgeon and give up.
236+ * Strikes me as "GDB for history rewriting" it has lots of facilities
237+ for manually inspecting and editing, but is not intended for the
238+ first-time or casual history spelunker. Only those who view history
239+ spelunking as a frequent hobby or job are likely to dive in. And it's
240+ not quite clear whether it is only useful to those transitioning from
241+ CVS/SVN or whether the facilities would also be useful to others.
242+ * Built on top of fast-export and fast-import, which I contend is the
243+ right architecture for a history filtering tool (see below).
244+
245+ I have read the reposurgeon documentation multiple times over the years,
246+ and am almost at a point where I feel like I know how to get started with
247+ it. I haven't had a need to convert a CVS or SVN repo in over a decade; if
248+ I had such a need, perhaps I'd persevere and learn more about it. I
249+ suspect it has some good ideas I could apply to filter-repo. But I haven't
250+ managed to get started with reposurgeon, so clearly my impressions of it
251+ should be taken with a grain of salt.
252+
253+ #### fast-export and fast-import
254+
255+ Finally, fast-export and fast-import can be used with a little editing of
256+ the fast-export output to handle a number of history rewriting cases. I
257+ have done this many times, but it has some
258+ [ drawbacks] ( https://public-inbox.org/git/CABPp-BGL-3_nhZSpt0Bz0EVY-6-mcbgZMmx4YcXEfA_ZrTqFUw@mail.gmail.com/ ) :
259+
260+ * Dangerous for programmatic edits: It's tempting to use sed or perl
261+ one-liners to e.g. try to modify filenames, but you risk accidentally
262+ also munging unrelated data such as commit messages, file contents, and
263+ branch and tag names.
264+ * Easy to miss corner cases: for example, fast-export only quotes
265+ filenames when necessary; as such, your attempt to rename a directory
266+ might leave files with spaces or UTF-8 characters in their original
267+ location.
268+ * Difficult to directly provide higher level facilities: for example,
269+ rewriting (possibly abbreviated) commit hashes in commit messages to
270+ refer to the new commit hashes, or stripping of non-merge commits which
271+ become empty or merge commits which become degenerate and empty.
272+ * Misses a lot of pieces needed to round things out into a usable
273+ tool
274+
275+ However, fast-export and fast-import are the right architecture for
276+ building a repository filtering tool on top of; they are fast, provide
277+ access to almost all aspects of a repository in a very machine-parseable
278+ format, and will continue to gain features and capabilities over time
279+ (e.g. when replace refs were added, fast-export and fast-import immediately
280+ gained support). To create a full repository surgery tool, you "just" need
281+ to [ combine fast-export and fast-import together with a whole lot of
282+ parsing and
283+ glue] ( https://github.com/newren/git-filter-repo#how-filter-repo-works ) ,
284+ which, in a nutshell, is what filter-repo is.
285+
286+ #### Upstream improvements
287+
288+ But to circle back to the question of improving existing tools, during the
289+ development of filter-repo and its predecessor, lots of [ improvements to
290+ both fast-export and
291+ fast-import] ( https://github.com/newren/git-filter-repo/tree/develop#upstream-improvements )
292+ were submitted and included in git.git.
293+
294+ (Also, [ filter-repo started in early 2009 as
295+ git_fast_filter.py
] ( https://public-inbox.org/git/[email protected] / ) 296+ and therefore technically predates both BFG and reposurgeon.)
297+
298+ ### Why not a builtin command?
299+
300+ One could ask why this new command is not written in C like most of git.
301+ While that would have several advantages, it doesn't meet the necessary
302+ design requirements. See the [ "VERSATILITY" section of the
303+ manpage] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L306-L326 )
304+ or see the "Versatility" section under the [ Design Rationale of the
305+ README] ( https://github.com/newren/git-filter-repo#design-rationale-behind-filter-repo-why-create-a-new-tool ) .
306+
307+ Technically, we could perhaps provide a mechanism for people to write
308+ and compile plugins that a builtin command could load, but having users
309+ write filtering functions in C sounds suboptimal, and requiring gcc for
310+ filter-repo sounds more onerous than using python.
311+
312+ ### Where to from here?
313+
314+ This was just a quick intro to filter-repo, and I've provided a lot of
315+ links above if you want to learn more. Just a few more that might be of
316+ interest:
317+
318+ * [ Ramifications of repository
319+ rewrites] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L340-L350 ) ;
320+ including
321+ [ some] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L376-L410 )
322+ [ tips] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L426-L431 )
323+ (not specific to filter-repo)
324+ * [ Finding big objects/directories/extensions (and renames) in your
325+ repo] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L356-L361 )
326+ (can be used together with tools other than filter-repo too)
327+ * [ Creating new history rewriting tools] ( https://github.com/newren/git-filter-repo/tree/master/contrib/filter-repo-demos )
81328
82329## Developer Spotlight: Jean-Noël Avila
83330
0 commit comments