Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 1 | = My First Object Walk |
| 2 | |
| 3 | == What's an Object Walk? |
| 4 | |
| 5 | The object walk is a key concept in Git - this is the process that underpins |
| 6 | operations like object transfer and fsck. Beginning from a given commit, the |
| 7 | list of objects is found by walking parent relationships between commits (commit |
| 8 | X based on commit W) and containment relationships between objects (tree Y is |
| 9 | contained within commit X, and blob Z is located within tree Y, giving our |
| 10 | working tree for commit X something like `y/z.txt`). |
| 11 | |
| 12 | A related concept is the revision walk, which is focused on commit objects and |
| 13 | their parent relationships and does not delve into other object types. The |
| 14 | revision walk is used for operations like `git log`. |
| 15 | |
| 16 | === Related Reading |
| 17 | |
| 18 | - `Documentation/user-manual.txt` under "Hacking Git" contains some coverage of |
| 19 | the revision walker in its various incarnations. |
Junio C Hamano | 2267da5 | 2019-12-18 23:09:43 | [diff] [blame] | 20 | - `revision.h` |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 21 | - https://eagain.net/articles/git-for-computer-scientists/[Git for Computer Scientists] |
| 22 | gives a good overview of the types of objects in Git and what your object |
| 23 | walk is really describing. |
| 24 | |
| 25 | == Setting Up |
| 26 | |
| 27 | Create a new branch from `master`. |
| 28 | |
| 29 | ---- |
| 30 | git checkout -b revwalk origin/master |
| 31 | ---- |
| 32 | |
| 33 | We'll put our fiddling into a new command. For fun, let's name it `git walken`. |
| 34 | Open up a new file `builtin/walken.c` and set up the command handler: |
| 35 | |
| 36 | ---- |
| 37 | /* |
| 38 | * "git walken" |
| 39 | * |
| 40 | * Part of the "My First Object Walk" tutorial. |
| 41 | */ |
| 42 | |
| 43 | #include "builtin.h" |
| 44 | |
| 45 | int cmd_walken(int argc, const char **argv, const char *prefix) |
| 46 | { |
| 47 | trace_printf(_("cmd_walken incoming...\n")); |
| 48 | return 0; |
| 49 | } |
| 50 | ---- |
| 51 | |
| 52 | NOTE: `trace_printf()` differs from `printf()` in that it can be turned on or |
| 53 | off at runtime. For the purposes of this tutorial, we will write `walken` as |
| 54 | though it is intended for use as a "plumbing" command: that is, a command which |
| 55 | is used primarily in scripts, rather than interactively by humans (a "porcelain" |
| 56 | command). So we will send our debug output to `trace_printf()` instead. When |
| 57 | running, enable trace output by setting the environment variable `GIT_TRACE`. |
| 58 | |
| 59 | Add usage text and `-h` handling, like all subcommands should consistently do |
| 60 | (our test suite will notice and complain if you fail to do so). |
Junio C Hamano | 59a32b0 | 2021-12-10 22:53:38 | [diff] [blame] | 61 | We'll need to include the `parse-options.h` header. |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 62 | |
| 63 | ---- |
Junio C Hamano | 59a32b0 | 2021-12-10 22:53:38 | [diff] [blame] | 64 | #include "parse-options.h" |
| 65 | |
| 66 | ... |
| 67 | |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 68 | int cmd_walken(int argc, const char **argv, const char *prefix) |
| 69 | { |
| 70 | const char * const walken_usage[] = { |
| 71 | N_("git walken"), |
| 72 | NULL, |
Junio C Hamano | 59a32b0 | 2021-12-10 22:53:38 | [diff] [blame] | 73 | }; |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 74 | struct option options[] = { |
| 75 | OPT_END() |
| 76 | }; |
| 77 | |
| 78 | argc = parse_options(argc, argv, prefix, options, walken_usage, 0); |
| 79 | |
| 80 | ... |
| 81 | } |
| 82 | ---- |
| 83 | |
| 84 | Also add the relevant line in `builtin.h` near `cmd_whatchanged()`: |
| 85 | |
| 86 | ---- |
| 87 | int cmd_walken(int argc, const char **argv, const char *prefix); |
| 88 | ---- |
| 89 | |
| 90 | Include the command in `git.c` in `commands[]` near the entry for `whatchanged`, |
| 91 | maintaining alphabetical ordering: |
| 92 | |
| 93 | ---- |
| 94 | { "walken", cmd_walken, RUN_SETUP }, |
| 95 | ---- |
| 96 | |
| 97 | Add it to the `Makefile` near the line for `builtin/worktree.o`: |
| 98 | |
| 99 | ---- |
| 100 | BUILTIN_OBJS += builtin/walken.o |
| 101 | ---- |
| 102 | |
| 103 | Build and test out your command, without forgetting to ensure the `DEVELOPER` |
| 104 | flag is set, and with `GIT_TRACE` enabled so the debug output can be seen: |
| 105 | |
| 106 | ---- |
| 107 | $ echo DEVELOPER=1 >>config.mak |
| 108 | $ make |
| 109 | $ GIT_TRACE=1 ./bin-wrappers/git walken |
| 110 | ---- |
| 111 | |
| 112 | NOTE: For a more exhaustive overview of the new command process, take a look at |
| 113 | `Documentation/MyFirstContribution.txt`. |
| 114 | |
| 115 | NOTE: A reference implementation can be found at |
| 116 | https://github.com/nasamuffin/git/tree/revwalk. |
| 117 | |
| 118 | === `struct rev_cmdline_info` |
| 119 | |
| 120 | The definition of `struct rev_cmdline_info` can be found in `revision.h`. |
| 121 | |
| 122 | This struct is contained within the `rev_info` struct and is used to reflect |
| 123 | parameters provided by the user over the CLI. |
| 124 | |
| 125 | `nr` represents the number of `rev_cmdline_entry` present in the array. |
| 126 | |
Junio C Hamano | 2267da5 | 2019-12-18 23:09:43 | [diff] [blame] | 127 | `alloc` is used by the `ALLOC_GROW` macro. Check `cache.h` - this variable is |
| 128 | used to track the allocated size of the list. |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 129 | |
| 130 | Per entry, we find: |
| 131 | |
| 132 | `item` is the object provided upon which to base the object walk. Items in Git |
| 133 | can be blobs, trees, commits, or tags. (See `Documentation/gittutorial-2.txt`.) |
| 134 | |
| 135 | `name` is the object ID (OID) of the object - a hex string you may be familiar |
| 136 | with from using Git to organize your source in the past. Check the tutorial |
| 137 | mentioned above towards the top for a discussion of where the OID can come |
| 138 | from. |
| 139 | |
| 140 | `whence` indicates some information about what to do with the parents of the |
| 141 | specified object. We'll explore this flag more later on; take a look at |
| 142 | `Documentation/revisions.txt` to get an idea of what could set the `whence` |
| 143 | value. |
| 144 | |
| 145 | `flags` are used to hint the beginning of the revision walk and are the first |
| 146 | block under the `#include`s in `revision.h`. The most likely ones to be set in |
| 147 | the `rev_cmdline_info` are `UNINTERESTING` and `BOTTOM`, but these same flags |
| 148 | can be used during the walk, as well. |
| 149 | |
| 150 | === `struct rev_info` |
| 151 | |
| 152 | This one is quite a bit longer, and many fields are only used during the walk |
| 153 | by `revision.c` - not configuration options. Most of the configurable flags in |
| 154 | `struct rev_info` have a mirror in `Documentation/rev-list-options.txt`. It's a |
| 155 | good idea to take some time and read through that document. |
| 156 | |
| 157 | == Basic Commit Walk |
| 158 | |
| 159 | First, let's see if we can replicate the output of `git log --oneline`. We'll |
| 160 | refer back to the implementation frequently to discover norms when performing |
| 161 | an object walk of our own. |
| 162 | |
| 163 | To do so, we'll first find all the commits, in order, which preceded the current |
| 164 | commit. We'll extract the name and subject of the commit from each. |
| 165 | |
| 166 | Ideally, we will also be able to find out which ones are currently at the tip of |
| 167 | various branches. |
| 168 | |
| 169 | === Setting Up |
| 170 | |
| 171 | Preparing for your object walk has some distinct stages. |
| 172 | |
| 173 | 1. Perform default setup for this mode, and others which may be invoked. |
| 174 | 2. Check configuration files for relevant settings. |
| 175 | 3. Set up the `rev_info` struct. |
| 176 | 4. Tweak the initialized `rev_info` to suit the current walk. |
| 177 | 5. Prepare the `rev_info` for the walk. |
| 178 | 6. Iterate over the objects, processing each one. |
| 179 | |
| 180 | ==== Default Setups |
| 181 | |
| 182 | Before examining configuration files which may modify command behavior, set up |
| 183 | default state for switches or options your command may have. If your command |
| 184 | utilizes other Git components, ask them to set up their default states as well. |
| 185 | For instance, `git log` takes advantage of `grep` and `diff` functionality, so |
| 186 | its `init_log_defaults()` sets its own state (`decoration_style`) and asks |
| 187 | `grep` and `diff` to initialize themselves by calling each of their |
| 188 | initialization functions. |
| 189 | |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 190 | ==== Configuring From `.gitconfig` |
| 191 | |
| 192 | Next, we should have a look at any relevant configuration settings (i.e., |
| 193 | settings readable and settable from `git config`). This is done by providing a |
| 194 | callback to `git_config()`; within that callback, you can also invoke methods |
| 195 | from other components you may need that need to intercept these options. Your |
| 196 | callback will be invoked once per each configuration value which Git knows about |
| 197 | (global, local, worktree, etc.). |
| 198 | |
| 199 | Similarly to the default values, we don't have anything to do here yet |
| 200 | ourselves; however, we should call `git_default_config()` if we aren't calling |
| 201 | any other existing config callbacks. |
| 202 | |
Junio C Hamano | 59a32b0 | 2021-12-10 22:53:38 | [diff] [blame] | 203 | Add a new function to `builtin/walken.c`. |
| 204 | We'll also need to include the `config.h` header: |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 205 | |
| 206 | ---- |
Junio C Hamano | 59a32b0 | 2021-12-10 22:53:38 | [diff] [blame] | 207 | #include "config.h" |
| 208 | |
| 209 | ... |
| 210 | |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 211 | static int git_walken_config(const char *var, const char *value, void *cb) |
| 212 | { |
| 213 | /* |
| 214 | * For now, we don't have any custom configuration, so fall back to |
| 215 | * the default config. |
| 216 | */ |
| 217 | return git_default_config(var, value, cb); |
| 218 | } |
| 219 | ---- |
| 220 | |
| 221 | Make sure to invoke `git_config()` with it in your `cmd_walken()`: |
| 222 | |
| 223 | ---- |
| 224 | int cmd_walken(int argc, const char **argv, const char *prefix) |
| 225 | { |
| 226 | ... |
| 227 | |
| 228 | git_config(git_walken_config, NULL); |
| 229 | |
| 230 | ... |
| 231 | } |
| 232 | ---- |
| 233 | |
| 234 | ==== Setting Up `rev_info` |
| 235 | |
| 236 | Now that we've gathered external configuration and options, it's time to |
| 237 | initialize the `rev_info` object which we will use to perform the walk. This is |
| 238 | typically done by calling `repo_init_revisions()` with the repository you intend |
| 239 | to target, as well as the `prefix` argument of `cmd_walken` and your `rev_info` |
| 240 | struct. |
| 241 | |
Junio C Hamano | 59a32b0 | 2021-12-10 22:53:38 | [diff] [blame] | 242 | Add the `struct rev_info` and the `repo_init_revisions()` call. |
| 243 | We'll also need to include the `revision.h` header: |
| 244 | |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 245 | ---- |
Junio C Hamano | 59a32b0 | 2021-12-10 22:53:38 | [diff] [blame] | 246 | #include "revision.h" |
| 247 | |
| 248 | ... |
| 249 | |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 250 | int cmd_walken(int argc, const char **argv, const char *prefix) |
| 251 | { |
| 252 | /* This can go wherever you like in your declarations.*/ |
| 253 | struct rev_info rev; |
| 254 | ... |
| 255 | |
| 256 | /* This should go after the git_config() call. */ |
| 257 | repo_init_revisions(the_repository, &rev, prefix); |
| 258 | |
| 259 | ... |
| 260 | } |
| 261 | ---- |
| 262 | |
| 263 | ==== Tweaking `rev_info` For the Walk |
| 264 | |
| 265 | We're getting close, but we're still not quite ready to go. Now that `rev` is |
| 266 | initialized, we can modify it to fit our needs. This is usually done within a |
| 267 | helper for clarity, so let's add one: |
| 268 | |
| 269 | ---- |
| 270 | static void final_rev_info_setup(struct rev_info *rev) |
| 271 | { |
| 272 | /* |
| 273 | * We want to mimic the appearance of `git log --oneline`, so let's |
| 274 | * force oneline format. |
| 275 | */ |
| 276 | get_commit_format("oneline", rev); |
| 277 | |
| 278 | /* Start our object walk at HEAD. */ |
| 279 | add_head_to_pending(rev); |
| 280 | } |
| 281 | ---- |
| 282 | |
| 283 | [NOTE] |
| 284 | ==== |
| 285 | Instead of using the shorthand `add_head_to_pending()`, you could do |
| 286 | something like this: |
| 287 | ---- |
| 288 | struct setup_revision_opt opt; |
| 289 | |
| 290 | memset(&opt, 0, sizeof(opt)); |
| 291 | opt.def = "HEAD"; |
| 292 | opt.revarg_opt = REVARG_COMMITTISH; |
| 293 | setup_revisions(argc, argv, rev, &opt); |
| 294 | ---- |
| 295 | Using a `setup_revision_opt` gives you finer control over your walk's starting |
| 296 | point. |
| 297 | ==== |
| 298 | |
| 299 | Then let's invoke `final_rev_info_setup()` after the call to |
| 300 | `repo_init_revisions()`: |
| 301 | |
| 302 | ---- |
| 303 | int cmd_walken(int argc, const char **argv, const char *prefix) |
| 304 | { |
| 305 | ... |
| 306 | |
| 307 | final_rev_info_setup(&rev); |
| 308 | |
| 309 | ... |
| 310 | } |
| 311 | ---- |
| 312 | |
| 313 | Later, we may wish to add more arguments to `final_rev_info_setup()`. But for |
| 314 | now, this is all we need. |
| 315 | |
| 316 | ==== Preparing `rev_info` For the Walk |
| 317 | |
| 318 | Now that `rev` is all initialized and configured, we've got one more setup step |
| 319 | before we get rolling. We can do this in a helper, which will both prepare the |
| 320 | `rev_info` for the walk, and perform the walk itself. Let's start the helper |
| 321 | with the call to `prepare_revision_walk()`, which can return an error without |
| 322 | dying on its own: |
| 323 | |
| 324 | ---- |
| 325 | static void walken_commit_walk(struct rev_info *rev) |
| 326 | { |
| 327 | if (prepare_revision_walk(rev)) |
| 328 | die(_("revision walk setup failed")); |
| 329 | } |
| 330 | ---- |
| 331 | |
| 332 | NOTE: `die()` prints to `stderr` and exits the program. Since it will print to |
| 333 | `stderr` it's likely to be seen by a human, so we will localize it. |
| 334 | |
| 335 | ==== Performing the Walk! |
| 336 | |
| 337 | Finally! We are ready to begin the walk itself. Now we can see that `rev_info` |
| 338 | can also be used as an iterator; we move to the next item in the walk by using |
| 339 | `get_revision()` repeatedly. Add the listed variable declarations at the top and |
| 340 | the walk loop below the `prepare_revision_walk()` call within your |
| 341 | `walken_commit_walk()`: |
| 342 | |
| 343 | ---- |
| 344 | static void walken_commit_walk(struct rev_info *rev) |
| 345 | { |
| 346 | struct commit *commit; |
| 347 | struct strbuf prettybuf = STRBUF_INIT; |
| 348 | |
| 349 | ... |
| 350 | |
| 351 | while ((commit = get_revision(rev))) { |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 352 | strbuf_reset(&prettybuf); |
| 353 | pp_commit_easy(CMIT_FMT_ONELINE, commit, &prettybuf); |
| 354 | puts(prettybuf.buf); |
| 355 | } |
| 356 | strbuf_release(&prettybuf); |
| 357 | } |
| 358 | ---- |
| 359 | |
| 360 | NOTE: `puts()` prints a `char*` to `stdout`. Since this is the part of the |
| 361 | command we expect to be machine-parsed, we're sending it directly to stdout. |
| 362 | |
| 363 | Give it a shot. |
| 364 | |
| 365 | ---- |
| 366 | $ make |
| 367 | $ ./bin-wrappers/git walken |
| 368 | ---- |
| 369 | |
| 370 | You should see all of the subject lines of all the commits in |
| 371 | your tree's history, in order, ending with the initial commit, "Initial revision |
| 372 | of "git", the information manager from hell". Congratulations! You've written |
| 373 | your first revision walk. You can play with printing some additional fields |
| 374 | from each commit if you're curious; have a look at the functions available in |
| 375 | `commit.h`. |
| 376 | |
| 377 | === Adding a Filter |
| 378 | |
| 379 | Next, let's try to filter the commits we see based on their author. This is |
| 380 | equivalent to running `git log --author=<pattern>`. We can add a filter by |
| 381 | modifying `rev_info.grep_filter`, which is a `struct grep_opt`. |
| 382 | |
Junio C Hamano | 992fbdc | 2020-12-09 00:14:29 | [diff] [blame] | 383 | First some setup. Add `grep_config()` to `git_walken_config()`: |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 384 | |
| 385 | ---- |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 386 | static int git_walken_config(const char *var, const char *value, void *cb) |
| 387 | { |
| 388 | grep_config(var, value, cb); |
| 389 | return git_default_config(var, value, cb); |
| 390 | } |
| 391 | ---- |
| 392 | |
| 393 | Next, we can modify the `grep_filter`. This is done with convenience functions |
| 394 | found in `grep.h`. For fun, we're filtering to only commits from folks using a |
| 395 | `gmail.com` email address - a not-very-precise guess at who may be working on |
| 396 | Git as a hobby. Since we're checking the author, which is a specific line in the |
| 397 | header, we'll use the `append_header_grep_pattern()` helper. We can use |
| 398 | the `enum grep_header_field` to indicate which part of the commit header we want |
| 399 | to search. |
| 400 | |
| 401 | In `final_rev_info_setup()`, add your filter line: |
| 402 | |
| 403 | ---- |
| 404 | static void final_rev_info_setup(int argc, const char **argv, |
| 405 | const char *prefix, struct rev_info *rev) |
| 406 | { |
| 407 | ... |
| 408 | |
| 409 | append_header_grep_pattern(&rev->grep_filter, GREP_HEADER_AUTHOR, |
| 410 | "gmail"); |
| 411 | compile_grep_patterns(&rev->grep_filter); |
| 412 | |
| 413 | ... |
| 414 | } |
| 415 | ---- |
| 416 | |
| 417 | `append_header_grep_pattern()` adds your new "gmail" pattern to `rev_info`, but |
| 418 | it won't work unless we compile it with `compile_grep_patterns()`. |
| 419 | |
| 420 | NOTE: If you are using `setup_revisions()` (for example, if you are passing a |
| 421 | `setup_revision_opt` instead of using `add_head_to_pending()`), you don't need |
| 422 | to call `compile_grep_patterns()` because `setup_revisions()` calls it for you. |
| 423 | |
| 424 | NOTE: We could add the same filter via the `append_grep_pattern()` helper if we |
| 425 | wanted to, but `append_header_grep_pattern()` adds the `enum grep_context` and |
| 426 | `enum grep_pat_token` for us. |
| 427 | |
| 428 | === Changing the Order |
| 429 | |
| 430 | There are a few ways that we can change the order of the commits during a |
| 431 | revision walk. Firstly, we can use the `enum rev_sort_order` to choose from some |
| 432 | typical orderings. |
| 433 | |
| 434 | `topo_order` is the same as `git log --topo-order`: we avoid showing a parent |
| 435 | before all of its children have been shown, and we avoid mixing commits which |
| 436 | are in different lines of history. (`git help log`'s section on `--topo-order` |
| 437 | has a very nice diagram to illustrate this.) |
| 438 | |
| 439 | Let's see what happens when we run with `REV_SORT_BY_COMMIT_DATE` as opposed to |
| 440 | `REV_SORT_BY_AUTHOR_DATE`. Add the following: |
| 441 | |
| 442 | ---- |
| 443 | static void final_rev_info_setup(int argc, const char **argv, |
| 444 | const char *prefix, struct rev_info *rev) |
| 445 | { |
| 446 | ... |
| 447 | |
| 448 | rev->topo_order = 1; |
| 449 | rev->sort_order = REV_SORT_BY_COMMIT_DATE; |
| 450 | |
| 451 | ... |
| 452 | } |
| 453 | ---- |
| 454 | |
| 455 | Let's output this into a file so we can easily diff it with the walk sorted by |
| 456 | author date. |
| 457 | |
| 458 | ---- |
| 459 | $ make |
| 460 | $ ./bin-wrappers/git walken > commit-date.txt |
| 461 | ---- |
| 462 | |
| 463 | Then, let's sort by author date and run it again. |
| 464 | |
| 465 | ---- |
| 466 | static void final_rev_info_setup(int argc, const char **argv, |
| 467 | const char *prefix, struct rev_info *rev) |
| 468 | { |
| 469 | ... |
| 470 | |
| 471 | rev->topo_order = 1; |
| 472 | rev->sort_order = REV_SORT_BY_AUTHOR_DATE; |
| 473 | |
| 474 | ... |
| 475 | } |
| 476 | ---- |
| 477 | |
| 478 | ---- |
| 479 | $ make |
| 480 | $ ./bin-wrappers/git walken > author-date.txt |
| 481 | ---- |
| 482 | |
| 483 | Finally, compare the two. This is a little less helpful without object names or |
| 484 | dates, but hopefully we get the idea. |
| 485 | |
| 486 | ---- |
| 487 | $ diff -u commit-date.txt author-date.txt |
| 488 | ---- |
| 489 | |
| 490 | This display indicates that commits can be reordered after they're written, for |
| 491 | example with `git rebase`. |
| 492 | |
| 493 | Let's try one more reordering of commits. `rev_info` exposes a `reverse` flag. |
| 494 | Set that flag somewhere inside of `final_rev_info_setup()`: |
| 495 | |
| 496 | ---- |
| 497 | static void final_rev_info_setup(int argc, const char **argv, const char *prefix, |
| 498 | struct rev_info *rev) |
| 499 | { |
| 500 | ... |
| 501 | |
| 502 | rev->reverse = 1; |
| 503 | |
| 504 | ... |
| 505 | } |
| 506 | ---- |
| 507 | |
| 508 | Run your walk again and note the difference in order. (If you remove the grep |
| 509 | pattern, you should see the last commit this call gives you as your current |
| 510 | HEAD.) |
| 511 | |
| 512 | == Basic Object Walk |
| 513 | |
| 514 | So far we've been walking only commits. But Git has more types of objects than |
| 515 | that! Let's see if we can walk _all_ objects, and find out some information |
| 516 | about each one. |
| 517 | |
| 518 | We can base our work on an example. `git pack-objects` prepares all kinds of |
| 519 | objects for packing into a bitmap or packfile. The work we are interested in |
| 520 | resides in `builtins/pack-objects.c:get_object_list()`; examination of that |
| 521 | function shows that the all-object walk is being performed by |
| 522 | `traverse_commit_list()` or `traverse_commit_list_filtered()`. Those two |
| 523 | functions reside in `list-objects.c`; examining the source shows that, despite |
| 524 | the name, these functions traverse all kinds of objects. Let's have a look at |
Junio C Hamano | 67fef49 | 2022-03-27 17:31:23 | [diff] [blame] | 525 | the arguments to `traverse_commit_list()`. |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 526 | |
Junio C Hamano | 67fef49 | 2022-03-27 17:31:23 | [diff] [blame] | 527 | - `struct rev_info *revs`: This is the `rev_info` used for the walk. If |
| 528 | its `filter` member is not `NULL`, then `filter` contains information for |
| 529 | how to filter the object list. |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 530 | - `show_commit_fn show_commit`: A callback which will be used to handle each |
| 531 | individual commit object. |
| 532 | - `show_object_fn show_object`: A callback which will be used to handle each |
| 533 | non-commit object (so each blob, tree, or tag). |
| 534 | - `void *show_data`: A context buffer which is passed in turn to `show_commit` |
| 535 | and `show_object`. |
Junio C Hamano | 67fef49 | 2022-03-27 17:31:23 | [diff] [blame] | 536 | |
| 537 | In addition, `traverse_commit_list_filtered()` has an additional paramter: |
| 538 | |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 539 | - `struct oidset *omitted`: A linked-list of object IDs which the provided |
| 540 | filter caused to be omitted. |
| 541 | |
Junio C Hamano | 67fef49 | 2022-03-27 17:31:23 | [diff] [blame] | 542 | It looks like these methods use callbacks we provide instead of needing us |
| 543 | to call it repeatedly ourselves. Cool! Let's add the callbacks first. |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 544 | |
| 545 | For the sake of this tutorial, we'll simply keep track of how many of each kind |
| 546 | of object we find. At file scope in `builtin/walken.c` add the following |
| 547 | tracking variables: |
| 548 | |
| 549 | ---- |
| 550 | static int commit_count; |
| 551 | static int tag_count; |
| 552 | static int blob_count; |
| 553 | static int tree_count; |
| 554 | ---- |
| 555 | |
| 556 | Commits are handled by a different callback than other objects; let's do that |
| 557 | one first: |
| 558 | |
| 559 | ---- |
| 560 | static void walken_show_commit(struct commit *cmt, void *buf) |
| 561 | { |
| 562 | commit_count++; |
| 563 | } |
| 564 | ---- |
| 565 | |
| 566 | The `cmt` argument is fairly self-explanatory. But it's worth mentioning that |
| 567 | the `buf` argument is actually the context buffer that we can provide to the |
| 568 | traversal calls - `show_data`, which we mentioned a moment ago. |
| 569 | |
| 570 | Since we have the `struct commit` object, we can look at all the same parts that |
| 571 | we looked at in our earlier commit-only walk. For the sake of this tutorial, |
| 572 | though, we'll just increment the commit counter and move on. |
| 573 | |
| 574 | The callback for non-commits is a little different, as we'll need to check |
| 575 | which kind of object we're dealing with: |
| 576 | |
| 577 | ---- |
| 578 | static void walken_show_object(struct object *obj, const char *str, void *buf) |
| 579 | { |
| 580 | switch (obj->type) { |
| 581 | case OBJ_TREE: |
| 582 | tree_count++; |
| 583 | break; |
| 584 | case OBJ_BLOB: |
| 585 | blob_count++; |
| 586 | break; |
| 587 | case OBJ_TAG: |
| 588 | tag_count++; |
| 589 | break; |
| 590 | case OBJ_COMMIT: |
| 591 | BUG("unexpected commit object in walken_show_object\n"); |
| 592 | default: |
| 593 | BUG("unexpected object type %s in walken_show_object\n", |
| 594 | type_name(obj->type)); |
| 595 | } |
| 596 | } |
| 597 | ---- |
| 598 | |
| 599 | Again, `obj` is fairly self-explanatory, and we can guess that `buf` is the same |
| 600 | context pointer that `walken_show_commit()` receives: the `show_data` argument |
| 601 | to `traverse_commit_list()` and `traverse_commit_list_filtered()`. Finally, |
| 602 | `str` contains the name of the object, which ends up being something like |
| 603 | `foo.txt` (blob), `bar/baz` (tree), or `v1.2.3` (tag). |
| 604 | |
| 605 | To help assure us that we aren't double-counting commits, we'll include some |
| 606 | complaining if a commit object is routed through our non-commit callback; we'll |
| 607 | also complain if we see an invalid object type. Since those two cases should be |
| 608 | unreachable, and would only change in the event of a semantic change to the Git |
| 609 | codebase, we complain by using `BUG()` - which is a signal to a developer that |
| 610 | the change they made caused unintended consequences, and the rest of the |
| 611 | codebase needs to be updated to understand that change. `BUG()` is not intended |
| 612 | to be seen by the public, so it is not localized. |
| 613 | |
| 614 | Our main object walk implementation is substantially different from our commit |
| 615 | walk implementation, so let's make a new function to perform the object walk. We |
| 616 | can perform setup which is applicable to all objects here, too, to keep separate |
| 617 | from setup which is applicable to commit-only walks. |
| 618 | |
| 619 | We'll start by enabling all types of objects in the `struct rev_info`. We'll |
| 620 | also turn on `tree_blobs_in_commit_order`, which means that we will walk a |
| 621 | commit's tree and everything it points to immediately after we find each commit, |
| 622 | as opposed to waiting for the end and walking through all trees after the commit |
| 623 | history has been discovered. With the appropriate settings configured, we are |
| 624 | ready to call `prepare_revision_walk()`. |
| 625 | |
| 626 | ---- |
| 627 | static void walken_object_walk(struct rev_info *rev) |
| 628 | { |
| 629 | rev->tree_objects = 1; |
| 630 | rev->blob_objects = 1; |
| 631 | rev->tag_objects = 1; |
| 632 | rev->tree_blobs_in_commit_order = 1; |
| 633 | |
| 634 | if (prepare_revision_walk(rev)) |
| 635 | die(_("revision walk setup failed")); |
| 636 | |
| 637 | commit_count = 0; |
| 638 | tag_count = 0; |
| 639 | blob_count = 0; |
| 640 | tree_count = 0; |
| 641 | ---- |
| 642 | |
| 643 | Let's start by calling just the unfiltered walk and reporting our counts. |
Junio C Hamano | 59a32b0 | 2021-12-10 22:53:38 | [diff] [blame] | 644 | Complete your implementation of `walken_object_walk()`. |
| 645 | We'll also need to include the `list-objects.h` header. |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 646 | |
| 647 | ---- |
Junio C Hamano | 59a32b0 | 2021-12-10 22:53:38 | [diff] [blame] | 648 | #include "list-objects.h" |
| 649 | |
| 650 | ... |
| 651 | |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 652 | traverse_commit_list(rev, walken_show_commit, walken_show_object, NULL); |
| 653 | |
| 654 | printf("commits %d\nblobs %d\ntags %d\ntrees %d\n", commit_count, |
| 655 | blob_count, tag_count, tree_count); |
| 656 | } |
| 657 | ---- |
| 658 | |
| 659 | NOTE: This output is intended to be machine-parsed. Therefore, we are not |
| 660 | sending it to `trace_printf()`, and we are not localizing it - we need scripts |
| 661 | to be able to count on the formatting to be exactly the way it is shown here. |
| 662 | If we were intending this output to be read by humans, we would need to localize |
| 663 | it with `_()`. |
| 664 | |
| 665 | Finally, we'll ask `cmd_walken()` to use the object walk instead. Discussing |
| 666 | command line options is out of scope for this tutorial, so we'll just hardcode |
| 667 | a branch we can change at compile time. Where you call `final_rev_info_setup()` |
| 668 | and `walken_commit_walk()`, instead branch like so: |
| 669 | |
| 670 | ---- |
| 671 | if (1) { |
| 672 | add_head_to_pending(&rev); |
| 673 | walken_object_walk(&rev); |
| 674 | } else { |
| 675 | final_rev_info_setup(argc, argv, prefix, &rev); |
| 676 | walken_commit_walk(&rev); |
| 677 | } |
| 678 | ---- |
| 679 | |
| 680 | NOTE: For simplicity, we've avoided all the filters and sorts we applied in |
| 681 | `final_rev_info_setup()` and simply added `HEAD` to our pending queue. If you |
| 682 | want, you can certainly use the filters we added before by moving |
| 683 | `final_rev_info_setup()` out of the conditional and removing the call to |
| 684 | `add_head_to_pending()`. |
| 685 | |
| 686 | Now we can try to run our command! It should take noticeably longer than the |
| 687 | commit walk, but an examination of the output will give you an idea why. Your |
| 688 | output should look similar to this example, but with different counts: |
| 689 | |
| 690 | ---- |
| 691 | Object walk completed. Found 55733 commits, 100274 blobs, 0 tags, and 104210 trees. |
| 692 | ---- |
| 693 | |
| 694 | This makes sense. We have more trees than commits because the Git project has |
| 695 | lots of subdirectories which can change, plus at least one tree per commit. We |
| 696 | have no tags because we started on a commit (`HEAD`) and while tags can point to |
| 697 | commits, commits can't point to tags. |
| 698 | |
| 699 | NOTE: You will have different counts when you run this yourself! The number of |
| 700 | objects grows along with the Git project. |
| 701 | |
| 702 | === Adding a Filter |
| 703 | |
| 704 | There are a handful of filters that we can apply to the object walk laid out in |
| 705 | `Documentation/rev-list-options.txt`. These filters are typically useful for |
| 706 | operations such as creating packfiles or performing a partial clone. They are |
| 707 | defined in `list-objects-filter-options.h`. For the purposes of this tutorial we |
| 708 | will use the "tree:1" filter, which causes the walk to omit all trees and blobs |
| 709 | which are not directly referenced by commits reachable from the commit in |
| 710 | `pending` when the walk begins. (`pending` is the list of objects which need to |
| 711 | be traversed during a walk; you can imagine a breadth-first tree traversal to |
| 712 | help understand. In our case, that means we omit trees and blobs not directly |
| 713 | referenced by `HEAD` or `HEAD`'s history, because we begin the walk with only |
| 714 | `HEAD` in the `pending` list.) |
| 715 | |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 716 | For now, we are not going to track the omitted objects, so we'll replace those |
| 717 | parameters with `NULL`. For the sake of simplicity, we'll add a simple |
Junio C Hamano | 67fef49 | 2022-03-27 17:31:23 | [diff] [blame] | 718 | build-time branch to use our filter or not. Preface the line calling |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 719 | `traverse_commit_list()` with the following, which will remind us which kind of |
| 720 | walk we've just performed: |
| 721 | |
| 722 | ---- |
| 723 | if (0) { |
| 724 | /* Unfiltered: */ |
| 725 | trace_printf(_("Unfiltered object walk.\n")); |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 726 | } else { |
| 727 | trace_printf( |
| 728 | _("Filtered object walk with filterspec 'tree:1'.\n")); |
Junio C Hamano | 67fef49 | 2022-03-27 17:31:23 | [diff] [blame] | 729 | CALLOC_ARRAY(rev->filter, 1); |
| 730 | parse_list_objects_filter(rev->filter, "tree:1"); |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 731 | } |
Junio C Hamano | 67fef49 | 2022-03-27 17:31:23 | [diff] [blame] | 732 | traverse_commit_list(rev, walken_show_commit, |
| 733 | walken_show_object, NULL); |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 734 | ---- |
| 735 | |
Junio C Hamano | 67fef49 | 2022-03-27 17:31:23 | [diff] [blame] | 736 | The `rev->filter` member is usually built directly from a command |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 737 | line argument, so the module provides an easy way to build one from a string. |
| 738 | Even though we aren't taking user input right now, we can still build one with |
| 739 | a hardcoded string using `parse_list_objects_filter()`. |
| 740 | |
| 741 | With the filter spec "tree:1", we are expecting to see _only_ the root tree for |
| 742 | each commit; therefore, the tree object count should be less than or equal to |
| 743 | the number of commits. (For an example of why that's true: `git commit --revert` |
| 744 | points to the same tree object as its grandparent.) |
| 745 | |
| 746 | === Counting Omitted Objects |
| 747 | |
| 748 | We also have the capability to enumerate all objects which were omitted by a |
| 749 | filter, like with `git log --filter=<spec> --filter-print-omitted`. Asking |
| 750 | `traverse_commit_list_filtered()` to populate the `omitted` list means that our |
| 751 | object walk does not perform any better than an unfiltered object walk; all |
| 752 | reachable objects are walked in order to populate the list. |
| 753 | |
| 754 | First, add the `struct oidset` and related items we will use to iterate it: |
| 755 | |
| 756 | ---- |
| 757 | static void walken_object_walk( |
| 758 | ... |
| 759 | |
| 760 | struct oidset omitted; |
| 761 | struct oidset_iter oit; |
| 762 | struct object_id *oid = NULL; |
| 763 | int omitted_count = 0; |
| 764 | oidset_init(&omitted, 0); |
| 765 | |
| 766 | ... |
| 767 | ---- |
| 768 | |
| 769 | Modify the call to `traverse_commit_list_filtered()` to include your `omitted` |
| 770 | object: |
| 771 | |
| 772 | ---- |
| 773 | ... |
| 774 | |
Junio C Hamano | 67fef49 | 2022-03-27 17:31:23 | [diff] [blame] | 775 | traverse_commit_list_filtered(rev, |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 776 | walken_show_commit, walken_show_object, NULL, &omitted); |
| 777 | |
| 778 | ... |
| 779 | ---- |
| 780 | |
| 781 | Then, after your traversal, the `oidset` traversal is pretty straightforward. |
| 782 | Count all the objects within and modify the print statement: |
| 783 | |
| 784 | ---- |
| 785 | /* Count the omitted objects. */ |
| 786 | oidset_iter_init(&omitted, &oit); |
| 787 | |
| 788 | while ((oid = oidset_iter_next(&oit))) |
| 789 | omitted_count++; |
| 790 | |
Junio C Hamano | 7d6f46e | 2021-09-10 19:54:21 | [diff] [blame] | 791 | printf("commits %d\nblobs %d\ntags %d\ntrees %d\nomitted %d\n", |
Junio C Hamano | 8ac8a3d | 2019-11-11 04:33:46 | [diff] [blame] | 792 | commit_count, blob_count, tag_count, tree_count, omitted_count); |
| 793 | ---- |
| 794 | |
| 795 | By running your walk with and without the filter, you should find that the total |
| 796 | object count in each case is identical. You can also time each invocation of |
| 797 | the `walken` subcommand, with and without `omitted` being passed in, to confirm |
| 798 | to yourself the runtime impact of tracking all omitted objects. |
| 799 | |
| 800 | === Changing the Order |
| 801 | |
| 802 | Finally, let's demonstrate that you can also reorder walks of all objects, not |
| 803 | just walks of commits. First, we'll make our handlers chattier - modify |
| 804 | `walken_show_commit()` and `walken_show_object()` to print the object as they |
| 805 | go: |
| 806 | |
| 807 | ---- |
| 808 | static void walken_show_commit(struct commit *cmt, void *buf) |
| 809 | { |
| 810 | trace_printf("commit: %s\n", oid_to_hex(&cmt->object.oid)); |
| 811 | commit_count++; |
| 812 | } |
| 813 | |
| 814 | static void walken_show_object(struct object *obj, const char *str, void *buf) |
| 815 | { |
| 816 | trace_printf("%s: %s\n", type_name(obj->type), oid_to_hex(&obj->oid)); |
| 817 | |
| 818 | ... |
| 819 | } |
| 820 | ---- |
| 821 | |
| 822 | NOTE: Since we will be examining this output directly as humans, we'll use |
| 823 | `trace_printf()` here. Additionally, since this change introduces a significant |
| 824 | number of printed lines, using `trace_printf()` will allow us to easily silence |
| 825 | those lines without having to recompile. |
| 826 | |
| 827 | (Leave the counter increment logic in place.) |
| 828 | |
| 829 | With only that change, run again (but save yourself some scrollback): |
| 830 | |
| 831 | ---- |
| 832 | $ GIT_TRACE=1 ./bin-wrappers/git walken | head -n 10 |
| 833 | ---- |
| 834 | |
| 835 | Take a look at the top commit with `git show` and the object ID you printed; it |
| 836 | should be the same as the output of `git show HEAD`. |
| 837 | |
| 838 | Next, let's change a setting on our `struct rev_info` within |
| 839 | `walken_object_walk()`. Find where you're changing the other settings on `rev`, |
| 840 | such as `rev->tree_objects` and `rev->tree_blobs_in_commit_order`, and add the |
| 841 | `reverse` setting at the bottom: |
| 842 | |
| 843 | ---- |
| 844 | ... |
| 845 | |
| 846 | rev->tree_objects = 1; |
| 847 | rev->blob_objects = 1; |
| 848 | rev->tag_objects = 1; |
| 849 | rev->tree_blobs_in_commit_order = 1; |
| 850 | rev->reverse = 1; |
| 851 | |
| 852 | ... |
| 853 | ---- |
| 854 | |
| 855 | Now, run again, but this time, let's grab the last handful of objects instead |
| 856 | of the first handful: |
| 857 | |
| 858 | ---- |
| 859 | $ make |
| 860 | $ GIT_TRACE=1 ./bin-wrappers git walken | tail -n 10 |
| 861 | ---- |
| 862 | |
| 863 | The last commit object given should have the same OID as the one we saw at the |
| 864 | top before, and running `git show <oid>` with that OID should give you again |
| 865 | the same results as `git show HEAD`. Furthermore, if you run and examine the |
| 866 | first ten lines again (with `head` instead of `tail` like we did before applying |
| 867 | the `reverse` setting), you should see that now the first commit printed is the |
| 868 | initial commit, `e83c5163`. |
| 869 | |
| 870 | == Wrapping Up |
| 871 | |
| 872 | Let's review. In this tutorial, we: |
| 873 | |
| 874 | - Built a commit walk from the ground up |
| 875 | - Enabled a grep filter for that commit walk |
| 876 | - Changed the sort order of that filtered commit walk |
| 877 | - Built an object walk (tags, commits, trees, and blobs) from the ground up |
| 878 | - Learned how to add a filter-spec to an object walk |
| 879 | - Changed the display order of the filtered object walk |