File-level error recovery for keksum

2024-03-09

File-level error recovery for keksum

Filed under: Hash functions, Software — Jacob Welsh @ 21:03

Practical experience using and teaching with my keksum utility informed that its main deficiency was not in fact the lack of an equivalent to GNU's old md5sum -c option to verify a file collection against provided hashes, since this was easy enough to do using the existing Unix diff with a temporary file for the computed hashes. Instead, it was the overly simplistic error handling, where any external failure reported by a system call would result in overall program termination. Sure, it worked as designed to ensure all errors were reported and never passed silently; but in the multi-file usage, it makes much more sense to report the error but carry on with the larger job by proceeding to the next file in the list.

For example, suppose you want to list the hashes of all regular files in a directory, notwithstanding that it also contains subdirectories. The easiest approach (by keystroke count if nothing else) would be to run keksum * and ignore the errors coming from its attempts to read the subdirectories as regular files. Thanks to the separation of standard output and error streams in Unix, you can also redirect the output of that command to save the useful hashes while only the possibly unexpected errors display on screen.

Implementing this required changing the interface of its low-level input/output library, indeed making it a bit more complicated by passing the requirement for error checking on up to the caller. Thus I only made the change to the one such routine that required it, read_all ; still it had to be propagated up through the sponge function and ultimately brought changes to all source files in the project. This is a downside of not having an exceptions facility in the language; alternatively, one could rig up a rudimentary substitute using longjmp but that hardly seemed worth it here.

The only other recoverable errors are those coming from the open call; this was simpler as there's no wrapper involved and the call is done right in the main loop which can directly jump to the continuation point.

Evidently no one but me noticed the "-l" option was mislabeled "-n" in the brief synopsis text. In any case it's fixed now, and the tease that a "-c" option is on the way is removed to better reflect current reality.⁽ⁱ⁾

The patch builds on an earlier one from July 2020 which doesn't seem to have got any explicit mention here yet: I redid the "genesis" patch yet again, this time to follow tree structuring conventions amid a batch of similars. Perhaps it was waiting on some more substantive change such as this one to ride along with.

Finally, I wrapped up this batch of work with a reformatting patch dedicated to the memory of Mircea Popescu, for his insistence that linefeed characters in a text are for expressing auctorial intent and not for working around broken tools that can't adequately handle long lines. The line-un-breaking affects comments as well as the help text displayed to the terminal; previously I'd been sticking to a "punch card" era (or perhaps VT100 era) 80-column discipline.

IBM punch card stack

Patch listing

Patch	Seals	Tree
keksum_subdir_genesis.vpatch	jfw	Browse
keksum_error_recovery_and_usage.vpatch	jfw	Browse
keksum_softwrap.vpatch	jfw	Browse

This work will also be included in the keksum version distributed as part of the next fetch-bitcoind release.

I did have some fun though thinking about how it could be implemented while preserving the current constant memory usage yet still not introducing artificial limits from static buffer sizes. Namely: a one-time sbrk call safely allocates up front an arbitrarily large buffer to read the hash field based on the -l option; over-length hashes are truncated and under-length ones reported as errors; arbitrarily long file paths are handled by iterated chdir, exploiting the fact that each individual component of the path (file name) can be reliably limited to 255 characters for most if not all filesystems; and an fchdir restores the original working directory for the next entry. [^]

2 Comments »

[...] sixth fetch-bitcoind release that wraps these up along with the recent keksum work is up in the canonical place. Besides the new patches, it updates the base URLs for my change of [...]

Pingback by Dropping BDB locking, bitcoind finally follows the Bitcoin protocol « Fixpoint — 2024-03-22 @ 05:40
[...] changed, as lower levels still might have Vivian Sporepress: I also noticed busybox tar has the abort-on-first-error behavior whereas gnu tar will print the warning, continue then return the error status. thoughts on whether [...]

Pingback by Regrinding Busybox archive extraction, fixing directory timestamps, symlink attacks, a buffer overflow and more « Fixpoint — 2024-05-04 @ 03:30

RSS feed for comments on this post. TrackBack URL

Fixpoint

2024-03-09