Watching a new Bitcoin node getting up to speed this week (a publicly listening one this time!), I observed the usual situation, that it's unable to consistently maintain a flow of inbound blocks from its network peers during initial download or re-sync, despite any number of supposed connections. And mind you I'm not talking about some lofty standard of "maximally utilizing available resources" but merely keeping it moving at all. It annoyed me enough that I went to have a closer look at the relevant code. This wasn't the first time either, but whether due to my recent mapping of the nearby swamps or simply due to the repeat effort, I made much more headway on this pass. In brief, it's a wonder the thing works even to the extent it does, but while not exactly trivial, it looks quite fixable and without particularly invasive changes. As a side benefit, I also see just fine now what was up with that runaway sendbuffer
DoS episode back in 2020 (archived).
One key breakthrough was figuring out the threading structure of the p2p network code. It consists primarily of two OS-level threads (i.e. independent, whether preemptively scheduled or truly concurrent), each of which runs a polled event loop and attends to various tasks in succession (which could be viewed as subprocesses). It might help to think of the threads as two workers in a somewhat elaborate mail-room, making their rounds to keep everything moving. Various data structures (mailboxes) are employed to pass data between the two, and consequently require locking so the two don't interfere with each other. In fact, each connected peer (represented by a CNode
object) gets its own set of such structures, though they're all attended by the same threads.
The first and simpler of the two is ThreadSocketHandler
. Its delay is set so it loops at around 20 Hz, though socket read or write availability is discovered without delay due to a select
call. Its entire function is as a middleman: for each peer, it reads incoming byte sequences from the kernel's buffer (socket), transferring them to an internal receiving buffer ; similarly it checks for bytes in an internal sending buffer, transferring them to the kernel for external delivery.(i) Why bother with this byte-shuffling bureaucracy when the socket already had perfectly good buffers of its own? The extra layer provides at least a few observable functions : disconnection of inactive peers so as to reclaim unused resources ; allowing (in theory) arbitrarily large buffers independent of what's provided by the socket(ii) ; and a rather dubious "flood control" mechanism. Conceivably the polling aspect also serves to obscure the fine-grained timing information of outbound messages, though I have little reason to suspect this was actually the intent and if so it's not too robust. Of course, there's no reason any of these functions would actually require a separate thread ; but hey, I guess we have a hammer so everything looks like a thumb or how did it go.
Oh, that "flood control" thingy? It doesn't work reliably, because it's enforced on the outgoing rather than incoming side of each supposedly guarded buffer, that is, after bytes are removed rather than before they're added! And it has no time element, so it may serve to increase fragility by forcing connections to drop just because some delayed packets drove the instantaneous load above the limit. If it did work properly, in the sense of enforcing the maximum buffer sizes, it would rule out the runaway getdata
response along with all others of its kind, though it might break other things.(iii)
As to the second thread, it's called ThreadMessageHandler
and its polling frequency is lower at up to 10 Hz, perhaps on the idea that it does heavier stuff. It does the "real work", decoding messages from the receiving buffer, dispatching to command handlers, which push objects as needed into yet more temporary collections (structured buffers essentially), which then get picked up and serialized into bytes in the sending buffer. There's various further weirdness and questions coming out of it all, but I'm aiming to stick to the originally burning question of why the block download keeps stalling out.
It looks to me that the core of it is a fundamentally passive nature of that download process. It asks for blocks on only two occasions : on first connecting to a peer (and even that only due to the "aggression" patch which I guess we could now call... passive-aggression?), which accounts for why restarting the node tends to get it going again ; and on receiving a block with missing antecedent ("bastard block"). In any case the response to this getblocks
query is a list of block hashes ("inventory" objects), with length limited by the sender so that the anticipated followup getdata
won't trigger the sender's flood control. Then the sender is responsible for remembering what's going on (through a hashContinue
field) and triggering the receiving peer to request the next batch once the last getdata
arrives. That's of course assuming it in fact arrives, and it looks to me that there's multiple ways it might not. For one, due to the details of how the starting point for the transfer is negotiated (CBlockLocator
) it can end up substantially behind where the requesting node is already at, so none of the resulting inventory generates a getdata
. For another, perhaps the block could be meanwhile received from another peer. Yet another factor could be flaky yet not outright malicious peer behavior : perhaps from implementations that "prune" old blocks or simply drop the send-side continuation mechanism.
So the correct approach from what I can see, requiring neither changes on the other end nor "checkpoint" shenanigans, will be for the downloading node to "man up" and ask for what it wants. And what it wants is for all of its peers to be informing it of all of their inventory of interest, all of the time.
Concretely, this seems to mean noticing when it's reached the end of a series of blocks from a given peer - even if they were just hashes for blocks it already had - and refreshing the getblocks
request from that point, rather than from the node's current best block since the two may be on different branches. Thus at any given time, for each of its peers, the node is either in process of downloading, or has a request out for new inventory that makes forward progress over the previous, or is all caught up and waiting for the simple gossip broadcasting of new blocks. There may be increased duplication where the same blocks are fetched around the same time from multiple peers ; but that strikes me as far superior to the current situation where bandwidth gets spent anyway on unusable blocks while time is wasted.
Unfortunately there's no explicit indication in the protocol of this "final block in sequence" ; however, the pushed inventory objects are grouped into batches of up to 1000, so simply reaching the end of such a message might be good enough. At the extreme, a getblocks
could be sent for every new block incoming or known block "inv". Care would need to be taken that this doesn't result in exponential amplification of messages back and forth (I don't think it would but need to confirm). Also in its present implementation getblocks
is kind of heavy, on both sending and receiving ends, so such frequent polling could be construed as abusive ; but there's substantial room to optimize there.
- The send/recv calls make the not-strictly-portable assumption that they don't need to check for EAGAIN and EWOULDBLOCK as distinct error codes. Log messages use raw error codes rather than bothering to decode them. Both warts probably stem from the code's evident WinSock origins. [^]
- For a reasonable instance, you wouldn't want to have to look up and fetch a 1 MB block from the database repeatedly just because it wouldn't fit all at once in the socket, nor block other activity while waiting for it to drain. [^]
- The
getheaders
request in particular can easily generate oversized responses in the intended usage : think 80 MB for a million-block chain. Now,getheaders
isn't actually issued anywhere by the reference implementation. I'm guessing it was added to support "SPV" clients and later picked up for "headers-first sync". The latter in particular doesn't seem like such a bad idea to me so I don't know that we'd want to just get rid of the command. [^]
[...] getting into the deeper code study this week, I had made some more basic changes which are suitable for production use, and updated [...]
Pingback by Two patches and a new fetch-bitcoind release « Fixpoint — 2022-05-05 @ 21:39
These are such a pleasure to read, thank you. Combined with the releases, they really make the case for giving this another go as node-code that you actively maintain so I guess it'll soon get to the point where I'll have to do it, huh.
Bwahaha, a very fitting description, I say!
Comment by Diana Coman — 2022-05-06 @ 10:09
@Diana Coman: yeee.
Comment by Jacob Welsh — 2022-05-06 @ 14:53
Also worth a link, more for the prior discussion than the specific patch attempted: A cursory look at the infamous TRB "wedge" bug.
Comment by Jacob Welsh — 2022-05-09 @ 01:41
[...] Bitcoin code reading efforts led naturally to another round of editing, in a one-two ratchet towards the object of full [...]
Pingback by Sending buffer overflows fixed in bitcoind, and other cleanups « Fixpoint — 2022-05-21 @ 22:33