[core-rs] Accounts Tree Prefix Encodings in Nimiq 2.0

terorie · July 18, 2020, 6:35am

I noticed core-rs stores state tree prefixes in hexadecimal serialization both on-chain and in the database. So, instead of storing hex path 0x0000…, it would store 0x30303030…

While it is a simpler way of storing a sequence of nibbles, it is obviously not as efficient as it could be. For example, the in-memory representation of “AddressNibbles” uses contiguous sequences of nibbles with an 8-bit length indicator. Ethereum goes even further and uses a 4-bit odd/even indicator.

I’m going to explore a few ways how we can improve efficiency, in this post.
If time allows it, I’ll write up some proof-of-concept Go code.

A more efficient encoding for the database

I can imagine historical reasons for storing prefixes with hex encoding. Perhaps browser engines imposed restrictions on core-js that only ASCII-compatible keys can be stored in the database. Since all the database code now directly depends on LMDB, I don’t think this is a problem anymore.

Whether changing the key encoding on Nimiq 1.0 nodes is worth doing is debatable. core-rs nodes have hit production usage already and such a change would require a full re-sync or complex upgrading code.

However, for Nimiq 2.0 (core-rs-albatross) it seems to make sense. There is no blockchain to lose and it’s in heavy development anyways. Cutting the key size in half could allow for some good performance gains.

A comment by @paberr in the core-rs database code supports this, by the way

impl AsDatabaseBytes for AddressNibbles {
    fn as_database_bytes(&self) -> Cow<[u8]> {
        // TODO: Improve AddressNibbles, so that no serialization is needed.
        let v = Serialize::serialize_to_vec(&self);
        Cow::Owned(v)
    }
}

Here are a few encoding suggestions.

Prefixing a 4-bit odd/even indicator

The first nibble (“prefix”) is the indicator.
Prefix 0b0000 / 0x0 means an odd number of nibbles follows,
Prefix 0b1000 / 0x8 means an even number of nibbles follows.
A single 0x0 nibble is used as padding at the end.

Example encodings:

[] => 80
[1] => 01
[8, 4] => 8840
[8, 8, 4] => 0884

Converting between the compressed form and the nibble list is fairly simple,
and this encoding is space-efficient. However, it would create an unnatural
split on the key index, and make in-order iteration impossible.

Suffixing a 4-bit odd/even indicator

The last nibble (“suffix”) is the indicator.
Suffix 0b0000 / 0x0 means an odd number of nibbles precedes,
Suffix 0b11111111 / 0xFF means an even number of nibbles precedes.

Example encodings:

[] => FF
[1] => 10
[8, 4] => 84FF
[8, 8, 4] => 8840

Conversion is slightly more complicated here in this case.
I like this one the most.

Prefixing the 8-bit length

This encoding is a bit more exotic, but it would allow for efficiently iterating the list of accounts/leaves in-order, without touching branches (select by prefix 0x28...).
It’s probably also less efficient that the other proposed encodings.

Changing the “on-chain” tree

Changing the accounts tree itself is probably not productive.
However it might decrease the amount of data sent over the wire when doing a full light sync.

heya · July 19, 2020, 11:21am

I personally don’t think this forum is a good platform for such discussions. Why didn’t you open an issue on GitHub instead, where other issues around of the core code are discussed?

As I understand, Nimiq 2.0 is written exclusively in Rust. Do you think that proof-of-concept Go code is suitable to give insights on performance in the Rust implementation?

Have you checked back with @paberr or any other team member if there are already plans and that you are not duplicating work?

As I understand, only the hash of the accounts tree is part of the chain, not the accounts tree itself. Thus it is possible to change the format of data sent over the wire when syncing the accounts tree without having any impact on the on-chain data, as long as there is a way to translate the on-wire format to the on-chain hash format.

terorie · July 19, 2020, 5:26pm

I personally don’t think this forum is a good platform for such discussions. Why didn’t you open an issue on GitHub instead, where other issues around of the core code are discussed?

Feel free to open an issue. The forum has a subsection dedicated to technical discussions, so I figured it’s a fitting space for protocol discussions.

As I understand, Nimiq 2.0 is written exclusively in Rust. Do you think that proof-of-concept Go code is suitable to give insights on performance in the Rust implementation?

I hope so, for the accounts tree code at least. I’m working on a re-implementation of the Nimiq Patricia-Merkle tree and the accounts logic in Go. It is based on RocksDB with in-memory cache overlays. I’d love to see how that compares to the LMDB code.

Regarding the prefix encodings, I think it’s best to just implement that in core-rs directly.

Either way it needs some sensible benchmark, like comitting the first 500k blocks.

Have you checked back with @paberr or any other team member if there are already plans and that you are not duplicating work?

I’m not aware of any GitHub issues or forum posts, which is why I started this post.

Changing the “on-chain” tree

As I understand, only the hash of the accounts tree is part of the chain, not the accounts tree itself.

That is correct, the accounts tree is not embedded into the chain itself, but is authenticated by the chain through the Merkle root hash. I just used “on-chain” for lack of a better term

Thus it is possible to change the format of data sent over the wire when syncing the accounts tree without having any impact on the on-chain data, as long as there is a way to translate the on-wire format to the on-chain hash format.

Afaik, the wire serialization is the same as the on-chain/hash format in most parts of the Nimiq protocol. I don’t think the minimal network traffic savings justify a special case here. WebSocket compression could come close, too.

paberr · July 20, 2020, 8:29am

You’re mostly right here. When we were still using LevelDB it was necessary to have string keys (the browser indexedDB actually supports Uint8Arrays ), so we stuck to this implementation in Nimiq 1.0 even after switching to LMDB.

When implementing the Rust node, as my comment suggests, we thought about using a different format for the database. If I remember correctly, the main reason for not pursuing this any further was that the current format was simple to implement and allows to iterate the accounts tree easily. Plus any other format still didn’t address my comment in a way so that “no serialization is needed”. Even your suggestions will require serialization in form of the conversion.

It indeed might still be worthwhile to change the encoding in order to save space.
I don’t expect any large performance gains though.

paberr · July 20, 2020, 8:33am

I prefer GitHub as well, but as there is a “Technical Discussion” category in the forum, this place probably isn’t totally wrong either.

That’s true by the way.
Oh and @terorie : Currently, no-one from the team is working on this.

terorie · July 20, 2020, 5:15pm

If I remember correctly, the main reason for not pursuing this any further was that the current format was simple to implement and allows to iterate the accounts tree easily.

I haven’t considered iteration. Makes sense, this is probably required for accounts tree chunks?

It indeed might still be worthwhile to change the encoding in order to save space.
I don’t expect any large performance gains though.

I’ve seen about 30% improvements in throughput with a RocksDB-based (CockroachDB) Ethereum block explorer before, when syncing from zero. We switched over from hex-checksum serialized addresses to binary addresses on the transaction log table.

I don’t know if LMDB would behave the same since the architecture and access pattern is quite different. It’s worth trying I guess I’m going to implement one of the aforementioned serialization schemes in core-rs and run some benchmarks.