Committing to large datasets using Merkle trees in PyTeal

January 10, 2022

Overview

Introduction

Storing data on-chain can be very costly as each data piece is duplicated across all nodes. On Algorand specifically, you can theoretically store a very large amount of state per smart contract by using the local storage of many accounts. However, the cost (in terms of minimum balance) can become quickly prohibitive.

As you may guess, there is no magic solution. This article is suggesting a trade-off: the data needs to be stored off-chain and provided as argument to the smart contract (with additional information used to verify this data). For many applications using large amount of data, this trade-off is very reasonable.

This solution is here to provide a way to allow smart contracts to store a small value that represents (or commits to) a potentially very large amount of data using Merkle trees.

A Merkle tree is a tree data structure (typically binary) in which every leaf node stores a cryptographic hash of a data block, and every non-leaf node stores a cryptographic hash of the joint of its child nodes. This structure allows efficient and secure verification of the contents of large data structures.

The image below shows a Merkle tree of height 2 committing to the data L1, L2, L3, L4.

L1, L2, L3, L4 can be arbitrary data stored off-chain.
The root of the tree (aka top hash) uniquely defines (or commits to) the value L1, L2, L3, L4. So it is sufficient to just store this root in the smart contract.

From Wikipedia :

In this solution, we will demonstrate how to create and use Merkle trees using Algorand’s Smart Contracts generated with PyTeal (TEAL v5).

Design overview

Assumptions

In this solution we assume the tree height is fixed and each empty leaf holds the hash value of an empty string. For example in the image above, when we initialize the tree, L1, L2, L3 and L4 are empty strings and the tree height is fixed to 2.

State

Our smart-contract holds 2 global state variables:
1. Root - a byte slice holding the Top Hash of the tree.
2. Size - the number of elements in the tree.

At the beginning, Size is set to 0 and Root is computed as if the tree leaves are all hashes of empty byte slices (e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 when tree height is 2).
This means that the Root value depends on the tree height.

Operations

Our Merkle tree supports 3 operations:

append - add a new data block to the end of the data sequence.
verify - verify that a data block is stored in the tree.
update - update an existent data block.

The above operations expects to receive input via the application arguments array, and all share the same schematics:
1. First argument: command-name - append, verify or update
2. Second argument: value - the value we wish to append/verify/update (in the image above, Li for i in [1,2,3,4])
3. Third argument and so on: sibling1, sibling2… - a list of siblings in the tree. More details below.

In addition, the update command also expects the last argument to be the new value to update.

- append command

This operation expects to receive the new data block to append, and its siblings along the path to the root. For example in the above diagram, in order to append L1 (assuming the tree is empty) the user would provide L1, Hash 0-1 and Hash 1.
The smart-contract will verify the position of L1 is indeed empty and will then update the Root with the new value and increment the Size.
Note that at this point L2, L3 and L4 are set to empty string and Hash 0-1 is the hash of the empty string.

- verify command

Similar to the append operation, verify expects a value and siblings.
For example in the above diagram, in order to verify L1 the user would provide L1, Hash 0-1 and Hash 1.
The smart-contract will verify the position of L1 actually holds its expected data block.

- update command

This operation can be thought of as a composition of the above operations.
In this case 2 values are expected: old value and new value.
The smart-contract will first verify the old value is in the correct position in the tree and will then compute and update the Root using the new value.

Passing the list of siblings

As mentioned above, all three commands expect to receive a list of siblings where each sibling is a byte-slice containing the hash value of its node in the tree.
Since Merkle trees are binary, we need to mark each sibling if it’s right or left.
To do so, each sibling will hold its orientation as the first byte of the byte slice: 0xaa for right and 0xbb for left.

The first sibling in the list is the 1st order sibling, the second is the 2nd order (i.e. sibling of parent) etc.

With this method we assure that the position of value to append/verify/update is deterministic.

PyTeal code

Now, let’s get into the actual implementation.

We’ll be using PyTeal to generate a TEAL v5 script that will be executed on the AVM 1.0 as a smart-contract.

Helpful subroutines

Since computing the root is core, we’ll introduce a TEAL subroutine just for that:

TREE_HEIGHT = Int(2)
FIRST_SIBLING_INDEX = Int(2)
LAST_SIBLING_INDEX = FIRST_SIBLING_INDEX + TREE_HEIGHT - Int(1)

LEFT_SIDE = Int(170)  # Bytes("base16", "aa")


@Subroutine(TealType.bytes)
def hash_concat(left: TealType.bytes, right: TealType.bytes):
    return Sha256(Concat(left, right))


@Subroutine(TealType.bytes)
def calc_root(init_value: Expr):
    i = ScratchVar(TealType.uint64)
    result = ScratchVar(TealType.bytes)

    init = i.store(FIRST_SIBLING_INDEX)
    cond = i.load() <= LAST_SIBLING_INDEX
    itr = i.store(i.load() + Int(1))
    return Seq(
        result.store(init_value),
        For(init, cond, itr).Do(
            result.store(
                If(GetByte(Txn.application_args[i.load()], Int(0)) == LEFT_SIDE)
                .Then(
                    hash_concat(
                        result.load(),
                        Extract(Txn.application_args[i.load()], Int(1), Int(32)),
                    )
                )
                .Else(
                    hash_concat(
                        Extract(Txn.application_args[i.load()], Int(1), Int(32)),
                        result.load(),
                    )
                )
            )
        ),
        result.load(),
    )

calc_root simply iterates over the list of siblings, concatenates them according to each orientation and hash. The init_value is a TEAL Expression that evaluates to the hash of leaf value we wish the start our computation from.

Note that the For loop boundaries are constant: FIRST_SIBLING_INDEX which is set to 2 and the LAST_SIBLING_INDEX which is set to FIRST_SIBLING_INDEX + TREE_HEIGHT - 1.

The next subroutine is for computing the initial Root when the tree is empty:

@Subroutine(TealType.bytes)
def calc_init_root(tree_height: Expr):
    i = ScratchVar(TealType.uint64)
    result = ScratchVar(TealType.bytes)

    init = i.store(Int(0))
    cond = i.load() < tree_height
    itr = i.store(i.load() + Int(1))
    return Seq([
        result.store(Sha256(Bytes(''))),
        For(init, cond, itr).Do(
            result.store(Sha256(Concat(result.load(), result.load())))
        ),
        result.load()
    ])

That will be used by the creation sequence:

on_creation = Seq([
        App.globalPut(Bytes('Root'), calc_init_root(TREE_HEIGHT)),  # init root
        App.globalPut(Bytes('Size'), Int(0)),  # init size
        Approve()
    ])

Note that the initial Root could actually be pre-computed but since the opcode budget is sufficient we can do it in TEAL.

append sequence

The append sequence first validates that the correct number of application arguments were received and that the value to append is not an empty byte-slice.
It then validates that the position (derived from the siblings’ orientations) is empty by computing the expected root from the specific position and comparing it to the actual Root.

In case all assertions pass, the Root and Size state variables are updated to the new state.

append = Seq([
        Assert(Txn.application_args.length() == Int(NUM_APP_ARGS_APPEND)),
        Assert(Txn.application_args[VALUE_INDEX] != Bytes('')),
        Assert(App.globalGet(Bytes('Root')) == calc_root(
            Sha256(Bytes(''))
        )),
        App.globalPut(Bytes('Root'), calc_root(
            Sha256(
                Txn.application_args[VALUE_INDEX]
            )
        )),
        App.globalPut(Bytes('Size'), App.globalGet(Bytes('Size')) + Int(1)),
        Int(1)
    ])

Note we have the following constants in the code:
- NUM_APP_ARGS_APPEND - the number of expected arguments for the append operation.
- VALUE_INDEX - index of the value in the application arguments array.

verify sequence

The verify sequence first validates that the correct number of application arguments were received.
It then computes the expected root according to the value and siblings array and compares it to the actual Root.

verify = Seq([
        Assert(Txn.application_args.length() == Int(NUM_APP_ARGS_VERIFY)),
        Assert(App.globalGet(Bytes('Root')) == calc_root(
            Sha256(Txn.application_args[VALUE_INDEX])
        )),
        Int(1)
    ])

NUM_APP_ARGS_VERIFY - the number of expected arguments for the verify operation.

update sequence

The update sequence first validates that the correct number of application arguments were received and that the value to update is not an empty byte-slice.
It then computes the expected root according to the value and siblings array and compares it to the actual Root.

In case all assertions pass, the Root state variables is updated to the new state.

update = Seq([
        Assert(Txn.application_args.length() == Int(NUM_APP_ARGS_UPDATE)),
        Assert(Txn.application_args[UPDATE_VALUE_INDEX] != Bytes('')),
        Assert(App.globalGet(Bytes('Root')) == calc_root(
            Sha256(
                Txn.application_args[VALUE_INDEX]
            )
        )),
        App.globalPut(Bytes('Root'), calc_root(
            Sha256(
                Txn.application_args[UPDATE_VALUE_INDEX]
            )
        )),
        Int(1)
    ])

Note we have the following constants in the code:
- NUM_APP_ARGS_UPDATE - the number of expected arguments for the update operation.
- VALUE_INDEX - index of the old value in the application arguments array.
- UPDATE_VALUE_INDEX - index of the new value in the application arguments array.

The approval program

The final approval program is simply a router to all the above sequences in addition to allowing OptIn to all
and restricting DeleteApplication and UpdateApplication from anyone other than the application creator.

program = Cond(
        [
            Txn.application_id() == Int(0),
            on_creation
        ],
        [
            Txn.on_completion() == OnComplete.OptIn,
            Approve()
        ],
        [
            Txn.on_completion() == OnComplete.DeleteApplication, 
            Return(Txn.sender() == Global.creator_address())
        ],
        [
            Txn.on_completion() == OnComplete.UpdateApplication,
            Return(Txn.sender() == Global.creator_address())
        ],
        [
            Txn.on_completion() == OnComplete.NoOp,
            Cond(
                [
                    Txn.application_args[COMMAND_INDEX] == Bytes('verify'), 
                    Return(verify)
                ],
                [
                    Txn.application_args[COMMAND_INDEX] == Bytes('append'), 
                    Return(append)
                ],
                [
                    Txn.application_args[COMMAND_INDEX] == Bytes('update'),                
                    Return(update)
                ]
            )
        ]
    )

Usage

Compile the smart contract

goal app create --creator $creator \
    --approval-prog ./mt_approval.teal --clear-prog ./mt_clear.teal \
    --global-byteslices 1 --global-ints 1 --local-byteslices 0 --local-ints 0

mt_approval.teal is the TEAL script generated from the above pyteal code.
mt_clear.teal is a simple TEAL script that returns 1 for every call.

Both can be easily generated using the pyteal code in the Github repository.

Appending

goal app call \
    --app-id $app_id \
    --app-arg "str:append" \
    --app-arg "str:record0" \ # L1
    --app-arg "b64:quOwxEKY/BwUmvv0yJlvuSQnrkHkZJuTTKSVmRt4UrhV" \ # Hash 0-1
    --app-arg "b64:qi26XbwznnMWrqJoP6+DnBt7HuIxPbeSESWIEY3wZqo1" \ # Hash 1
    --from $creator

Note that since Hash 0-1 and Hash 1 are oriented to the right, the have the prefix byte 0xaa.

Verifying

goal app call \
    --app-id $app_id \
    --app-arg "str:verify" \
    --app-arg "str:record0" \ # L1
    --app-arg "b64:quOwxEKY/BwUmvv0yJlvuSQnrkHkZJuTTKSVmRt4UrhV" \ # Hash 0-1
    --app-arg "b64:qi26XbwznnMWrqJoP6+DnBt7HuIxPbeSESWIEY3wZqo1" \ # Hash 1
    --from $creator

Updating

goal app call \
    --app-id $app_id \
    --app-arg "str:update" \
    --app-arg "str:record0" \ # L1
    --app-arg "b64:quOwxEKY/BwUmvv0yJlvuSQnrkHkZJuTTKSVmRt4UrhV" \ # Hash 0-1
    --app-arg "b64:qi26XbwznnMWrqJoP6+DnBt7HuIxPbeSESWIEY3wZqo1" \ # Hash 1
    --app-arg "str:new_record0" \ # new L1 value
    --from $creator

some caveats

Using this smart-contract implementation with large trees might exceed the operational cost limit allowed smart-contracts on the AVM.

Luckily, we can use AVM’s pooled opcode budget feature by grouping our original application calls with “Dummy” application calls and thus increasing our operational budget.

For example, we could create the following approval code to use as a “Dummy” application:

def dummy_approval():
    return Approve()

The above has the minimal operation cost for a TEAL program and thus grouping calls to it with other “expensive” calls will substantially increase our operational budget.

Comments and Extensions

This solution is intended for learning purposes only. It does not cover error checking and other edge cases. The smart contract(s) in this solution have NOT been audited. Therefore, it should not be used as a production application.
There are many ways to tweak the design for different purposes. For example, the smart contract may also compute the index of the value (from the sibling orientations 0xaa/0xbb) and output it as a scratch pad variable for another transaction.
In production, it is recommended to use domain separation for hashing, that is prepending values to be hashed by some unique prefix (and a different prefix for the data Hash like Hash 0-0 and the “internal” hashes like Hash 0 and root)

Create Publication