The Challenge of Large File Checksums

Tomato

10 Nov 2025 — 3 min read

Photo by Alexandros Giannakakis / Unsplash

How often do you use a website's provided checksum to verify that a file you downloaded wasn't corrupted or worse, what if it was tampered with during a man in the middle attack?

In the modern day of internet, I suppose you won't be using the checksum values on the website to make sure these things doesn't happen. Since we just all assume that the download is correct and if not we will just press the download button again.

We are shielded from these issues most of the time as we often relay on data transfer protocols to guarantee packet delivery.

Well, as a person who have too much time on their hand sometimes, I like to use the checksum function to make sure what I'm downloading is what the vendors share.

Every time when I try to use run checksum on large files over a few Gigabyte large, it started to feel slow, takes a few extra second than normal. So I decided I needed to take this issue into my own hand, and save myself a few extra second.

What is Merkle Tree

A Merkle Tree is when you try to compute a hash object though a tree like structure:

https://en.wikipedia.org/wiki/Merkle_tree

In this case you will be chopping up the files into chunks or data blocks, which you can individually hash them creating the leaf node of the tree.

Which than you can build upon the leaf nodes by hashing each data blocks with it's neighbor. Repeating the process recursively and it will build into a full Merkle tree

Here is what it will look like:

Spiting the large file into chunks

Computer hash of each file chunk

Computing hash of the next layer continuously until final node is reached

computing hash for each chunk and continuously hash the resulting hash nodes until final node is reach

Here is a concurrent Merkle Tree program I wrote in Golang:

When writing this program, I was trying to decide if I should use the fork-join method to hash the nodes concurrently, or if I should implement a pipeline method, where each level of the tree would be handled by a separate thread, responsible for hashing the nodes at that level until reaching the final level (root).

I ultimately went with the pipeline method because I had a few other ideas I wanted to work on, and the pipeline approach seemed more straightforward to implement.

The Challenge of Large File Checksums

Tomato

What is Merkle Tree

Reference:

Read more

1513. Number of Substrings With Only 1s

Heartbeat

Go Bills 🦬

Transporter, how does it work?