The Challenge of Large File Checksums

Share
The Challenge of Large File Checksums
Photo by Alexandros Giannakakis / Unsplash

How often do you use a website's provided checksum to verify that a file you downloaded wasn't corrupted or worse, what if it was tampered with during a man in the middle attack?

In the modern day of internet, I suppose you won't be using the checksum values on the website to make sure these things doesn't happen. Since we just all assume that the download is correct and if not we will just press the download button again.

We are shielded from these issues most of the time as we often relay on data transfer protocols to guarantee packet delivery.

Well, as a person who have too much time on their hand sometimes, I like to use the checksum function to make sure what I'm downloading is what the vendors share.

Every time when I try to use run checksum on large files over a few Gigabyte large, it started to feel slow, takes a few extra second than normal. So I decided I needed to take this issue into my own hand, and save myself a few extra second.

What is Merkle Tree

A Merkle Tree is when you try to compute a hash object though a tree like structure:

https://en.wikipedia.org/wiki/Merkle_tree

In this case you will be chopping up the files into chunks or data blocks, which you can individually hash them creating the leaf node of the tree.

Which than you can build upon the leaf nodes by hashing each data blocks with it's neighbor. Repeating the process recursively and it will build into a full Merkle tree

Here is what it will look like:

  1. Spiting the large file into chunks
file chunks
  1. Computer hash of each file chunk
compute hash of each file chunk
  1. Computing hash of the next layer continuously until final node is reached
computing hash for each chunk and continuously hash the resulting hash nodes until final node is reach

Here is a concurrent Merkle Tree program I wrote in Golang:

GitHub - epicseven-cup/mark: Go library & CLI for fast parallel file chunking, hashing and fingerprinting. Supports configurable worker count, chunk size, buffer-pool memory reuse, large file integrity, deduplication and fingerprint generation.
Go library & CLI for fast parallel file chunking, hashing and fingerprinting. Supports configurable worker count, chunk size, buffer-pool memory reuse, large file integrity, deduplication and f…

When writing this program, I was trying to decide if I should use the fork-join method to hash the nodes concurrently, or if I should implement a pipeline method, where each level of the tree would be handled by a separate thread, responsible for hashing the nodes at that level until reaching the final level (root).

I ultimately went with the pipeline method because I had a few other ideas I wanted to work on, and the pipeline approach seemed more straightforward to implement.

Reference:

Merkle tree - Wikipedia

Read more

蜡笔小新《大人帝国的反击》Nostalgia, and Why I Almost Built DeepSeek

蜡笔小新《大人帝国的反击》Nostalgia, and Why I Almost Built DeepSeek

Thanks for Chatgpt voice for fixing my awful chinese 碎碎念 grammar 所以这个是中文的博客,很久没写了,所以想要看一下怎么写。 大概有三四年没写了吧,有时候最近越来越少用中文了,感觉可以再加努力一下。 所以这是想说一下以前看的小动画片吧,这边有蜡笔小新:大人帝国的反击,然后下面连了一个视频,YouTube的视频(油管的视频),可以看一下,是比较详细的介绍吧,宣传一下。如果论知名度的话,看字幕的话大概知道它。 如果单说这部电影的话,我个人觉得它是我小时候看过最好的动画电影之一。《蜡笔小新》系列其实几乎每年都会出一部剧场版,但我大概从 2016 年左右开始就慢慢不看了,主要是因为太忙了,然后就是基本看不到 这次重新想起来,是因为最近又开始偶尔看看《蜡笔小新》,当作“下饭番”还挺轻松的。虽然不算特别认真地看,但还是会被一些情节触动。 那我对蜡笔小新《大人帝国的反击》具体的印象就是,大概大人全部走掉,然后小孩子的天堂吧。

By Tomato