It’s been a while and I am back to write my article. Recently I just heard about interesting tool that can analysis and compare two binary files. In the most cases, we usually use cryptographic hashes, like MD5 or SHA256 to identify the malware but malware may be modified or modify itself to avoid detection by tradition hashes. So if we have the way to compare two files to check their similarity, that would be better than the way using hashes. And Fuzzy Hash is the way which is the different from the traditional hashes.
0x01 Cryptographic Hash and Fuzzy Hash
What is Fuzzy Hash? It’s a type of Context-Triggered Piecewise Hash (CTPH). Cryptographic hash is used to identify data, it’s something like: “is file A exactly same as file B?”. Once a change on one bit of file, the result will be completely different. Fuzzy hash is produced by running a hash function on fixed size segments of the file, so the result would be like to answer: “is file A similar to file B?”. In today’s discussion, I will use SSDeep for some demonstration.
0x02 Demonstration
Let’s use the Linux built-in command ls as example. Let’s see the original ls copied from /bin folder:
15:35:32 [jieliau@jie:~/workspace]
$ ./ls
tmp15:36:08 [jieliau@jie:~/workspace]
$ sha256sum ./ls
8696974df4fc39af88ee23e307139afc533064f976da82172de823c3ad66f444 ./ls
And then using echo to add some random byte into ls and see if ls can be executed successfully and then check its SHA256 value as well
15:51:37 [jieliau@jie:~/workspace]
$ mv ls ls.back15:52:38 [jieliau@jie:~/workspace]
$ echo “AA” >> ls15:52:43 [jieliau@jie:~/workspace]
$ ./ls
tmp15:52:46 [jieliau@jie:~/workspace]
$ sha256sum ./ls
71b3181eea8f91e9195a4878c36d9226d2e97da5bce245124f408c7c8a544e45 ./ls15:52:59 [jieliau@jie:~/workspace]
$ sha256sum ./ls.back
8696974df4fc39af88ee23e307139afc533064f976da82172de823c3ad66f444 ./ls.back
From the result, you will see both ls can be executed successfully but have completely different SHA256 value. So this is much similar with some malware family cases. And let’s check if we use SSDeep, what we will see:
From the output, using SSDeep will get very similar result from the Fuzzy hash. So Fuzzy hash can give you some view like “are both binary file similar and is it the variant?” You can also use Python SSDeep module to get the comparison number:
16:07:08 [jieliau@jie:~/workspace]
$ python3
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.>>> import ssdeep
>>> hash1 = ssdeep.hash_from_file(“./ls”)
>>> hash2 = ssdeep.hash_from_file(“./ls.back”)
>>> ssdeep.compare(hash1 ,hash2)
99
It’s all for this article. Hope you like it and will help your malware analysis in any way. See you next time.