Using SHA Checksums

An SHA checksum is a string of letters and numbers that represents a long checksum, also known as a hash code.   A checksum is a number computed from the contents of a file using an algorithm that doesn't care about anything except using the actual bytes that are in a file to make the computation.    A good checksum algorithm will be fast to compute even for large files and it will generate a unique checksum for each file.

 

In this topic by SHA we mean SHA256, an extremely rigorous form.  No two files will have the same SHA checksums.  That allows using the SHA checksum for a file to identify that file and to guard against changes to the file.  

 

Once we compute an SHA checksum for a file we can use that checksum to verify if the file was altered or damaged over time by re-computing the SHA checksum for the file.  If the earlier and later SHA checksums are the same, we know that not one byte in that file has been changed.  If the SHA checksums are different, we know that the two files are different, even if they have the same name and are identically the same size.   At least one of the bytes has been changed.

 

SHA technology was developed for specialized encryption but now is also used for the simple purpose of identifying files:

 

 

SHA Checksums as Anti-Virus Tools

If we compute the SHA checksum for a new executable or installation file immediately when it is created by Visual Studio or other development tool we can keep that SHA checksum on hand to thereafter verify the file has not been changed.  When the file is copied onto servers for download by others we can verify the SHA checksum of the file on the servers is exactly the same as the SHA checksum for the file when it was first created.   

 

Likewise, after downloading the file to our computer from the server if we are worried the file might have been damaged or changed in any way we can verify the SHA checksum for the file.  If it is the same as the SHA checksum published for that file, we know that the file we have on our computer is exactly the same file, unchanged in any way from the moment it was first created.  

 

An independent ability to verify files are safe using SHA checksums is important because many consumer grade anti-virus programs will frequently issue false alarms about perfectly clean software, claiming that clean software contains a virus, a Trojan or some other malware.   The way to tell if such a warning is a false alarm is to verify the SHA checksum for the file.   If it matches what is on the server we know the warning was a false alarm.

Tools to Compute SHA Checksums

There are many tools available online that will compute SHA256 checksums.   The easiest one for modern Windows users to apply is Microsoft's certutil command line utility that is built into Windows 10 and similar recent Windows editions.  

Example

This example will use a 64-bit Release 9 installation file for Manifold Release 9.   SHA files work the same way for other Manifold products, such as SQL for ArcGIS Pro, Manifold Viewer, and Release 8.

 

 

We download the 64-bit Release 9 installation file manifold-9.0.169-x64.exe and place it in a folder called C:\files.  

 

To verify the SHA256 checksum for the file we have downloaded we launch a Command Prompt window and navigate into the folder where the files are stored.

 

 

To compute the SHA checksum for the manifold-9.0.159.msi file we enter the command line

 

certutil -hashfile manifold-9.0.169-x64.exe SHA256

 

We press Enter.

 

 

Certutil goes to work and dutifully reports the SHA256 checksum for that file.    We can highlight the checksum in the Command Prompt window and then press Ctrl-C to Copy it to the Clipboard.

 

 

We can compare the SHA checksum computed for that file by certutil to the SHA checksum published on the downloads page by opening Notepad and Pasting the checksum reported by certutil   We can then Copy the SHA checksum from the web page and Paste that into Notepad to make it easy to compare the two checksums.

 

In this case we can see the SHA checksum computed for the file we downloaded is the same as the SHA checksum published by Manifold for that file.  We know for sure the file has not been changed in any way since it was first created by Manifold.  We know it has not been infected by a virus or other malware.   If some anti-virus package says the file is infected we know the anti-virus package has made a mistake and has reported a false alarm.

Notes

Somebody told me two different files can have the same SHA checksum so I cannot count on that as a means to identify when a file has been changed. Is that true?   No.  Make a note of who told you that and remember never to trust them on technical matters.   Unfortunately, there is so much misinformation repeated on Internet by inexpert commentators that these notes must expend a few paragraphs to explain why the use of SHA checksums, such as the SHA256 checksums published by Manifold, is totally reliable.  

 

It is one of those mathematical issues that separate people who have a grasp of math from those who do not.   In pure theory, yes, two different files could have the same SHA number, just as in pure theory if the universe existed in infinite time a room full of monkeys keyboarding at random would eventually re-create the works of Shakespeare.  In fact, if we really know what "infinity" means we know that those monkeys would re-create the works of Shakespeare an infinite number of times.  People with common sense know that does not happen in real life.

 

Likewise, the one-in-may-as-well-be-infinity chances of two files having the same SHA checksum, even for a less rigorous SHA1 checksum,  are 1 in a number are far greater than the number of seconds the Universe has existed.  How much bigger?  Even if you waited around for the number of seconds the Universe was around, and then when the Universe ended you waited all over again for a second lifetime of the Universe, and then you did that again and again for a billion lifetimes of the universe, you still wouldn't be close.  You would have to count all the seconds in the Universe for a billion lifetimes of the Universe and then do all of that again a million times.  That is, a million billion lifetimes of the universe worth of seconds.     The possibility that the far more rigorous SHA256 checksum used by Manifold might be duplicated by chance is even less likely by so many more factors of unlikelihood that humans can't really grasp how unlikely that is.

 

Believing that two different files could have the same SHA checksum is like believing that random quantum fluctuations could momentarily transform you into a cat.  People with common sense know that is not going to happen, you are not going to be transformed into a cat, and no two different files will ever have the same SHA checksum.

 

In other words, if anybody tells you that two different files randomly could have the same SHA checksum you know for sure they don't know what they are talking about.

 

But not as a matter of random collision but deliberate breaking, could two files artificially be constructed to have the same SHA checksum? Yes, given enough resources a cryptographic attack could succeed against SHA256, if you had many lifetimes of the universe to make your calculations.   As computing speeds increase at some distant future point even SHA256 might become vulnerable.  At that point everyone will simply start using SHA512 or SHA1024.

 

But the same SHA checksum does not indicate the original file does not have a virus, true?  Not in Manifold's case, no.  That would be true only if you are a conspiracy theorist who believes tools like compilers from Microsoft are designed to inject viruses into software. In the real world such tools create clean files.  Once a file is created as a clean file it can, of course, be infected later in its lifetime.  A file could be infected when placed on a download server, for example.  But if you grab the SHA checksum for the file the moment it is created you know for sure at any future point if that file still has the same SHA checksum it is still clean and not infected.

 

The SHA256 checksum is absurdly long.  Isn't it overkill?   Why not use shorter SHA1 checksums that are easier to compare?  Yes, SHA256 checksums are absurdly long and yes, they are absurd overkill for this purpose.   SHA1 checksums are easier to compare and by any practical standard in the real world will guard against viruses in installation files.   

 

So why does Manifold use SHA256?  Unfortunately, Google has described a procedure to create SHA1 collisions that in theory could be used to create two Manifold installation files that have the same SHA1 checksum, with one of them, in theory, containing a virus package. The Google attack does not work against counter-cryptanalysis hardened SHA1 and it is impractical in real life virus injection scenarios.  But there is no point in arguing with fear not based on practical issues.  It is easier to eliminate fear by applying massively secure technology like SHA256.

 

Could SHA256 be vulnerable to the Google attack?  No.  The Google attack on SHA1 used 6,500 years of single-CPU computations and 100 years of GPU computations to create two, different, limited PDF files (and not executables the size of Manifold's installation) with the same SHA1 checksum.  Creating malware that could take over a single computer and in less than a few thousand years inject itself into an installation file, while concealing itself using the same SHA1 checksum for the original file, is a far more difficult task.    And that's just SHA1.   SHA256 is a quadrillion, quadrillion, etc,  times stronger than SHA1.  Attempting to do any of that with SHA256 would require more time than many lifetimes of the universe.