HASHCMP

PURPOSE OPERATION OPTIONS COMMAND LINES RELATED PROGRAMS

Author: Dan Mares, dmares @ maresware . com (you will be asked for e-mail address confirmation)
Portions Copyright © 1998-2021 by Dan Mares and Mares and Company, LLC
Phone: 678-427-3275
Last update: 7/16/2021 (hashcmp)
Current hashcmp.exe versin: 21.05.04.09.41, MD5: 9594C0F1FE38619A90850199ADEF495B

One liner: Compares two hash files created by Maresware hash.exe for hash matches.

All programs are command line programs.
MUST be run within a command window as administrator.

top

PURPOSE

HASHCMP (like all Maresware) is a 32 bit program and can only be run in a DOS window under NT/XP/WIN10.

VERY VERY important to remember. The default operation of the program compares the ENTIRE record. For most instances, only the hash value is wanted in the comparison. For this reason, consider adding the -h option (or -d -l options) if only the hash value comparison is needed. This -d (displacement), -l (length) pair, or the -h (hash field only) is useful when you have hashed folder D:\X and then hashed its copy on F:\Y. Because the outputs of hash will obviously create differing paths in the output files, comparing the entire record will result in 100% differences. BUT comparing on only the displacement -d and the length -l of the hash field will properly compare hashes.

The program HASHCMP is designed to display the differences in output files produced by the HASH program. Its design depends on the unique fixed length record output of the hash.exe program. Using it with variable length records will NOT work. Using on records of different formats and where the hash is located in a different displacment within the record WILL NOT work. (use compare for that situation). There is a 2nd version that will also compare the outputs of md5sum, md5deep, where the hash is the first item on the line, and the records are variable length. See information on HASHCMPV

The Maresware HASH program will create a listing of MD5, SHA or CRC32 values for files. These output files can be analyzed by HASHCMP to determine if there are any records in "file1" that are not in "file2".

This analysis can be used to show what files have been altered from the time the 1st and 2nd outputs were created indicating possible file alteration. But remember, the output records have to be IDENTICLE.

HASHCMP REQUIRES the input files to be made of IDENTICLE fixed length records. This is accomplished using the hash.exe program. Under certain circumstances HASHCMP can also compare 2 outputs of any other program which produces a fixed length record output. Two such programs are: DISKCAT, CRCKIT. Other programs which produce fixed records can also create files which hashcmp can use as inputs (such as the MD5 program).

Actually, if you think abou it. (I know, thinking gives you a headache). If the two records are identicle in format, and you have a single field you wish to compare on, it can be done using the -d -l pair. It may take a little practice, but hashcmp has been independently tested files with over 4 million records.

FILE COPY COMPARE
If you are wanting to compare the hashes of say, an original location, and then the files are copied to a secondary storage location and you wish to compare the hashes, most probably the paths would not be identicle, and thus the requirement of identicle record formats would not meet the requirements. In some cases a simple text edit of the source drive from say: K: to L: might work if the rest of the path is identicle. But don't bet on it. In the case of you wishing to compare hashes of two different locations, and possibly two different record lengths/formats I suggest you use the combination of disksort and compare. This combination is more generic.

HOWEVER
If you are wanting to make sure your forensic copy is good by comparing the original with a destination copy, merely use the UPCOPY program with the hash compare option. I'll leave it up to you to read the upcopy manual on the copy compare capability. Its better than sliced break.

If you have fixed length records, where the record lengths are not identicle, and the hash values are not in the same position, you should use the the disksort.exe and the compare.exe programs to perform the appropriate comparisons. As the compare program is designed to be a generic field compare program, while hashcmp was designed specifically to work with the fixed length output of the hash.exe program. Got it?

HASHCMP, bacically can be used to compare the contents, line by line, of two files with identicle formatted records. When it finds records in one file that do not have a match in the other file, the program displays the mismatch to the screen. Each line MUST be 100% identicle. HASHCMP, except in very special circumstances (see options: -d, -h or -l) will not parse the record for comparisons, and uses the entire record.

Assuming HASH was run on a disk drive at two different times, in order to determine if any files had changed you would want to compare the specific HASH records for each file. On some systems this could mean as many as 100,000 or more files. HASHCMP is designed to take the outputs of two different runs which have essentially identicle information, record for record and compare the two files. If any records are found in one file that is not 100% matching that record in the other file, the record is printed to the screen. (and with appropriate option, printed to an output file).

Because HASHCMP expects the records in both input files to be identicle in format it can be used to compare records in files that were created by different programs. Providing the records were identicle in size and content. HASHCMP will then compare the records and show which ones don't match.

The current version of hashcmp ( 21.05.04.09.41 ) even though it is a 32 bit version, currently has a 4.9 million record limit. (reports have been returned that it has been tested successfully on just over 4 million records). If you need more capability let me know. However, when you get to that volume, may I suggest you use a combination of disksort and compare. This combination is more verbose and generic than the hashcmp program. There is a version called hashcmp64 which has an 8 million record limit. But it is currently untested for that volume of data.

If the records are variable length you may wish to review the help file of the hashcmpv verion which will allow for variable length records. But test it first.

If the files you have are not identically formatted, but contain a single field (ie: hash value) that you wish to compare on, OR if one or both of the files contain more than 4.9 million records, you should consider using the more generic Maresware program called COMPARE. It is desinged as a generic compare program which can handle any number of FIXED length records. But the records must be sorted on the key field, which many persons may have difficulty accomplishing.

top

OPERATION

The default program operation is to show mismatches from both input files using the ENTIRE record. Meaning it will show all lines in file 1 not found in file 2, and it will also show all lines in file 2 not found in file 1. This capability is used to see if any of the original files were altered at some time. IE: On January 1 you perform a hash of C:\CASES then six months later you perform another hash of the identicle C:\CASES tree. Seeing as both runs were conducted on the original location, only wishing to see what files if any were altered, then the default operation would suffice since both runs were conducted on the same tree. This type of run is used to see what, if any files have been changed. However, in the real world (thats where most people live), the first and second run would probably have been conducted on two different trees. So the records would contain at a minimum, different paths. In this case, you should use only hash field matches, add the -h or (-d -l) options.

There is an option to cause it to show only records contained in file 1 and not in file 2. (the -1 option, that’s a one, not ell)

There is also an option to show only records contained in file 2 and not in file 1. (the -2 option. can you guess what the -2 means.)

Depending on the needs and reasons for running the HASH program, any of the three above comparisons could be used.

The actual process:
1. File one is read into memory and sorted. The entire record lengt (unless appropriate options used) is used for the sort.
2. Then file two is read into memory and sorted on the hash field.
3: Then the two files are compared first file 1 is compared to file 2, and next file 2 is compared to file 1. This cross comparison produces output which shows records existing in either file and NOT existing in the other file.

Don't forget, if we have the same filename in both files being compared, but the hashes have changed, we would get an error for both passes. This is because the hash value (which is dependent on the full path) of the first run, can't be found in the second run, and appropriately so, the hash value associated with the filename in the second run is not identical to the one in the first run. So there are 2 errors output, while they are referencing the same filename.

Since the entire record is compared, any mismatch on any part of the record would cause output to be generated. For forensic purposes this is what is most recommended.

NOTE: read and understand the *(&^ Note, please.

If it is possible that the two filenames in each file being compared may contain upper and lower case characters. (Since both these operating systems maintain case in file names). If this is the case, HASHCMP will normally view the different cases as a mismatch and put the mismatch to the screen. If this is not acceptable, and the comparison is needed without regard to case of the records, use the -i (ignore case) option.

If comparing output files that were created on two different file systems, it is possible that the first character of the record (the Drive letter) may be different. This first character (drive designation) will cause the paths to be mismatched. In this case, HASHCMP will consider this a mismatch and put the mismatch to the screen. If this is not acceptable, and the comparison is needed without regard to the first few characters of the record (the drive letter section, and/or the path), then use the -d option with -d 2 to allow the program to pass over the first 2 characters. Adjust the value (-d 2) to however many characters you wish to pass. If you wish to ignore the entire path, you would probably use something around 80, which is where the hash field starts in a default record. Or use the -h option, which tells the program to search for and use only the hash field. It is generally smart enough to be able to pick out the hash field without human intervention.

Some people have come to want both the hash MD5 and the SHA value in the same file. To do this, you would run the hash program 2 times. First creating the default MD5 value, and then run it again with the -s creating the SHA value. You should also consider using the -O option which will cause the 2^nd run to append to the first. The problem that this creates for hashcmp is this: the records contained in the file have the digests beginning at two different locations: see the sample below:

 -------- BEGIN PROCESSING MD5 -----------
D:\TMP\CLEANUP.BAT           FC717D598864C37A45C8140160E9754B 
D:\TMP\CLEANUP.BAT   489E320E1327C2D60C9A5F4FECE2CFAF3353F6F0 
 -------- END PROCESSING SHA -----------

When hashcmp sees files with these two formats it might have problems. Depending on a lot of other factors, ie: filenames, sizes, dates, time, etc. Anyway, if hashcmp starts indicating too many mistmatches the way around this is to use the -d and -l (that’s an ell) option. The -d would point to the start of the SHA values (in the above situation its 21) -d 21, and the -l (length option) would force the comparison to a length of 40 which is the length of the SHA value, instead of the rest of the record. (-l 40). I generally use 41 or 42, just to be certain everything is accounted for.

The use of the -d and -l option eliminates possible erroneous mismatches caused by mismatched dates and times. However, if you are getting a lot of incorrectly identified mismatches, then merely use the -d option which points to the start of the SHA value.

A SIMPLE HASHCMP PROCEEDURE

1. Run HASH on the system and create a "reference" file. The reference file will have a header and trailer line containing the following information"

——— BEGIN MD5 PROCESSING

——— END MD5 PROCESSING

Actual output file format. spaced reduced for legibility.

Started Mon Jan 18 12:51:18 2021 GMT, 08:51 Eastern Standard Time (EST/EDT:UTC-4*)
C:\UTILS\NTUTILS\hash.exe *.bat -d | -o reference.txt 

 -------- BEGIN PROCESSING MD5 -----------
  PATH                                              |                                      MD5|  SIZE |MDATE     |MTIME | TZ
C:\TMP\TEST_USB\PRIVATE\WHATS_IN.BAT                |         4ADAECDE0D1F608D1B77C9BB92616FFD|    490|04/08/2020|12:22:22:447w|EST
C:\TMP\TEST_USB\PRIVATE\ZZ_ADD_BATCHES.BAT          |         1813CD11CB365B1147B2E006DC80CBAF|    782|01/29/2020|07:26:07:064w|EST
C:\TMP\TEST_USB\PRIVATE\ZZ_ADD_SUMMARY_EXE.BAT      |         C10F612DC0D5217244F8F26F461B69CF|    267|01/29/2020|07:27:47:294w|EST
C:\TMP\TEST_USB\PRIVATE\ZZ_BUILD_ARTICLES_EXE.BAT   |         2C006A60147D84DA9C183EE2A288C83F|    200|01/05/2020|09:53:47:935w|EST

 -------- END PROCESSING MD5 -----------

  245 directories, 120 files, 79,488 bytes: 
 (Includes 7 Alternate Data Streams )
  Elapsed:  0 hrs. 0 mins. 3 secs.

  Last Access Date UPDATE is: turned ON

This beginning and ending header is what HASHCMP triggers on as the start and finish of the data records to analyse. Everything between these headers is considered data to be analysed. The headers are made up of 5 dashes - - - - -, followed by a space, followed by the word BEGIN. This header MUST begin in the first character of the line.

Because of this, the output of DISKCAT, CRCKIT or any other file can be processed by inserting similar headers and trailers at the proper locations.

Practice and inventiveness can greatly expand HASHCMP's capability to do other analysis of file differences.

2. Run HASH on the system at a later time to create a "current" file. This reflects the state of the system at the current time. If any file has been altered in any way, the HASH value will change and show up in the output.

3. Run HASHCMP to compare the reference file with the current file.

HASHCMP will show on the screen its progress, and indicate which lines in file #1 (current) are not found in file #2 (reference), and vice versa.

Here is a sample of the output. This is a single mismatch. The record has been truncated for display here.

***************************

Found the following entry in A not in B

A:\directory\of\file\hashcmp.mk  FBC1FE234827A7574E9A7912FA79C5D7
***********************

It shows the record that was found in file 1 that is not in file 2. The record format has been changed and truncated for display purposes only.

Here is a sample batch file to accomplish the above.

@echo off
rem  To obtain a reference file or a test file
rem  replace the -p C:\TOP_LEVEL  with a correct top level path of the source or reference directory.
rem  replace the -p C:\TOP_LEVEL2  with a correct top level path of the 2nd source.

rem  replace the REFERENCE.TXT with an output filename for the 1st run
rem  replace the TEST.TXT with an output filename for the 2nd run

rem  the first run should probably be a reference and
rem  the next run should probably be a testing run
rem  the two runs are shown below

hash -p C:\TOP_LEVEL  -w 350 -v -d "|" -AT3 -8840E -r -o REFERENCE.TXT

hash -p C:\TOP_LEVEL2 -w 350 -v -d "|" -AT3 -8840E -r -o TEST.TXT

rem  ****************************

rem NOW to find hash matched or mismatched, run one of the following commands
rem you don't need all of them
rem assume REFERENCE.TXT is the first hash set, and
rem TEST.TXT is the 2nd hash set


rem the simplest is, asking hashcmp to ONLY compare the -h hash field.
rem the program ATTEMPTS to locate the hash field. IF it can't it uses the entire record. Test it on small sample.

rem     hashcmp  REFERENCE.TXT  TEST.TXT  -h -o DIFF_FILES.TXT                          

rem if, you are not sure the -h is sufficient, a better command line would be 
rem to use the -d and -l options which specifically identify the hash field. Especially if the source folders would create path differences.
rem the -d counts from 0 not 1.


rem hashcmp  REFERENCE.TXT  TEST.TXT  -d 360  -l  32  -x > DIFF_FILES.TXT
rem hashcmp  REFERENCE.TXT  TEST.TXT  -d 360  -l  32  -x -1 > FILES_ON_1_NOT_ON2.TXT
rem hashcmp  REFERENCE.TXT  TEST.TXT  -d 360  -l  32  -x -2 > FILES_ON_2_NOT_ON1.TXT
rem if you want an output compatable with excell, use the -o option
rem hashcmp  REFERENCE.TXT  TEST.TXT  -d 360  -l  32  -x    -o DIFF_FILES.TXT
rem hashcmp  REFERENCE.TXT  TEST.TXT  -d 360  -l  32  -x -1 -o FILES_ON_1_NOT_ON2.TXT
rem hashcmp  REFERENCE.TXT  TEST.TXT  -d 360  -l  32  -x -2 -o FILES_ON_2_NOT_ON1.TXT

Don't forget, (UNLESS YOU USE THE -d -l PAIR) the HASHCMP program compares the ENTIRE record not just the hash value or filename. For this reason, both files should be of identical format.

Processing Hints

Here are some scenarios that you might want to follow or adopt.

Scenario 1. Forensic analysis. -

The object here is to be able to testify that the contents of files were not altered from the time of seizure to court. You would run HASH on the suspect system as soon as possible after the seizure. This creates a reference file. (reference.fil). Then, at a later time you run HASH again on the system and create a current output file. (current.fil). This records the state of the files now. Then run HASHCMP against both files.

C:>HASHCMP  current.fil  reference.fil

You should not see any differences in the output.

Scenario 2. Your own system references (virus detection/file alteration)-

Run HASH at some point to get a reference output.

At later times run HASH against either the entire system, or selected files. Then run HASHCMP with the -1 or -2 option (depending on which file you input first on the command line).

This will show you if the current files have been changed. If the files have been changed, the output will reflect a different hash value, and you might need to investigate the reason for alteration.

Scenario 3. General File Comparison

You must have two files with the exact same record layout (line structure).

You can then add the

—— BEGIN and

—— END

header and trailer to the file(s).

Run HASHCMP on the two files (with or without the -12 option) to see which lines show up in one and not the other.

TRICK TO COMPARE OTHER DATA OUTPUTS

Because computers are not the brightest things going, and hashcmp is a simple comparison program, it can be tricked into comparing the "fixed length" outputs of other programs, such as diskcat.

Hashcmp knows nothing of record format or layout, it defaults to try and find the hash value, and work with that. However, with the -x, -d, -l options, you can trick it into thinking that the data record is a hash output, and only compare on columns -d x to -l xx.

D:\WORK\C70\LIB70\LIB70\Release\BuildLog.htm         ...rest of the record
D:\WORK\C70\LIB70\LIB70\Release\Crc_stuf7.obj        ...rest of the record
D:\WORK\C70\LIB70\LIB70\Release\Fmat_i64.obj         ...rest of the record

Above we have the output of a diskcat run. assume the ...rest of the record indicator is substituted for the rest of the record, and we are only wanting to compare on filenames. (tricky analogy, ha!).

Now, we run a catalog on another drive and get the output below.

F:\WORK\C70\LIB70\LIB70\Release\BuildLog.htm         ...rest of the record
F:\WORK\C70\LIB70\LIB70\Release\Crc_stuf8.obj        ...rest of the record
F:\WORK\C70\LIB70\LIB70\Release\Fmat_i64.obj         ...rest of the record

See that the data records are identicle except for the drive letter, and the one filename of Crc_stuf7.obj is now Crc_stuf7.obj

To get hashcmp to compare on the filename only, first figure out the length of the filename field. Lets assume it is 40. (you could just as easily include the file size, date/time, etc fields in the legnth).

The hashcmp compare command line would be:
hashcmp input1 input2 -o outputfile -x -d 2 -l 40

Notice i used a -d 2, to bypass the drive letter. We don't want the drive letter confusing the comparison.

The output file will contain the two records where the filenames are different. As they used to say, try it, you'll like it.

HASHCMPV

HASHCMPV is used to process variable length records with hash values and filenames. Most people find it useful to process the outputs of md5sum or md5deep. Both of these program produce outputs where the hash is the first field, and the filename is a variable at the end of the record.

Because the records are variable in size, the HashcmpV was designed.

It will default to process only the hash field but can be optioned (-lw) to process the entire record which would include the filename.

If you have a need for it, or wish to try it out, download the HASHCMPV and test it out. The help screen is gotten with a hashcmpv -? option.

top

HASHCMP OPTIONS

-1 (that’s a one, not ELL) Only show output of lines in file 1 that do not appear in file 2.

-2 Only show output of lines in file 2 that do not appear in file 1. (note the option -12 is the same as the default of show both file mismatches)

-i: 'I'gnore the case of the records.This is useful if comparing two files created with Windows 95 or Windows NT, since both these program maintain case in their paths.

-d + #: Replace # with a number from 1-xx. Where # is the displacement in the record to start the comparison. Use this if the two files have different drive letters as the 1st section of the path. (i.e. D:\path\....., C:\path\.....).

-l + #: (that’s an ELL, not one) Restrict length of compare field to this many characters. This can be used to restrict the length of the field to the correct number of characters of the MD5 or SHA field length. Otherwise, the entire record from the first character, or the character identified in the -d option is used. If the filesize, data, time, are not important, and only the hash or SHA is needed, this is a very useful option. It is best used with the -d, pointing to the first character of the SHA or MD5 field. (ie: -d 80 -l 40 ). See the -h option next for a shortcut to the -d -l option.

-h This indicates that the comparison is to be done ONLY using the hash field. (or the CRC or SHA field). This is a good option to compare only the calculated values regardless of filename, path, dates etc. After all, who really cares what the name is if the files are identicle. This is sort of a combination of the -d and -l options. Except it attempts to find the hash field automatically. This can cause problems if the files being checked have mixed MD5 and SHA values, as their hash fields will begin in different positions. In that case, it is advised to use the -d and -l options.

-x Indicates there is (N)o header (i.e. ----- BEGIN ) in the files. The entire file is assumed to be data. And the 1st record MUST be of a length equal to or greater than the data records. You can use this option if you are processing DISKCAT output, or any other fixed length files.

-[oO] + outputfile: Write output to "outputfile" creating a log of process. If uppercase (-O) is used then output file is also sent to default printer 75 characters per line with no formatting.

top

COMMAND LNIES

C:>HASHCMP file1.new file2.ref
compare file1.new to file2.ref

Following is the only way to get the output to go to an output file. You MUST redirect the output.
C:>HASHCMP file1.new file2.ref >> output.fle
compare file1.new to fiel2.ref and redirect output to output.fle

C:>HASHCMP file1.new file2.ref -1
compare same 2 files, and show only items in file1 not in file2

C:>HASHCMP file1.new file2.ref -2
compare same 2 files, and show only items in file2 not in file1

C:>HASHCMP file1.new file2.ref -2 -d 2 >> output.fle
compare same 2 files, and show only items in file2 not in file1 without regard to drive letter in each record. Notice in this instance, a redirected output file was chosen.

C:>HASHCMP file1.ref file2.new -2 -d 80 -l 40
This is starting the compare at character 80 (assuming that is the start of the SHA field), and compares for only 40 characters, the length of the SHA field. This causes the program to ignore all other parts of the record.

top

RELATED PROGRAMS

CRCKIT
HASH
DISKCAT
MD5
top