HASHCMP

PURPOSE   OPERATION   OPTIONS   COMMAND LINES   RELATED PROGRAMS


Author: Dan Mares, dmares @ maresware . com (you will be asked for e-mail address confirmation)
Portions Copyright © 1998-2014 by Mares and Company, LLC
Phone: (770)242-6687 X 119
Last update: 5/16/2013 (hashcmp64)

All programs are command line programs.
MUST be run within a command window as administrator.


top

PURPOSE

HASHCMP (like all Maresware) is a 32 bit program and can only be run in a DOS window under NT/XP.

The program HASHCMP is designed to display the differences in output files produced by the HASH program. There is a 2nd version that will also compare the outputs of md5sum, md5deep, where the hash is the first item on the line, and the records are variable length. See information on hashcmpV.

The HASH program will create a listing of MD5, SHA or CRC32 totals for files. These output files can be analyzed by HASHCMP to determine if there are any records in "file1" that are not in "file2".

This analysis can be used to show what files have been altered from the time the 1st and 2nd outputs were created indicating possible file alteration.

HASHCMP REQUIRES the input files to be made of IDENTICLE fixed length records. This is accomplished using the hash.exe program. Under certain circumstances HASHCMP can also compare 2 outputs of any other program which produces a fixed length record output. Two such programs are: DISKCAT, CRCKIT. Other programs which produce fixed records can also create files which hashcmp can use as inputs (such as the MD5 program).

HASHCMP, bacically can be used to comare the contents, line by line, of two files with identicle formatted records. When it finds records in one file that do not have a match in the other file, the program displays the mismatch to the screen. Each line MUST be 100% identicle. HASHCMP, except in very special circumstances (see options: -d, -h or -l) will not parse the record for comparisons, and uses the entire record.

Assuming HASH was run on a disk drive at two different times, in order to determine if any files had changed you would want to compare the specific HASH records for each file. On some systems this could mean as many as 100,000 or more files. HASHCMP is designed to take the outputs of two different runs which have essentially identicle information, record for record and compare the two files. If any records are found in one file that is not 100% matching that record in the other file, the record is printed to the screen. (and with appropriate option, printed to an output file).

Because HASHCMP expects the records in both input files to be identicle in format it can be used to compare records in files that were created by different programs. Providing the records were identicle in size and content. HASHCMP will then compare the records and show which ones don't match.

Because it attempts to handle both files in memory, there is an arbitrary limit of 150,000 records in each input file. If you have more than 150,000 files/records you should use hashcmp64 which currently has a 1.5 million record limit. If you need more capability let me know.

 


 

If the files you have are not identically formatted, but contain a single field (ie: hash value) that you wish to compare on, OR if one or both of the files contain more than 1.5 million records, you should consider using the more generic Maresware program called COMPARE. It is desinged as a generic compare program which can handle any number of FIXED length records. But the records must be sorted on the key field, which many persons may have difficulty accomplishing.

 


top

OPERATION

The default program operation is to show mismatches from both input files. Meaning it will show all lines in file 1 not found in file 2, and it will also show all lines in file 2 not found in file 1.

There is an option to cause it to show only records contained in file 1 and not in file 2. (the -1 option, that’s a one, not ell)

There is also an option to show only records contained in file 2 and not in file 1. (the -2 option)

Depending on the needs and reasons for running the HASH program, any of the three above comparisons could be used.

Next file one is read into memory and sorted. The entire record length is used for the sort.

Then file two is read into memory and sorted.

Then the two files are compared first file 1 is compared to file 2, and next file 2 is compared to file 1. This cross comparison produces output which shows records existing in either file and NOT existing in the other file.

Don't forget, if we have the same filename in both files being compared, but the hashes have changed, we would get an error for both passes. This is because the hash value of the first run, can't be found in the second run, and appropriately so, the hash value associated with the filename in the second run is not identical to the one in the first run. So there are 2 errors output, while they are referencing the same filename.

Since the entire record is compared, any mismatch on any part of the record would cause output to be generated. For forensic purposes this is what is most recommended.

NOTE: read and understand the *(&^ Note, please.

If using Windows95 or Windows NT it is possible that the two filenames in each file being compared may contain upper and lower case characters. (Since both these operating systems maintain case in file names). If this is the case, HASHCMP will normally view the different cases as a mismatch and put the mismatch to the screen. If this is not acceptable, and the comparison is needed without regard to case of the records, use the -i (ignore case) option.

If comparing output files that were created on two different file systems, it is possible that the first character of the record (the Drive letter) may be different. This first character (drive designation) will cause the paths to be mismatched. In this case, HASHCMP will consider this a mismatch and put the mismatch to the screen. If this is not acceptable, and the comparison is needed without regard to the first few characters of the record (the drive letter section, and/or the path), then use the -d option with -d 2 to allow the program to pass over the first 2 characters. Adjust the value (-d 2) to however many characters you wish to pass. If you wish to ignore the entire path, you would probably use something around 80, which is where the hash field starts in a default record. Or use the -h option, which tells the program to search for and use only the hash field. It is generally smart enough to be able to pick out the hash field without human intervention.

Some people have come to want both the hash MD5 and the SHA value in the same file. To do this, you would run the hash program 2 times. First creating the default MD5 value, and then run it again with the -s creating the SHA value. You should also consider using the -O option which will cause the 2nd run to append to the first. The problem that this creates for hashcmp is this: the records contained in the file have the digests beginning at two different locations: see the sample below:

 -------- BEGIN PROCESSING MD5 -----------
D:\TMP\CLEANUP.BAT           FC717D598864C37A45C8140160E9754B 
D:\TMP\CLEANUP.BAT   489E320E1327C2D60C9A5F4FECE2CFAF3353F6F0 
 -------- END PROCESSING SHA -----------

When hashcmp sees files with these two formats it might have problems. Depending on a lot of other factors, ie: filenames, sizes, dates, time, etc. Anyway, if hashcmp starts indicating too many mistmatches the way around this is to use the -d and -l (that’s an ell) option. The -d would point to the start of the SHA values (in the above situation its 21) -d 21, and the -l (length option) would force the comparison to a length of 40 which is the length of the SHA value, instead of the rest of the record. (-l 40). I generally use 41 or 42, just to be certain everything is accounted for.

The use of the -d and -l option eliminates possible erroneous mismatches caused by mismatched dates and times. However, if you are getting a lot of incorrectly identified mismatches, then merely use the -d option which points to the start of the SHA value.

A SIMPLE HASHCMP PROCEEDURE

1. Run HASH on the system and create a "reference" file. The reference file will have a header and trailer line containing the following information"

——— BEGIN MD5 PROCESSING

——— END MD5 PROCESSING

This beginning and ending header is what HASHCMP triggers on as the start and finish of the data records to analyse. Everything between these headers is considered data to be analysed. The headers are made up of 5 dashes - - - - -, followed by a space, followed by the word BEGIN. This header MUST begin in the first character of the line.

Because of this, the output of DISKCAT, CRCKIT or any other file can be processed by inserting similar headers and trailers at the proper locations.

Practice and inventiveness can greatly expand HASHCMP's capability to do other analysis of file differences.

2. Run HASH on the system at a later time to create a "current" file. This reflects the state of the system at the current time. If any file has been altered in any way, the HASH value will change and show up in the output.

3. Run HASHCMP to compare the reference file with the current file.

HASHCMP will show on the screen its progress, and indicate which lines in file #1 (current) are not found in file #2 (reference), and vice versa.

Here is a sample of the output. This is a single mismatch. The record has been truncated for display here.

***************************

Found the following entry in A not in B

A:\directory\of\file\hashcmp.mk  FBC1FE234827A7574E9A7912FA79C5D7
***********************

It shows the record that was found in file 1 that is not in file 2. The record format has been changed and truncated for display purposes only.

 


Here is a sample batch file to accomplish the above.

@echo off
rem  To obtain a reference file or a test file
rem  replace the -p C:\TOP_LEVEL  with a correct top level path of the source
rem  replace the REFERENCE.TXT with an output filename
rem  the first run should probably be a reference and
rem  the next run should probably be a testing run

hash -p C:\TOP_LEVEL -w 350 -v -d "|" -AT3 -8840E -r -o REFERENCE.TXT

rem NOW to find hash matched or mismatched, run one of the following commands
rem you don't need all of them
rem assume REFERENCE.TXT is the first hash set, and
rem TEST.TXT is the 2nd hash set
rem if you don't have hashcmp64, use hashcmp
rem hashcmp64 has a higher file limit of 1.5 million

rem hashcmp64  REFERENCE.TXT  TEST.TXT  -d 360  -l  32  -x > DIFF_FILES.TXT
rem hashcmp64  REFERENCE.TXT  TEST.TXT  -d 360  -l  32  -x -1 > FILES_ON_1_NOT_ON2.TXT
rem hashcmp64  REFERENCE.TXT  TEST.TXT  -d 360  -l  32  -x -2 > FILES_ON_2_NOT_ON1.TXT
rem if you want an output compatable with excell, use the -o option
rem hashcmp64  REFERENCE.TXT  TEST.TXT  -d 360  -l  32  -x    -o DIFF_FILES.TXT
rem hashcmp64  REFERENCE.TXT  TEST.TXT  -d 360  -l  32  -x -1 -o FILES_ON_1_NOT_ON2.TXT
rem hashcmp64  REFERENCE.TXT  TEST.TXT  -d 360  -l  32  -x -2 -o FILES_ON_2_NOT_ON1.TXT

Don't forget, the HASHCMP program compares the entire record not just the hash value or filename. For this reason, both files should be of identical format. (i.e. record layout structure)


Processing Hints

Here are some scenarios that you might want to follow or adopt.

Scenario 1. Forensic analysis. -

The object here is to be able to testify that the contents of files were not altered from the time of seizure to court. You would run HASH on the suspect system as soon as possible after the seizure. This creates a reference file. (reference.fil). Then, at a later time you run HASH again on the system and create a current output file. (current.fil). This records the state of the files now. Then run HASHCMP against both files.

C:>HASHCMP  current.fil  reference.fil

You should not see any differences in the output.

Scenario 2. Your own system references (virus detection/file alteration)-

Run HASH at some point to get a reference output.

At later times run HASH against either the entire system, or selected files. Then run HASHCMP with the -1 or -2 option (depending on which file you input first on the command line).

This will show you if the current files have been changed. If the files have been changed, the output will reflect a different hash value, and you might need to investigate the reason for alteration.

Scenario 3. General File Comparison

You must have two files with the exact same record layout (line structure).

You can then add the

—— BEGIN and

—— END

header and trailer to the file(s).

Run HASHCMP on the two files (with or without the -12 option) to see which lines show up in one and not the other.

TRICK TO COMPARE OTHER DATA OUTPUTS

Because computers are not the brightest things going, and hashcmp is a simple comparison program, it can be tricked into comparing the "fixed length" outputs of other programs, such as diskcat.

Hashcmp knows nothing of record format or layout, it defaults to try and find the hash value, and work with that. However, with the -x, -d, -l options, you can trick it into thinking that the data record is a hash output, and only compare on columns -d x to -l xx.

D:\WORK\C70\LIB70\LIB70\Release\BuildLog.htm         ...rest of the record
D:\WORK\C70\LIB70\LIB70\Release\Crc_stuf7.obj        ...rest of the record
D:\WORK\C70\LIB70\LIB70\Release\Fmat_i64.obj         ...rest of the record

Above we have the output of a diskcat run. assume the ...rest of the record indicator is substituted for the rest of the record, and we are only wanting to compare on filenames. (tricky analogy, ha!).

Now, we run a catalog on another drive and get the output below.

F:\WORK\C70\LIB70\LIB70\Release\BuildLog.htm         ...rest of the record
F:\WORK\C70\LIB70\LIB70\Release\Crc_stuf8.obj        ...rest of the record
F:\WORK\C70\LIB70\LIB70\Release\Fmat_i64.obj         ...rest of the record

See that the data records are identicle except for the drive letter, and the one filename of Crc_stuf7.obj is now Crc_stuf7.obj

To get hashcmp to compare on the filename only, first figure out the length of the filename field. Lets assume it is 40. (you could just as easily include the file size, date/time, etc fields in the legnth).

The hashcmp compare command line would be:
hashcmp input1 input2 -o outputfile -x -d 2 -l 40

Notice i used a -d 2, to bypass the drive letter. We don't want the drive letter confusing the comparison.

The output file will contain the two records where the filenames are different. As they used to say, try it, you'll like it.


HASHCMPV

HashcmpV is used to process variable length records with hash values and filenames. Most people find it useful to process the outputs of md5sum or md5deep. Both of these program produce outputs where the hash is the first field, and the filename is a variable at the end of the record.

Because the records are variable in size, the HashcmpV was designed.

It will default to process only the hash field but can be optioned (-lw) to process the entire record which would include the filename.

If you have a need for it, or wish to try it out, download the HASHCMPV and test it out. The help screen is gotten with a hashcmpv -? option.


top

HASHCMP OPTIONS

-1  (that’s a one, not ELL) Only show output of lines in file 1 that do not appear in file 2.

-2  Only show output of lines in file 2 that do not appear in file 1. (note the option -12 is the same as the default of show both file mismatches)

-i:  'I'gnore the case of the records.This is useful if comparing two files created with Windows 95 or Windows NT, since both these program maintain case in their paths.

-d + #:  Replace # with a number from 1-xx. Where # is the displacement in the record to start the comparison. Use this if the two files have different drive letters as the 1st section of the path. (i.e. D:\path\....., C:\path\.....).

-l + #:  (that’s an ELL, not one) Restrict length of compare field to this many characters. This can be used to restrict the length of the field to the correct number of characters of the MD5 or SHA field length. Otherwise, the entire record from the first character, or the character identified in the -d option is used. If the filesize, data, time, are not important, and only the hash or SHA is needed, this is a very useful option. It is best used with the -d, pointing to the first character of the SHA or MD5 field. (ie: -d 80 -l 40 ). See the -h option next for a shortcut to the -d -l option.

-h  This indicates that the comparison is to be done ONLY using the hash field. (or the CRC or SHA field). This is a good option to compare only the calculated values regardless of filename, path, dates etc. After all, who really cares what the name is if the files are identicle. This is sort of a combination of the -d and -l options. Except it attempts to find the hash field automatically. This can cause problems if the files being checked have mixed MD5 and SHA values, as their hash fields will begin in different positions. In that case, it is advised to use the -d and -l options.

-x  Indicates there is (N)o header (i.e. ----- BEGIN ) in the files. The entire file is assumed to be data. And the 1st record MUST be of a length equal to or greater than the data records. You can use this option if you are processing DISKCAT output, or any other fixed length files.

-[oO] + outputfile:   Write output to "outputfile" creating a log of process. If uppercase (-O) is used then output file is also sent to default printer 75 characters per line with no formatting.


top

COMMAND LNIES

C:>HASHCMP file1.new file2.ref
compare file1.new to file2.ref

Following is the only way to get the output to go to an output file. You MUST redirect the output.
C:>HASHCMP file1.new file2.ref >> output.fle
compare file1.new to fiel2.ref and redirect output to output.fle

C:>HASHCMP file1.new file2.ref -1
compare same 2 files, and show only items in file1 not in file2

C:>HASHCMP file1.new file2.ref -2
compare same 2 files, and show only items in file2 not in file1

C:>HASHCMP file1.new file2.ref -2 -d 2 >> output.fle
compare same 2 files, and show only items in file2 not in file1 without regard to drive letter in each record. Notice in this instance, a redirected output file was chosen.

C:>HASHCMP file1.ref file2.new -2 -d 80 -l 40
This is starting the compare at character 80 (assuming that is the start of the SHA field), and compares for only 40 characters, the length of the SHA field. This causes the program to ignore all other parts of the record.


top

RELATED PROGRAMS

CRCKIT
HASH
DISKCAT
MD5
top