DATABASE RECORD EXTRACTION
Detective Bill Moylan,
Nassau County, NY Police Dept
Computer Crime Section
In December 1998, I posted a question to the IACIS list requesting help with
sorting and extracting data from a large database. I received several
suggestions that unfortunately did not work out, and I struggled with SQL
queries until I was cross-eyed. The timing of my posting was faulty,
however, because it was right around that time that Dan Mares was busy
retiring from Civil Service and entering the private arena, and he missed
my first request for assistance. When I had occasion to speak to Dan directly,
he was able to describe a procedure using some Maresware programs that
accomplished exactly what I wanted to do. I have used the procedure several
times since, and decided to write it down because I cannot easily call it to
mind when I need it (Alzheimer’s, I guess).
The problem involved an 8-mb master database file, containing business records. I
wanted to extract only the records of transactions in a set of 1100 zip codes.
The database contained 7,000 records, and the zip codes were not contiguous,
so that any kind of Excel sort was not helpful. I was able to export the
database to a DBASE type file, which is a FIXED-LENGTH RECORD type database
file, which is necessary for this procedure.
The first step was to isolate the data, by removing the first part of the file,
which contained the database header, leaving the actual data records in the
file. This was accomplished using Maresware
Filsplit. To extract the
data,
Filsplit is run using parameters for the offset value (
-d)
(number of bytes to pass before the start of data), the record length,
(
-r) and the number of bytes to copy (This parameter is not exact,
the value in this case should exceed the total number of bytes in the file).
In order to find the record length, I used Maresware
Hexdump to display
the data in HEX mode, and determined the structure of each record, including
the record length, and also the position within the record of the zip code
field
Filsplit input.fil output.fil –r 1545 -d 286 –D 9000000
Where
–r = record length
–d = displacement to start of data
-D = number of bytes to copy
Once the data is isolated, it must be sorted on any one field. In this case I
used the first field, which contained the customer number. I imported the data
into Excel, and sorted on the first field, saving the file as a text file.
This file MUST be properly sorted in order for the remaining processes to
work correctly. In order to check that the file is properly sorted, I ran
Maresware
Sortchek, inputting the record length, the position in the
record of the first character of the sort field and the number of characters
to check.
Sortchek input.fil –r 1545 –p 0 –l 6
Where
–r = record length
-p = pass this many characters before comparing (in this case, 0 bytes)
-l = number of characters to compare (in this case, 6 character customer number)
When executed this program will display the number of records being checked, and
will show if the records are properly sorted. It will display the location of
the first missorted record, if any.
Once the file is properly sorted. The actual extraction can be done.
The program used for the extraction is Maresware
Search, and is
executed using a parameter file to hold the option values for the extraction.
In this case a parameter file "param.txt" was prepared using the
following values. (They must be entered in this order)
15450 (number of bytes to work on at a time, it is a multiple of record length)
1545 (record length)
265 (location in record of first character to compare{zip code field})
5 (length of field to be compared {zip code=5)
11101
12802
Another
Another
Another
Once the parameter file is properly prepared the command line is simply:
SEARCH input.fil output.fil param.fil
The parameter file contained the above values and the 1100+ zip codes
(sorted ascending) to be searched for. This program jumped to the location in
each record to check, compared the next 5 bytes against each of the 1100+ zip
codes, and if a match was found, exported the entire record to the output file.
The result was a database file containing only the records pertaining to the
zip codes that I was interested in(about 750 records). Once this file was
extracted, the
Filsplit program was used once again, this time to
copy only the first 286 bytes of data containing the header information. That
resulting 286 byte file was replaced on the new data file using DOS
COPY hdr.fil + data.fil final.fil.
Now the extracted file was in DBASE format and could be imported back into
the original program, and could be viewed, and records and invoices printed
in their original format.
This procedure is only one of the ways that these programs can be used to
manipulate large data records. The .pdf files that accompany Maresware will
explain each of these (and other) program utilities in detail. I have dealt
recently with files that are just to large to be handled by any spreadsheet
or other kinds of Windows programs, and must be handled by specialized tools
in a DOS environment. I have had some of these Maresware programs in my
possession for years, and never realized how powerful these tools are.
I have purchased the new commercial version of the Maresware utilities,
and recommend that they be made a part of everyone’s software toolbox.
My thanks to Dan Mares, who patiently explained to me (several times) the
concepts and procedures needed to solve this problem. I have used the same
procedure several times since he first explained it to me and each time have
been impressed with the speed and accuracy of the results.