FrontPage
Tools †
extract_heritrix.pl †
name †
extract_heritrix.pl - extract HTML contents from a Heritrix Archive file.
usage: †
perl extract_heritrix.pl INPUT_ARCFILE LOGFILE OUTPUT_DIR DATE_STR
example:
% perl extract_heritrix.pl \
IAH-20051128151845-00000-fraublucher.sslmit.unibo.it.arc \
test.log \
RESULT \
200511
- INPUT_ARCFILE: archive file name generated by Heritrix
- LOGFILE: log of URLs of extracted HTML contents from INPUT_ARCFILE
- 1st field: output filename
- 2nd field: URL
- OUTPUT_DIR: dirctory for output files
- DATE_STR: (a part of) the date string that arc file was crawled. This string will be used to find a header of Heritrix record.