webaves warc dump¶
Synopsis¶
dump [-h|--help] [-o|--output] [--overwrite] <input>
Description¶
Transform WARC files to JSON formatted output
Options¶
- -h, --help
Print help information
- -o, --output [default: -]
Path to output file
- --overwrite [default: false]
Allow overwriting existing files
- <input>
Path to WARC file
Example¶
webaves --verbose warc dump input_file.warc.gz --output output_file.json
Format¶
The output format is multiple JSON documents where each document is on a single line.
For each record in the WARC file, it outputs 3 types of documents:
Header portion of the record.
Multiple block portions of the record.
End of a record indicator.
Example header:
{
"Header": {
"version": "WARC/1.1",
"fields": [
{
"name": {
"text": "Field-Name"
},
"value": {
"text": "Field value"
}
}
]
}
}
Example part of a block:
{
"Block": {
"data": [ 72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33 ]
}
}
End of record indicator:
"EndOfRecord"