webaves warc dump

Synopsis

dump [-h|--help] [-o|--output] [--overwrite] <input>

Description

Transform WARC files to JSON formatted output

Options

-h, --help

Print help information

-o, --output [default: -]

Path to output file

--overwrite [default: false]

Allow overwriting existing files

<input>

Path to WARC file

Example

webaves --verbose warc dump input_file.warc.gz --output output_file.json

Format

The output format is multiple JSON documents where each document is on a single line.

For each record in the WARC file, it outputs 3 types of documents:

  1. Header portion of the record.

  2. Multiple block portions of the record.

  3. End of a record indicator.

Example header:

{
    "Header": {
        "version": "WARC/1.1",
        "fields": [
            {
                "name": {
                    "text": "Field-Name"
                },
                "value": {
                    "text": "Field value"
                }
            }
        ]
    }
}

Example part of a block:

{
    "Block": {
        "data": [ 72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33 ]
    }
}

End of record indicator:

"EndOfRecord"