Warctika: Python library for processing WARC files through Apache Tika



Nicholls, Tom ORCID: 0000-0002-6971-8614
(2014) Warctika: Python library for processing WARC files through Apache Tika. [Software]

Access the full-text of this item by clicking on the Open Access link.

Abstract

This library is designed to handle web crawl data fetched using the Heritrix web crawler (or other tools producing WARC files), extract the plain text from structured formats and resave the data as WARC "conversion" records. The primary use for this tool is to extract text from webcrawl data sets for use in machine learning and supervised classification work. WARC (Web ARChive) is a file format for storing web crawls: http://bibnum.bnf.fr/WARC/ This library was originally based upon the "warc" library by the Internet Archive and others, but now relies upon the hanzo warctools and has no code in common with the original library. The hanzo library which this code is dependent upon can be installed with 'pip install warctools'. Beware that there are several old versions floating around under different names in the index.

Item Type: Software
Divisions: Faculty of Humanities and Social Sciences > School of the Arts
Depositing User: Symplectic Admin
Date Deposited: 15 Jul 2021 08:11
Last Modified: 06 Oct 2022 06:17
DOI: 10.5281/zenodo.12183
Open Access URL: http://doi.org/10.5281/zenodo.12183
Related URLs:
URI: https://livrepository.liverpool.ac.uk/id/eprint/3130016