Nicholls, Tom ORCID: 0000-0002-6971-8614
(2014)
Warctika: Python library for processing WARC files through Apache Tika.
[Software]
Abstract
This library is designed to handle web crawl data fetched using the Heritrix web crawler (or other tools producing WARC files), extract the plain text from structured formats and resave the data as WARC "conversion" records. The primary use for this tool is to extract text from webcrawl data sets for use in machine learning and supervised classification work. WARC (Web ARChive) is a file format for storing web crawls: http://bibnum.bnf.fr/WARC/ This library was originally based upon the "warc" library by the Internet Archive and others, but now relies upon the hanzo warctools and has no code in common with the original library. The hanzo library which this code is dependent upon can be installed with 'pip install warctools'. Beware that there are several old versions floating around under different names in the index.
Item Type: | Software |
---|---|
Divisions: | Faculty of Humanities and Social Sciences > School of the Arts |
Depositing User: | Symplectic Admin |
Date Deposited: | 15 Jul 2021 08:11 |
Last Modified: | 18 Jan 2023 21:36 |
DOI: | 10.5281/zenodo.12183 |
Open Access URL: | http://doi.org/10.5281/zenodo.12183 |
Related URLs: | |
URI: | https://livrepository.liverpool.ac.uk/id/eprint/3130016 |