RISJbot: a scrapy project to extract the text and metadata of articles from news websites



Nicholls, Tom ORCID: 0000-0002-6971-8614
(2018) RISJbot: a scrapy project to extract the text and metadata of articles from news websites. [Software]

Access the full-text of this item by clicking on the Open Access link.

Abstract

This should provide much of the structure and parsing code needed to fetch from arbitrary news websites. It may work out-of-the-box on some or more of the sites with specific spiders already written (see below) but be aware that web scrapers are by their nature somewhat brittle: they depend on the underlying format and structure of each site's pages, and when these are changed they tend to break. Although RISJbot has a fallback scraper that does a reasonable job with arbitrary news pages, it's not a substitute for a hand-tailored spider. Having some degree of experience with Python would be very helpful. If sites update their templates or you want to add a new site to the collection then some coding will be necessary. I've tried to ensure that the existing code is well commented. The Scrapy docs are themselves quite good if you find yourself needing to understand what is going on behind the scenes. You should be aware that this was written to support the author's academic research into online news. It is still actively (if slowly) developed for that purpose, but it is not production-level code and comes with even fewer guarantees than most Free software.

Item Type: Software
Divisions: Faculty of Humanities and Social Sciences > School of the Arts
Depositing User: Symplectic Admin
Date Deposited: 15 Jul 2021 08:11
Last Modified: 18 Jan 2023 21:36
DOI: 10.5281/ZENODO.1341873
Open Access URL: https://doi.org/10.5281/zenodo.1341873
Related URLs:
URI: https://livrepository.liverpool.ac.uk/id/eprint/3130014