Combining Textual and Visual Information for Typed and Handwritten Text Separation in Legal Documents



Torrisi, Alessandro, Bevan, Robert, Atkinson, Katie ORCID: 0000-0002-5683-4106, Bollegala, Danushka ORCID: 0000-0003-4476-7003 and Coenen, Frans ORCID: 0000-0003-1026-6649
(2019) Combining Textual and Visual Information for Typed and Handwritten Text Separation in Legal Documents. .

[img] Text
Torrisi et al camera ready.pdf - Author Accepted Manuscript

Download (104kB) | Preview

Abstract

A paginated legal bundle is an indexed version of all the evidence documents considered relevant to a court case. The pagination process requires all documents to be analysed by an expert and sorted accordingly. This is a time consuming and expensive task. Automated pagination is complicated by the fact that the constituent documents can contain both typed and handwritten texts. A successful auto-pagination system must recognise the different text types, and treat them accordingly. In this paper we compare methods for determining the type of text data contained within paginated bundle pages. Specifically, we classify pages as containing typed data only, handwritten data only, or a mixture of the two. For this purpose, we compare text classification methods, image classification methods, and ensemble methods using both textual and visual information. We find the text and image based approaches provide complimentary information, and that combining the two produces a powerful document classifier.

Item Type: Conference or Workshop Item (Unspecified)
Uncontrolled Keywords: Pagination of Legal Bundles, Image Classification, Text Classification
Depositing User: Symplectic Admin
Date Deposited: 06 Apr 2020 11:25
Last Modified: 18 Jan 2023 23:56
DOI: 10.3233/FAIA190329
Related URLs:
URI: https://livrepository.liverpool.ac.uk/id/eprint/3081429