This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution.
I work at the intersection of computing, philology, and linguistics both as an independent scholar and as a software developer working on digital humanities projects with other scholars. My interests include morphology (theoretical, computational, and historical), Indo-European linguistics, Linguistic Linked Open Data, text encoding and annotation of historical language corpora (especially Ancient Greek but also Old English and Old Norse), machine-actionable language description, computer-aided historical language learning (especially Ancient Greek but also Old English and Old Norse).
I am a tenure-track researcher at the Meertens Institute of the Royal Netherlands Academy of Arts and Sciences. My research is interdisciplinary, adopting computational methods to study the field of humanities, in particular folkloristics. My research interests lie in the development of computational text analysis methods in the context of ethnology, anthropology, literary theory and cultural evolution (see my résumé for further details). Drop me a line or follow me on Twitter or GitHub.
Guest presentation in Projects in Rare Book Digitization course (Pratt University, LIS 666) on analyzing digitized books and printed materials with digital humanities methods (primarily text analysis).
Increasing numbers of primary and secondary source texts have been digitized in recent years. Scholars who want to study these new collections in depth need computational assistance because of their large scale. The non-programmer tools for text analysis currently available operate at the word level, and they show tables of counts and lists of occurrences, but rarely interactive visualizations. We propose to build a text analysis tool that includes visualizations and works on the grammatical structure and stylistic features of text, applying highly accurate technology from computational linguistics and authorship identification to extract this information. We will develop our tool for a collection of slave narratives whose authorship is ambiguous. In doing so, we will find out whether visualizations of grammatical and stylistic features are useful to literary scholars, and whether this information allows them to make satisfying large-scale analyses of their text.
Melissa Terras is Director of UCL Centre for Digital Humanities, Professor of Digital Humanities in UCL’s Department of Information Studies, and Vice Dean of Research in UCL’s Faculty of Arts and Humanities. With a background in Classical Art History, English Literature, and Computing Science, her doctorate (Engineering, University of Oxford) examined how to use advanced information engineering technologies to interpret and read Roman texts. Publications include “Image to Interpretation: Intelligent Systems to Aid Historians in the Reading of the Vindolanda Texts” (2006, Oxford University Press) and “Digital Images for the Information Professional” (2008, Ashgate) and she has co-edited various volumes such as “Digital Humanities in Practice” (Facet 2012) and “Defining Digital Humanities: A Reader” (Ashgate 2013). She is currently serving on the Board of Curators of the University of Oxford Libraries, and the Board of the National Library of Scotland, and is a Fellow of the Chartered Institute of Library and Information Professionals and Fellow of the British Computer Society. Her research focuses on the use of computational techniques to enable research in the arts and humanities that would otherwise be impossible. You can generally find her on twitter @melissaterras.
I currently work as Head of Film Access at the Bundesarchiv in Berlin. Between 2016 and 2018 I was the administrative head and researcher at the Brandenburg Center for Media Studies in Potsdam. From 2010 to September 2016 I worked as researcher, curator and archivist at the Austrian Film Museum in Vienna. My main areas of expertise include database development and metadata structures as well as the publication of archival films on DVD and the internet (e.g. Kinonedelja – Online Edition, etc.). I obtained my PhD in Russian studies and a Masters in Comparative Literature from the University of Innsbruck and Vienna. In 2016 I have also completed Library- and Information Sciences at the Humboldt-University in Berlin. I am the author of the book Kollision der Kader. Dziga Vertovs Filme, die Visualisierung ihrer Strukturen und die Digital Humanities (2016) and have published on Russian cinema, archival collections and visualization of filmic structures.
This essay describes the popular Bechdel Test—a measure of women’s dialogue in films—in terms of social network analysis within fictional narrative. It argues that this form of vernacular criticism arrives at a productive convergence with contemporary academic critical methodologies in surface and postcritical reading practices, on the one hand, and digital humanities, on the other. The data-oriented character of the Bechdel Test, which a text rigidly passes or fails, stands in sharp contrast to identification- or recognition-based evaluations of a text’s feminist orientation, particularly because the former does not prescribe the content, but merely the social form, of women’s agency. This essay connects the Bechdel Test and a lineage of feminist and early queer theory to current work on social network analysis within literary texts, and it argues that the Bechdel Test offers the beginnings of a measured approach to understanding agency within actor networks.
This syllabus was my fourth version of a course aimed at introducing the digital humanities at an undergraduate level. The course was organized around four projects, each of which was oriented by a theoretical reading: mapping a novel; text analysis with archival sources; creating composite images from digital films; and text digitization, extraction, and analysis from a large, in-copyright corpus.