• Compounded Mediation: A Data Archaeology of the Newspaper Navigator Dataset

    Author(s):
    Benjamin Lee (see profile)
    Date:
    2020
    Subject(s):
    Critical data studies, Digital humanities, Digital libraries, Machine learning, Science and technology studies (STS)
    Item Type:
    Article
    Tag(s):
    Chronicling America, data archaeology, digitized newspapers, library of congress, Newspaper Navigator
    Permanent URL:
    http://dx.doi.org/10.17613/k9gt-6685
    Abstract:
    The increasing role of machine learning in the construction of cultural heritage and humanities datasets necessitates critical examination of the myriad biases introduced by machines, algorithms, and the humans who build and deploy them. From image classification to OCR, the effects of decisions ostensibly made by machines compound through the digitization pipeline and redouble in each step, mediating our interactions with digitally-rendered artifacts through the search and discovery process. Here, I consider the Library of Congress’s Newspaper Navigator dataset, which I created as part of the Library of Congress’s Innovator-in-Residence program. The dataset consists of visual content extracted from 16 million historic newspaper pages in the Chronicling America database using machine learning. In this data archaeology, I examine the ways in which a Chronicling America newspaper page is transmuted and decontextualized during its journey from a physical artifact to a series of probabilistic photographs, illustrations, maps, comics, cartoons, headlines, and advertisements in the Newspaper Navigator dataset. I consider the digitization journeys of four different pages in Black newspapers in Chronicling America that reproduce the same photograph of W.E.B. Du Bois. In tracing the pages’ journeys, I unpack how each step in the pipelines, such as the imaging process and the construction of training data, not only imprints bias on the resulting Newspaper Navigator dataset but also propagates the bias via the machine learning algorithms employed. I investigate the limitations of the Newspaper Navigator dataset and machine learning as it relates to cultural heritage, from marginalization and erasure via algorithmic bias to unfair labor practices in the construction of commonly-used datasets. I argue that any use of machine learning with cultural heritage must be done with an understanding of the broader socio-technical ecosystems in which the algorithms have been utilized.
    Metadata:
    Status:
    Published
    Last Updated:
    2 months ago
    License:
    All-Rights-Granted
    Share this:

    Downloads

    Item Name:pdf bcgl-nn-data-archaeology.pdf
     Download View in browser
    Activity: Downloads: 168