A group for any and all who are interested in using online crowdsourcing for research, or researching the practice of crowdsourcing for research in the humanities. Practitioners, participants, enthusiasts and skeptics welcome. This is a group for information, discussion, and sharing resources (projects, toolkits, analysis methods, publications, etc.).

The Decade in Crowdsourcing (of Transcription and other tasks)

1 reply, 2 voices Last updated by Samantha Blickhan 1 year, 10 months ago
Viewing 1 reply thread
  • Author
    Posts
    • #27974

      Ben Brumfield
      Participant
      @benwbrum

      I recently posted a review of the major developments I saw in crowdsourced transcription during the 2020s, and was wondering what other practitioners’ opinions were about what the 2020s might bring both in transcription and other crowdsourcing tasks.

      In particular, I’m curious what other people see as the big challenges to be addressed or the progress they project over the next decade.  My own ideas about the 2020s are excerpted below:

      Quality control

      Quality control methodologies remain a hot topic, as wiki-like platforms like FromThePage experiment with assigned review and double-keying platforms like Zooniverse experiment with collaborative approaches. This remains a challenge not because volunteers don’t produce good work, but because quality control methods require projects to carefully balance volunteer effort, staff labor, suitability to the material, and usefulness of results.

      Unruly materials

      I think we’ll also see more progress on tools for source material that is currently poorly served. Free-text tools like FromThePage now support structured data (i.e. form based) transcription and structured transcription tools like Zooniverse can support free text, so there has been some progress on this front during the last few years. Despite that progress, tabular documents like ledgers or census records remain hard to transcribe in a scalable way. Audio transcription seems like an obvious next step for many platforms — the Smithsonian Transcription Center already begun with TC Sound. Linking transcribed text to linked data resources will require new user interfaces and complex data flows. Finally, while OCR correction seems like it should be a solved problem (and is for Trove), it continues to present massive challenges in layout analysis for newspapers and volunteer motivation for everything else.

      Artificial intelligence

      The Transcribe Bentham team has led the way on integrating crowdsourced transcription with handwritten text recognition as part of the READ project, and the Transkribus HTR platform has built a crowdsourcing component into their software. That’s solid progress towards integrating AI techniques with crowdsourcing, but we can expect a lot more flux this decade as the boundaries shift between the kinds of tasks computers do well and those which only humans can do. One of the biggest challenges is to find ways to use machine learning to make humans more productive without replacing or demotivating them if experience with OCR is any indication.

    • #28418

      Samantha Blickhan
      Participant
      @snblickhan

      Thanks, Ben! I’d certainly agree with your assessment above.

      We’re easing into some audio transcription projects this year on Zooniverse (which is very exciting!), but there’s still a ton of work to be done. I know I’d love to hear more about other’s experiences (you all are starting some audio efforts on FtP, if you’ve not already done so, yes?) and certainly more about the Smithsonian’s process of branching out into the audio world.

      Tabular data is also high on our list of data formats to tackle, because it’s *so* common. People have approached it on Zooniverse using a variety of tools, but we don’t yet have an elegant solution.

      Machine learning and support for OCR/HTR integration is a huge one for us, too. But this then raises the very good question that’s been posted by Mia Ridge & others in recent months re: is crowdsourcing a data creating/processing task, or one for engagement with collections/heritage materials? OCR verification/editing may be more efficient, but is it as fulfilling? How do we strike a balance between realistic project lifecycles and ethical practices with volunteer communities? These are the questions that keep me up at night, anyway…

Viewing 1 reply thread
  • Only members can participate in this group's discussions.