Thursday, August 24, 2023

IWCP 2023, San Jose

Thanks to Isabelle Marthot-Santaniello, I was able to attend the IWCP workshop on computational paleography at the ICDAR 2023 conference in San Jose, CA today. It was great to finally meet in person many with whom I had only interacted online in the past and to hear about the exciting work that is being done in the area of digital paleography.

Dominique Stutzmann started out with a keynote overview of the state of the art where he suggested that many of the basic problems in the field are essentially already solved (e.g., writer identification, style classification, dating), after which he pointed towards more difficult challenges that remain (e.g., degraded and/or small corpora, other difficulties in providing reliable labelled data). Momina Moetesum discussed a project on computerized restoration of broken ink strokes in Greek papyri, where they trained a network to be able to reconstruct Greek letters that had been artificially degraded. Vasiliki Kougia gave an update on a project using dated Greek literary hands to automatically classify (and date) Greek manuscripts. And Isabelle Marthot-Santaniello and Marie Beurton-Aimar presented their work on stylistic clustering based on clips of individual characters.

After the break, Julius Tabin discussed a project to capture and annotate Hieratic characters from facsimiles of Egyptian texts in a way that can be used to illuminate style development. Anguelos Nicolaou described an approach to quickly (manually) label regions on medieval charters to identify features for further analysis. In this presentation, he made the interesting observation that the more detail and discussion needed for a classification/label, the more likely that labelers will have difficulty providing consistent and clear labelling, so we should be careful of too much precision in labels. Of course, it is precisely the difficult, transitional, and contested areas that are generally of most interest to humanities scholars, as Dominique Stutzmann pointed out in the discussion. And Sojung Lucia Kim talked about a project using deep learning to classify Korean records in Chinese characters.

In the final discussion, two questions dominated. First, the question of whether manuscript dating is a problem of classification into discrete categories (e.g., date ranges by century) or regression (i.e., placing on a continuous timeline. The general consensus was that it depends upon the nature of the labelled data and what it allows. I pointed out that radiocarbon dating has much potential to change the nature of the data and allows for more regression models in corpora for which this hasn't in the past been possible, but this is still only a dream for many corpora. 

The second questions was whether it is better to create "end-to-end" products that move directly from input to the final desired result or "modular" products that break the process into various (intermediate) steps, each with their own data records. The modular approach actually (perhaps counterintuitively) decreases accuracy, but allows for the explication of various stages in the analysis that may be useful for humanities scholars. All in all, it was a great day of meeting and learning, and I'm very grateful to have been able to attend.