What will philology become in the wake of the digital revolution? How can computer vision, handwritten text recognition, natural language processing, deep neural networks and/or other forms of machine learning refine the arsenal of techniques for studying premodern evidence?
This works-in-progress symposium will feature six teams of Princeton scholars who are applying machine learning to manuscripts, rare books, archives, inscriptions, coins and other pre-1600 texts. Presentations will include projects on materials in Syriac, Hebrew, Latin, Greek, Chinese and English. David Smith (Computer Science, Northeastern) will offer remarks.
This event is co-organized by the Center for Digital Humanities and the Manuscript, Rare Book and Archive Studies Initiative, with support from the Center for Statistics and Machine Learning. This symposium is intended as the first of a pair; the second will take place in 2023–24 and solicit proposals from beyond the Princeton community.
Questions? Please email [email protected] and a member of the coordinating committee will get back to you.
Friday, December 9, 2022, 9am-5pm
Center for Digital Humanities (B Floor, Firestone Library; also as Zoom webinar).
George Kiraz (Research Associate, Institute for Advanced Study)
Syriac, like Arabic, is a cursive script. This has hindered the development of usable OCR engines for Syriac. User-friendly portals such as eScriptorium and Transkribus, among others, have now allowed us to remove the “language” and “script” elements from the equation. This talk will explore the process of building HTR and OCR models for the various Syriac scripts and how these models were used to build, within a short period, a corpus (https://simtho.bethmardutho.org) of more than 13 million words. The talk will also explore methods for post HTR/OCR processing to obtain electronic texts.
Moderator: Jack Tannous (History, Princeton University)
Marina Rustow (Near Eastern Studies, Princeton University) and Daniel Stökl Ben Ezra (École Pratique des Hautes Études)
The Cairo Geniza is one of the main sources for the history of northern Africa, western Asia and the Indian Ocean basin in the tenth–thirteenth centuries. Yet fewer than 5,000 of the estimated 30,000 documentary geniza fragments have been published, and even that modest number has taken the field approximately 130 years to transcribe. At this rate, transcribing the rest of the corpus would take centuries. To address the urgent need for more geniza transcriptions, we undertook a project called HTR4PGP, which uses the eScriptorium API and, as ground truth, the Princeton Geniza Project’s electronic corpus of 4,000 transcriptions. The project aims to decipher an additional 12,000 geniza documents automatically, thereby quadrupling the corpus of geniza transcriptions. But we’ve faced some practical challenges along the way, chief among them the heterogeneity of our training data. The documents are a single page long, written by many scribes, often in informal hands. Most are torn or have holes. More than half the documents have complex layouts, with text-blocks laid out at multiple angles. Because of the complexity of the corpus, we set 90% accuracy as our benchmark — not as good as a human with a PhD, but good enough to make the texts findable and more usable for philologists and historians who wish to improve them and continue to work with them. We will present the interim results of the project and some possible future directions for research, and also reflect on the implications of HTR work for the future of philology.
Moderator: George Kiraz (Research Associate, Institute for Advanced Study)
Helmut Reimitz (History, Princeton University), Tim Geelhaar (University of Bielefeld), Jan Odstrčilík (Austrian Academy of Sciences, Vienna)
This presentation will discuss a project to use HTR for ninth–twelfth century Latin manuscripts containing the massive Ten Books of Histories by Gregory of Tours. This work, composed in the sixth century, was a bestseller in the Middle Ages: there are more than fifty extant manuscripts dating from the seventh to the fifteenth centuries. But the text was less stable than modern editions and translations suggest. A subset of manuscripts contain a reworking of the text; while they follow a common template, the compilers’ cut-and-paste job resulted in different arrangements of the text. Our project used HTR technology, first, to develop the first ground truth model for Carolingian minuscule, and, second, to build digital tools to compare versions of the text and to tag them. Doing so allows us to look over the shoulders of medieval historians and compilers and to understand the complex reworkings of a large corpus.
Moderator: Barbara Graziosi (Classics, Princeton University)
Creston Brooks (Princeton University) Charlie Cowen-Breen (Cambridge), Barbara Graziosi (Classics, Princeton University), Johannes Haubold (Classics, Princeton University)
Ancient and medieval texts survive in manuscripts full of gaps and scribal errors. Generations of scholars with a high degree of expertise and benefitting from institutional investment have filled lacunae and corrected errors in classical Greco-Roman texts. But other periods of Greek literature have been less well-served. How can one produce editions with well-founded emendations and thereby ensure the future of a global archive of premodern texts even as philological expertise and investment are rapidly diminishing? To address this concern, the project LOGION has developed a deep neural network for textual restoration and emendation, and a system for tracking historical developments in language and style. Its architecture focuses on collaboration and decision-support for philologists. This presentation will illustrate the cooperative potential between artificial intelligence and human philology, taking the letters of the Byzantine author Psellos as our case study. First, we simulate gaps and compare results that have been achieved by philologists alone, by LOGION working by itself, and by philologists working in collaboration with LOGION. Second, we discuss some actual textual problems. Third, we outline next steps to improve the performance of LOGION on our current dataset and to support expansion to other archives and languages.
Moderator: Will Noel (Special Collections, Princeton University Library)
Gian Duri Rominger (East Asian Studies, Princeton University) and Nick Budak (Software Developer, Stanford University Libraries)
This presentation will illustrate how algorithms and machine learning techniques can help to detect poetic features in pre-modern Chinese texts and thereby stratify and date them. The Chinese writing system has been stable over the centuries and allows readers of Classical Chinese to access texts spanning more than two millennia. But it also masks the phonological developments of the underlying language, leading to the strange phenomenon that ancient Chinese texts are now read using modern pronunciation. DIRECT (Digital Intertextual Resonances in Early Chinese Texts) was designed to address this problem using a grapheme-to-phoneme (g2p) conversion tool for historical Chinese texts, a feature currently being optimized through Natural Language Processing pipelines. Our project uses DIRECT for glosses from the 6th-century annotation collection Jingdian Shiwen 經典釋文 (Elucidation of Classical Texts). By rendering texts as phonological reconstructions, DIRECT detects poetic features such as rhyme, and Regular Expression (Regex) searches can help extract known poetic patterns. These, in turn, can allow the texts to be dated by applying findings from historical linguistics. While this pipeline is tailored to third–first century BCE texts, it can be extended to other premodern texts. This presentation will explore how algorithmic thinking can apply a distinct philological method to ancient and medieval Chinese texts, and illustrate limitations and possibilities of applying computational thinking to philological inquiry.
Gabriel Swift (Special Collections, Princeton University Library), Seth Perry (Religion, Princeton University), Kurt Lemai (Princeton Class of 2025)
This project uses the text recognition software Transkribus to investigate an “unreadable” handwritten sermon by Samuel Phillips, Sr. (1625–96), a pastor in Rowley, Massachusetts who was involved in religious controversies such as the Salem witchcraft delusion. Rowley is one of the least-studied early Puritan leaders, likely due to the difficulty of reading his handwriting. We used Transkribus to create a handwriting model to decipher one of Phillips’s sermons, written in an impenetrable, minuscule hand on the margins and verso of a 1680 broadside preserved in Princeton’s Scheide Library. To create the training dataset, we used Python to scrape 431 images and transcriptions from Phillips’s diary as transcribed by Helen Gelinas and Lori Stokes, using it for an initial alignment of image and text and then creating progressively larger training datasets. Our initial trial run produced a character error rate (CER) of 26.78%, but the final run improved the CER to 8.10%. We then ran this model on Princeton’s broadside, “unlocking” the text and making strong contextual guesses as to which characters or words the model had read incorrectly. The next phase of this project will apply the AI model to transcribe multiple Phillips sermons held at the Peabody Essex Museum and the New England Historic Genealogical Society. We hope it will support the first serious academic study of Phillips that will situate his intellectual output within the history of American Puritanism.
Moderator: Meredith Martin (English & Center for Digital Humanities, Princeton University)