Using eScriptorium together with kraken as an infrastructure, we developed a simple but highly efficient procedure for reducing the amount of human labor necessary for creating large amounts of segmentation ground truth for documents with highly complex layouts, i.e., documents comprising regions with lines at eight different angles. Our specific project deals with medieval documents in Hebrew script in Judeo‑Arabic, Aramaic and Hebrew from the Cairo Genizah, including letters, legal documents, lists, notes and accounts. There are about 40,000 documentary texts from the Genizah, of which only about 5,000 have been transcribed. Therefore, our current aim is to create enough data to be able to train a global segmentation model with a very large number of classes, so that it can segment complex layouts in a single step.
The event is part of the 'Documents anciens et reconnaissance automatique des écritures manuscrites' colloquium, which is taking place on June 23 and 24 in Paris. See event details.