Collaborative Research: RI: Small: Unsupervised Islamicate Manuscript Transcription via Lacunae Reconstruction
合作研究:RI:小型:通过缺口重建进行无监督伊斯兰手稿转录
基本信息
- 批准号:2200333
- 负责人:
- 金额:$ 30万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2022
- 资助国家:美国
- 起止时间:2022-07-01 至 2025-06-30
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
This award tackles handwritten text recognition (HTR, the task of automatically transcribing images of handwritten manuscripts into symbolic text) for Islamicate manuscripts, a domain that encompasses Persian and Arabic written traditions originating in the premodern Islamic world (7th-19th centuries). HTR for modern text is itself a challenging problem that has received substantial attention from the fields of machine learning (ML) and artificial intelligence (AI). However, the predominance of modern text in HTR research is, to some extent, waning: current techniques are relatively robust on modern data, and contemporary written media production is already almost entirely digital. In contrast, historical manuscripts have received comparatively less attention from ML and AI, and at the same time represent both an exceptional opportunity for impact and a set of unique challenges for ML techniques. Specifically, the written traditions of the Islamicate world together form one of the largest -- if not the largest -- archives of human cultural production of the premodern world. Scanning and digitization efforts over the last decade have made images of Islamicate manuscripts in a large number of collections available to the public. However, this data remains ‘locked’ for most scholarly uses because it has not been transcribed into symbolic text which is required for many types of analysis. In fact, the script styles used in Islamicate manuscripts -- 'scribal hands' -- vary so widely and differ so substantially from modern forms that even manual close reading of these texts requires expert training and is thus limited to a small subset of researchers. The primary outcome of this project will be new techniques that 'unlock' the Islamicate written tradition by accurately transcribing it. As a result, this project has the potential to be transformative for humanities disciplines such as Islamic and Near Eastern Studies by enabling libraries to accurately transcribe entire collections and, further, by allowing individual researchers to accurately transcribe manuscripts outside the western canon. Finally, this research will also support interdisciplinary training of a diverse set of graduate students at the University of California San Diego and the University of Maryland.Current techniques for HTR require large amounts of in-domain supervised training data in order to produce highly accurate transcriptions. The neural architectures behind these modern methods are capable of generalizing, to some degree, across modern handwriting styles when trained on larger and more diverse collections of transcribed data. However, their limitations make these techniques impractical for large-scale transcription of Islamicate texts for two reasons: (1) scribal hand variation across Islamicate manuscripts is much more pronounced than stylistic variation in modern handwriting; and (2) transcriptions of Islamicate manuscripts that can be used as supervised training data are extremely scarce because accurate manual transcription requires expert training. This project will develop a new unsupervised learning framework for Islamicate HTR centered around a novel pretraining task: lacuna reconstruction. The new approach trains a neural encoder for images of manuscript text lines by learning to reconstruct masked regions -- i.e. lacaunae -- of unlabeled manuscript images. This completely unsupervised training criterion implicitly incentivizes the model to discover and encode discreteThis award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
该奖项针对伊斯兰手稿的手写文字识别(HTR,将手写手稿的图像自动抄录到符号文字的任务),该手稿涵盖了一个涵盖波斯语和阿拉伯语书面传统的领域,起源于伊斯兰世界的前现代伊斯兰世界(7-19世纪)。现代文本的HTR本身是一个挑战问题,它从机器学习(ML)和人工智能(AI)领域受到了极大的关注。但是,现代文本在HTR研究中的优势在某种程度上是在减弱:当前技术对现代数据相对强大,而当代的书面媒体生产几乎已经完全是数字化的。相反,历史手稿从ML和AI受到相对较少的关注,同时既代表了影响力的出色机会,也代表了ML技术的一系列独特挑战。具体而言,伊斯兰世界的书面传统共同构成了最大的(即使不是最大的)人类文化生产的档案。在过去的十年中,扫描和数字化努力在大量的公众可用收藏中制作了伊斯兰手稿的图像。但是,对于大多数科学用途,这些数据仍然“锁定”,因为它尚未转录为符号文本,这是许多类型的分析所需的。实际上,伊斯兰手稿中使用的脚本样式 - “抄写手” - 与现代形式的差异很大,而且与现代形式有很大不同,即使是手动仔细阅读这些文本也需要专家培训,因此仅限于一小部分研究人员。该项目的主要结果将是通过准确抄录来“解锁”伊斯兰书面传统的新技术。结果,该项目有可能通过使图书馆准确地抄写整个藏品,从而使人类学科(如伊斯兰和近东研究)具有变革性,并通过允许单个研究人员准确地转录西方佳能外的手稿。最后,这项研究还将支持加利福尼亚大学圣地亚哥大学和马里兰州大学的一组潜水员的跨学科培训。htr的电流技术需要大量的内域监督培训数据,以便产生高度准确的转录。这些现代方法背后的神经体系结构能够在某种程度上跨越现代手写样式的概括,当时接受了更大,更多样化的转录数据集合的培训。但是,它们的局限性使这些技术对于伊斯兰文本的大规模转录而不切实际,原因有两个:(1)跨伊斯兰手稿的涂鸦手部变化比现代笔迹中的风格变化更为明显; (2)可以用作监督培训数据的伊斯兰手稿的抄录非常稀缺,因为准确的手动转录需要专家培训。该项目将为围绕一项新颖的预处理任务的伊斯兰HTR开发一个新的无监督学习框架:空隙重建。新方法通过学习重建未标记的手稿图像的掩盖区域(即lacaunae)来训练手稿文本线的图像。这个完全无监督的培训标准隐含地激励该模型,以发现和编码离散奖,反映了NSF的法定任务,并使用基金会的知识分子优点和更广泛的影响审查标准,通过评估来诚实地表示支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Taylor Berg-Kirkpatrick其他文献
Taylor Berg-Kirkpatrick的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Taylor Berg-Kirkpatrick', 18)}}的其他基金
CAREER: Modeling Language Evolution via Deep Probabilistic Factorization
职业:通过深度概率分解建模语言演化
- 批准号:
2146151 - 财政年份:2022
- 资助金额:
$ 30万 - 项目类别:
Continuing Grant
RI: Small: Print and Probability - A Statistical Approach to Analysis of Clandestine Publication
RI:小:印刷品和概率 - 秘密出版物分析的统计方法
- 批准号:
1936155 - 财政年份:2019
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
RI: Small: Print and Probability - A Statistical Approach to Analysis of Clandestine Publication
RI:小:印刷品和概率 - 秘密出版物分析的统计方法
- 批准号:
1816311 - 财政年份:2018
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
RI: Small: Collaborative Research: Unsupervised Transcription of Early Modern Documents
RI:小型:合作研究:早期现代文献的无监督转录
- 批准号:
1618044 - 财政年份:2016
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
相似国自然基金
跨膜蛋白LRP5胞外域调控膜受体TβRI促钛表面BMSCs归巢、分化的研究
- 批准号:82301120
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
Dectin-2通过促进FcεRI聚集和肥大细胞活化加剧哮喘发作的机制研究
- 批准号:82300022
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
TβRI的UFM化修饰调控TGF-β信号通路和乳腺癌转移的作用及机制研究
- 批准号:32200568
- 批准年份:2022
- 资助金额:30.00 万元
- 项目类别:青年科学基金项目
藏药甘肃蚤缀β-咔啉生物碱类TβRI抑制剂的发现及其抗肺纤维化作用机制研究
- 批准号:
- 批准年份:2022
- 资助金额:30 万元
- 项目类别:青年科学基金项目
藏药甘肃蚤缀β-咔啉生物碱类TβRI抑制剂的发现及其抗肺纤维化作用机制研究
- 批准号:82204762
- 批准年份:2022
- 资助金额:30.00 万元
- 项目类别:青年科学基金项目
相似海外基金
Collaborative Research: RI: Medium: Principles for Optimization, Generalization, and Transferability via Deep Neural Collapse
合作研究:RI:中:通过深度神经崩溃实现优化、泛化和可迁移性的原理
- 批准号:
2312841 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: RI: Medium: Principles for Optimization, Generalization, and Transferability via Deep Neural Collapse
合作研究:RI:中:通过深度神经崩溃实现优化、泛化和可迁移性的原理
- 批准号:
2312842 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: RI: Small: Foundations of Few-Round Active Learning
协作研究:RI:小型:少轮主动学习的基础
- 批准号:
2313131 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: RI: Medium: Lie group representation learning for vision
协作研究:RI:中:视觉的李群表示学习
- 批准号:
2313151 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Continuing Grant
Collaborative Research: RI: Medium: Principles for Optimization, Generalization, and Transferability via Deep Neural Collapse
合作研究:RI:中:通过深度神经崩溃实现优化、泛化和可迁移性的原理
- 批准号:
2312840 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Standard Grant