CRII: AF: RUI: Faster and Cache-Efficient Similarity Filters and Searches for Big Data
CRII:AF:RUI:更快且高效缓存的相似性过滤器和大数据搜索
基本信息
- 批准号:1755791
- 负责人:
- 金额:$ 17.5万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2018
- 资助国家:美国
- 起止时间:2018-02-15 至 2022-08-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Analysing and organising the massive amount of data collected everyday is one of the biggest challenges faced by computer scientists. This impacts almost every field of modern human life, such as health care, military applications, smart cities, transportation, networks, etc. To quickly analyse and organise such huge amounts of data, one needs to develop space and time efficient algorithms, that accelerate the search for information while minimising memory requirements. This project aims to develop space-efficient similarity filters, which quickly filter out queries that have very little similarity from the records existing in the database. This has applications in security, databases, machine learning, file systems, vision, pattern recognition, compression and networks. Apart from the targeted impact of these applications to society, this project also aims at producing the next generation of researchers and promoting under-represented groups in science and technology. Minority undergraduate and high school students will be involved in this project. Queens College, CUNY is a diverse institution, representing a wide range of ethnic minorities and has 28% Hispanic students. In an effort to engage more undergraduate students and women in research (Queens College has roughly 56% female students), the PI plans to offer a new course that develops an understanding of approximate membership and similarity search data structures in the setting of big data.Similarity searching is a fundamental problem in this context, where a massive data of records needs to be organised so as to answer quickly queries of the form: given a record, is it similar to some record in the data? Similarity search, modeled as near(est) neighbor search (NNS) is an important and well-studied problem with numerous applications: Given a set P of n points lying in some d-dimensional metric space, the goal is to preprocess P so as to return for a given query point q the point p in P that is closest to q. Exact versions of this problem are hard to solve, and all solutions to the approximate version of the problem require superlinear (more than nd) space. This space usage is prohibitive for many big data settings, where not just n, but d could be huge (e.g., the number of pixels in an image). This project is aimed at developing the theory and applications of Similarity Filters, which are decision (yes/no) based data structures. Given a query q, a (c, r)-similarity filter always returns yes if q is within a distance r of some input point, and returns no with a small error probability (thus some false positives are allowed) if all input points are at least cr away from the query. Recent results from the researcher shows that such filters can be built with sublinear (asymptotically less than nd) space. The project takes on two tasks: 1) to understand the approximation factor-space-query tradeoff for similarity filters, developing tight upper and lower bounds, especially in the sublinear space regime, and 2) to develop I/O efficient Similarity Filters and nearest neighbor data structures, akin to quotient or cascade filters for the approximate membership problem (where the in-RAM Bloom Filter is the standard data structure, used in many networking applications). The overarching goal is to use similarity filters on RAM in conjunction with a disk-based data structure: faraway queries will be filtered quickly, and nearby queries will have their neighbor(s) returned following a slower but cache friendly search. The availability of such a two-layered space-and-time-efficient data structure will greatly boost the scope of applications when the amount of data is massive, especially in databases, networks, security and machine learning.
分析和组织每天收集的大量数据是计算机科学家面临的最大挑战之一。这影响了现代人类生活的几乎每个领域,例如医疗保健,军事应用,智能城市,运输,网络等,以快速分析和组织如此庞大的数据,人们需要开发空间和时间效率的算法,以加速搜索信息,同时最大程度地减少记忆需求。该项目旨在开发空间有效的相似性过滤器,该过滤器快速滤除与数据库中存在的记录相似性几乎没有相似性的查询。这在安全性,数据库,机器学习,文件系统,视觉,模式识别,压缩和网络方面具有应用程序。除了这些应用对社会的目标影响外,该项目还旨在生产下一代研究人员,并促进科学和技术的代表性不足的群体。少数族裔本科生和高中生将参与该项目。 CUNY皇后学院是一个多元化的机构,代表着众多少数民族,拥有28%的西班牙裔学生。 In an effort to engage more undergraduate students and women in research (Queens College has roughly 56% female students), the PI plans to offer a new course that develops an understanding of approximate membership and similarity search data structures in the setting of big data.Similarity searching is a fundamental problem in this context, where a massive data of records needs to be organised so as to answer quickly queries of the form: given a record, is it similar to some record in the data?相似性搜索(以接近(EST)的邻居搜索(NNS)为模型是一个重要且研究充分的问题:给定许多应用程序:鉴于位于某些d维度量空间中的n个点的一组p,目标是预处理P,以便返回给定查询点Q q返回最接近Q的P中P中的P点P中的P点P。此问题的确切版本很难解决,并且所有问题的解决方案都需要超线性(超过ND)空间。对于许多大数据设置,此空间的使用量不高,不仅是n,而且D可能是巨大的(例如,图像中的像素数)。该项目旨在开发基于决策(是/否)数据结构的相似性过滤器的理论和应用。给定查询Q,如果Q在某些输入点的距离内,则Q(c,r)类似物过滤器总是返回是,并且如果所有输入点至少都远离查询,则以较小的误差概率返回no(因此允许使用某些误报)。研究人员的最新结果表明,这种过滤器可以使用sublinear(渐近少于ND)的空间构建。该项目执行了两项任务:1)了解相似性过滤器的近似因素空间 - 质量折衷方案,开发紧密的上限和下限,尤其是在sublinear空间制度中,以及2)),以及2)),以及有效的相似性过滤器以及有效的邻居数据结构,类似于商品或casacade滤波器的近似成员国问题(在此近似网络中,该网络都在范围内进行了过滤到标准的数据,这是许多标准的数据,是许多标准数据,是在许多标准数据结构中。总体目标是将RAM上的相似性过滤器与基于磁盘的数据结构结合使用:遥远的查询将迅速过滤,附近的查询将使他们的邻居在较慢但缓存友好型搜索后返回。当数据量庞大时,尤其是在数据库,网络,安全性和机器学习中时,这种两层的两层空间和时间效率数据结构的可用性将大大提高应用程序的范围。
项目成果
期刊论文数量(4)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
A Manifold View of Adversarial Risk
- DOI:10.48550/arxiv.2203.13277
- 发表时间:2022-03
- 期刊:
- 影响因子:0
- 作者:Wen-jun Zhang;Yikai Zhang;Xiaoling Hu;Mayank Goswami;Chao Chen;Dimitris N. Metaxas
- 通讯作者:Wen-jun Zhang;Yikai Zhang;Xiaoling Hu;Mayank Goswami;Chao Chen;Dimitris N. Metaxas
A Topological Filter for Learning with Label Noise
- DOI:
- 发表时间:2020-12
- 期刊:
- 影响因子:0
- 作者:Pengxiang Wu;Songzhu Zheng;Mayank Goswami;Dimitris N. Metaxas;Chao Chen
- 通讯作者:Pengxiang Wu;Songzhu Zheng;Mayank Goswami;Dimitris N. Metaxas;Chao Chen
On the I/O Complexity of the k-Nearest Neighbors Problem
- DOI:10.1145/3375395.3387649
- 发表时间:2020-02
- 期刊:
- 影响因子:0
- 作者:Mayank Goswami;R. Jacob;R. Pagh
- 通讯作者:Mayank Goswami;R. Jacob;R. Pagh
Error-Bounded Correction of Noisy Labels
- DOI:
- 发表时间:2020-07
- 期刊:
- 影响因子:0
- 作者:Songzhu Zheng;Pengxiang Wu;A. Goswami;Mayank Goswami;Dimitris N. Metaxas;Chao Chen
- 通讯作者:Songzhu Zheng;Pengxiang Wu;A. Goswami;Mayank Goswami;Dimitris N. Metaxas;Chao Chen
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Mayank Goswami其他文献
Weakly confined silicon nano-structures as a THz absorbing material
弱约束硅纳米结构作为太赫兹吸收材料
- DOI:
10.1063/5.0204896 - 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Pooja Sudha;Mayank Goswami;Arup Samanta - 通讯作者:
Arup Samanta
Missing Motility: High-resolution in vivo imaging of retinal microglia reveals stationary ramified cells.
运动性缺失:视网膜小胶质细胞的高分辨率体内成像揭示了静止的分支细胞。
- DOI:
- 发表时间:
2016 - 期刊:
- 影响因子:0
- 作者:
Eric B. Miller;Pengfei Zhang;Mayank Goswami;R. Zawadzki;E. Pugh;M. E. Burns - 通讯作者:
M. E. Burns
Pattern-Avoiding Access in Binary Search Trees
二叉搜索树中避免模式访问
- DOI:
10.1109/focs.2015.32 - 发表时间:
2015 - 期刊:
- 影响因子:0
- 作者:
Parinya Chalermsook;Mayank Goswami;L. Kozma;K. Mehlhorn;Thatchaphol Saranurak - 通讯作者:
Thatchaphol Saranurak
Adaptive Discretization for Computerized Tomography
计算机断层扫描的自适应离散化
- DOI:
- 发表时间:
2018 - 期刊:
- 影响因子:0
- 作者:
S. Shakya;A. Saxena;P. Munshi;Mayank Goswami - 通讯作者:
Mayank Goswami
Fair subgraph selection for contagion containment (Brief Announcement)
- DOI:
10.1016/j.procs.2023.08.250 - 发表时间:
2023-01-01 - 期刊:
- 影响因子:
- 作者:
Esther M. Arkin;Rezaul A. Chowdhury;Mayank Goswami;Jason Huang;Joseph S.B. Mitchell;Valentin Polishchuk;Rakesh Ravindran - 通讯作者:
Rakesh Ravindran
Mayank Goswami的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Mayank Goswami', 18)}}的其他基金
AF: Small: RUI: Towards Resolving the Dynamic Optimality Conjecture.
AF:小:RUI:解决动态最优猜想。
- 批准号:
1910873 - 财政年份:2019
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
相似国自然基金
H2S介导剪接因子BraU2AF65a的S-巯基化修饰促进大白菜开花的分子机制
- 批准号:32372727
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
AF9通过ARRB2-MRGPRB2介导肠固有肥大细胞活化促进重症急性胰腺炎发生MOF的研究
- 批准号:82300739
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
剪接因子U2AF1突变在急性髓系白血病原发耐药中的机制研究
- 批准号:82370157
- 批准年份:2023
- 资助金额:49 万元
- 项目类别:面上项目
线粒体活性氧介导的胎盘早衰在孕期双酚AF暴露致婴幼儿神经发育迟缓中的作用
- 批准号:82304160
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
U2AF2-circMMP1调控能量代谢促进结直肠癌肝转移的分子机制
- 批准号:82303789
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
相似海外基金
CRII: AF: RUI: New Frontiers in Fundamental Error-Correcting Codes
CRII:AF:RUI:基本纠错码的新领域
- 批准号:
2347371 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: AF: RUI: Algorithmic Fairness for Computational Social Choice Models
CRII:AF:RUI:计算社会选择模型的算法公平性
- 批准号:
2348275 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: AF: RUI: Breaking Ground on Circulant TSP
CRII:AF:RUI:循环 TSP 破土动工
- 批准号:
2153331 - 财政年份:2022
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: AF: RUI: New Approaches for Space-Efficient Similarity Search
CRII:AF:RUI:节省空间的相似性搜索的新方法
- 批准号:
2103813 - 财政年份:2021
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: AF: RUI: Markov Chains and Random Sampling on Graphs
CRII:AF:RUI:马尔可夫链和图上的随机采样
- 批准号:
2104795 - 财政年份:2021
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant