Summary Objectives: Current genomic privacy technologies assume the identity of genomic sequence data is protected if personal information, such as demographics, are obscured, removed, or encrypted. While demographic features can directly compromise an individual’s identity, recent research demonstrates such protections are insufficient because sequence data itself is susceptible to re-identification. To counteract this problem, we introduce an algorithm for anonymizing a collection of person-specific DNA sequences. Methods: The technique is termed DNA lattice an-onymization (DNALA), and is based upon the formal privacy protection schema of k-anonymity. Under this model, it is impossible to observe or learn features that distinguish one genetic sequence from k-1 other entries in a collection. To maximize information retained in protected sequences, we incorporate a concept generalization lattice to learn the distance between two residues in a single nucleotide region. The lattice provides the most similar generalized concept for two residues (e.g. adenine and guanine are both purines). Results: The method is tested and evaluated with several publicly available human population datasets ranging in size from 30 to 400 sequences. Our findings imply the anonymization schema is feasible for the protection of sequences privacy. Conclusions: The DNALA method is the first computational disclosure control technique for general DNA sequences. Given the computational nature of the method, guarantees of anonymity can be formally proven. There is room for improvement and validation, though this research provides the groundwork from which future researchers can construct genomics anonymization schemas tailored to specific data-sharing scenarios.
摘要目的:如果个人信息(例如人口统计学)被遮盖,删除或加密,则可以保护基因组隐私技术,而人口统计学特征则可以直接妥协,因为序列数据本身很容易对其进行介绍。特定的DNA序列:该技术被称为DNA晶格An-nymization(DNALA),并且基于此模型的正式隐私保护架构。单个核植物区域提供了两个保留的概念(例如,腺嘌呤和鸟嘌呤是神经的测试和评估)。一般的DNA序列,鉴于该方法的计算性质,可以正式证明匿名性的保证。