Communications in Cryptology IACR CiC

On the Privacy of Sublinear-Communication Jaccard Index Estimation via Min-hash

Authors

Mingyu Liang, Seung Geol Choi, Dana Dachman-Soled, Linsheng Liu, Arkady Yerukhimovich
Mingyu Liang ORCID
University of Maryland, USA
George Washington University, USA
mliang5 at gmu dot edu
Seung Geol Choi ORCID
United States Naval Academy, USA
choi at usna dot edu
Dana Dachman-Soled ORCID
University of Maryland, USA
danadach at umd dot edu
Linsheng Liu ORCID
George Washington University, USA
lls at gwu dot edu
Arkady Yerukhimovich ORCID
George Washington University, USA
arkady at gwu dot edu

Abstract

The min-hash sketch is a well-known technique for low-communication approximation of the Jaccard index between two input sets. Moreover, there is a folklore belief that min-hash sketch-based protocols protect the privacy of the inputs. In this paper, we consider variants of private min-hash sketch based-protocols and investigate this folklore to quantify the privacy of the min-hash sketch.

We begin our investigation by presenting a highly-efficient two-party protocol for estimating the Jaccard index while ensuring differential privacy. This protocol adds Laplacian noise to the min-hash sketch counts to provide privacy protection.

Then, we aim to understand what privacy, if any, is guaranteed if the results of the min-hash are released without any additional noise, such as in the case of historical data. We begin our investigation by considering the privacy of min-hash in a centralized setting where the hash functions are chosen by the min-hash functionality and are unknown to the participants. We show that in this case the min-hash output satisfies the standard definition of differential privacy (DP) without any additional noise.

We next consider a more practical distributed setting, where the hash function must be shared among all parties and is typically public.

Unfortunately, we show that in this public hash function setting, the min-hash output is no longer DP. We therefore consider the notion of distributional differential privacy (DDP) introduced by Bassily et al. (FOCS 2013). We show that if the honest party's set has sufficiently high min-entropy, the min-hash output achieves DDP without requiring noise.

Our findings provide guidance on how to use the min-hash sketch for private Jaccard index estimation and clarify the extent to which min-hash protocols protect input privacy, refining the common belief in their privacy guarantees.

References

[ABS20]
Martin Aumüller, Anders Bourgeat, and Jana Schmurr. Differentially Private Sketches for Jaccard Similarity Estimation. In SISAP 2020, pages 18–32. 2020. Springer. DOI: 10.1007/978-3-030-60936-8_2
[ACSS23]
Idan Attias, Edith Cohen, Moshe Shechner, and Uri Stemmer. A Framework for Adversarial Streaming via Differential Privacy and Difference Estimators. In Yael Tauman Kalai, editor, ITCS 2023: 14th Innovations in Theoretical Computer Science Conference, volume 251, pages 8:1–8:19, Cambridge, MA, USA. 2023. Leibniz International Proceedings in Informatics (LIPIcs). DOI: 10.4230/LIPIcs.ITCS.2023.8
[BBDS12]
Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Sheffet. The Johnson-Lindenstrauss Transform Itself Preserves Differential Privacy. In 53rd Annual Symposium on Foundations of Computer Science, pages 410–419, New Brunswick, NJ, USA. 2012. IEEE Computer Society Press. DOI: 10.1109/FOCS.2012.67
[BCG14]
Carlo Blundo, Emiliano De Cristofaro, and Paolo Gasti. EsPRESSO: Efficient privacy-preserving evaluation of sample set similarity. J. Comput. Secur., 22(3):355–381, 2014. DOI: 10.3233/jcs-130482
[BGKS13]
Raef Bassily, Adam Groce, Jonathan Katz, and Adam Smith. Coupled-Worlds Privacy: Exploiting Adversarial Uncertainty in Statistical Data Privacy. In 54th Annual Symposium on Foundations of Computer Science, pages 439–448, Berkeley, CA, USA. 2013. IEEE Computer Society Press. DOI: 10.1109/FOCS.2013.54
[BGMZ97]
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic Clustering of the Web. Comput. Networks, 29(8-13):1157–1166, 1997. DOI: 10.1016/s0169-7552(97)00031-7
[BJWY22]
Omri Ben-Eliezer, Rajesh Jayaram, David P. Woodruff, and Eylon Yogev. A Framework for Adversarially Robust Streaming Algorithms. J. ACM, 69(2):17:1–17:33, 2022. DOI: 10.1145/3498334
[BLV19]
Elette Boyle, Rio LaVigne, and Vinod Vaikuntanathan. Adversarially Robust Property-Preserving Hash Functions. In Avrim Blum, editor, ITCS 2019: 10th Innovations in Theoretical Computer Science Conference, volume 124, pages 16:1–16:20, San Diego, CA, USA. 2019. Leibniz International Proceedings in Informatics (LIPIcs). DOI: 10.4230/LIPIcs.ITCS.2019.16
[BNO08]
Amos Beimel, Kobbi Nissim, and Eran Omri. Distributed Private Data Analysis: Simultaneously Solving How and What. In Crypto 2008, pages 451–468. 2008. Springer. DOI: 10.1007/978-3-540-85174-5_25
[BNSGT17]
Raef Bassily, Kobbi Nissim, Uri Stemmer, and Abhradeep Guha Thakurta. Practical locally private heavy hitters. Advances in Neural Information Processing Systems, 30, 2017. DOI: 10.5555/3294771.3294989
[BO13]
Gilles Barthe and Federico Olmedo. Beyond Differential Privacy: Composition Theorems and Relational Logic for f-divergences between Probabilistic Programs. In Fedor V. Fomin, Rusins Freivalds, Marta Z. Kwiatkowska, and David Peleg, editors, ICALP 2013: 40th International Colloquium on Automata, Languages and Programming, Part II, volume 7966 of Lecture Notes in Computer Science, pages 49–60, Riga, Latvia. 2013. Springer Berlin Heidelberg, Germany. DOI: 10.1007/978-3-642-39212-2_8
[Bro97]
A. Broder. On the Resemblance and Containment of Documents. In Compression and Complexity of Sequences, International Conference on, pages 21. 1997. IEEE Computer Society. DOI: 10.1109/sequen.1997.666900
[BS15]
Raef Bassily and Adam Smith. Local, private, efficient protocols for succinct histograms. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 127–135. 2015. DOI: 10.1145/2746539.2746632
[CCMS19]
T.-H. Hubert Chan, Kai-Min Chung, Bruce M. Maggs, and Elaine Shi. Foundations of Differentially Oblivious Algorithms. In Timothy M. Chan, editor, 30th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2448–2467, San Diego, CA, USA. 2019. ACM-SIAM. DOI: 10.1137/1.9781611975482.150
[CDSKY20]
Seung Geol Choi, Dana Dachman-Soled, Mukul Kulkarni, and Arkady Yerukhimovich. Differentially-private multi-party sketching for large-scale statistics. Proceedings on Privacy Enhancing Technologies, 2020(3):153–174, 2020. DOI: 10.2478/popets-2020-0047
[CFGT12]
Emiliano De Cristofaro, Sky Faber, Paolo Gasti, and Gene Tsudik. Genodroid: are privacy-preserving genomic tests ready for prime time?. In WPES 2012, pages 97–108. 2012. ACM. DOI: 10.1145/2381966.2381980
[CGT12]
Emiliano De Cristofaro, Paolo Gasti, and Gene Tsudik. Fast and Private Computation of Cardinality of Set Intersection and Union. In CANS 2012, volume 7712, pages 218–231. 2012. Springer. DOI: 10.1007/978-3-642-35404-5_17
[COK22]
Wei-Ning Chen, Ayfer Ozgur, and Peter Kairouz. The Poisson Binomial Mechanism for Unbiased Federated Learning with Secure Aggregation. In ICML 2022, volume 162, pages 3490–3506. 2022. PMLR.
[DKZ18]
Stefan Dziembowski, Tomasz Kazana, and Maciej Zdanowicz. Quasi chain rule for min-entropy. Inf. Process. Lett., 134:62–66, 2018. DOI: 10.1016/j.ipl.2018.02.007
[DMNS06]
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography (TCC 2006), pages 265–284. 2006. Springer. DOI: 10.1007/11681878_14
[Doe18]
Benjamin Doerr. Probabilistic Tools for the Analysis of Randomized Optimization Heuristics. CoRR, abs/1801.06733, 2018.
[DORS08]
Yevgeniy Dodis, Rafail Ostrovsky, Leonid Reyzin, and Adam Smith. Fuzzy Extractors: How to Generate Strong Keys from Biometrics and Other Noisy Data. SIAM Journal on Computing, 38(1):97–139, January 2008. DOI: 10.1137/060651380
[DR+14]
Cynthia Dwork, Aaron Roth, and others. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014. DOI: 10.1561/9781601988195
[DTT22]
Charlie Dickens, Justin Thaler, and Daniel Ting. Order-invariant cardinality estimators are differentially private. Advances in Neural Information Processing Systems, 35:15204–15216, 2022. DOI: 10.5555/3600270.3601376
[Dwo06]
Cynthia Dwork. Differential privacy. In Automata, Languages and Programming: 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10-14, 2006, Proceedings, Part II 33, pages 1–12. 2006. Springer. DOI: 10.1007/11787006_1
[Fab16]
Sky Faber. Variants of Privacy Preserving Set Intersection and their Practical Applications. PhD Thesis, 2016.
[FIM+01]
Joan Feigenbaum, Yuval Ishai, Tal Malkin, Kobbi Nissim, Martin Strauss, and Rebecca N. Wright. Secure Multiparty Computation of Approximations. In Fernando Orejas, Paul G. Spirakis, and Jan van Leeuwen, editors, ICALP 2001: 28th International Colloquium on Automata, Languages and Programming, volume 2076 of Lecture Notes in Computer Science, pages 927–938, Heraklion, Crete, Greece. 2001. Springer Berlin Heidelberg, Germany. DOI: 10.1007/3-540-48224-5_75
[FLS22]
Nils Fleischhacker, Kasper Green Larsen, and Mark Simkin. Property-Preserving Hash Functions for Hamming Distance from Standard Assumptions. In Orr Dunkelman and Stefan Dziembowski, editors, Advances in Cryptology – EUROCRYPT 2022, Part II, volume 13276 of Lecture Notes in Computer Science, pages 764–781, Trondheim, Norway. 2022. Springer, Cham, Switzerland. DOI: 10.1007/978-3-031-07085-3_26
[FS21]
Nils Fleischhacker and Mark Simkin. Robust Property-Preserving Hash Functions for Hamming Distance and More. In Anne Canteaut and François-Xavier Standaert, editors, Advances in Cryptology – EUROCRYPT 2021, Part III, volume 12698 of Lecture Notes in Computer Science, pages 311–337, Zagreb, Croatia. 2021. Springer, Cham, Switzerland. DOI: 10.1007/978-3-030-77883-5_11
[GKLX22]
S. Dov Gordon, Jonathan Katz, Mingyu Liang, and Jiayu Xu. Spreading the Privacy Blanket: - Differentially Oblivious Shuffling for Differential Privacy. In Giuseppe Ateniese and Daniele Venturi, editors, ACNS 22: 20th International Conference on Applied Cryptography and Network Security, volume 13269 of Lecture Notes in Computer Science, pages 501–520, Rome, Italy. 2022. Springer, Cham, Switzerland. DOI: 10.1007/978-3-031-09234-3_25
[GRR19]
Adam Groce, Peter Rindal, and Mike Rosulek. Cheaper Private Set Intersection via Differentially Private Leakage. Proc. Privacy Enhancing Technologies (PETS), 2019(3):6–25, 2019. DOI: 10.2478/popets-2019-0034
[HKKN01]
Shai Halevi, Robert Krauthgamer, Eyal Kushilevitz, and Kobbi Nissim. Private approximation of NP-hard functions. In 33rd Annual ACM Symposium on Theory of Computing, pages 550–559, Crete, Greece. 2001. ACM Press. DOI: 10.1145/380752.380850
[HLTW22]
Justin Holmgren, Minghao Liu, LaKyah Tyner, and Daniel Wichs. Nearly Optimal Property Preserving Hashing. In Yevgeniy Dodis and Thomas Shrimpton, editors, Advances in Cryptology – CRYPTO 2022, Part III, volume 13509 of Lecture Notes in Computer Science, pages 473–502, Santa Barbara, CA, USA. 2022. Springer, Cham, Switzerland. DOI: 10.1007/978-3-031-15982-4_16
[HMFS17]
Xi He, Ashwin Machanavajjhala, Cheryl J. Flynn, and Divesh Srivastava. Composing Differential Privacy and Secure Computation: A Case Study on Scaling Private Record Linkage. In Bhavani M. Thuraisingham, David Evans, Tal Malkin, and Dongyan Xu, editors, ACM CCS 2017: 24th Conference on Computer and Communications Security, pages 1389–1406, Dallas, TX, USA. 2017. ACM Press. DOI: 10.1145/3133956.3134030
[HQYC22]
Ziyue Huang, Yuan Qiu, Ke Yi, and Graham Cormode. Frequency estimation under multiparty differential privacy: one-shot and streaming. Proc. VLDB Endow., 15(10):2058–2070, June 2022. DOI: 10.14778/3547305.3547312
[HTC23]
Jonathan Hehir, Daniel Ting, and Graham Cormode. Sketch-Flip-Merge: Mergeable Sketches for Private Distinct Counting. In ICML 2023, volume 202, pages 12846–12865. 2023. PMLR. DOI: 10.5555/3618408.3618930
[Jac01]
Paul Jaccard. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat, 37:547–579, 1901. DOI: 10.5169/seals-266450
[JKWC22]
Bo Jiang, Hamid Krim, Tianfu Wu, and Derya Cansever. Refining Self-Supervised Learning in Imaging: Beyond Linear Metric. In ICIP 2022, pages 76–80. 2022. IEEE. DOI: 10.1109/icip46576.2022.9897745
[KWS+20]
Benjamin Kreuter, Craig William Wright, Evgeny Sergeevich Skvortsov, Raimundo Mirisola, and Yao Wang. Privacy-Preserving Secure Cardinality and Frequency Estimation. Technical report, Google, LLC. 2020.
[LLSS19]
Tian Li, Zaoxing Liu, Vyas Sekar, and Virginia Smith. Privacy for free: Communication-efficient learning with differential privacy using sketches. arXiv preprint arXiv:1911.00972, 2019. DOI: 10.48550/arXiv.1911.00972
[LOZ12]
Ping Li, Art B. Owen, and Cun-Hui Zhang. One Permutation Hashing. In In Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012., pages 3122–3130. 2012. DOI: 10.5555/2999325.2999482
[MDDC16]
Luca Melis, George Danezis, and Emiliano De Cristofaro. Efficient Private Statistics with Succinct Sketches. In Proceedings 2016 Network and Distributed System Security Symposium. 2016. Internet Society. DOI: 10.14722/ndss.2016.23175
[MG18]
Sahar Mazloom and S. Dov Gordon. Secure Computation with Differentially Private Access Patterns. In David Lie, Mohammad Mannan, Michael Backes, and XiaoFeng Wang, editors, ACM CCS 2018: 25th Conference on Computer and Communications Security, pages 490–507, Toronto, ON, Canada. 2018. ACM Press. DOI: 10.1145/3243734.3243851
[MLRG20]
Sahar Mazloom, Phi Hung Le, Samuel Ranellucci, and S. Dov Gordon. Secure parallel computation on national scale volumes of data. In Srdjan Capkun and Franziska Roesner, editors, USENIX Security 2020: 29th USENIX Security Symposium, pages 2487–2504. 2020. USENIX Association.
[MMNW11]
Darakhshan Mir, Shan Muthukrishnan, Aleksandar Nikolov, and Rebecca N Wright. Pan-private algorithms via statistics on sketches. In ACM SIGMOD-SIGACT-SIGART, pages 37–48. 2011. DOI: 10.1145/1989284.1989290
[NvVT20]
Saskia Nuñez von Voigt and Florian Tschorsch. Rrtxfm: Probabilistic counting for differentially private statistics. In Digital Transformation for a Sustainable Society in the 21st Century: I3E 2019 IFIP WG 6.11 International Workshops, pages 86–98. 2020. Springer. DOI: 10.1007/978-3-030-39634-3_9
[PS21]
Rasmus Pagh and Nina Mesing Stausholm. Efficient Differentially Private $F_0$ Linear Sketching. In Ke Yi and Zhewei Wei, editors, 24th International Conference on Database Theory (ICDT 2021), volume 186 of Leibniz International Proceedings in Informatics (LIPIcs), pages 18:1–18:19, Dagstuhl, Germany. 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. DOI: 10.4230/LIPIcs.ICDT.2021.18
[PT22]
Rasmus Pagh and Mikkel Thorup. Improved utility analysis of private countsketch. Advances in Neural Information Processing Systems, 35:25631–25643, 2022. DOI: 10.5555/3600270.3602128
[RCS+19]
M. Sadegh Riazi, Beidi Chen, Anshumali Shrivastava, Dan Wallach, and Farinaz Koushanfar. Sub-Linear Privacy-Preserving Near-Neighbor Search. Cryptology ePrint Archive, Report 2019/1222. 2019.
[SCRS17]
Elaine Shi, T.-H. Hubert Chan, Eleanor Gilbert Rieffel, and Dawn Song. Distributed Private Data Analysis: Lower Bounds and Practical Constructions. ACM Trans. Algorithms, 13(4):50:1–50:38, 2017. DOI: 10.1145/3146549
[Sk{\'o}19]
Maciej Skórski. Strong chain rules for min-entropy under few bits spoiled. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 1122–1126. 2019. IEEE. DOI: 10.1109/isit.2019.8849240
[SNY17]
Rade Stanojevic, Mohamed Nabeel, and Ting Yu. Distributed cardinality estimation of set operations with differential privacy. In 2017 IEEE Symposium on Privacy-Aware Computing (PAC), pages 37–48. 2017. IEEE. DOI: 10.1109/pac.2017.43
[SSGT20]
Adam Smith, Shuang Song, and Abhradeep Guha Thakurta. The flajolet-martin sketch itself preserves differential privacy: Private counting with minimal space. NeurIPS 2020, 33:19561–19572, 2020. DOI: 10.5555/3495724.3497365
[STS18]
Hagen Sparka, Florian Tschorsch, and Björn Scheuermann. P2KMV: A privacy-preserving counting sketch for efficient and accurate set intersection cardinality estimations. Cryptology ePrint Archive, 2018. DOI: 10.14279/DEPOSITONCE-8374
[SV15]
Igal Sason and Sergio Verdú. Bounds among f-divergences. CoRR, abs/1508.00335, 2015.
[TBK07]
Chayant Tantipathananandh, Tanya Y. Berger-Wolf, and David Kempe. A framework for community identification in dynamic social networks. In KDD, pages 717–726. 2007. ACM. DOI: 10.1145/1281192.1281269
[TL24]
Yang Tan and Bo Lv. Break Two PSI-CA Protocols in Polynomial Time. In Proceedings of the 2024 16th International Conference on Machine Learning and Computing, pages 65–72, New York, NY, USA. 2024. Association for Computing Machinery. DOI: 10.1145/3651671.3651682
[WPS22]
Lun Wang, Iosif Pinelis, and Dawn Song. Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space. In International Conference on Learning Representations. 2022.
[WWT+19]
Huijun Wu, Chen Wang, Yuriy Tyshetskiy, Andrew Docherty, Kai Lu, and Liming Zhu. Adversarial Examples for Graph Data: Deep Insights into Attack and Defense. In IJCAI 2019, pages 4816–4823. 2019. ijcai.org. DOI: 10.24963/ijcai.2019/669
[YLL+17]
Ziqi Yan, Jiqiang Liu, Gang Li, Zhen Han, and Shuo Qiu. PrivMin: Differentially Private MinHash for Jaccard Similarity Computation. CoRR, abs/1705.07258, 2017. DOI: 10.48550/arXiv.1705.07258
[YWR+19]
Ziqi Yan, Qiong Wu, Meng Ren, Jiqiang Liu, Shaowu Liu, and Shuo Qiu. Locally private Jaccard similarity estimation. Concurr. Comput. Pract. Exp., 31(24), 2019. DOI: 10.1002/cpe.4889
[ZQR+22]
Fuheng Zhao, Dan Qiao, Rachel Redberg, Divyakant Agrawal, Amr El Abbadi, and Yu-Xiang Wang. Differentially private linear sketches: Efficient implementations and applications. NeurIPS 2022, 35:12691–12704, 2022. DOI: 10.5555/3600270.3601192
[Ös02]
Ferdinand Österreicher. Csiszar’s f-divergence-basic properties. RGMIA Research Report Collection, 2002.

PDFPDF Open access

History
Submitted: 2024-10-07
Accepted: 2024-12-03
Published: 2025-01-13
How to cite

Mingyu Liang, Seung Geol Choi, Dana Dachman-Soled, Linsheng Liu, and Arkady Yerukhimovich, On the Privacy of Sublinear-Communication Jaccard Index Estimation via Min-hash. IACR Communications in Cryptology, vol. 1, no. 4, Jan 13, 2025, doi: 10.62056/ak2i5w7sf.

License

Copyright is held by the author(s)

This work is licensed under a Creative Commons Attribution (CC BY) license.