On the Privacy of Sublinear-Communication Jaccard Index Estimation via Min-hash

Mingyu Liang; Seung Geol Choi; Dana Dachman-Soled; Linsheng Liu; Arkady Yerukhimovich

doi:10.62056/ak2i5w7sf

On the Privacy of Sublinear-Communication Jaccard Index Estimation via Min-hash

Authors

Mingyu Liang, Seung Geol Choi, Dana Dachman-Soled, Linsheng Liu, Arkady Yerukhimovich

Mingyu Liang

University of Maryland, USA
George Washington University, USA
mliang5 at gmu dot edu

Seung Geol Choi

United States Naval Academy, USA
choi at usna dot edu

Dana Dachman-Soled

University of Maryland, USA
danadach at umd dot edu

Linsheng Liu

George Washington University, USA
lls at gwu dot edu

Arkady Yerukhimovich

George Washington University, USA
arkady at gwu dot edu

Keywords: Differential Privacy MPC Sublinear Communication Sketching Min-hash Jaccard Index

Abstract

The min-hash sketch is a well-known technique for low-communication approximation of the Jaccard index between two input sets. Moreover, there is a folklore belief that min-hash sketch-based protocols protect the privacy of the inputs. In this paper, we consider variants of private min-hash sketch based-protocols and investigate this folklore to quantify the privacy of the min-hash sketch.

We begin our investigation by presenting a highly-efficient two-party protocol for estimating the Jaccard index while ensuring differential privacy. This protocol adds Laplacian noise to the min-hash sketch counts to provide privacy protection.

Then, we aim to understand what privacy, if any, is guaranteed if the results of the min-hash are released without any additional noise, such as in the case of historical data. We begin our investigation by considering the privacy of min-hash in a centralized setting where the hash functions are chosen by the min-hash functionality and are unknown to the participants. We show that in this case the min-hash output satisfies the standard definition of differential privacy (DP) without any additional noise.

We next consider a more practical distributed setting, where the hash function must be shared among all parties and is typically public.

Unfortunately, we show that in this public hash function setting, the min-hash output is no longer DP. We therefore consider the notion of distributional differential privacy (DDP) introduced by Bassily et al. (FOCS 2013). We show that if the honest party's set has sufficiently high min-entropy, the min-hash output achieves DDP without requiring noise.

Our findings provide guidance on how to use the min-hash sketch for private Jaccard index estimation and clarify the extent to which min-hash protocols protect input privacy, refining the common belief in their privacy guarantees.

References

[ABS20]

Martin Aumüller, Anders Bourgeat, and Jana Schmurr. Differentially Private Sketches for Jaccard Similarity Estimation. In SISAP 2020, pages 18–32. 2020. Springer. DOI: 10.1007/978-3-030-60936-8_2

Google Scholar ePrint

[ACSS23]

Idan Attias, Edith Cohen, Moshe Shechner, and Uri Stemmer. A Framework for Adversarial Streaming via Differential Privacy and Difference Estimators. In Yael Tauman Kalai, editor, ITCS 2023: 14th Innovations in Theoretical Computer Science Conference, volume 251, pages 8:1–8:19, Cambridge, MA, USA. 2023. Leibniz International Proceedings in Informatics (LIPIcs). DOI: 10.4230/LIPIcs.ITCS.2023.8

Google Scholar ePrint

[BBDS12]

Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Sheffet. The Johnson-Lindenstrauss Transform Itself Preserves Differential Privacy. In 53rd Annual Symposium on Foundations of Computer Science, pages 410–419, New Brunswick, NJ, USA. 2012. IEEE Computer Society Press. DOI: 10.1109/FOCS.2012.67

Google Scholar ePrint

[BCG14]

Carlo Blundo, Emiliano De Cristofaro, and Paolo Gasti. EsPRESSO: Efficient privacy-preserving evaluation of sample set similarity. J. Comput. Secur., 22(3):355–381, 2014. DOI: 10.3233/jcs-130482

Google Scholar ePrint

[BGKS13]

Raef Bassily, Adam Groce, Jonathan Katz, and Adam Smith. Coupled-Worlds Privacy: Exploiting Adversarial Uncertainty in Statistical Data Privacy. In 54th Annual Symposium on Foundations of Computer Science, pages 439–448, Berkeley, CA, USA. 2013. IEEE Computer Society Press. DOI: 10.1109/FOCS.2013.54

Google Scholar ePrint

[BGMZ97]

Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic Clustering of the Web. Comput. Networks, 29(8-13):1157–1166, 1997. DOI: 10.1016/s0169-7552(97)00031-7

Google Scholar ePrint

[BJWY22]

Omri Ben-Eliezer, Rajesh Jayaram, David P. Woodruff, and Eylon Yogev. A Framework for Adversarially Robust Streaming Algorithms. J. ACM, 69(2):17:1–17:33, 2022. DOI: 10.1145/3498334

Google Scholar ePrint

[BLV19]

Elette Boyle, Rio LaVigne, and Vinod Vaikuntanathan. Adversarially Robust Property-Preserving Hash Functions. In Avrim Blum, editor, ITCS 2019: 10th Innovations in Theoretical Computer Science Conference, volume 124, pages 16:1–16:20, San Diego, CA, USA. 2019. Leibniz International Proceedings in Informatics (LIPIcs). DOI: 10.4230/LIPIcs.ITCS.2019.16

Google Scholar ePrint

[BNO08]

Amos Beimel, Kobbi Nissim, and Eran Omri. Distributed Private Data Analysis: Simultaneously Solving How and What. In Crypto 2008, pages 451–468. 2008. Springer. DOI: 10.1007/978-3-540-85174-5_25

Google Scholar ePrint

[BNSGT17]

Raef Bassily, Kobbi Nissim, Uri Stemmer, and Abhradeep Guha Thakurta. Practical locally private heavy hitters. Advances in Neural Information Processing Systems, 30, 2017. DOI: 10.5555/3294771.3294989

Google Scholar ePrint

[BO13]

Gilles Barthe and Federico Olmedo. Beyond Differential Privacy: Composition Theorems and Relational Logic for f-divergences between Probabilistic Programs. In Fedor V. Fomin, Rusins Freivalds, Marta Z. Kwiatkowska, and David Peleg, editors, ICALP 2013: 40th International Colloquium on Automata, Languages and Programming, Part II, volume 7966 of Lecture Notes in Computer Science, pages 49–60, Riga, Latvia. 2013. Springer Berlin Heidelberg, Germany. DOI: 10.1007/978-3-642-39212-2_8

Google Scholar ePrint

[Bro97]

A. Broder. On the Resemblance and Containment of Documents. In Compression and Complexity of Sequences, International Conference on, pages 21. 1997. IEEE Computer Society. DOI: 10.1109/sequen.1997.666900

Google Scholar ePrint

[BS15]

Raef Bassily and Adam Smith. Local, private, efficient protocols for succinct histograms. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 127–135. 2015. DOI: 10.1145/2746539.2746632

Google Scholar ePrint

[CCMS19]

T.-H. Hubert Chan, Kai-Min Chung, Bruce M. Maggs, and Elaine Shi. Foundations of Differentially Oblivious Algorithms. In Timothy M. Chan, editor, 30th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2448–2467, San Diego, CA, USA. 2019. ACM-SIAM. DOI: 10.1137/1.9781611975482.150

Google Scholar ePrint

[CDSKY20]

Seung Geol Choi, Dana Dachman-Soled, Mukul Kulkarni, and Arkady Yerukhimovich. Differentially-private multi-party sketching for large-scale statistics. Proceedings on Privacy Enhancing Technologies, 2020(3):153–174, 2020. DOI: 10.2478/popets-2020-0047

Google Scholar ePrint

[CFGT12]

Emiliano De Cristofaro, Sky Faber, Paolo Gasti, and Gene Tsudik. Genodroid: are privacy-preserving genomic tests ready for prime time?. In WPES 2012, pages 97–108. 2012. ACM. DOI: 10.1145/2381966.2381980

Google Scholar ePrint

[CGT12]

Emiliano De Cristofaro, Paolo Gasti, and Gene Tsudik. Fast and Private Computation of Cardinality of Set Intersection and Union. In CANS 2012, volume 7712, pages 218–231. 2012. Springer. DOI: 10.1007/978-3-642-35404-5_17

Google Scholar ePrint

[COK22]

Wei-Ning Chen, Ayfer Ozgur, and Peter Kairouz. The Poisson Binomial Mechanism for Unbiased Federated Learning with Secure Aggregation. In ICML 2022, volume 162, pages 3490–3506. 2022. PMLR.

Google Scholar ePrint

[DKZ18]

Stefan Dziembowski, Tomasz Kazana, and Maciej Zdanowicz. Quasi chain rule for min-entropy. Inf. Process. Lett., 134:62–66, 2018. DOI: 10.1016/j.ipl.2018.02.007

Google Scholar ePrint

[DMNS06]

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography (TCC 2006), pages 265–284. 2006. Springer. DOI: 10.1007/11681878_14

Google Scholar ePrint

[Doe18]

Benjamin Doerr. Probabilistic Tools for the Analysis of Randomized Optimization Heuristics. CoRR, abs/1801.06733, 2018.

Google Scholar ePrint

[DORS08]

Yevgeniy Dodis, Rafail Ostrovsky, Leonid Reyzin, and Adam Smith. Fuzzy Extractors: How to Generate Strong Keys from Biometrics and Other Noisy Data. SIAM Journal on Computing, 38(1):97–139, January 2008. DOI: 10.1137/060651380

Google Scholar ePrint

[DR⁺14]

Cynthia Dwork, Aaron Roth, and others. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014. DOI: 10.1561/9781601988195

Google Scholar ePrint

[DTT22]

Charlie Dickens, Justin Thaler, and Daniel Ting. Order-invariant cardinality estimators are differentially private. Advances in Neural Information Processing Systems, 35:15204–15216, 2022. DOI: 10.5555/3600270.3601376

Google Scholar ePrint

[Dwo06]

Cynthia Dwork. Differential privacy. In Automata, Languages and Programming: 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10-14, 2006, Proceedings, Part II 33, pages 1–12. 2006. Springer. DOI: 10.1007/11787006_1

Google Scholar ePrint

[Fab16]

Sky Faber. Variants of Privacy Preserving Set Intersection and their Practical Applications. PhD Thesis, 2016.

Google Scholar ePrint

[FIM⁺01]

Joan Feigenbaum, Yuval Ishai, Tal Malkin, Kobbi Nissim, Martin Strauss, and Rebecca N. Wright. Secure Multiparty Computation of Approximations. In Fernando Orejas, Paul G. Spirakis, and Jan van Leeuwen, editors, ICALP 2001: 28th International Colloquium on Automata, Languages and Programming, volume 2076 of Lecture Notes in Computer Science, pages 927–938, Heraklion, Crete, Greece. 2001. Springer Berlin Heidelberg, Germany. DOI: 10.1007/3-540-48224-5_75

Google Scholar ePrint

[FLS22]

Nils Fleischhacker, Kasper Green Larsen, and Mark Simkin. Property-Preserving Hash Functions for Hamming Distance from Standard Assumptions. In Orr Dunkelman and Stefan Dziembowski, editors, Advances in Cryptology – EUROCRYPT 2022, Part II, volume 13276 of Lecture Notes in Computer Science, pages 764–781, Trondheim, Norway. 2022. Springer, Cham, Switzerland. DOI: 10.1007/978-3-031-07085-3_26

Google Scholar ePrint

[FS21]

Nils Fleischhacker and Mark Simkin. Robust Property-Preserving Hash Functions for Hamming Distance and More. In Anne Canteaut and François-Xavier Standaert, editors, Advances in Cryptology – EUROCRYPT 2021, Part III, volume 12698 of Lecture Notes in Computer Science, pages 311–337, Zagreb, Croatia. 2021. Springer, Cham, Switzerland. DOI: 10.1007/978-3-030-77883-5_11

Google Scholar ePrint

[GKLX22]

S. Dov Gordon, Jonathan Katz, Mingyu Liang, and Jiayu Xu. Spreading the Privacy Blanket: - Differentially Oblivious Shuffling for Differential Privacy. In Giuseppe Ateniese and Daniele Venturi, editors, ACNS 22: 20th International Conference on Applied Cryptography and Network Security, volume 13269 of Lecture Notes in Computer Science, pages 501–520, Rome, Italy. 2022. Springer, Cham, Switzerland. DOI: 10.1007/978-3-031-09234-3_25

Google Scholar ePrint

[GRR19]

Adam Groce, Peter Rindal, and Mike Rosulek. Cheaper Private Set Intersection via Differentially Private Leakage. Proc. Privacy Enhancing Technologies (PETS), 2019(3):6–25, 2019. DOI: 10.2478/popets-2019-0034

Google Scholar ePrint

[HKKN01]

Shai Halevi, Robert Krauthgamer, Eyal Kushilevitz, and Kobbi Nissim. Private approximation of NP-hard functions. In 33rd Annual ACM Symposium on Theory of Computing, pages 550–559, Crete, Greece. 2001. ACM Press. DOI: 10.1145/380752.380850

Google Scholar ePrint

[HLTW22]

Justin Holmgren, Minghao Liu, LaKyah Tyner, and Daniel Wichs. Nearly Optimal Property Preserving Hashing. In Yevgeniy Dodis and Thomas Shrimpton, editors, Advances in Cryptology – CRYPTO 2022, Part III, volume 13509 of Lecture Notes in Computer Science, pages 473–502, Santa Barbara, CA, USA. 2022. Springer, Cham, Switzerland. DOI: 10.1007/978-3-031-15982-4_16

Google Scholar ePrint

[HMFS17]

Xi He, Ashwin Machanavajjhala, Cheryl J. Flynn, and Divesh Srivastava. Composing Differential Privacy and Secure Computation: A Case Study on Scaling Private Record Linkage. In Bhavani M. Thuraisingham, David Evans, Tal Malkin, and Dongyan Xu, editors, ACM CCS 2017: 24th Conference on Computer and Communications Security, pages 1389–1406, Dallas, TX, USA. 2017. ACM Press. DOI: 10.1145/3133956.3134030

Google Scholar ePrint

[HQYC22]

Ziyue Huang, Yuan Qiu, Ke Yi, and Graham Cormode. Frequency estimation under multiparty differential privacy: one-shot and streaming. Proc. VLDB Endow., 15(10):2058–2070, June 2022. DOI: 10.14778/3547305.3547312

Google Scholar ePrint

[HTC23]

Jonathan Hehir, Daniel Ting, and Graham Cormode. Sketch-Flip-Merge: Mergeable Sketches for Private Distinct Counting. In ICML 2023, volume 202, pages 12846–12865. 2023. PMLR. DOI: 10.5555/3618408.3618930

Google Scholar ePrint

[Jac01]

Paul Jaccard. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat, 37:547–579, 1901. DOI: 10.5169/seals-266450

Google Scholar ePrint

[JKWC22]

Bo Jiang, Hamid Krim, Tianfu Wu, and Derya Cansever. Refining Self-Supervised Learning in Imaging: Beyond Linear Metric. In ICIP 2022, pages 76–80. 2022. IEEE. DOI: 10.1109/icip46576.2022.9897745

Google Scholar ePrint

[KWS⁺20]

Benjamin Kreuter, Craig William Wright, Evgeny Sergeevich Skvortsov, Raimundo Mirisola, and Yao Wang. Privacy-Preserving Secure Cardinality and Frequency Estimation. Technical report, Google, LLC. 2020.

Google Scholar ePrint

[LLSS19]

Tian Li, Zaoxing Liu, Vyas Sekar, and Virginia Smith. Privacy for free: Communication-efficient learning with differential privacy using sketches. arXiv preprint arXiv:1911.00972, 2019. DOI: 10.48550/arXiv.1911.00972

Google Scholar ePrint

[LOZ12]

Ping Li, Art B. Owen, and Cun-Hui Zhang. One Permutation Hashing. In In Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012., pages 3122–3130. 2012. DOI: 10.5555/2999325.2999482

Google Scholar ePrint

[MDDC16]

Luca Melis, George Danezis, and Emiliano De Cristofaro. Efficient Private Statistics with Succinct Sketches. In Proceedings 2016 Network and Distributed System Security Symposium. 2016. Internet Society. DOI: 10.14722/ndss.2016.23175

Google Scholar ePrint

[MG18]

Sahar Mazloom and S. Dov Gordon. Secure Computation with Differentially Private Access Patterns. In David Lie, Mohammad Mannan, Michael Backes, and XiaoFeng Wang, editors, ACM CCS 2018: 25th Conference on Computer and Communications Security, pages 490–507, Toronto, ON, Canada. 2018. ACM Press. DOI: 10.1145/3243734.3243851

Google Scholar ePrint

[MLRG20]

Sahar Mazloom, Phi Hung Le, Samuel Ranellucci, and S. Dov Gordon. Secure parallel computation on national scale volumes of data. In Srdjan Capkun and Franziska Roesner, editors, USENIX Security 2020: 29th USENIX Security Symposium, pages 2487–2504. 2020. USENIX Association.

Google Scholar ePrint

[MMNW11]

Darakhshan Mir, Shan Muthukrishnan, Aleksandar Nikolov, and Rebecca N Wright. Pan-private algorithms via statistics on sketches. In ACM SIGMOD-SIGACT-SIGART, pages 37–48. 2011. DOI: 10.1145/1989284.1989290

Google Scholar ePrint

[NvVT20]

Saskia Nuñez von Voigt and Florian Tschorsch. Rrtxfm: Probabilistic counting for differentially private statistics. In Digital Transformation for a Sustainable Society in the 21st Century: I3E 2019 IFIP WG 6.11 International Workshops, pages 86–98. 2020. Springer. DOI: 10.1007/978-3-030-39634-3_9

Google Scholar ePrint

[PS21]

Rasmus Pagh and Nina Mesing Stausholm. Efficient Differentially Private $F_0$ Linear Sketching. In Ke Yi and Zhewei Wei, editors, 24th International Conference on Database Theory (ICDT 2021), volume 186 of Leibniz International Proceedings in Informatics (LIPIcs), pages 18:1–18:19, Dagstuhl, Germany. 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. DOI: 10.4230/LIPIcs.ICDT.2021.18

Google Scholar ePrint

[PT22]

Rasmus Pagh and Mikkel Thorup. Improved utility analysis of private countsketch. Advances in Neural Information Processing Systems, 35:25631–25643, 2022. DOI: 10.5555/3600270.3602128

Google Scholar ePrint

[RCS⁺19]

M. Sadegh Riazi, Beidi Chen, Anshumali Shrivastava, Dan Wallach, and Farinaz Koushanfar. Sub-Linear Privacy-Preserving Near-Neighbor Search. Cryptology ePrint Archive, Report 2019/1222. 2019.

Google Scholar ePrint

[SCRS17]

Elaine Shi, T.-H. Hubert Chan, Eleanor Gilbert Rieffel, and Dawn Song. Distributed Private Data Analysis: Lower Bounds and Practical Constructions. ACM Trans. Algorithms, 13(4):50:1–50:38, 2017. DOI: 10.1145/3146549

Google Scholar ePrint

[Sk{\'o}19]

Maciej Skórski. Strong chain rules for min-entropy under few bits spoiled. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 1122–1126. 2019. IEEE. DOI: 10.1109/isit.2019.8849240

Google Scholar ePrint

[SNY17]

Rade Stanojevic, Mohamed Nabeel, and Ting Yu. Distributed cardinality estimation of set operations with differential privacy. In 2017 IEEE Symposium on Privacy-Aware Computing (PAC), pages 37–48. 2017. IEEE. DOI: 10.1109/pac.2017.43

Google Scholar ePrint

[SSGT20]

Adam Smith, Shuang Song, and Abhradeep Guha Thakurta. The flajolet-martin sketch itself preserves differential privacy: Private counting with minimal space. NeurIPS 2020, 33:19561–19572, 2020. DOI: 10.5555/3495724.3497365

Google Scholar ePrint

[STS18]

Hagen Sparka, Florian Tschorsch, and Björn Scheuermann. P2KMV: A privacy-preserving counting sketch for efficient and accurate set intersection cardinality estimations. Cryptology ePrint Archive, 2018. DOI: 10.14279/DEPOSITONCE-8374

Google Scholar ePrint

[SV15]

Igal Sason and Sergio Verdú. Bounds among f-divergences. CoRR, abs/1508.00335, 2015.

Google Scholar ePrint

[TBK07]

Chayant Tantipathananandh, Tanya Y. Berger-Wolf, and David Kempe. A framework for community identification in dynamic social networks. In KDD, pages 717–726. 2007. ACM. DOI: 10.1145/1281192.1281269

Google Scholar ePrint

[TL24]

Yang Tan and Bo Lv. Break Two PSI-CA Protocols in Polynomial Time. In Proceedings of the 2024 16th International Conference on Machine Learning and Computing, pages 65–72, New York, NY, USA. 2024. Association for Computing Machinery. DOI: 10.1145/3651671.3651682

Google Scholar ePrint

[WPS22]

Lun Wang, Iosif Pinelis, and Dawn Song. Differentially Private Fractional Frequency Moments Estimation with Polylogarithmic Space. In International Conference on Learning Representations. 2022.

Google Scholar ePrint

[WWT⁺19]

Huijun Wu, Chen Wang, Yuriy Tyshetskiy, Andrew Docherty, Kai Lu, and Liming Zhu. Adversarial Examples for Graph Data: Deep Insights into Attack and Defense. In IJCAI 2019, pages 4816–4823. 2019. ijcai.org. DOI: 10.24963/ijcai.2019/669

Google Scholar ePrint

[YLL⁺17]

Ziqi Yan, Jiqiang Liu, Gang Li, Zhen Han, and Shuo Qiu. PrivMin: Differentially Private MinHash for Jaccard Similarity Computation. CoRR, abs/1705.07258, 2017. DOI: 10.48550/arXiv.1705.07258

Google Scholar ePrint

[YWR⁺19]

Ziqi Yan, Qiong Wu, Meng Ren, Jiqiang Liu, Shaowu Liu, and Shuo Qiu. Locally private Jaccard similarity estimation. Concurr. Comput. Pract. Exp., 31(24), 2019. DOI: 10.1002/cpe.4889

Google Scholar ePrint

[ZQR⁺22]

Fuheng Zhao, Dan Qiao, Rachel Redberg, Divyakant Agrawal, Amr El Abbadi, and Yu-Xiang Wang. Differentially private linear sketches: Efficient implementations and applications. NeurIPS 2022, 35:12691–12704, 2022. DOI: 10.5555/3600270.3601192

Google Scholar ePrint

[Ös02]

Ferdinand Österreicher. Csiszar’s f-divergence-basic properties. RGMIA Research Report Collection, 2002.

Google Scholar ePrint

PDF Open access

DOI: https://doi.org/10.62056/ak2i5w7sf

History

Submitted: 2024-10-07
Accepted: 2024-12-03
Published: 2025-01-13

How to cite

Mingyu Liang, Seung Geol Choi, Dana Dachman-Soled, Linsheng Liu, and Arkady Yerukhimovich, On the Privacy of Sublinear-Communication Jaccard Index Estimation via Min-hash. IACR Communications in Cryptology, vol. 1, no. 4, Jan 13, 2025, doi: 10.62056/ak2i5w7sf.

License

Copyright is held by the author(s)

This work is licensed under a Creative Commons Attribution (CC BY) license.

@article{10.62056/ak2i5w7sf,
         author={Mingyu Liang and Seung Geol Choi and Dana Dachman-Soled and Linsheng Liu and Arkady Yerukhimovich},
         title={On the Privacy of Sublinear-Communication Jaccard Index Estimation via Min-hash},
         volume={1},
         number={4},
         year={2025},
         date={2025-01-13},
         issn={3006-5496},
         doi={10.62056/ak2i5w7sf},
         journal={{IACR} Communications in Cryptology},
         publisher={International Association for Cryptologic Research}
}

TY  - JOUR
AU  - Mingyu Liang
AU  - Seung Geol Choi
AU  - Dana Dachman-Soled
AU  - Linsheng Liu
AU  - Arkady Yerukhimovich
PY  - 2025
TI  - On the Privacy of Sublinear-Communication Jaccard Index Estimation via Min-hash
JF  - IACR Communications in Cryptology
JA  - CIC
VL  - 1
IS  - 4
DO  - 10.62056/ak2i5w7sf
UR  - https://doi.org/10.62056/ak2i5w7sf
AB  - <p>   The min-hash sketch is a well-known technique for low-communication approximation of the Jaccard index between two input sets.  Moreover, there is a folklore belief that min-hash sketch-based protocols protect the privacy of the inputs.  In this paper, we consider variants of private min-hash sketch based-protocols and investigate this folklore to quantify the privacy of the min-hash sketch.</p><p>   We begin our investigation by presenting a highly-efficient two-party protocol for estimating the Jaccard index while ensuring differential privacy. This protocol adds Laplacian noise to the min-hash sketch counts to provide privacy protection.</p><p>   Then, we aim to understand what privacy, if any, is guaranteed if the results of the min-hash are released without any additional noise, such as in the case of historical data. We begin our investigation by considering the privacy of min-hash in a centralized setting where the hash functions are chosen by the min-hash functionality and are unknown to the participants. We show that in this case the min-hash output satisfies the standard definition of differential privacy (DP) without any additional noise.</p><p>   We next consider a more practical distributed setting, where the hash function must be shared among all parties and is typically public.</p><p>   Unfortunately, we show that in this public hash function setting, the min-hash output is no longer DP.  We therefore consider the notion of distributional differential privacy (DDP) introduced by Bassily et al. (FOCS 2013). We show that if the honest party's set has sufficiently high min-entropy, the min-hash output achieves DDP without requiring noise.</p><p>   Our findings provide guidance on how to use the min-hash sketch for private Jaccard index estimation and clarify the extent to which min-hash protocols protect input privacy, refining the common belief in their privacy guarantees. </p>
ER  -

Mingyu Liang, Seung Geol Choi, Dana Dachman-Soled, Linsheng Liu, and Arkady Yerukhimovich, On the Privacy of Sublinear-Communication Jaccard Index Estimation via Min-hash. IACR Communications in Cryptology, vol. 1, no. 4, Jan 13, 2025, doi: 10.62056/ak2i5w7sf.