Abstrakt | Clustering is an unsupervised machine learning technique that divides a data
set into groups. The manifold applications of clustering range from segmenting
a market using consumer preferences to grouping photos of diseased organs
in medical imaging. These scenarios inevitably require sensitive data such
as business secrets or personal health records. We consider multiple mutually
distrusting parties, e.g., corporations or hospitals that do not want to or are
not permitted to share their clients’ or patients’ data due to legislation such as
GDPR or HIPAA or to protect their market position and business secrets.
Nevertheless, these parties wish to cluster their joint data, since a larger data
pool usually results in a better result providing more insights.
A growing body of research proposes privacy-preserving variants of clus-
tering algorithms. These algorithms operate in a semi-honest security model,
which assumes that all parties honestly follow the protocol. However, exist-
ing implementations are impractical for real-world usage due to immense
computational costs or privacy-compromising data leakage during the cluster-
ing process. To the best of our knowledge, the few clustering schemes
that fully preserve privacy focus on the simple and efficient k-means
algorithm; however, the effectiveness of k-means depends on many assumptions.
For example, the number of clusters must be known in advance, and the input
data cannot include nominal variables or outliers. These assumptions often do
not hold for real-world clustering applications, limiting k-means’ practicality.
In our work, we explore the suitability of advanced clustering algorithms
for efficient crypto-oriented clustering. We identify affinity propagation to
be the most promising algorithm candidate. Affinity propagation and k-means
have comparable computational complexities of O(n2), and affinity propagation
provides more flexibility in terms of tolerance to outliers, compatibility with var-
ious data types, and the selection of tunable parameters. Furthermore, affinity
propagation involves only operations that can be implemented relatively effi-
ciently using secure computation techniques, such as addition and comparison.
The algorithm was also successfully applied for privacy-relevant use cases, e.g.,
the detection of genes from chromosomal data.
Based on our insights, we design the first privacy-preserving affinity propa-
gation clustering protocol using secure computation techniques. Moreover, our
protocol was implemented and benchmarked with the SPDZ framework
and enables private clustering in a malicious security model for the first time,
providing protection from strong adversaries that deviate from the protocol. |
---|