Example of Background Knowledge Attack [17]: Alice has another
friend, Tanaka, whose medical information also appears in the table
(i.e., Fig. 9.2.1 (b)). Tanaka is a Japanese female whose age is 21 and
lives in the 13068 area so it can be inferred that Tanaka’s number is
between 1 and 4, whose health problem could be heart disease or viral
infection. Because it is well known that Japanese people seldom got
heart disease, Alice knows that Tanaka has a viral infection issue.
From the above examples, we can see that although the k-anonymity
is an effective solution to preserve the records’ privacy, in some cases, it
may lose effectiveness on protecting sensitive information. To tackle these
issues, l-diversity [17, 18] was proposed by keeping the diversity of the
attributes to hide sensitive information.
Machanavajjhala et al. [17, 18] proposed the diversity principle, i.e.,
l-diversity, to protect sensitive information from malicious attacks. To fulfi ll
this purpose, it requires each quasi-identifi er attribute group has at least l
“well represented” different values, which can be used to make the tuple
indistinct. To defi ne how well the values of attributes represented, several
possible models could be applied. The simplest one is that at least l distinct
values exist in the attribute group. If we have l=k, then it should satisfy
the k-anonymity. This model is mentioned in [22]. However, this simple
implementation can still be attacked, i.e., probabilistic inference attack. The
reason is that some values are more frequent than others in the group and
it is not diffi cult to deduce those frequent ones based on the distribution
of the values. To tackle this issue, some more deliberated principles based
on l-diversity are introduced.
Entropy l-diversity [17]: For entropy l-diversity, in each quasi-identifi er
group, we have −¹
s¢S
P(qid, s)log(P(qid, s)) log(l), where S is a sensitive
attribute, and P(qid, s) is the probability of tuples in a quasi-identifi er
group which have the value s. Because of the property of entropy, we
can know that a larger value of the entropy indicates the sensitive
values can distributed more evenly in the group, which makes the
tuples more indistinct. However, the entropy l-diversity is still not
perfect to prevent all the attacks. Moreover, it has the drawback that
the entropy value is diffi cult to understand for users, who prefer
some probability based explanation, i.e., malicious attackers have 20%
chance to know that Bob has cancer, according to the current l-diversity
setting.
Recursive (c, l)-diversity [17]: Similar to the entropy l-diversity, the
key idea of the recursive (c, l)-diversity also ensures that the sensitive
values are distributed as more evenly as possible, that the frequent
values are not so frequent and the rare values are not so rare. In a given
quasi-identifi er group qn, r
i
is denoted as the number of times the ith
Privacy Preserving in Data Mining 209
210 Applied Data Mining
most frequent sensitive value appears in qn. Given a constant c, qn
satisfi es recursive (c, l)-diversity if r
1
< c(r
l
+r
l+1
+· · ·+r
m
). A table satisfi es
the recursive (c, l)-diversity if every quasi-identifi er group satisfi es the
recursive -diversity. Note that 1-diversity is always satisfi ed.
There are some other principles based on the l-diversity introduced,
i.e., positive disclosure-recursive (c, l)-diversity and negative/positive
disclosure-recursive (c
1
, c
2
, l)-diversity [17, 18]. The basic idea of these
principles is that although with the help of some background an attacker
may remove some values from the group, he still cannot recognize sensitive
information. From another viewpoint of these principles, some work is
introduced to estimate the maximum disclosure risk of the published data
[19] based on different privacy preserving metrics.
Note that the above mentioned issues are based on such an assumption
that only one sensitive attribute exists. If there are multiple sensitive
attributes, the l-diversity problem becomes more challenging. Some works
have explored this issue, i.e., [18, 25]. However, all these works suffer from
the issue of the curse of dimensionality.
9.3 The
t
-Closeness Method
Due to the intrinsic drawback of the l-diversity, Li et al. [16] found that
leakage of sensitive information could happen when the overall distribution
of a sensitive attribute is skewed. The reason is that because the l-diversity
requirement ensures “diversity” of sensitive values in each group, it does
not take into account the semantical closeness of these values. For example,
suppose we have a patient table where 90% of the tuples have headache
and 10% have cancer. If we have a quasi-identifi er group which has 50% of
headache and 50% of cancer, it satisfi es the 2-diversity rule. Nevertheless,
this quasi-identifi er group may face to a privacy problem because it is easy
to infer that any person in this group has 50% chance to get cancer, yet
consider in the whole table, this probability reduces to 10%. The obvious
difference between these two conclusions makes the l-diversity principle
lose its effect.
To protect the attack from the above mentioned example, Li et al. [16]
introduced the t-closeness principle. The key idea is that t-closeness requires
the distribution of each sensitive attribute in every group should be similar
to that in the overall table. Moreover, in [16], the authors introduced a new
distance metric, i.e., Earth Mover Distance (EMD), to estimate the closeness
between two distributions. A constant t is used as a threshold to satisfy
the t-closeness principle. Although the advantage it may obtain, there are
several issues brought by this interesting principle: (1) it is not easy to protect
privacy according to different security levels; (2) the introduced distance
metric, i.e., EMD, lacks a fl exibility to cope with numerical attributes [15];
and (3) the utility of the published data may largely be sacrifi ced because
it is too strict a rule to let all the distributions of attributes be similar to
each other. Several solutions are introduced to tackle part (if not all) of
these issues [9].
9.4 Discussion and Challenges
In this chapter, we presented the main strategies for privacy preserving
data mining. There are several issues which should be mentioned. The fi rst
one is how to keep a good balance between different evaluation metrics,
such as privacy and utility. It is intuitive that the safest strategy to protect
privacy is to publish as few data as possible, though this approach, leads
to low utility. For some principle (e.g., entropy l-diversity), how to explain
the semantic meaning of the setting becomes more diffi cult and thus, is
challenging to be applied on the real applications. Another main issue for
privacy preserving is the curse of dimensionality. The work in [1] states that
to keep privacy, a large number of the attributes may need to be suppressed
or generalized. This requirement also leads to the loss of the data’s utility.
More seriously, it seems that some methods become infeasible to implement
with more dimensionality taken into account.
9.5 Chapter Summary
In this chapter, we introduced the basic concept and main techniques for
the privacy-preserving data mining issue. We presented a variety of data
transformation strategies such as k-anonymity, l-diversity, and t-closeness
based methods. Furthermore, we gave some concrete examples to illustrate
the advantages and disadvantages of these approaches and analyzed them
thoroughly. Some related issues, i.e., curse of dimensionality and balance
between utility and privacy, were also discussed.
References
[1] C. C. Aggarwal. On k-anonymity and the curse of dimensionality. In: Proceedings of the
31st International Conference on Very Large Data Bases, pp. 901–909, 2005.
[2] C. C. Aggarwal and P. S. Yu. A condensation approach to privacy preserving data mining.
In: Proceedings of International Conference on Extending Database Technology, pp. 183–199,
2004.
[3] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas and A. Zhu.
Achieving anonymity via clustering. In: Proceedings of the Twenty-fi fth ACM SIGMOD-
SIGACT-SIGART Symposium on Principles of Database Systems, pp. 153–162, 2006.
[4] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas and
A. Zhu. Anonymizing tables. In: T. Eiter and L. Libkin, editors, ICDT, pp. 246–258, 2005.
Privacy Preserving in Data Mining 211
212 Applied Data Mining
[5] R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In:
Proceedings of the 21st International Conference on Data Engineering, pp. 217–228, 2005.
[6] V. Ciriani, S. D. C. di Vimercati, S. Foresti and P. Samarati. Anonymity, Vol. 33 of Advances
in Information Security. Springer, 2007.
[7] C. Clifton. Using sample size to limit exposure to data mining. J. Comput. Secur., 8:
281–307, December 2000.
[8] J. Domingo-Ferrer and J. M. Mateo-Sanz. Practical data-oriented microaggregation for
statistical disclosure control. IEEE Trans. on Knowl. and Data Eng., 14: 189–201, January
2002.
[9] J. Domingo-Ferrer and V. Torra. A critique of k-anonymity and some of its enhancements.
In: Proceedings of the 2008 Third International Conference on Availability, Reliability and
Security, pp. 990–993, 2008.
[10] B. C. M. Fung, K. Wang and P. S. Yu. Top-down specialization for information and privacy
preservation. In: Proceedings of the 21st International Conference on Data Engineering, pp.
205–216, 2005.
[11] V. S. Iyengar. Transforming data to satisfy privacy constraints. In: Proceedings of the
eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pp. 279–288, 2002.
[12] L. V. S. Lakshmanan, R. T. Ng and G. Ramesh. To do or not to do: the dilemma of
disclosing anonymized data. In: Proceedings of the 2005 ACM SIGMOD International
Conference on Management of Data, pp. 61–72, 2005.
[13] K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Incognito: efficient full-domain
kanonymity. In: Proceedings of the 2005 ACM SIGMOD international conference on Man-
agement of data, pp. 49–60, 2005.
[14] K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Mondrian multidimensional k-anonymity.
In: Proceedings of the 22nd International Conference on Data Engineering, 2006.
[15] J. Li, Y. Tao and X. Xiao. Preservation of proximity privacy in publishing numerical
sensitive data. In: Proceedings of the 2008 ACM SIGMOD International Conference on
Management of Data, pp. 473–486, 2008.
[16] N. Li, T. Li and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity
and l-diversity. In: Proceedings of the 21st International Conference on Data Engineering,
pp. 106–115, 2007.
[17] A. Machanavajjhala, D. Kifer, J. Gehrke and M. Venkitasubramaniam. ?-diversity: Privacy
beyond k-anonymity. In: Proceedings of International Conference on Data Engineering, 2006.
[18] A. Machanavajjhala, D. Kifer, J. Gehrke and M. Venkitasubramaniam. L-diversity: Privacy
beyond k-anonymity. ACM Trans. Knowl. Discov. Data, 1, March 2007.
[19] D. J. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke and J. Y. Halpern. Worst-case
background knowledge in privacy. In: Proceedings of International Conference on Data
Engineering, pp. 126–135, 2007.
[20] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In: Pr
oceedings
of the twenty-third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database
Systems, pp. 223–228, 2004.
[21] P. Samarati. Protecting respondents’ identities in microdata release. IEEE Trans. on Knowl.
and Data Eng., 13: 1010–1027, November 2001.
[22] T. M. Truta and B. Vinay. Privacy protection: p-sensitive k-anonymity property. In:
Proceedings of the 22nd International Conference on Data Engineering Workshops, 2006.
[23] K. Wang, P. S. Yu and S. Chakraborty. Bottom-up generalization: A data mining solution
to privacy protection. In: Proceedings of the Fourth IEEE International Conference on Data
Mining, pages 249–256, 2004.
[24] W. E. Winkler. Using simulated annealing for k-anonymity, 2002.
[25] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation. In: Proceedings
of the 32nd International Conference on Very Large Data Bases, pp. 139–150, 2006.
[26] C. Yao, X. S. Wang and S. Jajodia. Checking for k-anonymity violation by views. In:
Proceedings of the 31st international conference on Very large data bases, pp. 910–921, 2005.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.208.183