Chapter 9: Privacy Preserving in Data Mining (2/2)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

• Example of Background Knowledge Attack [17]: Alice has another

friend, Tanaka, whose medical information also appears in the table

(i.e., Fig. 9.2.1 (b)). Tanaka is a Japanese female whose age is 21 and

lives in the 13068 area so it can be inferred that Tanaka’s number is

between 1 and 4, whose health problem could be heart disease or viral

infection. Because it is well known that Japanese people seldom got

heart disease, Alice knows that Tanaka has a viral infection issue.

From the above examples, we can see that although the k-anonymity

is an effective solution to preserve the records’ privacy, in some cases, it

may lose effectiveness on protecting sensitive information. To tackle these

issues, l-diversity [17, 18] was proposed by keeping the diversity of the

attributes to hide sensitive information.

Machanavajjhala et al. [17, 18] proposed the diversity principle, i.e.,

l-diversity, to protect sensitive information from malicious attacks. To fulﬁ ll

this purpose, it requires each quasi-identiﬁ er attribute group has at least l

“well represented” different values, which can be used to make the tuple

indistinct. To deﬁ ne how well the values of attributes represented, several

possible models could be applied. The simplest one is that at least l distinct

values exist in the attribute group. If we have l=k, then it should satisfy

the k-anonymity. This model is mentioned in [22]. However, this simple

implementation can still be attacked, i.e., probabilistic inference attack. The

reason is that some values are more frequent than others in the group and

it is not difﬁ cult to deduce those frequent ones based on the distribution

of the values. To tackle this issue, some more deliberated principles based

on l-diversity are introduced.

• Entropy l-diversity [17]: For entropy l-diversity, in each quasi-identiﬁ er

group, we have −¹

s¢S

P(qid, s)log(P(qid, s)) ≥ log(l), where S is a sensitive

attribute, and P(qid, s) is the probability of tuples in a quasi-identiﬁ er

group which have the value s. Because of the property of entropy, we

can know that a larger value of the entropy indicates the sensitive

values can distributed more evenly in the group, which makes the

tuples more indistinct. However, the entropy l-diversity is still not

perfect to prevent all the attacks. Moreover, it has the drawback that

the entropy value is difﬁ cult to understand for users, who prefer

some probability based explanation, i.e., malicious attackers have 20%

chance to know that Bob has cancer, according to the current l-diversity

setting.

• Recursive (c, l)-diversity [17]: Similar to the entropy l-diversity, the

key idea of the recursive (c, l)-diversity also ensures that the sensitive

values are distributed as more evenly as possible, that the frequent

values are not so frequent and the rare values are not so rare. In a given

quasi-identiﬁ er group qn, r

is denoted as the number of times the ith

Privacy Preserving in Data Mining 209

210 Applied Data Mining

most frequent sensitive value appears in qn. Given a constant c, qn

satisﬁ es recursive (c, l)-diversity if r

< c(r

l+1

+· · ·+r

). A table satisﬁ es

the recursive (c, l)-diversity if every quasi-identiﬁ er group satisﬁ es the

recursive -diversity. Note that 1-diversity is always satisﬁ ed.

There are some other principles based on the l-diversity introduced,

i.e., positive disclosure-recursive (c, l)-diversity and negative/positive

disclosure-recursive (c

, c

, l)-diversity [17, 18]. The basic idea of these

principles is that although with the help of some background an attacker

may remove some values from the group, he still cannot recognize sensitive

information. From another viewpoint of these principles, some work is

introduced to estimate the maximum disclosure risk of the published data

[19] based on different privacy preserving metrics.

Note that the above mentioned issues are based on such an assumption

that only one sensitive attribute exists. If there are multiple sensitive

attributes, the l-diversity problem becomes more challenging. Some works

have explored this issue, i.e., [18, 25]. However, all these works suffer from

the issue of the curse of dimensionality.

9.3 The

-Closeness Method

Due to the intrinsic drawback of the l-diversity, Li et al. [16] found that

leakage of sensitive information could happen when the overall distribution

of a sensitive attribute is skewed. The reason is that because the l-diversity

requirement ensures “diversity” of sensitive values in each group, it does

not take into account the semantical closeness of these values. For example,

suppose we have a patient table where 90% of the tuples have headache

and 10% have cancer. If we have a quasi-identiﬁ er group which has 50% of

headache and 50% of cancer, it satisﬁ es the 2-diversity rule. Nevertheless,

this quasi-identiﬁ er group may face to a privacy problem because it is easy

to infer that any person in this group has 50% chance to get cancer, yet

consider in the whole table, this probability reduces to 10%. The obvious

difference between these two conclusions makes the l-diversity principle

lose its effect.

To protect the attack from the above mentioned example, Li et al. [16]

introduced the t-closeness principle. The key idea is that t-closeness requires

the distribution of each sensitive attribute in every group should be similar

to that in the overall table. Moreover, in [16], the authors introduced a new

distance metric, i.e., Earth Mover Distance (EMD), to estimate the closeness

between two distributions. A constant t is used as a threshold to satisfy

the t-closeness principle. Although the advantage it may obtain, there are

several issues brought by this interesting principle: (1) it is not easy to protect

privacy according to different security levels; (2) the introduced distance

metric, i.e., EMD, lacks a ﬂ exibility to cope with numerical attributes [15];

and (3) the utility of the published data may largely be sacriﬁ ced because

it is too strict a rule to let all the distributions of attributes be similar to

each other. Several solutions are introduced to tackle part (if not all) of

these issues [9].

9.4 Discussion and Challenges

In this chapter, we presented the main strategies for privacy preserving

data mining. There are several issues which should be mentioned. The ﬁ rst

one is how to keep a good balance between different evaluation metrics,

such as privacy and utility. It is intuitive that the safest strategy to protect

privacy is to publish as few data as possible, though this approach, leads

to low utility. For some principle (e.g., entropy l-diversity), how to explain

the semantic meaning of the setting becomes more difﬁ cult and thus, is

challenging to be applied on the real applications. Another main issue for

privacy preserving is the curse of dimensionality. The work in [1] states that

to keep privacy, a large number of the attributes may need to be suppressed

or generalized. This requirement also leads to the loss of the data’s utility.

More seriously, it seems that some methods become infeasible to implement

with more dimensionality taken into account.

9.5 Chapter Summary

In this chapter, we introduced the basic concept and main techniques for

the privacy-preserving data mining issue. We presented a variety of data

transformation strategies such as k-anonymity, l-diversity, and t-closeness

based methods. Furthermore, we gave some concrete examples to illustrate

the advantages and disadvantages of these approaches and analyzed them

thoroughly. Some related issues, i.e., curse of dimensionality and balance

between utility and privacy, were also discussed.

References

[1] C. C. Aggarwal. On k-anonymity and the curse of dimensionality. In: Proceedings of the

31st International Conference on Very Large Data Bases, pp. 901–909, 2005.

[2] C. C. Aggarwal and P. S. Yu. A condensation approach to privacy preserving data mining.

In: Proceedings of International Conference on Extending Database Technology, pp. 183–199,

2004.

[3] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas and A. Zhu.

Achieving anonymity via clustering. In: Proceedings of the Twenty-ﬁ fth ACM SIGMOD-

SIGACT-SIGART Symposium on Principles of Database Systems, pp. 153–162, 2006.

[4] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas and

A. Zhu. Anonymizing tables. In: T. Eiter and L. Libkin, editors, ICDT, pp. 246–258, 2005.

Privacy Preserving in Data Mining 211

212 Applied Data Mining

[5] R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In:

Proceedings of the 21st International Conference on Data Engineering, pp. 217–228, 2005.

[6] V. Ciriani, S. D. C. di Vimercati, S. Foresti and P. Samarati. Anonymity, Vol. 33 of Advances

in Information Security. Springer, 2007.

[7] C. Clifton. Using sample size to limit exposure to data mining. J. Comput. Secur., 8:

281–307, December 2000.

[8] J. Domingo-Ferrer and J. M. Mateo-Sanz. Practical data-oriented microaggregation for

statistical disclosure control. IEEE Trans. on Knowl. and Data Eng., 14: 189–201, January

2002.

[9] J. Domingo-Ferrer and V. Torra. A critique of k-anonymity and some of its enhancements.

In: Proceedings of the 2008 Third International Conference on Availability, Reliability and

Security, pp. 990–993, 2008.

[10] B. C. M. Fung, K. Wang and P. S. Yu. Top-down specialization for information and privacy

preservation. In: Proceedings of the 21st International Conference on Data Engineering, pp.

205–216, 2005.

[11] V. S. Iyengar. Transforming data to satisfy privacy constraints. In: Proceedings of the

eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,

pp. 279–288, 2002.

[12] L. V. S. Lakshmanan, R. T. Ng and G. Ramesh. To do or not to do: the dilemma of

disclosing anonymized data. In: Proceedings of the 2005 ACM SIGMOD International

Conference on Management of Data, pp. 61–72, 2005.

[13] K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Incognito: efficient full-domain

kanonymity. In: Proceedings of the 2005 ACM SIGMOD international conference on Man-

agement of data, pp. 49–60, 2005.

[14] K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Mondrian multidimensional k-anonymity.

In: Proceedings of the 22nd International Conference on Data Engineering, 2006.

[15] J. Li, Y. Tao and X. Xiao. Preservation of proximity privacy in publishing numerical

sensitive data. In: Proceedings of the 2008 ACM SIGMOD International Conference on

Management of Data, pp. 473–486, 2008.

[16] N. Li, T. Li and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity

and l-diversity. In: Proceedings of the 21st International Conference on Data Engineering,

pp. 106–115, 2007.

[17] A. Machanavajjhala, D. Kifer, J. Gehrke and M. Venkitasubramaniam. ?-diversity: Privacy

beyond k-anonymity. In: Proceedings of International Conference on Data Engineering, 2006.

[18] A. Machanavajjhala, D. Kifer, J. Gehrke and M. Venkitasubramaniam. L-diversity: Privacy

beyond k-anonymity. ACM Trans. Knowl. Discov. Data, 1, March 2007.

[19] D. J. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke and J. Y. Halpern. Worst-case

background knowledge in privacy. In: Proceedings of International Conference on Data

Engineering, pp. 126–135, 2007.

[20] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In: Pr

oceedings

of the twenty-third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database

Systems, pp. 223–228, 2004.

[21] P. Samarati. Protecting respondents’ identities in microdata release. IEEE Trans. on Knowl.

and Data Eng., 13: 1010–1027, November 2001.

[22] T. M. Truta and B. Vinay. Privacy protection: p-sensitive k-anonymity property. In:

Proceedings of the 22nd International Conference on Data Engineering Workshops, 2006.

[23] K. Wang, P. S. Yu and S. Chakraborty. Bottom-up generalization: A data mining solution

to privacy protection. In: Proceedings of the Fourth IEEE International Conference on Data

Mining, pages 249–256, 2004.

[24] W. E. Winkler. Using simulated annealing for k-anonymity, 2002.

[25] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation. In: Proceedings

of the 32nd International Conference on Very Large Data Bases, pp. 139–150, 2006.

[26] C. Yao, X. S. Wang and S. Jajodia. Checking for k-anonymity violation by views. In:

Proceedings of the 31st international conference on Very large data bases, pp. 910–921, 2005.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Privacy Preserving in Data Mining (2/2)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 9: Privacy Preserving in Data Mining (2/2)