Database Anonymization. David Sánchez

Читать онлайн.
Название Database Anonymization
Автор произведения David Sánchez
Жанр Компьютеры: прочее
Серия Synthesis Lectures on Information Security, Privacy, and Trust
Издательство Компьютеры: прочее
Год выпуска 0
isbn 9781681731988



Скачать книгу

Brandeis, back in 1890, in an article [103] published in the Harvard Law Review. These authors presented laws as dynamic systems for the protection of individuals whose evolution is triggered by social, political, and economic changes. In particular, the conception of the right to privacy is triggered by the technical advances and new business models of the time. Quoting Warren and Brandeis:

      Instantaneous photographs and newspaper enterprise have invaded the sacred precincts of private and domestic life; and numerous mechanical devices threaten to make good the prediction that what is whispered in the closet shall be proclaimed from the house-tops.

      Warren and Brandeis argue that the “right to privacy” was already existent in many areas of the common law; they only gathered all these sparse legal concepts, and put them into focus under their common denominator. Within the legal framework of the time, the “right to privacy” was part of the right to life, one of the three fundamental individual rights recognized by the U.S. constitution.

      Privacy concerns revived again with the invention of computers [31] and information exchange networks, which skyrocketed information collection, storage and processing capabilities. The generalization of population surveys was a consequence. The focus was then on data protection.

      Nowadays, privacy is widely considered a fundamental right, and it is supported by international treaties and many constitutional laws. For example, the Universal Declaration of Human Rights (1948) devotes its Article 12 to privacy. In fact, privacy has gained worldwide recognition and it applies to a wide range of situations such as: avoiding external meddling at home, limiting the use of surveillance technologies, controlling processing and dissemination of personal data, etc.

      As far as the protection of individuals’ data is concerned, privacy legislation is based on several principles [69, 101]: collection limitation, purpose specification, use limitation, data quality, security safeguards, openness, individual participation, and accountability. Although, with the appearance of big data, it is unclear if any of these principles is really effective [93].

      Among all the aspects that relate to data privacy, we are especially interested in data dissemination. Dissemination is, for instance, the primary task of National Statistical Institutes. These aim at offering an accurate picture of society; to that end, they collect and publish statistical data on a wide range of aspects such as economy, population, etc. Legislation usually assimilates privacy violations in data dissemination to individual identifiability [1, 2]; for instance, Title 13, Chapter 1.1 of the U.S. Code states that “no individual should be re-identifiable in the released data.”

      For a more comprehensive review of the history of privacy, check [43]. A more visual perspective of privacy is given by the timelines [3, 4]. In [3] key privacy-related events between 1600 (when it was a civic duty to keep an eye on your neighbors) and 2008 (after the U.S. Patriot Act and the inception of Facebook) are listed. In [4] key moments that have shaped privacy-related laws are depicted.

      The type of data being released determines the potential threats to privacy as well as the most suitable protection methods. Statistical databases come in three main formats.

      • Microdata. The term “microdata” refers to a record that contains information related to a specific individual (a citizen or a company). A microdata release aims at publishing raw data, that is, a set of microdata records.

      • Tabular data. Cross-tabulated values showing aggregate values for groups of individuals are released. The term contingency (or frequency) table is used when counts are released, and the term “magnitude table” is used for other aggregate magnitudes. These types of data is the classical output of official statistics.

      • Queryable databases, that is, interactive databases to which the user can submit statistical queries (sums, averages, etc.).

      Our focus in subsequent chapters is on microdata releases. Microdata offer the greatest level of flexibility among all types of data releases: data users are not confined to a specific prefixed view of data; they are able to carry out any kind of custom analysis on the released data. However, microdata releases are also the most challenging for the privacy of individuals.

      A microdata set can be represented as a table (matrix) where each row refers to a different individual and each column contains information regarding one of the attributes collected. We use X to denote the collected microdata file. We assume that X contains information about n respondents and m attributes. We use xi, to refer to the record contributed by respondent i, and xj (or Xj) to refer to attribute j. The value of attribute j for respondent i is denoted by

.

      The attributes in a microdata set are usually classified in the following non-exclusive categories.

      • Identifiers. An attribute is an identifier if it provides unambiguous re-identification of the individual to which the record refers. Some examples of identifier attributes are the social security number, the passport number, etc. If a record contains an identifier, any sensitive information contained in other attributes may immediately be linked to a specific individual. To avoid direct re-identification of an individual, identifier attributes must be removed or encrypted. In the following chapters, we assume that identifier attributes have previously been removed.

      • Quasi-identifiers. Unlike an identifier, a quasi-identifier attribute alone does not lead to record re-identification. However, in combination with other quasi-identifier attributes, it may allow unambiguous re-identification of some individuals. For example, [99] shows that 87% of the population in the U.S. can be unambiguously identified by combining a 5-digit ZIP code, birth date, and sex. Removing quasi-identifier attributes, as proposed for the identifiers, is not possible, because quasi-identifiers are most of the time required to perform any useful analysis of the data. Deciding whether a specific attribute should be considered a quasi-identifier is a thorny issue. In practice, any information an intruder has about an individual can be used in record re-identification. For uninformed intruders, only the attributes available in an external non-anonymized data set should be classified as quasi-identifiers; in the presence of informed intruders any attribute may potentially be a quasi-identifier. Thus, in the strictest case, to make sure all potential quasi-identifiers have been removed, one ought to remove all attributes (!).

      • Confidential attributes. Confidential attributes hold sensitive information on the individuals that took part in the data collection process (e.g., salary, health condition, sex orientation, etc.). The primary goal of microdata protection techniques is to prevent intruders from learning confidential information about a specific individual. This goal involves not only preventing the intruder from determining the exact value that a confidential attribute takes for some individual, but also preventing accurate inferences on the value of that attribute (such as bounding it).

      • Non-confidential attributes. Non-confidential attributes are those that do not belong to any of the previous categories. As they do not contain sensitive information about individuals and cannot be used for record re-identification, they do not affect our discussion on disclosure limitation for microdata sets. Therefore, we assume that none of the attributes in X belong to this category.

      A first attempt to come up with a formal definition of privacy was made by Dalenius in [14]. He stated that access to the released data should not allow any attacker to increase his knowledge about confidential information related to a specific individual. In other words, the prior and the posterior beliefs about an individual in the database should be similar. Because the ultimate goal in privacy is to keep the secrecy of sensitive information about specific individuals, this is a natural definition of privacy. However, Dalenius’ definition is too strict to be useful in practice. This was illustrated with two examples [29]. The first one considers an adversary whose prior view is that everyone has two left feet. By accessing a statistical database,