Write about the potential ethical concerns of de-identification techniques.
One of the ethical issues associated with the technique of suppression is the further underrepresentation of historically underrepresented groups. For example, within the MOOC dataset there exists only one row affiliated with a user from Antartica (continent). Similarly, the following countries have also appeared once or twice only in the entire dataset: San Marino, Falkland Islands, and Brunei. Geographically underrepresented, these pieces of information would get entirely eliminated under the suppression technique. This presents an ethical paradox: on one hand, we wish to preserve k-anonymity and protect the privacy of users from underrepresented regions; on the other hand, to reach that goal we would have to erase their existence in the dataset and further reinforce the pre-existing underrepresentation. Another example would be the column of the year of birth. The MOOCs user population are overrepresented by young people, whereas users over 80 years old only take a proportion less than 1% in the entire dataset. In particular, if we were to use the suppression technique to achieve k-anonymity, almost every piece of information associated with users over 80 years old would have to be discarded. Again, the elderly have been historically underrepresented in the industry of education, yet by using suppression, we protect their privacy at the expense of representation.

This loss of representation as a result of suppression is detrimental at multiple aspects and therefore deserves attention. As the MOOCs platform is deprived of relevant information on these users, the company will lose sight of the potential demand that these underrepresented groups have for online education, thereby failing to reach them through marketing and advertisement. Besides straining the platform’s user base and profitability, this would worsen the situation that the underrepresented groups have limited resources to online education. Moreover, inferences on the suppressed dataset could reinforce the societal bias that certain groups of people are not open to education – such as the elderly – whereas the reality is simply that we erase the information on the elderly who actively engage in online education platforms. As a result, the elderly could feel themselves further distanced from the mainstream media and technology. Therefore, using suppression alone to achieve k-anonymity within a dataset, while effective, could be complicated by the ethical issue of underrepresentation.