File size: 10,985 Bytes
30e9731
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
<!--
@license
Copyright 2020 Google. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

<!DOCTYPE html>

<html>
<head>
	<meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <link rel="apple-touch-icon" sizes="180x180" href="https://pair.withgoogle.com/images/favicon/apple-touch-icon.png">
  <link rel="icon" type="image/png" sizes="32x32" href="https://pair.withgoogle.com/images/favicon/favicon-32x32.png">
  <link rel="icon" type="image/png" sizes="16x16" href="https://pair.withgoogle.com/images/favicon/favicon-16x16.png">
  <link rel="mask-icon" href="https://pair.withgoogle.com/images/favicon/safari-pinned-tab.svg" color="#00695c">
  <link rel="shortcut icon" href="https://pair.withgoogle.com/images/favicon.ico">

  <script>
    !(function(){
      var url = window.location.href
      if (url.split('#')[0].split('?')[0].slice(-1) != '/' && !url.includes('.html')) window.location = url + '/'
    })()
  </script>

  <title>How randomized response can help collect sensitive information responsibly</title>
  <meta property="og:title" content="How randomized response can help collect sensitive information responsibly">
  <meta property="og:url" content="https://pair.withgoogle.com/explorables/anonymization/">

  <meta name="og:description" content="The availability of giant datasets and faster computers is making it harder to collect and study private information without inadvertently violating people's privacy.">
  <meta property="og:image" content="https://pair.withgoogle.com/explorables/images/anonymization.png">
  <meta name="twitter:card" content="summary_large_image">
  
	<link rel="stylesheet" type="text/css" href="../style.css">

  <link href='https://fonts.googleapis.com/css?family=Roboto+Slab:400,500,700|Roboto:700,500,300' rel='stylesheet' type='text/css'>  
  <link href="https://fonts.googleapis.com/css?family=Google+Sans:400,500,700" rel="stylesheet">

	<meta name="viewport" content="width=device-width">
</head>
<body>
  <div class='header'>
    <div class='header-left'>
      <a href='https://pair.withgoogle.com/'>
        <img src='../images/pair-logo.svg' style='width: 100px'></img>
      </a>
      <a href='../'>Explorables</a> 
    </div>
  </div>
  
  <h1 class='headline'>How randomized response can help collect sensitive information responsibly</h1>
  <div class="post-summary">Giant datasets are revealing new patterns in <a href='https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5070532/'>cancer</a>, <a href='https://opportunityinsights.org/national_trends/'>income inequality</a> and other important areas. However, the widespread availability of fast computers that can cross reference public data is making it harder to collect private information without inadvertently violating people's privacy. Modern randomization techniques can help preserve anonymity. </div>
  <link rel="stylesheet" href="style.css">
<link rel="stylesheet" href="style-graph-scroll.css">

<div id='container' class='container-1'>
<div id='graph'></div>
<div id='sections'>
<div>

<h3>Anonymous Data</h3>

<p>Let's pretend we're analysts at a small college, looking at anonymous survey data about plagiarism.

<p>We've gotten responses from the entire student body, reporting if they've ever <span class='highlight purple'>plagiarized</span> or <span class='highlight grey'>not</span>. To encourage them to respond honestly, names were not collected. 
<p>

<p class='note'>The data here has been randomly generated</p>
</div>


<div>
<p>On the survey students also report several bits of information about themselves, like their age...  
</div>


<div>
<p>...and what state they're from. 

<p>This additional information is critical to finding potential patterns in the data—why have so many first-years from New Hampshire plagiarized?  
</div>


<div>
<h3>Revealed Information</h3>
<p>But granular information comes with a cost. 

<p>One student has a <span class='highlight box square orange'>unique</span> age/home state combination. By searching another student database for a 19-year old from Vermont we can identify one of the plagiarists from supposedly anonymous survey data.
</div>


<div>
<p>Increasing granularity exacerbates the problem. If the students reported slightly more about their ages by including what season they were born in, we'd be able to <span class='highlight box square orange'>identify</span> about a sixth of them. 

<p>This isn't just a hypothetical:  A <a href="https://cpg.doc.ic.ac.uk/individual-risk/">birthday / gender / zip code combination</a> uniquely identifies 83% of the people in the United States. 

<p>With the spread of large datasets, it is increasingly difficult to release detailed information without inadvertently revealing someone's identity. A week of a person's location data could <a href='https://www.nytimes.com/interactive/2018/12/10/business/location-data-privacy-apps.html'>reveal</a> a home and work address—possibly enough to find a name using public records.
</div>


<div>
<h3>Randomization</h3>
<p>One solution is to randomize responses so each student has plausible deniability. This lets us buy privacy at the cost of some uncertainty in our estimation of plagiarism rates.

<p><b>Step 1:</b> Each student flips a coin and looks at it without showing anyone.
</div>


<div>
<p><b>Step 2:</b> Students who flip heads <span class='highlight purple-box box'>report plagiarism</span>, even if they haven't plagiarized. 

<p>Students that flipped tails report the truth, secure with the knowledge that even if their response is linked back to their name, they can claim they flipped heads.
</div>


<div>
<p>With a little bit of math, we can approximate the rate of plagiarism from these randomized responses. We'll skip the algebra, but doubling the reported non-plagiarism rate gives a good estimate of the actual non-plagiarism rate.    

<p class='rand-text'></p>

<div class='button-outer'>
<div class='button-container flip-coins-once'>
Flip coins
</div>
</div>

</div>


<div>  
<h3>How far off can we be?</h3>

<p>If we simulate this coin flipping lots of times, we can see the distribution of errors. 

<p>The estimates are close most of the time, but errors can be quite large.  

<div class='button-outer'>
<div class='button-container flip-coins'>
Flip coins 200 times
</div>
</div>

</div>


<div>    
<p>Reducing the random noise (by reducing the number of students who flip heads) increases the accuracy of our estimate, but risks leaking information about students.  

<p>If the coin is heavily weighted towards tails, identified students can't credibly claim they reported plagiarizing because they flipped heads.  

<div class="slider-outer">
<div class="slide-container-heads-prob"></div>
<div class='pointer'><div></div></div>
</div>

</div>


<div>    
<p>One surprising way out of this accuracy-privacy tradeoff: carefully collect information from even more people. 

<p>If we got students from other schools to fill out this survey, we could accurately measure plagiarism while protecting everyone's privacy. With enough students, we could even start comparing plagiarism across different age groups again—safely this time.     
 
<div class="slider-outer">
<div class="slide-container-population"></div>
&nbsp;
<div class="slide-container-heads-prob"></div>
</div>
</div>



</div>
</div>

<h3>Conclusion</h3>

<p>Aggregate statistics about private information are valuable, but can be risky to collect. We want researchers to be able to study things like the connection between demographics and health outcomes without revealing our entire medical history to our neighbors. The coin flipping technique in this article, called <a href='https://en.wikipedia.org/wiki/Randomized_response'>randomized response</a>, makes it possible to safely study private information.  

<p>You might wonder if coin flipping is the only way to do this. It's not—<a href='https://desfontain.es/privacy/differential-privacy-in-more-detail.html'>differential privacy</a> can add targeted bits of random noise to a dataset and guarantee privacy. More flexible than randomized response, the 2020 Census will use it to <a href='https://www.youtube.com/watch?v=pT19VwBAqKA'>protect respondents' privacy</a>. In addition to randomizing responses, differential privacy also limits the impact any one response can have on the released data.


<h3>Credits</h3>

<p>Adam Pearce and Ellen Jiang // September 2020

<p>Thanks to Carey Radebaugh, Fernanda Viégas, Emily Reif, Hal Abelson, Jess Holbrook, Kristen Olson, Mahima Pushkarna, Martin Wattenberg, Michael Terry, Miguel Guevara, Rebecca Salois, Yannick Assogba, Zan Armstrong and our other colleagues at Google for their help with this piece.

</div>


<h3>More Explorables</h3>

<p id='recirc'></p>

<div id='end'></div>

<script src='../third_party/seedrandom.min.js'></script>
<script src='../third_party/d3_.js'></script>
<script src='../third_party/swoopy-drag.js'></script>
<script src='../third_party/misc.js'></script>
<script src='annotations.js'></script>


<script src='make-axii.js'></script>
<script src='make-students.js'></script>
<script src='make-sel.js'></script>
<script src='make-estimates.js'></script>
<script src='make-sliders.js'></script>
<script src='make-slides.js'></script>
<script src='make-gs.js'></script>
<script src='init.js'></script>

<script src='../third_party/recirc.js'></script>

</body>

<script async src="https://www.googletagmanager.com/gtag/js?id=UA-138505774-1"></script>
<script>
  if (window.location.origin === 'https://pair.withgoogle.com'){
    window.dataLayer = window.dataLayer || [];
    function gtag(){dataLayer.push(arguments);}
    gtag('js', new Date());
    gtag('config', 'UA-138505774-1');
  }
</script>

<script>
  // Tweaks for displaying in an iframe
  if (window !== window.parent){
    
    // Open links in a new tab
    Array.from(document.querySelectorAll('a'))
      .forEach(e => {
        // skip anchor links
        if (e.href && e.href[0] == '#') return

        e.setAttribute('target', '_blank')
        e.setAttribute('rel', 'noopener noreferrer')
      })

    // Remove recirc h3
    Array.from(document.querySelectorAll('h3'))
      .forEach(e => {
        if (e.textContent != 'More Explorables') return

        e.parentNode.removeChild(e)
      })

    // Remove recirc container
    var recircEl = document.querySelector('#recirc')
    recircEl.parentNode.removeChild(recircEl)
  }
</script>

</html>