tag:blogger.com,1999:blog-42254779285666306922024-03-05T17:25:50.576-08:00Big Data BlogBrad Schoeninghttp://www.blogger.com/profile/15502390798829156899noreply@blogger.comBlogger5125tag:blogger.com,1999:blog-4225477928566630692.post-67288438220919090702020-08-31T18:05:00.021-07:002021-01-26T19:56:30.717-08:00Using Shuffle for Data Anonymization<p>Data anonymization is a popular topic today for both enterprise and open public data uses. Less commonly used techniques are data <b>shuffling </b>and <b>swapping</b>, techniques which work well when retention of data distribution is important. For example, retaining the age distributions of employees.</p><div style="font-stretch: normal; line-height: normal;"><span face="">Shuffling data reorders one or more columns so that the statistical distribution remains the same but the shuffled values no longer can be used to re-identify entities in rows. Since shuffling doesn’t remove or alter uniquely identifying individual values, it may need to be combined with other techniques to properly anonymize a data set. </span></div><div style="font-stretch: normal; line-height: normal;"><span face=""><br /></span></div><h3>Random Shuffle</h3><div><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">Basic data shuffling will randomly permute a list of elements. Consider as an example, an open public data set for the <a href="https://data.world/brentbrewington/atlanta-city-employee-salaries" target="_blank">City of Atlanta employee salary data in 2015</a>. This data is in the public record, and contains the employees' full names, ages and salaries. </div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">We'll start de-identification by first replacing each employee’s name with a sequential ID. </div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">Original public data with name replaced:</div><div style="font-family: helvetica; font-size: 11px; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><table cellpadding="0" cellspacing="0" style="border-collapse: collapse; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.5px; border: 0.5px solid rgb(0, 0, 0); width: 501.999px;"><tbody><tr><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Employee</b></div></td><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Age</b></div></td><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Annual Salary</b></div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0001</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">38</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">46,000</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0002</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">52</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">26,700</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0003</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">44</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">46,575</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0004</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">42</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">42,867</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0005</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">32</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">28,035</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0006</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">44</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">67,800</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0007</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">33</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">46,378</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0008</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">28</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">39,328</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0009</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">58</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">125,000</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0010</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">45</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">60,466</div></td></tr></tbody></table><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">To further de-identify the data set, while preserving data utility, we might shuffle the age column. <b>Quasi-</b><span style="font-family: helvetica;"><b>identifiers</b> are attributes which do not uniquely identify an individual, but are sufficiently correlated that they can sometimes or when combined with other data can be used to re-identify someone. </span></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><span style="font-family: helvetica;"><br /></span></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><span style="font-family: helvetica;">Shuffling the age column removes the correlation between age and salary. The statistical distributions of age and salary are unaffected with this approach.</span></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">With <b>age</b> shuffled:</div><div style="font-family: helvetica; font-size: 11px; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><table cellpadding="0" cellspacing="0" style="border-collapse: collapse; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.5px; border: 0.5px solid rgb(0, 0, 0); width: 501.999px;"><tbody><tr><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Employee</b></div></td><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Age</b></div></td><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Annual Salary</b></div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0001</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">44</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">46,000</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0002</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">28</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">26,700</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0003</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">44</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">46,575</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0004</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">38</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">42,867</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0005</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">52</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">28,035</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0006</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">45</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">67,800</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0007</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">42</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">46,378</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0008</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">32</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">39,328</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0009</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">58</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">125,000</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>0010</b></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">33</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">60,466</div></td></tr></tbody></table><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><h3 style="font-stretch: normal; line-height: normal; text-align: left;"><span style="font-family: helvetica;">Shuffling Data</span></h3><div><span style="font-family: helvetica;"><br /></span></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">All permutations are equally likely when shuffling, including the original list. In this instance, run using python's shuffle, employee 0003's age after shuffling retained the same value.</div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">Data shuffling has been endorsed by the EU Data Protection Board in 2014 which wrote:</div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="color: #666666; font-family: helvetica; font-stretch: normal; line-height: normal; margin-left: 36px;"><a href="https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf" target="_blank">Shuffling the values of attributes in a table so that some of them are artificially linked to different data subjects, is useful when it is important to retain the exact distribution of each attribute within the dataset</a></div><div style="color: #666666; font-family: helvetica; font-stretch: normal; line-height: normal; margin-left: 36px; min-height: 13px;"><a href="https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf" target="_blank"><br /></a></div><div style="color: #666666; font-family: helvetica; font-stretch: normal; line-height: normal; margin-left: 36px;"><a href="https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf" target="_blank">Similarly to noise addition, permutation may not provide anonymisation by itself and should always be combined with the removal of obvious attributes/quasi-identifiers.</a></div><div style="color: #707070; font-family: helvetica; font-stretch: normal; line-height: normal; margin-left: 36px; min-height: 12px;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">Many ETL tools provide shuffling as a de-identification technique, including Talend, Informatica, Ab Initio and even Oracle. </div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><br /></div><h3>History of the Fisher-Yates Shuffle</h3><div style="font-family: helvetica; font-size: 11px; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-stretch: normal; line-height: normal;"><span face="">The first shuffle algorithm was described by<span style="color: #202122; font-stretch: normal; line-height: normal;"> <a href="https://en.wikipedia.org/wiki/Ronald_Fisher"><span style="color: #0b0080;">Ronald Fisher</span></a> and <a href="https://en.wikipedia.org/wiki/Frank_Yates"><span style="color: #0b0080;">Frank Yates</span></a></span> in their 1938 book <span style="color: #202122; font-stretch: normal; line-height: normal;"><i><a href="https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#cite_note-fisheryates-1" target="_blank">Statistical tables for biological, agricultural and medical research</a></i>.</span> A software version of the algorithm, optimized by Richard <span style="color: #202122; font-stretch: normal; line-height: normal;">Durstenfeld to</span> run in linear O(n) time, was popularized in Donald Knuth’s <a href="https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming#Volume_2_%E2%80%93_Seminumerical_Algorithms" target="_blank">The Art of Computer Programming Volume II</a>. </span></div><div style="font-stretch: normal; line-height: normal; min-height: 13px;"><span face=""><br /></span></div><div style="font-stretch: normal; line-height: normal;"><span face="">Below is a python example of this algorithm:</span></div><div style="font-family: helvetica; font-size: 11px; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px;"><span style="font-family: "courier new" , "courier" , monospace;">import random </span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px;"><span style="font-family: "courier new" , "courier" , monospace;">list = [38, 52, 44, 42, 32, 44, 33, 28, 58, 45]</span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px;"><span style="font-family: "courier new" , "courier" , monospace;">print ("The original list is : " + str(list)) </span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px; min-height: 10px;"><br /></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px;"><span style="font-family: "courier new" , "courier" , monospace;"># Fisher–Yates Shuffle</span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px;"><span style="font-family: "courier new" , "courier" , monospace;">for i in range(len(list)-1, 0, -1): </span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px;"><span style="font-family: "courier new" , "courier" , monospace;"> # Pick a random index from 0 to i </span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px;"><span style="font-family: "courier new" , "courier" , monospace;"> j = random.randint(0, i + 1) </span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px; min-height: 10px;"><span style="font-family: "courier new" , "courier" , monospace;"> </span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px;"><span style="font-family: "courier new" , "courier" , monospace;"> # Swap arr[i] with the element at random index </span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px;"><span style="font-family: "courier new" , "courier" , monospace;"> list[i], list[j] = list[j], list[i] </span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px; min-height: 10px;"><span style="font-family: "courier new" , "courier" , monospace;"> </span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px;"><span style="font-family: "courier new" , "courier" , monospace;">print ("The shuffled list is : " + str(list))</span></div><div style="font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-stretch: normal; line-height: normal;"><span face="">Python and Java, however, provide built-in shuffle methods. Below is python code to shuffle the employee ages in the example above: </span></div><div style="font-stretch: normal; line-height: normal;"><span face=""><br /></span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px;"><span style="font-family: Courier New, Courier, monospace;">list = [38, 52, 44, 42, 32, 44, 33, 28, 58, 45]</span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px;"><span style="font-family: Courier New, Courier, monospace;">random.shuffle(list)</span></div><div style="font-stretch: normal; line-height: normal; margin-left: 36px;"><span style="font-family: Courier New, Courier, monospace;">[44, 28, 44, 38, 52, 45, 42, 32, 58, 33]</span></div><div style="font-family: helvetica; font-size: 16px; font-stretch: normal; line-height: normal; margin-bottom: 6px; margin-top: 14px;"><h3>Well Known Uses</h3></div><div style="color: #172b4d; font-family: helvetica; font-stretch: normal; line-height: normal; margin-top: 14px;">The U.S. Census Bureau began using a variant of data swapping for the 1990 decennial census. The method was tested with extensive simulations, and the results were considered to be a success and essentially the same methodology was used for actual data releases. In the U.S. Census Bureau’s version, records were swapped between census blocks for individuals or households that have been matched on a predetermined set of k variables. A similar approach was used for the U.S. 2000 census. The Office for National Statistics in the UK applied data swapping as part of its disclosure control procedures for the U.K. 2001 Census.</div><div style="color: #172b4d; font-family: helvetica; font-stretch: normal; line-height: normal; margin-top: 14px;">In 2002, researchers at the U.S. National Institute of Statistical Science (NISS) developed <b>WebSwap</b>, a public web-based tool to perform data swapping in databases of categorical variables.</div><div style="color: #172b4d; font-family: helvetica; font-stretch: normal; line-height: normal; margin-top: 14px;"><br /></div><h3>Other Shuffle and Swapping Techniques</h3><div><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Group shuffle</b> is used when two or more rows need to be shuffled together. Some data sets may have columns with highly correlated data, so shuffling a single column in isolation would diminish the analytical value of the data. </div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">Consider our earlier employee example with the addition of a years-of-service column. Since a persons age is related to their potential employment longevity, it would make sense to shuffle these together.</div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Direct swapping</b> is a non-random approach which handpicks records by finding pairs to swap. For a record set with age and salary, you can swap the salary of individuals of the same age. For example the following rows :</div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><table cellpadding="0" cellspacing="0" style="border-collapse: collapse; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.5px; border: 0.5px solid rgb(0, 0, 0); font-family: "Times New Roman"; width: 501.999px;"><tbody><tr><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Dept. Id</b></div></td><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Age</b></div></td><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Annual Salary</b></div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><span style="white-space: pre;">A03031</span></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">44</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">95,000</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><span style="white-space: pre;">A68002</span></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">44</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">60,000</div></td></tr></tbody></table><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;">swapping salary where age = 44:</div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><table cellpadding="0" cellspacing="0" style="border-collapse: collapse; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.5px; border: 0.5px solid rgb(0, 0, 0); font-family: "Times New Roman"; width: 501.999px;"><tbody><tr><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Dept. Id</b></div></td><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Age</b></div></td><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Annual Salary</b></div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><span style="white-space: pre;">A03031</span></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">44</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">60,000</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><span style="white-space: pre;">A68002</span></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">44</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">95,000</div></td></tr></tbody></table></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">Direct swapping works with categorical data (e.g., eye color green, blue, brown or black) or a discrete number of values, in our example age represents a range 0..110.</div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"> </div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Rank swap</b> is similar, swapping pairs which are not exact matches, but close in value. It works well on continuous variables. For example blood pressure within 5 hg could be chosen as pairs to swap:</div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;">original:</div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;"><table cellpadding="0" cellspacing="0" style="border-collapse: collapse; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.5px; border: 0.5px solid rgb(0, 0, 0); font-family: "Times New Roman"; width: 501.999px;"><tbody><tr><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-stretch: normal; line-height: normal;"><b style="font-family: helvetica;">Systolic Blood </b><span style="font-family: helvetica;"><b>Pressure</b></span><b style="font-family: helvetica;"> (Hg)</b></div></td><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Height</b></div></td><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Weight</b></div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><span style="white-space: pre;">121</span></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">5' 8" </div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">155</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><span style="white-space: pre;">122</span></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">5' 11"</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">180</div></td></tr></tbody></table><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;">swapped:</div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><table cellpadding="0" cellspacing="0" style="border-collapse: collapse; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.5px; border: 0.5px solid rgb(0, 0, 0); font-family: "Times New Roman"; width: 501.999px;"><tbody><tr><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-stretch: normal; line-height: normal;"><b style="font-family: helvetica;"><b>Systolic </b>Blood </b><span style="font-family: helvetica;"><b>Pressure</b></span><b style="font-family: helvetica;"> (Hg)</b></div></td><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Height</b></div></td><td style="background-color: #b0b3b2; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><b>Weight</b></div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><span style="white-space: pre;">121</span></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">5' 11" </div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">180</div></td></tr><tr><td style="background-color: #d4d4d4; border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><span style="white-space: pre;">122</span></div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">5' 8"</div></td><td style="border-color: rgb(0, 0, 0); border-style: solid; border-width: 0.8px; border: 0.8px solid rgb(0, 0, 0); padding: 3px;" valign="top"><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">155</div></td></tr></tbody></table></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;">In practice, both direct and rank swapping typically swap a subset of the records, not the entire set.</div><div style="font-family: helvetica; font-stretch: normal; line-height: normal;"> </div><h3>Summary</h3><div><br /></div><div style="font-family: helvetica; font-stretch: normal; line-height: normal; min-height: 13px;">Shuffle can be applied as one of several valuable techniques for data de-identification. As the EU Data Protection Board recommends, it may not be appropriate for stand-alone use, and always requires careful analysis with a given data set that obvious identifiers and quasi-identifiers are protected. Consider applying a <a href="https://en.wikipedia.org/wiki/K-anonymity">k-anonymization </a>test on the output data.</div>Brad Schoeninghttp://www.blogger.com/profile/15502390798829156899noreply@blogger.com0tag:blogger.com,1999:blog-4225477928566630692.post-62361530573996112242019-09-17T20:36:00.003-07:002020-08-31T18:06:30.152-07:00How Tokenized Data Protected Capital One When Encryption Failed<br />
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<meta name="image" property="og:image" content="https://lh4.googleusercontent.com/bkEqdC9OSTp0Wwn2xxqKkg1aGN1ssgxfgBN_VpgYca9Et3Ls-dJDEG4CTK4phHqzc0XE4XFGymCfjB4ON3387L1KW5XPpWnwnfRQOOrgyZCms1mMGkx6Fip2bXmcDYH_ppv6ewe2">
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">The Capital One data breach this past August impacted over <a href="https://www.capitalone.com/facts2019/" target="_blank">100M individuals</a> but provided some important lessons and a few silver linings. With so many affected, the FBI’s quick arrest of the perpetrator was met with relief and questions of what had happened to the data. Although the data had been downloaded and posted privately on Github, the actual data was never sold or disseminated. In fact, much of the highly sensitive data, in particular Social Security and account numbers, had been "tokenized" and therefore remained secure.</span></div>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">Roughly 99% of the Social Security Numbers were protected. But why then were some 140,000 SSNs exposed? As it turns out, those that were exposed had not been tokenized. As per Capital One’s policies, all American SSNs should be and were tokenized. But, Capital One's policies did not require tokenizing the employee ID field, which in some cases consisted of a SSN. Also, the equally sensitive Canadian Insurance Numbers, with a slightly different format than the US numbers, were not tokenized and over 1M were exposed.</span></div>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">Data privacy professionals should note the importance of tokenizing data over encryption. The attacker in this instance gained unauthorized access to the encryption keys, and the encrypted data was then successfully decrypted and breached. Tokenized data, however, remained fully protected.</span><br />
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;"><br /></span>
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">As illustrated by this incident, the risk of theft or misuse of an encryption key remains a major hurdle to fully securing data with encryption. </span><span style="color: #444444; font-family: "arial"; font-size: 16px; white-space: pre-wrap;">The bank exposed the most sensitive PII (Personally Identifiable Information) tied to some of its US and Canadian customers when encryption failed to protect the data.</span></div>
<h1 dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 8pt 0pt 12pt;">
<span style="background-color: transparent; font-family: "arial"; font-size: 20pt; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">Tokenization</span></h1>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<a href="https://www.blogger.com/blogger.g?blogID=4225477928566630692" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="blob:https://www.blogger.com/00667472-1b24-4a18-8b43-80b4bf9996ea" style="cursor: move;" /></a><span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">Tokenization methods vary, but the process is simple: a data field with a sensitive, personal identifier is replaced with a different, synthetic value, of the same format. Tokenization is usually performed consistently, whereby each unique value is always converted to another unique value within a data set. It is consistent, so the translation from original to synthetic value is the same, every time. Thus, a relational database using SSN as a key can still join on the synthetic SSN.</span></div>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">Payment vendor <a href="https://squareup.com/us/en/townsquare/what-does-tokenization-actually-mean" target="_blank">Square reports that</a> ‘</span><span style="color: #3e4348; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">payment experts are seeing more and more organizations moving from encryption to tokenization as a more cost-effective (and secure) way to protect and safeguard sensitive information.</span><span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">“</span></div>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">The most frequenlty stolen Social Security Number of all time is 078-05-1120. <a href="https://www.ssa.gov/history/ssn/misused.html" target="_blank">The story behind the stolen SSN</a> dates back to 1938 when Douglas Patterson, VP of E.H. Ferree Company, used his secretary’s real SSN on thousands of example Social Security Cards. The purpose of the fake card with a real number was to demonstrate how well it fit into a new line of wallets. Although the sample card was labeled specimen and printed in red instead of blue, the SSN was eventually used by over 40,000 people, some of whom had thought that the card in their wallet was their own.</span></div>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<a href="https://www.blogger.com/blogger.g?blogID=4225477928566630692" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="blob:https://www.blogger.com/00667472-1b24-4a18-8b43-80b4bf9996ea" style="cursor: move;" /></a><span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">SSN 078-05-1120 is now a retired number; so let’s use it as an example. The tokenization process would transform this into a synthetic number of exactly the same format, but no longer represents the original value: </span><br />
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;"><span id="docs-internal-guid-b6fca152-7fff-f0b5-f606-ff8354f4f4f7" style="color: black; white-space: normal;"><span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;"><span style="border: none; display: inline-block; height: 261px; overflow: hidden; width: 624px;"><img height="261" src="https://lh4.googleusercontent.com/bkEqdC9OSTp0Wwn2xxqKkg1aGN1ssgxfgBN_VpgYca9Et3Ls-dJDEG4CTK4phHqzc0XE4XFGymCfjB4ON3387L1KW5XPpWnwnfRQOOrgyZCms1mMGkx6Fip2bXmcDYH_ppv6ewe2" style="margin-left: 0px; margin-top: 0px;" width="624" /></span></span></span></span><br />
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;"></span></div>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">Tokenization can have another benefit. When the format is preserved, and the values are constrained (e.g., digits), the output can be indistinguishable from a real value. In the case of Capital One, it’s unlikely the hacker knew that the SSN values were synthetic, and not real values. <a href="https://regmedia.co.uk/2019/07/29/capital_one_paige_thompson.pdf" target="_blank">According to the FBI</a>, she wrote:</span></div>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;"><span style="border: none; display: inline-block; height: 237px; overflow: hidden; width: 624px;"><img height="237" src="https://lh3.googleusercontent.com/0Wz93NW04iM27trE82FldLsZfPJssmKyzBHOdjZSn0__HlEuTedPsVzVp-sZi11s-D0k7sVmeJz8IF2S6iVEDDyH9c7XojWUvpqb6YkOL-o6ua0jNdgVH9IhmIyhaQumWQcIVUXU" style="margin-left: 0px; margin-top: 0px;" width="624" /></span></span></div>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">Unknown to the hacker, the SSNs were tokenized, so they were synthetic values unrelated to real Social Security Numbers. Had Capital One also tokenized Canadian Insurance Numbers and employee ID fields, its likely none of these numbers would have been breached</span></div>
<h1 dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 8pt 0pt 12pt;">
<span style="background-color: transparent; font-family: "arial"; font-size: 20pt; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">Random Tokenization</span></h1>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;"><a href="https://www.capitalone.com/facts2019/" target="_blank">Capital One appears to have used a tokenization technique</a>, known as Format Preserving Encryption (FPE), which itself uses encryption. Had these keys, used by FPE, also been stolen, the tokenized data could have been decrypted.</span></div>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">A better approach is random tokenization, whereby, as the term implies, the tokens are selected randomly. Random tokenization uses a separate key-value database or token vault to consistently tokenize the input data. Unlike encryption, random tokenization does not rely on a single set of keys and the tokens are not vulnerable to mathematical or brute force attacks.</span></div>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">As Jonathan Deveaux wrote in <a href="https://www.paymentsjournal.com/why-tokenization-is-key-when-it-comes-to-security-and-compliance-for-the-modern-psp/" target="_blank">Payments Journal</a>: tokenization "can address the failings of encryption” and “if a hacker was successful and gained access to the tokenized data, it would still be protected as the information would have no exploitable value.”</span></div>
<h1 dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 8pt 0pt 12pt;">
<span style="background-color: transparent; font-family: "arial"; font-size: 20pt; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">Summary</span></h1>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 12pt;">
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">Tokenization is recognized as a highly effective data protection technique, and the Capital One incident clearly illustrates its value and effectiveness. </span><br />
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;"><br /></span>
<span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">Capital One had policies for tokenizing SSN data. Had Capital One simply tokenized more fields with PII data, neither the SSN or CIN numbers would have been breached and their financial exposure, estimated at $100-$150M, would likely be limited. </span><span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; white-space: pre-wrap;">In contrast, encryption as a data protection policy, remains highly vulnerable to theft or mis-use of the necessary encryption keys. </span></div>
<h1 dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 8pt 0pt 11pt;">
<span style="background-color: transparent; font-family: "arial"; font-size: 20pt; font-weight: 400; vertical-align: baseline; white-space: pre-wrap;">References</span></h1>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 11pt;">
<a href="https://squareup.com/us/en/townsquare/what-does-tokenization-actually-mean" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 12pt; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">Payment Tokenization Explained</span></a></div>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 11pt;">
<a href="https://www.paymentsjournal.com/why-tokenization-is-key-when-it-comes-to-security-and-compliance-for-the-modern-psp/" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 12pt; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">Why Tokenization Is Key When It Comes to Security and Compliance for the Modern PSP</span></a><span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">, Payments Journal, July 25, 2019</span></div>
<div dir="ltr" style="background-color: white; line-height: 1.92; margin-bottom: 0pt; margin-top: 0pt; padding: 0pt 0pt 11pt;">
<a href="https://www.peerlyst.com/posts/capitalone-hack-explained-what-s-in-your-bucket-christopher-gebhardt" style="text-decoration: none;"><span style="background-color: transparent; color: #1155cc; font-family: "arial"; font-size: 12pt; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">Capital One Hack Explained: What’s in your Bucket?</span></a><span style="background-color: transparent; color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">, Peerlyst, July 30, 2019</span></div>
<a href="https://regmedia.co.uk/2019/07/29/capital_one_paige_thompson.pdf" style="text-decoration: none;"><span style="color: #1155cc; font-family: "arial"; font-size: 12pt; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">United States of America vs Paige A Thompson a/k/a “erratic”</span></a><span style="color: #444444; font-family: "arial"; font-size: 12pt; vertical-align: baseline; white-space: pre-wrap;">, US District Court for the Western District of Washington, July 29, 2019</span>Brad Schoeninghttp://www.blogger.com/profile/15502390798829156899noreply@blogger.com0tag:blogger.com,1999:blog-4225477928566630692.post-89353677247072182972018-12-03T18:31:00.000-08:002018-12-03T18:33:02.591-08:00View link for: <a href="https://docs.google.com/presentation/d/e/2PACX-1vTi3dASFeWW6Ra0IYQ0znH8ADRZbCVJyZ7nJpf7zANkfgnApZIi6O0mM0RVBKMDVejUWQHWCQjIFvMQ/pub?start=false&loop=false&delayms=3000">Cassandra Overview</a><br />
<br />
<br />Brad Schoeninghttp://www.blogger.com/profile/15502390798829156899noreply@blogger.com0tag:blogger.com,1999:blog-4225477928566630692.post-63491381477868039282018-01-07T18:36:00.003-08:002022-05-23T18:33:08.349-07:00<div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgOZhlfioewdM15yGaligM2s4RYHfU3qggX56PjsHZKmiVna42Q__lam26I9Z7TRO0bzTFz1hZX72KduCUGhbCtzCKzBxEqZN-CLQH-yMVyOlWGGpbIaCO6h9Gn9W0aB2o2KQsV9bWjcNfU-CSmSVMJ5JwDTMz5UlGg6TQeE4XtPiUE2x8Tu8-OXBxcSA" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="1293" data-original-width="1663" height="240" src="https://blogger.googleusercontent.com/img/a/AVvXsEgOZhlfioewdM15yGaligM2s4RYHfU3qggX56PjsHZKmiVna42Q__lam26I9Z7TRO0bzTFz1hZX72KduCUGhbCtzCKzBxEqZN-CLQH-yMVyOlWGGpbIaCO6h9Gn9W0aB2o2KQsV9bWjcNfU-CSmSVMJ5JwDTMz5UlGg6TQeE4XtPiUE2x8Tu8-OXBxcSA" width="309" /></a></div><br /><br /></div>
A Mind Map illustrating the core concepts of Apache Cassandra. Brad Schoeninghttp://www.blogger.com/profile/15502390798829156899noreply@blogger.com0tag:blogger.com,1999:blog-4225477928566630692.post-48736603231916779652017-02-27T11:43:00.001-08:002020-08-31T18:09:45.161-07:00<h2>
<span style="color: #0b5394;">Understanding the Cassandra Partitioner</span></h2>
<h3>
<span style="color: #0b5394;"><br /></span></h3>
<h3>
<span style="color: #0b5394;">
Partitioning Data with Hashing</span></h3>
<br />
<span style="font-family: "arial" , "helvetica" , sans-serif;">Cassandra uses a <b>Partitioner</b> to distribute data across all the nodes in a Cassandra cluster. When a row of data is written or read, the partitioner calculates the hash value of the partition key. This hash value is called a <b>Token</b> and is mapped to a node which owns that token value. Each node in a cluster is configured to own a primary token range unique to the cluster. Therefore, once the hash or token value has been calculated, we can determine which node the data belongs to.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "arial" , "helvetica" , sans-serif;">In Cassandra, Replication Factor (RF) is typically greater than one, so replicas of the data are stored on multiple nodes. For the partitioner, this simply means finding the token owner and then distributing replicas among the adjacent token range owners.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "arial" , "helvetica" , sans-serif;">A well partitioned result distributes data evenly and randomly across the nodes in a cluster. The application developer influences evenness through the definition of the partition key. Ensuring randomness requires a good hashing algorithm.</span><br />
<br />
<h3>
<span style="color: #0b5394;">
Hashing</span></h3>
<br />
<span style="font-family: "arial" , "helvetica" , sans-serif;">Cassandra 1.0 used the classic 128-bit <a href="https://en.wikipedia.org/wiki/MD5">MD5</a> hashing algorithm for partitioning. MD5 was designed as a cryptographic hash function such that hash values must be random, evenly distributed, and in addition it must be hard or impossible to guess the original value from a hash value. </span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Beginning with Cassandra 1.2, <a href="https://en.wikipedia.org/wiki/MurmurHash">Murmur3</a>, a faster, non-cryptographic hash function, replaced MD5 as the default partition hash. The name was coined from the machine language operations multiply (MU) and rotate (R).</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><b>Murmur3Partitioner</b> is now the default and should be used for all new clusters. It remains an option to configure the original RandomPartitioner using MD5 for compatibility with older clusters.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "arial" , "helvetica" , sans-serif;">The published Murmur3 hash algorithm provides three versions optimized for different platforms. Murmur3Partitioner in Cassandra uses the x64 128-bit version, but it truncates the result and uses only the upper 64 bits of the hash value. The token range for Murmur3 in Cassandra is therefore <span style="background-color: white;">-2</span><span class="ph sup" style="background-color: white; box-sizing: border-box; line-height: 0; position: relative; top: -0.5em; vertical-align: baseline;">63</span><span style="background-color: white;"> to +2</span><span class="ph sup" style="background-color: white; box-sizing: border-box; line-height: 0; position: relative; top: -0.5em; vertical-align: baseline;">63</span><span style="background-color: white;">-1.</span></span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 16px;"><br /></span></span>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-family: "arial" , "helvetica" , sans-serif;">The Murmur3Partitioner.java source code in Cassandra 3.x creates a 128 bit Murmur3 hash with the x64_128 algorithm. The first 64 bits in long[0] are returned as the token value and the lower 64 bits in long[1] are ignored.</span></span><br />
<span style="color: #0b5394;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"> private LongToken getToken(ByteBuffer key, long[] hash)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> if (key.remaining() == 0)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> return MINIMUM;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"> return new LongToken(normalize(hash[0]));</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> }</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"> private long[] getHash(ByteBuffer key)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> long[] hash = new long[2];</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> MurmurHash.hash3_x64_128(key, key.position(), </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> key.remaining(), 0, hash);</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> return hash;</span><br />
<span style="color: #0b5394; font-family: "courier new" , "courier" , monospace;"></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> }</span><br />
<span style="color: #0b5394;"><br /></span>
<br />
<h3>
<span style="color: #0b5394;">Hashing Examples</span></h3>
<script src="https://peterolson.github.com/BigInteger.js/BigInteger.min.js" type="text/javascript"></script>
<script src="https://rawgit.com/karanlyons/murmurHash3.js/master/murmurHash3.min.js" type="text/javascript"></script>
<span style="font-family: "arial" , "helvetica" , sans-serif;">
Enter a string to be hashed:</span><br />
<input id="s" onkeyup="hashit()" size="80" value="" /> <br />
<br />
<span style="font-family: "arial" , "helvetica" , sans-serif;">128 bits:</span> <input disabled="" id="h" onkeyup="hashit()" size="36" /> <br />
<br />
<span style="font-family: "arial" , "helvetica" , sans-serif;">High 64 bits:</span> <input disabled="" id="hi" size="18" /> <span style="font-family: "arial" , "helvetica" , sans-serif;">as signed integer:</span> <input disabled="" id="hi0" size="24" style="background-color: #dbffff; color: black; font-family: "arial"; font-size: 12px; font-weight: normal;" /> <br />
<span style="font-family: "arial" , "helvetica" , sans-serif;">Low 64 bits:</span> <input disabled="" id="lo" size="18" /> <span style="font-family: "arial" , "helvetica" , sans-serif;">as signed integer:</span> <input disabled="" id="lo0" size="24" /> <br />
<br />
<h3>
<span style="color: #0b5394;">Choosing Token Ranges</span></h3>
<div>
<span style="color: #0b5394; font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Token ranges are configured in <span style="font-family: "courier new" , "courier" , monospace;">cassandra.yaml</span> </span><span style="font-family: "arial" , "helvetica" , sans-serif;">by one of the following:</span><br />
<ul>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Manually by assigning each node a starting range</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Automatically by enabling vnodes</span></li>
</ul>
<span style="font-family: "arial" , "helvetica" , sans-serif;">VNodes are administratively easier and are recommended for most new environments. The default number of vnodes is 128, which creates 128 token ranges per node. For DSE clusters using Search the recommended number of vnodes is lower, with recommended values of16 or 32. <span style="background-color: white; color: #242729;">The primary advantages of vnodes are:</span></span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; color: #242729;"><br /></span>
</span><br />
<ul style="background-color: white; border: 0px; color: #242729; margin: 0px 0px 1em 30px; padding: 0px;">
<li style="border: 0px; margin: 0px 0px 0.5em; padding: 0px; word-wrap: break-word;"><span style="font-family: "arial" , "helvetica" , sans-serif;">When adding or removing nodes from a cluster, manual rebalancing is not required. </span></li>
<li style="border: 0px; margin: 0px; padding: 0px; word-wrap: break-word;"><span style="font-family: "arial" , "helvetica" , sans-serif;">Faster recovery from node failures or removal. With vnodes, rebuilding can stream data from all online nodes. This compares with manual tokens which will read from at most four nodes adjacent to the node being replaced. Especially with larger clusters, this can be an important factor providing operational agility.</span></li>
</ul>
<span style="font-family: "arial" , "helvetica" , sans-serif;">You must not mix vnodes and manual tokens within a single data center.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Your choice of a partitioner and token range scheme determines where the data resides in a cluster. Changes to either of these on a production cluster is operationally difficult and may require migrating all of your data.</span></span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span>
<br />
<h3>
<span style="color: #0b5394;">Calculating Manual Token Ranges</span></h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;">When configuring token ranges manually, it helps to use a token range calculator. </span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Enter the number of nodes below, and it will calculate the starting token offsets beginning with zero for the Murmur3Partitioner. </span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span>
<br />
<h4>
Token Calculator</h4>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Number of nodes:</span> <input id="n" onkeyup="tcalc()" size="10" value="" /> <br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Initial Tokens: </span><br />
<textarea cols="80" id="tokens" rows="10"></textarea>
<script>
function hashit() {
var x;
x = document.getElementById("s").value;
var hexval = murmurHash3.x64.hash128(x);
document.getElementById("h").value = hexval;
var hi = hexval.substring(0, 16);
var lo = hexval.substring(16, 32);
document.getElementById("hi").value = hi;
document.getElementById("lo").value = lo;
var hival = bigInt(hi,16);
if (hival.geq(bigInt("8000000000000000",16))) {
hival = hival.subtract(bigInt("FFFFFFFFFFFFFFFF",16));
hival = hival.subtract(1);
}
var loval = bigInt(lo, 16);
if (loval.geq(bigInt("8000000000000000",16))) {
loval = loval.subtract(bigInt("FFFFFFFFFFFFFFFF",16));
loval = loval.subtract(1);
}
document.getElementById("hi0").value = hival;
document.getElementById("lo0").value = loval;
}
function tcalc() {
var n = document.getElementById("n").value;
if (n <= 0) return;
var mx = bigInt("FFFFFFFFFFFFFFFF",16).divide(n);
var hi = bigInt("8000000000000000",16);
var i = 0;
var arr = [];
while (i < n) {
var token = mx.multiply(i);
if (token.gt(hi)) {
token = hi - token;
}
arr.push(token.toString());
i++;
}
console.log("array: ", arr);
document.getElementById("tokens").value = arr.toString();
}
</script>
<br />
<br />
<br />
<div class="p1">
<h3>
<span class="s1" style="color: #0b5394;">Murmur3 Partitioning Example</span></h3>
</div>
<div class="p2">
<span class="s1"></span><br /></div>
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff; min-height: 13.0px}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style>
<br />
<div class="p1">
<span class="s1" style="font-family: "arial" , "helvetica" , sans-serif;">As a practical example with real data, we can use the Murmur3 partitioner with the twelve zodiac signs to hash their values and observe how they would be partitioned if used as keys across a Cassandra cluster. I’ve chosen six nodes because it is a fairly common cluster size and an even divisor for the number of keys. A perfectly even distribution would provide two key values per node.</span></div>
<div class="p1">
<span class="s1"><br /></span></div>
<div class="p1">
<span class="s1">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style>
</span></div>
<div class="p1">
<span class="s1" style="font-family: "arial" , "helvetica" , sans-serif;"><b>Step One</b>: Calculate token ranges using the calculator above. Using n=6 we have:</span><br />
<span class="s1"><br /></span>
<br />
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><b><span class="s1">Node </span><span style="font-variant-ligatures: no-common-ligatures;">Starting Range</span></b></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">1 </span><span style="font-variant-ligatures: no-common-ligatures;">0</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">2 </span><span style="font-variant-ligatures: no-common-ligatures;">3074457345618258602</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">3 </span><span style="font-variant-ligatures: no-common-ligatures;">6148914691236517204</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">4 </span><span style="font-variant-ligatures: no-common-ligatures;">9223372036854775806</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">5 </span><span style="font-variant-ligatures: no-common-ligatures;">-3074457345618258000</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">6 </span><span style="font-variant-ligatures: no-common-ligatures;">-6148914691236518000</span></span></div>
</div>
<div class="p1">
<span class="s1"><br /></span></div>
<div class="p1">
<span class="s1">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style>
</span></div>
<div class="p1">
<span class="s1" style="font-family: "arial" , "helvetica" , sans-serif;"><b>Step Two:</b> Calculate the hash value for each of the twelve zodiac sign names and match the hash with a node's token range:</span></div>
<div class="p1">
<span class="s1"><br /></span>
<br />
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><b><span class="s1">Key Value </span><span style="font-variant-ligatures: no-common-ligatures;">Murmur3 Hash </span><span style="font-variant-ligatures: no-common-ligatures;">Node</span></b></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">Aries </span><span style="font-variant-ligatures: no-common-ligatures;"> </span><span style="font-variant-ligatures: no-common-ligatures;">6446536566984288488 </span><span style="font-variant-ligatures: no-common-ligatures;">3</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">Taurus </span><span style="font-variant-ligatures: no-common-ligatures;"> </span><span style="font-variant-ligatures: no-common-ligatures;">4155751160254564535 </span><span style="font-variant-ligatures: no-common-ligatures;">2</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">Gemini </span><span style="font-variant-ligatures: no-common-ligatures;"> </span><span style="font-variant-ligatures: no-common-ligatures;">1317029125904582964 </span><span style="font-variant-ligatures: no-common-ligatures;">1</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">Cancer </span><span style="font-variant-ligatures: no-common-ligatures;">-8016596991533194765 </span><span style="font-variant-ligatures: no-common-ligatures;">6</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">Leo </span><span style="font-variant-ligatures: no-common-ligatures;">-</span><span style="font-variant-ligatures: no-common-ligatures;">8583032252751962986 </span><span style="font-variant-ligatures: no-common-ligatures;">3</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">Virgo </span><span style="font-variant-ligatures: no-common-ligatures;">-8041781948673145583 </span><span style="font-variant-ligatures: no-common-ligatures;">6</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">Libra </span><span style="font-variant-ligatures: no-common-ligatures;">-2142727802591540075 </span><span style="font-variant-ligatures: no-common-ligatures;">4</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">Scorpio </span><span style="font-variant-ligatures: no-common-ligatures;">-5744609807935173055 </span><span style="font-variant-ligatures: no-common-ligatures;">5</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">Sagittarius </span><span style="font-variant-ligatures: no-common-ligatures;">-0816785684867175026 </span><span style="font-variant-ligatures: no-common-ligatures;">1</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">Capricorn </span><span style="font-variant-ligatures: no-common-ligatures;">-6957124044486481194 </span><span style="font-variant-ligatures: no-common-ligatures;">6</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">Aquarius </span><span style="font-variant-ligatures: no-common-ligatures;">-3903387275638502447 </span><span style="font-variant-ligatures: no-common-ligatures;">5</span></span></div>
<div class="p1">
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s1">Pisces </span><span style="font-variant-ligatures: no-common-ligatures;"> </span><span style="font-variant-ligatures: no-common-ligatures;">7634852637572685346 </span><span style="font-variant-ligatures: no-common-ligatures;">3</span></span></div>
</div>
<div class="p1">
<span class="s1"></span><br /></div>
<div class="p1">
<span class="s1">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff; min-height: 13.0px}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style>
</span></div>
<div class="p2">
<span class="s1" style="font-family: "arial" , "helvetica" , sans-serif;"><b>Result:</b> The result is fairly even for a random partitioner; two nodes have three values, two nodes have two values, and two nodes have one value. </span><br />
<span class="s1"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgiEStsMIuSAYJBdKEFECul08ZLjxI6z5r7Me-yHdA5jNZAlN_E5Qyq4vqEESUCmcNg-LEONbI06tasYWuqNh4Z7VP_yZvwzT81dJk_XmWPWJPaKY9fqzZ3m6aVNNIcD7Qi6pbJfnVBJh11/s1600/ZodiakPartitioning.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="815" data-original-width="876" height="593" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgiEStsMIuSAYJBdKEFECul08ZLjxI6z5r7Me-yHdA5jNZAlN_E5Qyq4vqEESUCmcNg-LEONbI06tasYWuqNh4Z7VP_yZvwzT81dJk_XmWPWJPaKY9fqzZ3m6aVNNIcD7Qi6pbJfnVBJh11/s640/ZodiakPartitioning.png" width="640" /></a></div>
<h3>
<span class="s1"><br /></span><span class="s1" style="color: #0b5394;">Verify Hashing in Cassandra</span></h3>
<span class="s1"><br /></span>
<span class="s1">We can now verify the Murmur3 hash values calculated above match exactly with what is seen in Cassandra.</span><br />
<span class="s1"><br /></span>
<br />
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">CREATE TABLE test.zodiac (</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> sign text,</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> body text,</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> PRIMARY KEY (sign)</span></span></div>
<span class="s1" style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style>
</span><br />
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">);</span></span></div>
<div class="p1">
<span class="s1" style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">cqlsh> insert into zodiac (sign, body) values ('Aries', 'Mars');</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">cqlsh> insert into zodiac (sign, body) values ('Taurus', 'Earth');</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">cqlsh> insert into zodiac (sign, body) values ('Gemini', 'Mercury');</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">cqlsh> insert into zodiac (sign, body) values ('Cancer', 'Moon');</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">cqlsh> insert into zodiac (sign, body) values ('Leo', 'Sun');</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">cqlsh> insert into zodiac (sign, body) values ('Virgo', 'Mercury');</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">cqlsh> insert into zodiac (sign, body) values ('Libra', 'Venus');</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">cqlsh> insert into zodiac (sign, body) values ('Scorpio', 'Pluto');</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">cqlsh> insert into zodiac (sign, body) values ('Sagittarius', 'Jupiter');</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">cqlsh> insert into zodiac (sign, body) values ('Capricorn', 'Saturn');</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">cqlsh> insert into zodiac (sign, body) values ('Aquarius', 'Uranus');</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style>
</span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">cqlsh> insert into zodiac (sign, body) values ('Pisces', 'Neptune');</span></span></div>
<span class="s1" style="font-size: x-small;"><br /></span>
<br />
<div class="p1">
<span style="font-size: x-small;"><span class="s1">cqlsh> select sign, token(sign) from zodiac</span><span style="font-variant-ligatures: no-common-ligatures;">;</span></span></div>
<div class="p2">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s1"></span><br /></span></div>
<div class="p3">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s2"> </span><span class="s3"><b>sign</b></span><span class="s2"> | </span><span class="s1"><b>system.token(sign)</b></span></span></div>
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">-------------+----------------------</span></span></div>
<div class="p4">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s4"><b> Leo</b></span><span class="s2"> | </span><span class="s1"><b>-8583032252751962986</b></span></span></div>
<div class="p4">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s4"><b> Virgo</b></span><span class="s2"> | </span><span class="s1"><b>-8041781948673145583</b></span></span></div>
<div class="p4">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s4"><b> Cancer</b></span><span class="s2"> | </span><span class="s1"><b>-8016596991533194765</b></span></span></div>
<div class="p4">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s4"><b> Capricorn</b></span><span class="s2"> | </span><span class="s1"><b>-6957124044486481194</b></span></span></div>
<div class="p4">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s4"><b> Scorpio</b></span><span class="s2"> | </span><span class="s1"><b>-5744609807935173055</b></span></span></div>
<div class="p4">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s4"><b> Aquarius</b></span><span class="s2"> | </span><span class="s1"><b>-3903387275638502447</b></span></span></div>
<div class="p4">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s4"><b> Libra</b></span><span class="s2"> | </span><span class="s1"><b>-2142727802591540075</b></span></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><span class="s4" style="font-size: x-small;"><b> Sagittarius</b></span><span class="s2" style="font-size: x-small;"> | </span><span class="s1" style="font-size: x-small;"><b>-816785684867175026</b></span></span></div>
<div class="p4">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s4"><b> Gemini</b></span><span class="s2"> | </span><span class="s1"><b>1721847210301305769</b></span></span></div>
<div class="p4">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s4"><b> Taurus</b></span><span class="s2"> | </span><span class="s1"><b>4155751160254564535</b></span></span></div>
<div class="p4">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s4"><b> Aries</b></span><span class="s2"> | </span><span class="s1"><b>6446536566984288488 </b></span></span></div>
<div class="p4">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s4"><b> Pisces</b></span><span class="s2"> | </span><span class="s1"><b>7634852637572685346</b></span></span></div>
<div class="p2">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><span class="s1"></span><br /></span></div>
<span class="s1" style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff; min-height: 13.0px}
p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #d53bd3; background-color: #ffffff}
p.p4 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #34bc26; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}
span.s2 {font-variant-ligatures: no-common-ligatures; color: #000000}
span.s3 {font-variant-ligatures: no-common-ligatures; color: #c33720}
span.s4 {font-variant-ligatures: no-common-ligatures; color: #afad24}
</style>
</span><br />
<div class="p1">
<span class="s1"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">(12 rows)</span></span></div>
<div class="p1">
<span class="s1"><span style="font-size: x-small;"><br /></span></span></div>
<div class="p1">
<span class="s1"><span style="font-size: x-small;"><br /></span></span></div>
</div>
<div class="p2">
<span class="s1" style="font-size: x-small;"><br /></span></div>
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style><br />
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}
</style>Brad Schoeninghttp://www.blogger.com/profile/15502390798829156899noreply@blogger.com0