Follow up on direct disclosure #243
Replies: 6 comments
-
Peter-Paul: Let me give you some more background to my question. With frequency count tables, I often like to mentally reorder them in such a way that the rows represent the “identifiable groups” and the columns represent the “sensitive variable”. So e.g. the rows would be defined by region x gender x occupation and the columns would be defined by type of disease or type of injury. One of our rules on frequency count tables states that we should not have group disclosure. That is, looking at an “identifiable group” they should not (almost) all score on a single sensitive category. But then you could have the strange situation that a detailed table (detailed cancer types as diseases) can be published (not almost all score on a single cancer type), but a less detailed table (only the overall category “cancer”) cannot be published (because almost all score on “cancer”). So that is what you call a “hidden hierarchy”. We call that a “meaningful combination”. Note that often you can find some combination such that you end up with group disclosure, but if you combine non-related categories that does not mean anything (e.g. something like “ovarian cancer grouped together with heart failure”). You might even get several hidden hierarchies: different regroupings could all yield meaningful combinations that could lead to group disclosure. This makes it rather complicated and we do not really have a nice way to deal with it (yet). I need to think a bit more on using hidden hierarchies in finding primaries, but not using it in finding secondaries. But that seems to be exactly what you call the “one and a half” suppression problem. It is actually the reason I often say that the suppression methods as available in e.g. tau-argus are more targeted at magnitude tables, not frequency count tables. For frequency count tables I would more often recommend rounding as a protection method. On something else: I am not familiar with the Gaussian suppression method.
you could end up with a pattern like this
depending on the ordering of the candidate cells you consider? |
Beta Was this translation helpful? Give feedback.
-
Daniel: Concerning hidden hierarchies: I have a similar way of arranging the tables in my head, with rows being identifiable variables and columns being the sensitive variables. Indeed, the idea for direct disclosure comes from the microdata world, since frequency tables are really just a compressed representation of (aggregated) microdata. So the idea was to translate a (somewhat more complex) notion of l-diversity on to frequency tables, though the actual formalization of direct disclosure as a microdata risk measure is rather technical and not very elegant. I would say the major benefit of the model matrix representation that Gaussian suppression uses (the X matrix in my slides), is that hierarchies need not be tree-shaped. Each row in X corresponds to an inner cell, and each column represents a possible combination of these cells. In this way, any “meaningful combination” can be represented as a column in the same X matrix. In fact, the proof-of-concept I made last week supports specification of any meaningful combination of the single categories (with an, admittedly, not very user-friendly interface). So, in addition to injured = serious + light, one could, in theory, have “seriousornone” = serious + none as another meaningful combination to be protected within the same hidden “hierarchy”. I’d like to elaborate a bit more on my view of the one-and-a-half problem (please let me know if you have a more suitable name). The somewhat absurd naming came from the fact that primary suppression suppresses cells that are considered disclosive, and secondary suppression suppresses cells such that one cannot recalculate any suppressed cells. For frequency tables, I believe there should a step in the middle: suppress extra cells such that membership in the disclosive cells cannot be determined, hence the naming of the problem. This is where the idea of converting l-diversity came from. Regarding the one-and-a-half problem, let us consider the single row
The proof of concept I wrote would primary suppress as follows:
Similar to what you mentioned, I generate virtual cells in the proof-of-concept and use them to determine which (actual) cells to suppress. I have since talked to my colleagues about this (and how the 1.5-problem is not addressed by this suppression pattern), and we have an idea of how this could be solved (at least in the framework used by Gauss suppression), which would result in, for example,
This would be an adequate solution to the 1.5-problem with a hidden hierarchy including “injured”, as well as being a valid secondary suppression. This would entail including the virtual cells in both the primary and secondary suppression methods. We haven’t implemented this yet, but we believe we know how it can be done within the Gauss framework. I can keep you updated once we have more to show. Regarding your comment about singletons using virtual cells, I believe Øyvind has implemented this (and other considerations) in the Gauss framework, but I think he is much more qualified to make statements about this than I am. Finally, concerning your example.
The default behavior of GaussSuppression yields the following table (it prioritizes lower frequency cells for
In fact, the suppression pattern you gave cannot be generated by GaussSuppression. If it is supplied with those suppressions, it finds one more:
This makes sense, since without this last suppression, the cell (k1,B) could be recalculated precisely (unless I am mistaken…have to admit I used barely-tested code to check here). Øyvind has run a benchmark to see how the candidate order affects number of secondary suppressions given the three primary cells, and for 10 000 random orders we get the following distribution (first row is number of secondary suppressed cells, second row is how often that number of secondary cells occurred in the sample).
So it seems at least highly unlikely that the whole table would end up suppressed 😊 |
Beta Was this translation helpful? Give feedback.
-
Øyvind: I’m not 100% sure, but it seems that the JJ-format corresponds to the model matrix. The tau-manual says that JJ “establish a link between the (hierarchical) tables and the structures required…..” An interpretation is that the structures required is the same as the M-dimensional cover table and that this M-dimensional table is the same as what we call inner cells. In our approach the model matrix is simply a huge dummy matrix with inner cells as rows and the cell to be published as columns. Sparse matrix methodology must be used so that large data can be handled. In the example below the dummy model matrix is a 16*25 matrix. The column representing cell (k1,B) depends linearly on the 19 columns representing non-suppressed cells. Gauss suppression avoids such linear dependency. In the beginning, I programmed both Gram–Schmidt Orthogonalization, Householder transformation and Gaussian elimination. All gave the same solution. The latter method could keep sparsity (sparse matrix) and was therefore preferred. The method was first included as an additional method in r-package easySdcTable. Good experience and many issues with SIMPLEHEURISTIC in sdcTable have led to GaussSuppression now being preferred as much better. Since we needed more flexibility and since the user interface in r package SmallCountRounding could easily be re-used, r package GaussSuppression was made. |
Beta Was this translation helpful? Give feedback.
-
In response to email comments by Peter-Paul:
so not sure yet how to get to
My response: |
Beta Was this translation helpful? Give feedback.
-
Another question regarding meaningful combinations: Should we protect disclosures within meaningful combinations? That is to say, in
the one seriously injured knows that all other injured are in the light category. |
Beta Was this translation helpful? Give feedback.
-
I would like to follow up on this discussion. We have come quite a bit further in this work, and wrote a short paper intended for PSD on it (we are in the process of preparing a journal version). Unfortunately, due to a series of misunderstandings on the side of the only one reviewer we got, it was rejected. Since we had discussed the work leading to this paper, I thought you might be interested in the recent developments. The paper as submitted to PSD can be found here: link. The general idea is as follows: as we discussed in Poznan and the above conversation, it's small differences between certain cells that lead to disclosure, not the frequency of certain cells. So the idea is that an attacker must not be able to recalculate these differences too precisely in order to prevent disclosure. To address this, we primary suppress these differences, not actual table cells. Then secondary suppression using only publishable cells as candidates protects the disclosure of these differences. In practice, this means the construction of disclosive differences as hidden cells. This approach addresses everything we've discussed in the above conversation, in particular meaningful combinations (both disclosure of meaningful combinations, and disclosure within meaningful combinations). I'd be happy to discuss this with you, both in Paris if you are going and otherwise. |
Beta Was this translation helpful? Give feedback.
-
Daniel Lupp:
I’d like to follow up on our brief discussion on whether seriously injured + lightly injured = injured is disclosure. As I mentioned, I do agree with you that this should constitute disclosure, but the published code on CRAN does not support customizing this yet. I did mention an idea on defining hierarchies to solve this problem, so I whipped up a proof-of-concept implementation to show what I mean (the actual code is on the “feature/shadow-hierarchies” branch on https://github.com/statisticsnorway/GaussSuppression, if you’re interested).
The code on that branch can take hierarchical tables into account (but can’t handle non-disclosive unknown values), but is also able to take what I call a shadow hierarchy: a hierarchy where the marginals are not going to be published, but are considered during primary suppression.
Consider the following table (rows are municipalities, columns are levels of injury):
Running the code on CRAN (without using any shadow hierarchy) yields the following primary suppression:`
Row one is directly disclosive, but we want to have row two directly disclosive as well (since the one “none” in injury can disclose that everyone else in k2 is injured). For this we define shadow hierarchies for our dataset:
Using this shadow hierarchy in the code, one gets the following primary suppression:
Now the “injured” disclosure in k2 is detected. Note that this shadow hierarchy does not change the shape of the table, it is only used in primary cell detection.
Now, an interesting problem arises, which I (somewhat tongue-in-cheek) have called the “one and a half” suppression problem when talking to my colleagues. This was what I originally wanted to talk to you about at the ESTP course. The standard approach to cell suppression in GaussSuppression yields the following suppressed table:
A perfectly valid secondary cell suppression, since no cell values can be recalculated. However, it is still very disclosive when considering “injured” a disclosive category. I believe I could game the system (similarly to how I do for the version on CRAN) by tweaking the order in which cells are considered for secondary suppression, but the underlying problem is something that secondary suppression does not address.
This is what I call the “one and a half” suppression problem: How to suppress cells, so that membership in primary suppressed cells is hidden? As far as I’m aware, no method for secondary suppression does this, they focus on preventing the recalculation of cell values, which is an entirely different problem. I believe “one and a half” is unique to frequency tables, but if one wishes to construct a method geared towards suppressing disclosure, as opposed to suppressing values, then this should be taken into account. I was wondering whether you have encountered this discussion before?
Back in April I tried formalizing this as a MILP, but unfortunately ran out of time due to other obligations and since this was very low priority…so this has been on hold since then.
Beta Was this translation helpful? Give feedback.
All reactions