Follow up on direct disclosure #243

ppdewolf · 2021-12-10T08:28:40Z

ppdewolf
Dec 10, 2021
Maintainer

Daniel Lupp:

I’d like to follow up on our brief discussion on whether seriously injured + lightly injured = injured is disclosure. As I mentioned, I do agree with you that this should constitute disclosure, but the published code on CRAN does not support customizing this yet. I did mention an idea on defining hierarchies to solve this problem, so I whipped up a proof-of-concept implementation to show what I mean (the actual code is on the “feature/shadow-hierarchies” branch on https://github.com/statisticsnorway/GaussSuppression, if you’re interested).

The code on that branch can take hierarchical tables into account (but can’t handle non-disclosive unknown values), but is also able to take what I call a shadow hierarchy: a hierarchy where the marginals are not going to be published, but are considered during primary suppression.

Consider the following table (rows are municipalities, columns are levels of injury):

    mun light none serious Total
1    k1     4    0       0     4
2    k2     2    1       5     8
3    k3     2    5       1     8
4    k4     4    4       4    12
5 Total    12   10      10    32

Running the code on CRAN (without using any shadow hierarchy) yields the following primary suppression:`

    mun light none serious Total
1    k1    NA    0       0     4
2    k2     2    1       5     8
3    k3     2    5       1     8
4    k4     4    4       4    12
5 Total    12   10      10    32

Row one is directly disclosive, but we want to have row two directly disclosive as well (since the one “none” in injury can disclose that everyone else in k2 is injured). For this we define shadow hierarchies for our dataset:

> s.hier2
$mun
  levels codes
1      @ Total
2     @@    k1
3     @@    k2
4     @@    k3
5     @@    k4

$inj
  levels   codes
1      @   Total
2     @@ injured
3    @@@ serious
4    @@@   light
5     @@    none

Using this shadow hierarchy in the code, one gets the following primary suppression:

    mun light none serious Total
1    k1    NA    0       0     4
2    k2    NA    1      NA     8
3    k3     2    5       1     8
4    k4     4    4       4    12
5 Total    12   10      10    32

Now the “injured” disclosure in k2 is detected. Note that this shadow hierarchy does not change the shape of the table, it is only used in primary cell detection.

Now, an interesting problem arises, which I (somewhat tongue-in-cheek) have called the “one and a half” suppression problem when talking to my colleagues. This was what I originally wanted to talk to you about at the ESTP course. The standard approach to cell suppression in GaussSuppression yields the following suppressed table:

    mun light none serious Total
1    k1    NA    0      NA     4
2    k2    NA    1      NA     8
3    k3     2    5       1     8
4    k4     4    4       4    12
5 Total    12   10      10    32

A perfectly valid secondary cell suppression, since no cell values can be recalculated. However, it is still very disclosive when considering “injured” a disclosive category. I believe I could game the system (similarly to how I do for the version on CRAN) by tweaking the order in which cells are considered for secondary suppression, but the underlying problem is something that secondary suppression does not address.
This is what I call the “one and a half” suppression problem: How to suppress cells, so that membership in primary suppressed cells is hidden? As far as I’m aware, no method for secondary suppression does this, they focus on preventing the recalculation of cell values, which is an entirely different problem. I believe “one and a half” is unique to frequency tables, but if one wishes to construct a method geared towards suppressing disclosure, as opposed to suppressing values, then this should be taken into account. I was wondering whether you have encountered this discussion before?

Back in April I tried formalizing this as a MILP, but unfortunately ran out of time due to other obligations and since this was very low priority…so this has been on hold since then.

ppdewolf · 2021-12-10T08:30:10Z

ppdewolf
Dec 10, 2021
Maintainer Author

Peter-Paul:
The discussion you mentioned, in view of your new law, about sensitivity of variables reminded me of the discussion we get at Statistics Netherlands from time to time with our (new) legal staff. According to them there is no such thing as sensitivity as it comes to our publications. According to the law they say, we need to treat all (personal) variables we get from respondents equally regarding sensitivity. However, they are aware that risk of disclosure of zero does not exist (unless we don’t publish), so they do accept that the policy of SN should be based on some sort of risk management. In the “old days” we made this “easy” by defining sensitive and non-sensitive variables: with sensitive variables we need to be more strict in applying risk measures and SDC methods. However, nowadays they (the new legal staff) are a bit more reluctant to use the word “sensitive”. I usually try to flip the argument: “in view of risk assessment, disclosing certain variables does not have a large impact. So we can be a bit more relaxed in applying risk measures and SDC methods. Those variables I call non-sensitive.”.

Let me give you some more background to my question. With frequency count tables, I often like to mentally reorder them in such a way that the rows represent the “identifiable groups” and the columns represent the “sensitive variable”. So e.g. the rows would be defined by region x gender x occupation and the columns would be defined by type of disease or type of injury. One of our rules on frequency count tables states that we should not have group disclosure. That is, looking at an “identifiable group” they should not (almost) all score on a single sensitive category. But then you could have the strange situation that a detailed table (detailed cancer types as diseases) can be published (not almost all score on a single cancer type), but a less detailed table (only the overall category “cancer”) cannot be published (because almost all score on “cancer”). So that is what you call a “hidden hierarchy”. We call that a “meaningful combination”. Note that often you can find some combination such that you end up with group disclosure, but if you combine non-related categories that does not mean anything (e.g. something like “ovarian cancer grouped together with heart failure”). You might even get several hidden hierarchies: different regroupings could all yield meaningful combinations that could lead to group disclosure. This makes it rather complicated and we do not really have a nice way to deal with it (yet).

I need to think a bit more on using hidden hierarchies in finding primaries, but not using it in finding secondaries. But that seems to be exactly what you call the “one and a half” suppression problem. It is actually the reason I often say that the suppression methods as available in e.g. tau-argus are more targeted at magnitude tables, not frequency count tables. For frequency count tables I would more often recommend rounding as a protection method.
It does remind me of a perhaps similar problem in magnitude tables when looking at two singleton cells in a row or column. They could “protect” each other in the sense of safety ranges, but since they are singletons they can disclose each other when not suppressing any additional cell. To tackle this problem we add a “virtual cell” to the table structure. That virtual cell is the sum of the two singleton cells. We then require non-exact disclosure of that virtual cell. Maybe something like that is possible for the hidden hierarchies as well: defining a hidden subtotal as an additional “virtual” cell that needs protection. But to give that protection you can only use “true” cells from the original table (i.e. you cannot use subtotals from the hidden hierarchy as secondary suppressions). We might even get this into the MILP way of secondary suppression that is in the Fiscetti Salazar approach (optimal/modular) using the idea of adding those virtual cells to the MILP problem.

On something else: I am not familiar with the Gaussian suppression method.
Just wondering whether starting from

            A    B    C    D Total
1    k1     6   NA    7    4    18
2    k2    NA    5    6    6    19
3    k3     3    8    5   NA    18
4    k4     8    4    9    7    28
5 Total    19   18   27   19    83

you could end up with a pattern like this

            A    B    C    D Total
1    k1    NA   NA   NA    4    18
2    k2    NA    5   NA   10    19
3    k3     3   NA    5   NA    18
4    k4     8   NA    9   NA    28
5 Total    19   18   27   19    83

depending on the ordering of the candidate cells you consider?

0 replies

ppdewolf · 2021-12-10T08:32:17Z

ppdewolf
Dec 10, 2021
Maintainer Author

Daniel:
Regarding the legal situation, we are currently in a process with our institute’s legal team to try and find a more nuanced understanding of the law. Similarly to your situation, they do not like the differentiation between sensitive and non-sensitive. I like your phrasing/approach, and we will certainly try to convey the same point, though I am not going to get my hopes up that they will come to the same understanding we have.

Concerning hidden hierarchies: I have a similar way of arranging the tables in my head, with rows being identifiable variables and columns being the sensitive variables. Indeed, the idea for direct disclosure comes from the microdata world, since frequency tables are really just a compressed representation of (aggregated) microdata. So the idea was to translate a (somewhat more complex) notion of l-diversity on to frequency tables, though the actual formalization of direct disclosure as a microdata risk measure is rather technical and not very elegant.

I would say the major benefit of the model matrix representation that Gaussian suppression uses (the X matrix in my slides), is that hierarchies need not be tree-shaped. Each row in X corresponds to an inner cell, and each column represents a possible combination of these cells. In this way, any “meaningful combination” can be represented as a column in the same X matrix. In fact, the proof-of-concept I made last week supports specification of any meaningful combination of the single categories (with an, admittedly, not very user-friendly interface). So, in addition to injured = serious + light, one could, in theory, have “seriousornone” = serious + none as another meaningful combination to be protected within the same hidden “hierarchy”.

I’d like to elaborate a bit more on my view of the one-and-a-half problem (please let me know if you have a more suitable name). The somewhat absurd naming came from the fact that primary suppression suppresses cells that are considered disclosive, and secondary suppression suppresses cells such that one cannot recalculate any suppressed cells. For frequency tables, I believe there should a step in the middle: suppress extra cells such that membership in the disclosive cells cannot be determined, hence the naming of the problem. This is where the idea of converting l-diversity came from.

Regarding the one-and-a-half problem, let us consider the single row

mun light none serious Total
k2    4    1      5     10

The proof of concept I wrote would primary suppress as follows:

mun light none serious Total
k2    NA    1     NA     10

Similar to what you mentioned, I generate virtual cells in the proof-of-concept and use them to determine which (actual) cells to suppress. I have since talked to my colleagues about this (and how the 1.5-problem is not addressed by this suppression pattern), and we have an idea of how this could be solved (at least in the framework used by Gauss suppression), which would result in, for example,

mun light none serious Total
k2    NA   NA     5     10

This would be an adequate solution to the 1.5-problem with a hidden hierarchy including “injured”, as well as being a valid secondary suppression. This would entail including the virtual cells in both the primary and secondary suppression methods. We haven’t implemented this yet, but we believe we know how it can be done within the Gauss framework. I can keep you updated once we have more to show.

Regarding your comment about singletons using virtual cells, I believe Øyvind has implemented this (and other considerations) in the Gauss framework, but I think he is much more qualified to make statements about this than I am.

Finally, concerning your example.

            A    B    C    D Total
1    k1     6   NA    7    4    18
2    k2    NA    5    6    6    19
3    k3     3    8    5   NA    18
4    k4     8    4    9    7    28
5 Total    19   18   27   19    83

The default behavior of GaussSuppression yields the following table (it prioritizes lower frequency cells for
secondary suppression):

   Var1     A    B    C    D Total
1    k1     6   NA    7   NA    18
2    k2    NA   NA    6    6    19
3    k3    NA    8    5   NA    18
4    k4     8    4    9    7    28
5 Total    19   18   27   19    83

In fact, the suppression pattern you gave cannot be generated by GaussSuppression. If it is supplied with those suppressions, it finds one more:

   Var1     A    B    C    D Total
1    k1    NA   NA   NA    4    18
2    k2    NA    5   NA    6    19
3    k3    NA   NA    5   NA    18
4    k4     8   NA    9   NA    28
5 Total    19   18   27   19    83

This makes sense, since without this last suppression, the cell (k1,B) could be recalculated precisely (unless I am mistaken…have to admit I used barely-tested code to check here).

Øyvind has run a benchmark to see how the candidate order affects number of secondary suppressions given the three primary cells, and for 10 000 random orders we get the following distribution (first row is number of secondary suppressed cells, second row is how often that number of secondary cells occurred in the sample).

   3    4    5    6
2735  907 5513  845

So it seems at least highly unlikely that the whole table would end up suppressed 😊

0 replies

ppdewolf · 2021-12-10T08:33:11Z

ppdewolf
Dec 10, 2021
Maintainer Author

Øyvind:
Very interesting discussion. I will just add some comments about Gauss suppression. I will not claim that it leads to correct suppression patterns in all cases. But it suppresses cells so that no suppressed cell can, in general, be recalculated as a linear combination of non-suppressed cells. By, in general, I mean a linear combination that applies independently of the cell values. Integer and non-zero requirements are not treated. But handling of singletons (and zeros when they are suppressed) are included.

I’m not 100% sure, but it seems that the JJ-format corresponds to the model matrix. The tau-manual says that JJ “establish a link between the (hierarchical) tables and the structures required…..” An interpretation is that the structures required is the same as the M-dimensional cover table and that this M-dimensional table is the same as what we call inner cells. In our approach the model matrix is simply a huge dummy matrix with inner cells as rows and the cell to be published as columns. Sparse matrix methodology must be used so that large data can be handled. In the example below the dummy model matrix is a 16*25 matrix. The column representing cell (k1,B) depends linearly on the 19 columns representing non-suppressed cells. Gauss suppression avoids such linear dependency.

In the beginning, I programmed both Gram–Schmidt Orthogonalization, Householder transformation and Gaussian elimination. All gave the same solution. The latter method could keep sparsity (sparse matrix) and was therefore preferred. The method was first included as an additional method in r-package easySdcTable. Good experience and many issues with SIMPLEHEURISTIC in sdcTable have led to GaussSuppression now being preferred as much better. Since we needed more flexibility and since the user interface in r package SmallCountRounding could easily be re-used, r package GaussSuppression was made.

0 replies

danlupp · 2021-12-10T09:07:25Z

danlupp
Dec 10, 2021

In response to email comments by Peter-Paul:
The l-diversity problem you mention, is what we call the group disclosure problem in frequency tables (and use l-diversity in case of microdata). In your example you would have 90% of the observations in the meaningful combination light+serious. Our rules state that we should not have group disclosure (with some flexible threshold for the percentage, that might depend on the level of sensitivity of the variable), not in the categories at hand, but also not in the meaningful combinations. In some sense you would have that the meaningful combination light+serious is primary as well and would need additional suppression. Looking at it in that way, you might end up with

mun light none serious Total
k2    NA   NA     NA     10

so not sure yet how to get to

mun light none serious Total
k2    NA   NA     5     10

My response:
Contrary to my first email (first post in this conversation), I guess I am advocating for not considering primary and secondary protection separate when trying to address the 1.5 problem. So if we have a hidden "injured" cell that is disclosed, this does not necessarily mean that all published cells within this hidden cell ("serious" and "light") must be primary suppressed. Rather, one can primary suppress this hidden cell and only consider publishable cells as possible secondary suppressions. So, to protect the hidden cell, one could secondary suppress "none", and then to protect "none" one could protect either "light" or "serious". In general these secondary cells need to be chosen carefully in order to prevent group disclosures, but I think it is possible. I have another proof-of-concept (I like tinkering, but I must emphasize that it's very hacked, and not "publishable" code) on the branch mentioned above which leads to the above suppression pattern (see the examples in the Roxygen documentation of SuppressDirectDisclosure2), but does not address the 1.5 problem in all cases (I have only changed the primary suppression, I haven't added any "smart" secondary treatment yet).

0 replies

danlupp · 2021-12-10T09:31:51Z

danlupp
Dec 10, 2021

Another question regarding meaningful combinations: Should we protect disclosures within meaningful combinations? That is to say, in

    mun light none serious Total
3    k3     2    5       1     8

the one seriously injured knows that all other injured are in the light category.

0 replies

danlupp · 2022-09-05T07:12:15Z

danlupp
Sep 5, 2022

I would like to follow up on this discussion. We have come quite a bit further in this work, and wrote a short paper intended for PSD on it (we are in the process of preparing a journal version). Unfortunately, due to a series of misunderstandings on the side of the only one reviewer we got, it was rejected. Since we had discussed the work leading to this paper, I thought you might be interested in the recent developments. The paper as submitted to PSD can be found here: link.

The general idea is as follows: as we discussed in Poznan and the above conversation, it's small differences between certain cells that lead to disclosure, not the frequency of certain cells. So the idea is that an attacker must not be able to recalculate these differences too precisely in order to prevent disclosure. To address this, we primary suppress these differences, not actual table cells. Then secondary suppression using only publishable cells as candidates protects the disclosure of these differences. In practice, this means the construction of disclosive differences as hidden cells.

This approach addresses everything we've discussed in the above conversation, in particular meaningful combinations (both disclosure of meaningful combinations, and disclosure within meaningful combinations). I'd be happy to discuss this with you, both in Paris if you are going and otherwise.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDC Tools

Follow up on direct disclosure #243

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

SDC Tools

Follow up on direct disclosure #243

ppdewolf Dec 10, 2021 Maintainer

Replies: 6 comments

ppdewolf Dec 10, 2021 Maintainer Author

ppdewolf Dec 10, 2021 Maintainer Author

ppdewolf Dec 10, 2021 Maintainer Author

danlupp Dec 10, 2021

danlupp Dec 10, 2021

danlupp Sep 5, 2022

ppdewolf
Dec 10, 2021
Maintainer

ppdewolf
Dec 10, 2021
Maintainer Author

ppdewolf
Dec 10, 2021
Maintainer Author

ppdewolf
Dec 10, 2021
Maintainer Author

danlupp
Dec 10, 2021

danlupp
Dec 10, 2021

danlupp
Sep 5, 2022