Skip to content

Ownership vs Access

Kinan Bab edited this page Apr 27, 2023 · 5 revisions

So far, we only looked at OWNED_BY annotations, which express a dual policy decision:

  1. Access: The associated data subject has access rights to the data (i.e. it's returned on GDPR GET).
  2. Deletion: The associated data subject has some effect on the deletion of data (i.e. it's deleted when the last owner is deleted, modulo anonymization). Internally, K9db implements this in the storage layer by storing a copy of the row in the shards associated with each owner data subject.

In this tutorial, we will look at ACCESSED_BY, which expressed the first policy above, but not the second. Internally, K9db does not store a copy of the row in the shards of associated accessing data subjects, which reduces overhead, and also ensures that the deletion of the row is under the control of its other owners, since K9db stores the row in their shards.

Medical chat

Let's look at an application scenario corresponding to a medical application. Specifically, a patient portal where patients and doctors can exchange chat messages.

Schema

In this application, we have two kinds of data subjects: users, and doctors. We can express this by creating two data subject tables, one for each kind.

CREATE DATA_SUBJECT TABLE doctors (
  ID int,
  name text,
  PRIMARY KEY(ID)
);

CREATE DATA_SUBJECT TABLE patients (
  ID int,
  name text,
  PRIMARY KEY(ID)
);

Consider now the chat table that stores messages between patients and doctors. This table has two foreign keys, one to the patient and one to the doctor. These chat messages may contain sensitive medical data about the patients, and thus our policy must provide patients with complete control over it. In particular, the messages should be deleted when the patient requests their data be deleted, regardless of the status of the corresponding doctor.

At the same time, the doctor requesting deletion of their data should have no affect on the messages exchanged with existing patients (with the exception of some non-intrusive anonymization of the doctor's identity). Since it is unacceptable for patients to lose some of their medical history because their doctors retired or moved to a different institution for example. With that said, it is reasonable to give doctors read access to the messages they exchanged with patients.

Thus, this is a scenario where the patient has both access and deletion rights, and thus is the owner of the data, while the doctor only has access rights. We can express this to K9db as below.

CREATE TABLE chat (
  ID int,
  patient_id int,
  doctor_id int,
  message text,
  PRIMARY KEY(ID),
  -- Patient is the sole OWNER.
  FOREIGN KEY (patient_id) OWNED_BY patients(ID),
  -- Doctor only has access
  FOREIGN KEY (doctor_id) ACCESSED_BY doctors(ID)
);

We can insert some data and validate that this indeed works as intended.

INSERT INTO patients VALUES (1, 'Alice');
INSERT INTO patients VALUES (2, 'Bob');

INSERT INTO doctors VALUES (10, 'Carl');
INSERT INTO doctors VALUES (20, 'Dracula');

INSERT INTO chat VALUES (1, 1, 10, 'Msg (1)');
INSERT INTO chat VALUES (2, 1, 10, 'Msg (2)');
INSERT INTO chat VALUES (3, 2, 10, 'Msg (3)');
INSERT INTO chat VALUES (4, 1, 20, 'Msg (4)');
INSERT INTO chat VALUES (5, 2, 20, 'Msg (5)');

When Carl requests a copy of his data, he should see the messages he exchanged with the patients. However, these messages should still exist if Carl decides to delete his data and leave the system. On the other hand, if Alice decides to delete her data, all of her messages should be deleted, regardless if it was exchanged with an existing or a deleted-since doctor.

-- Doctors see the chat
GDPR GET doctors 10;
+------+------+
| ID   | name |
+------+------+
|   10 | Carl |
+------+------+
1 row in set (0.000 sec)

+------+------------+-----------+---------+
| ID   | patient_id | doctor_id | message |
+------+------------+-----------+---------+
|    1 |          1 |        10 | Msg (1) |
|    3 |          2 |        10 | Msg (3) |
|    2 |          1 |        10 | Msg (2) |
+------+------------+-----------+---------+
3 rows in set (0.000 sec)
-- A doctor leaving does not affect the chat
GDPR FORGET doctors 10;
> Query OK, 1 row affected (0.004 sec)

SELECT * FROM chat;
+------+------------+-----------+---------+
| ID   | patient_id | doctor_id | message |
+------+------------+-----------+---------+
|    5 |          2 |        20 | Msg (5) |
|    3 |          2 |        10 | Msg (3) |
|    4 |          1 |        20 | Msg (4) |
|    2 |          1 |        10 | Msg (2) |
|    1 |          1 |        10 | Msg (1) |
+------+------------+-----------+---------+
5 rows in set (0.000 sec)
-- A patient leaving deletes their chat
GDPR FORGET patients 1;
> Query OK, 4 rows affected (0.005 sec)

SELECT * FROM chat;
+------+------------+-----------+---------+
| ID   | patient_id | doctor_id | message |
+------+------------+-----------+---------+
|    5 |          2 |        20 | Msg (5) |
|    3 |          2 |        10 | Msg (3) |
+------+------------+-----------+---------+
2 rows in set (0.000 sec)

Anonymization

While the above policy is reasonable at a high level, it has a few problematic issues:

  1. A doctor that requests a copy to their data can see the ID of the patient they communicated with. Doctors can save a copy of this data for their own record. If a patient then requests to leave the system, their ID remains in the saved copy that their doctors has.
  2. The doctor's identifier is similarly visible to patients that request a copy of their data.
  3. The doctor's identifier remains in the chat table in the database even after they request deletion.

We can address these issues with anonymization, both for when data subject requests deletion, e.g. for (3), and when they request access, e.g. for (1) and (2).

CREATE TABLE chat (
  ID int,
  patient_id int,
  doctor_id int,
  message text,
  PRIMARY KEY(ID),
  FOREIGN KEY (patient_id) OWNED_BY patients(ID),
  FOREIGN KEY (doctor_id) ACCESSED_BY doctors(ID),
  -- addresses (1)
  ON GET doctor_id ANON (patient_id),
  -- addresses (2)
  ON GET patient_id ANON (doctor_id),
  -- addresses (3)
  ON DEL doctor_id ANON (doctor_id)
);

The anonymization pattern we saw above appears to be a common pattern that many applications may rely on. Specifically, when there are multiple owners or accessing data subjects associated with some row, then it is often the case that the following anonymization rules should be applied:

  1. When one of the data subjects requests access, the identifiers corresponding to all other data subjects should be anonymized.
  2. When one of the data subjects requests deletion, their own identifier should be anonymized.

Note that (2) is not needed for patients, since the patient is the sole owner of the chat message, and thus the entire row is deleted when the patient requests deletion without need for anonymization.