-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Add multi-level security awareness to event formatter #30
Conversation
Still needs tests and documentation. |
I've refactored this PR a bit to allow for "multi-level security". |
Okay added tests and documentation. This is ready to merge if we think this is a reasonable approach. |
Oops, forgot some docs. Working on that now. The code/tests should be ready for review. |
Ok added initial docs, but they likely need work. |
@Zsailer this is great work. Thanks for diving into this important area of this work. I opened a new issue #31 with some comments on this, but will also add a note here. The GDPR has a useful description and classification of "personal data": that I think is helpful here. While I like the direction this is going, I am concerned about us picking security levels that are somewhat arbitrary. What if we just had two levels:
|
Another key point - adding a username or identifier to essentially any event causes that entire event to be considered personal data. Because of this, I think usernames/identifiers need special treatment. |
Thanks, @ellisonbg!
What if we filter out the username/identifier properties of the event using this PRs approach? Is the event still considered personal data? |
Where do we put the username/identifier on events. Is that a common property on a base event schema? If not, I think it should be - I don't want people trying to get around these things by calling it something different and pretending they aren't collecting it. You raise a great question though. Let's see if we can find an example of an event that shows how this would work. Let's say there is an event that has:
If username alone is removed, the event would definitely still be personal data. However, if both the usearname and social security number are removed, then I think the event (at least this one!) is no longer personal data. So as long as the filtering logic removes all personal data attributes, I think the event can survive...unless there are a combination of non-personal data that in aggregate becomes personal. This gets into the land of reidentification. When assessing properties as personal data or not, we need to be extremely skeptical about if they could be used to reidentify the individual. The definition of personal data in GDPR does a decent job of covering this, and it would probably be worthwhile for us to have an expert in data privacy law audit our assessment of properties once we get further along. |
I hope nobody puts social security numbers in JSON events! :D |
Link to SELinux where we stole this from: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/security-enhanced_linux/mls |
jupyter_telemetry/formatter.py
Outdated
key = (log_record['__schema__'], log_record['__version__']) | ||
schema = self.logger.schemas[key]['properties'] | ||
# Logging keys that won't be in the schema. | ||
ignored_keys = ['__schema__', '__timestamp__', '__version__', 'message'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably ignore every key that starts with and ends with two underscores, since these will never come from the user schema
We should also put |
For levels, I think we should go from our experience building schemas for things inside the Jupyter project, and be more descriptive. I propose that currently we only have three levels:
We should provide strong guidelines on what kinda data gets set to what level, and encourage people to adopt those. We can enforce that in the events we ship in our projects. However, we can't enforce it on others building custom events, and I hope we limit data collection in the default events we emit to only what's needed. |
docs/pages/schemas.rst
Outdated
@@ -33,6 +40,7 @@ Here is a minimal example of a valid JSON schema for an event. | |||
properties: | |||
name: | |||
title: Name | |||
level: confidential |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I don't know that this is a good example. Since the name
property is required for all events (right?), I think maybe this should always be unclassified. Maybe a two-field example better illustrates it:
name:
title: Name
level: unclassified
description: Name of event
user:
title: User name
level: confidential
description: Name of user who initiated event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 good point. I'll update this.
docs/pages/sensitive-data.rst
Outdated
|
||
Jupyter Telemetry offers flexibility when logging and handling sensitive data. | ||
|
||
Since events may include data with varying degrees of sensitivity, Jupyter Telemetry uses a multi-level security approach. It exposed four levels of sensitivity: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a default level when unspecified, or is level always required to be explicit for every property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, the default is unclassified
if level is not given, but I think it's an open question before merging this. I actually think it makes sense to require the level to be explicit in each property... what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to have it be required.
jupyter_telemetry/eventlog.py
Outdated
for handler in self.handlers: | ||
# If event_log is not an attribute of handler, patch in. | ||
if not hasattr(handler, 'event_level'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than modifying the object, we can use
event_level = getattr(handler, 'event_level', 'unclassified')
jupyter_telemetry/formatter.py
Outdated
|
||
super(JsonEventFormatter, self).__init__(*args, **kwargs) | ||
|
||
def setEventLevel(self, event_level): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a public setter API and a protected attribute for the same is not very Pythonic (much of the stdlib logging module is this way for historical reasons, but it doesn't need to be added to). This can be accomplished with a single property, eliminating the setEventLevel
setter:
@property
def event_level(self):
return self._event_level
@event_level.setter
def event_level(self, event_level):
if event_level not in EVENT_LEVELS:
raise Exception("Event level '{}' not understood.".format(event_level))
self._event_level = self.handler.event_level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is true— but in this case, we're overriding the method from the JsonFormatter
parent class (which is an external dependency). The issue is that this method is called by other places in this parent class. Thus, I think we need to keep this method signature consistent with the parent class. I realize this isn't Pythonic, but it's consistent with the method it overrides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this was recently changed! Ignore my prior comment. :)
Updating to use property setter.
@yuvipanda how would you rank the three levels you mentioned above:
Isn't |
Thanks everyone for contributing to this discussion... The @Zsailer raises an important question about the ordering of the classes. I tend to think of these as more nominal categoricals, rather than ordinal. |
Apologies for having stepped away from this a while, and thank you for picking this back up, @Zsailer and @minrk. I agree with @ellisonbg - these aren't ordinal numerical levels. I think of username as differently sensitive than user-identifiable-information. For example, if the fact that my username is 'yuvipanda' leaks, it is a different issue than if my SSN or whatever leaks. So it is a useful distinction to make, without any comment about wether one is more sensitive or not. So this means admins would have to explicitly opt-in to each level, rather than a 'this level and below' option. |
Agreed that they should probably be treated as categorical rather than hierarchical. Think of it as a list of tags/topics to subscribe to, along the lines of: self.topics = ['user-identifiable', 'location']
def emit(self, record):
for field in fields:
# intersection = any match, use .issubset for all match
if schema[field]['topics'].intersection(self.topics):
include(field) A couple questions that follow from that:
I think it's a good idea to sketch some real events and labels/filters folks could want to see what's the best fit. I'm honestly not sure at this point. |
Min brings up great questions. I think the missing piece here is policies
about how different categories are handled. Some examples of policies that
one might apply to fields/events of a given class:
* Remove
* Hash using different algorithms (salted rotating hash, etc.)
* Allow
* Allow but fuzz using differential privacy
Which of these policies is applied to which classes has an immediate impact
on the bigger picture of data privacy and on legal compliance. I would love
to see our library to the next step and implement policies like this, as
some of them are really difficult to get right. I want to make it really
easy for operators to do the right thing and not make silly mistakes that
compromise data privacy/security.
Given this background, I will try to explore answers to Mins questions...
I think there is room for categories to be a free field that is matched
against. In terms of policies, that would roughly look like "apply this
policy to this tag."
If we want to use these categories to apply policies though, then I don't
think it makes sense to be able to apply multiple tags (or at least it
isn't obvious with the current tags we are discussing). This is mostly the
case because some of the policies are contradictory (allow/remove,
hash/don't hash).
…On Sun, Feb 9, 2020 at 6:50 AM Min RK ***@***.***> wrote:
Agreed that they should probably be treated as categorical rather than
hierarchical. Think of it as a list of tags/topics to subscribe to, along
the lines of:
self.topics = ['user-identifiable', 'location']
def emit(self, record):
for field in fields:
# intersection = any match, use .issubset for all match
if schema[field]['topics'].intersection(self.topics):
include(field)
A couple questions that follow from that:
- is there still a place for level *in addition to* category for
things like user-identifiable, or is category all we really need?
- Are categories a set dictated by this repo, or a free field to match
against for packages? My inclination would be the latter so deployments can
define some of their own categories for events, but that could become a
documentation/standardization challenge if there are too many sources for
events (notebook package, jupyterhub, spawner, local customization, etc.).
- If they are categorical, can an event have more than one category?
If so, does the subscription filter match using 'all' or 'any'? An event's
request ip address might reasonably be labeled 'user-identifiable' and
'location' for instance. Not sure if that would be a good practice or not.
I think it's a good idea to sketch some real events and labels/filters
folks could want to see what's the best fit. I'm honestly not sure at this
point.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30?email_source=notifications&email_token=AAAGXUHUUXCGKUMZQOCUL33RCAJ2TA5CNFSM4I36SYM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELGOM4A#issuecomment-583853680>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAGXUHMYHJTSHNEREDBKQ3RCAJ2TANCNFSM4I36SYMQ>
.
--
Brian E. Granger
Principal Technical Program Manager, AWS AI Platform ([email protected])
On Leave - Professor of Physics and Data Science, Cal Poly
@ellisonbg on GitHub
|
Great stuff here! Thanks, everyone. 👍 to categorical over hierarchical.
I'm finding that levels are hard to prescribe. Maybe we start with categories+policies first; then, add levels if need arises. I think @minrk is right—sketching out some "real" events around Jupyter apps might help us identify categories, policies, etc. I'll try to draft some example events for the notebook server and post them here. If someone else wants to try to sketching events for jupyterhub, spawner, etc., that would be great! In the meantime, I've been doing a deep dive into various legislation/documentation around data privacy—e.g. GDPR, CCPA, NIST (SP 800-53 Rev. 4), OMB (Memorandum 07-16)—to clearly define "personal data" under different regulatory regimes and what policies appropriate address privacy concerns in each regime. If anyone knows of other documents I should review, feel free to add them here. I'll create a summary document for this repo's docs. |
@Zsailer I think it's probably important, at least initially and for this repo, to only cover those things to the extent that we have suggested/defined categories for the right classes of data (i.e. is PII a single category, or are there more fine-grained classifications that we need). Assigning things to the right categories should be handled in the individual packages where events are being defined. 👍 to only having categories until and unless levels show up as having a real need. |
Closing in favor of #46 |
This PR adds "multi-level security" awareness to the event log.
Background
I've been playing around with different approaches for handling personally identifiable information when event logging. When I've talked to a few different stake-holders, PII alway became a big point of conversation. Some groups requires PII data in their event logs, others are required by law to drop it. It's a challenge to accommodate both groups. Here's what I came up with...
Technical Details
This PR patches the json formatter to look for a log level in the properties of event schemas. If the given handler that's emitting those events has an attribute
handler.event_level
, then properties with levels > event_level will be dropped from the emitted event. We should discuss what are appropriate level names. In this PR, I've just arbitrarily chosen:This requires operators to explicitly add
event_level
to each handler to emit sensitive data from that handler. Each handler can dictate its own level. The same event log can emit sensitive data to one handler and not another. This make it extremely flexible for managing this type of data.This also requires event schemas to include
level
attribute for each property in their JSON schema. If this attribute is missing from a property, it is assumed that the property is not "unclassified". Here's what an example schema might look like. Thename
property for any event with this schema is PII and will not be included iflevel = 'confidential'
on the emitting handler.For operators
It's simple to configure this option. Simply add an attribute to each handler:
When the event log is initialized, each handler will get the new JSON formatter with PII awareness.