Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add multi-level security awareness to event formatter #30

Closed
wants to merge 8 commits into from

Conversation

Zsailer
Copy link
Member

@Zsailer Zsailer commented Sep 30, 2019

This PR adds "multi-level security" awareness to the event log.

Background

I've been playing around with different approaches for handling personally identifiable information when event logging. When I've talked to a few different stake-holders, PII alway became a big point of conversation. Some groups requires PII data in their event logs, others are required by law to drop it. It's a challenge to accommodate both groups. Here's what I came up with...

Technical Details

This PR patches the json formatter to look for a log level in the properties of event schemas. If the given handler that's emitting those events has an attribute handler.event_level, then properties with levels > event_level will be dropped from the emitted event. We should discuss what are appropriate level names. In this PR, I've just arbitrarily chosen:

  1. unclassified
  2. confidential
  3. secret
  4. top_secret

This requires operators to explicitly add event_level to each handler to emit sensitive data from that handler. Each handler can dictate its own level. The same event log can emit sensitive data to one handler and not another. This make it extremely flexible for managing this type of data.

This also requires event schemas to include level attribute for each property in their JSON schema. If this attribute is missing from a property, it is assumed that the property is not "unclassified". Here's what an example schema might look like. The name property for any event with this schema is PII and will not be included if level = 'confidential' on the emitting handler.

    $id: url.to.event.schema
    version: 1
    title: My Event
    description: |
      All events must have a name property
    type: object
    properties:
      name:
        title: Name
        description: |
            Name of event
        level: confidential
        type: string
    required:
    - name

For operators

It's simple to configure this option. Simply add an attribute to each handler:

import logging 

handler = logging.FileHandler('events.log')
handler.event_level = 'confidential'

eventlog = EventLog(
    handlers = [handler],
    ...
)

When the event log is initialized, each handler will get the new JSON formatter with PII awareness.

@Zsailer Zsailer changed the title Add PII awareness to event formatter [WIP] Add PII awareness to event formatter Sep 30, 2019
@Zsailer
Copy link
Member Author

Zsailer commented Sep 30, 2019

Still needs tests and documentation.

@Zsailer Zsailer changed the title [WIP] Add PII awareness to event formatter [WIP] Add multi-level security awareness to event formatter Oct 1, 2019
@Zsailer
Copy link
Member Author

Zsailer commented Oct 1, 2019

I've refactored this PR a bit to allow for "multi-level security".

@Zsailer
Copy link
Member Author

Zsailer commented Oct 2, 2019

Okay added tests and documentation. This is ready to merge if we think this is a reasonable approach.

@Zsailer
Copy link
Member Author

Zsailer commented Oct 2, 2019

Oops, forgot some docs. Working on that now. The code/tests should be ready for review.

@Zsailer
Copy link
Member Author

Zsailer commented Oct 2, 2019

Ok added initial docs, but they likely need work.

@Zsailer Zsailer changed the title [WIP] Add multi-level security awareness to event formatter [MRG] Add multi-level security awareness to event formatter Oct 2, 2019
@ellisonbg
Copy link
Collaborator

@Zsailer this is great work. Thanks for diving into this important area of this work. I opened a new issue #31 with some comments on this, but will also add a note here. The GDPR has a useful description and classification of "personal data":

https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/key-definitions/what-is-personal-data/

that I think is helpful here. While I like the direction this is going, I am concerned about us picking security levels that are somewhat arbitrary. What if we just had two levels:

  1. Personal data
  2. Not personal data

@ellisonbg
Copy link
Collaborator

Another key point - adding a username or identifier to essentially any event causes that entire event to be considered personal data. Because of this, I think usernames/identifiers need special treatment.

@Zsailer
Copy link
Member Author

Zsailer commented Oct 3, 2019

Thanks, @ellisonbg!

adding a username or identifier to essentially any event causes that entire event to be considered personal data

What if we filter out the username/identifier properties of the event using this PRs approach? Is the event still considered personal data?

@ellisonbg
Copy link
Collaborator

Where do we put the username/identifier on events. Is that a common property on a base event schema? If not, I think it should be - I don't want people trying to get around these things by calling it something different and pretending they aren't collecting it.

You raise a great question though. Let's see if we can find an example of an event that shows how this would work. Let's say there is an event that has:

  • username
  • social security number
  • NEW NOTEBOOK ACTION WAS TRIGGERED!

If username alone is removed, the event would definitely still be personal data. However, if both the usearname and social security number are removed, then I think the event (at least this one!) is no longer personal data. So as long as the filtering logic removes all personal data attributes, I think the event can survive...unless there are a combination of non-personal data that in aggregate becomes personal.

This gets into the land of reidentification. When assessing properties as personal data or not, we need to be extremely skeptical about if they could be used to reidentify the individual. The definition of personal data in GDPR does a decent job of covering this, and it would probably be worthwhile for us to have an expert in data privacy law audit our assessment of properties once we get further along.

@yuvipanda
Copy link
Collaborator

I hope nobody puts social security numbers in JSON events! :D

@yuvipanda
Copy link
Collaborator

key = (log_record['__schema__'], log_record['__version__'])
schema = self.logger.schemas[key]['properties']
# Logging keys that won't be in the schema.
ignored_keys = ['__schema__', '__timestamp__', '__version__', 'message']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably ignore every key that starts with and ends with two underscores, since these will never come from the user schema

@yuvipanda
Copy link
Collaborator

We should also put __event_classification__ or `event_level`` in all emitted events, so analysts can look at it and figure out what fields they can expect to be there. So you can figure out what attributes exist with a combination of (schema_name, version, event_level)

@yuvipanda
Copy link
Collaborator

For levels, I think we should go from our experience building schemas for things inside the Jupyter project, and be more descriptive. I propose that currently we only have three levels:

  1. unrestricted (or basic?)
  2. user-identifier (usernames, user id, etc)
  3. user-identifiable-information (uncleaned user agent, IPs, lat/lon (why?), social security number (usernames are not passwords, america!))

We should provide strong guidelines on what kinda data gets set to what level, and encourage people to adopt those. We can enforce that in the events we ship in our projects. However, we can't enforce it on others building custom events, and I hope we limit data collection in the default events we emit to only what's needed.

@@ -33,6 +40,7 @@ Here is a minimal example of a valid JSON schema for an event.
properties:
name:
title: Name
level: confidential
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't know that this is a good example. Since the name property is required for all events (right?), I think maybe this should always be unclassified. Maybe a two-field example better illustrates it:

name:
  title: Name
  level: unclassified
  description: Name of event
user:
  title: User name
  level: confidential
  description: Name of user who initiated event

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 good point. I'll update this.


Jupyter Telemetry offers flexibility when logging and handling sensitive data.

Since events may include data with varying degrees of sensitivity, Jupyter Telemetry uses a multi-level security approach. It exposed four levels of sensitivity:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a default level when unspecified, or is level always required to be explicit for every property?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the default is unclassified if level is not given, but I think it's an open question before merging this. I actually think it makes sense to require the level to be explicit in each property... what are your thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have it be required.

docs/pages/sensitive-data.rst Show resolved Hide resolved
for handler in self.handlers:
# If event_log is not an attribute of handler, patch in.
if not hasattr(handler, 'event_level'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than modifying the object, we can use

event_level = getattr(handler, 'event_level', 'unclassified')


super(JsonEventFormatter, self).__init__(*args, **kwargs)

def setEventLevel(self, event_level):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a public setter API and a protected attribute for the same is not very Pythonic (much of the stdlib logging module is this way for historical reasons, but it doesn't need to be added to). This can be accomplished with a single property, eliminating the setEventLevel setter:

@property
def event_level(self):
    return self._event_level

@event_level.setter
def event_level(self, event_level):
    if event_level not in EVENT_LEVELS:
         raise Exception("Event level '{}' not understood.".format(event_level))
    self._event_level = self.handler.event_level

Copy link
Member Author

@Zsailer Zsailer Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is true— but in this case, we're overriding the method from the JsonFormatter parent class (which is an external dependency). The issue is that this method is called by other places in this parent class. Thus, I think we need to keep this method signature consistent with the parent class. I realize this isn't Pythonic, but it's consistent with the method it overrides.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was recently changed! Ignore my prior comment. :)

Updating to use property setter.

@Zsailer
Copy link
Member Author

Zsailer commented Feb 7, 2020

@yuvipanda how would you rank the three levels you mentioned above:

  • unrestricted
  • user-identifier
  • user-identifiable-information

Isn't user-identifier more sensitive than user-identifiable-information? For example, a user's affiliation (e.g. Project Jupyter) can be personally-identifiable information when combined with other PII data, but alone, it's less identifiable/sensitive than actually using their username (e.g. zsailer), right? It's unclear to me what the sensitivity "level" of these categories would be.

@ellisonbg
Copy link
Collaborator

Thanks everyone for contributing to this discussion...

The user-identifier and user-identifiable-information classes are definitely classes that I see in the wild. From my perspective, the usefulness of having classes is in giving the operator information that can be used to (a) remove the information or (b) hash/anonymize appropriately. Because user identifiers tend to be the link that enables reidentification, I think they deserve special consideration. So, overall, I think the three classes proposed by @yuvipanda make sense.

@Zsailer raises an important question about the ordering of the classes. I tend to think of these as more nominal categoricals, rather than ordinal.

@yuvipanda
Copy link
Collaborator

Apologies for having stepped away from this a while, and thank you for picking this back up, @Zsailer and @minrk.

I agree with @ellisonbg - these aren't ordinal numerical levels. I think of username as differently sensitive than user-identifiable-information. For example, if the fact that my username is 'yuvipanda' leaks, it is a different issue than if my SSN or whatever leaks. So it is a useful distinction to make, without any comment about wether one is more sensitive or not.

So this means admins would have to explicitly opt-in to each level, rather than a 'this level and below' option.

@minrk
Copy link
Member

minrk commented Feb 9, 2020

Agreed that they should probably be treated as categorical rather than hierarchical. Think of it as a list of tags/topics to subscribe to, along the lines of:

self.topics = ['user-identifiable', 'location']

def emit(self, record):
    for field in fields:
		# intersection = any match, use .issubset for all match
        if schema[field]['topics'].intersection(self.topics):
            include(field)

A couple questions that follow from that:

  • is there still a place for level in addition to category for things like user-identifiable, or is category all we really need?
  • Are categories a set dictated by this repo, or a free field to match against for packages? My inclination would be the latter so deployments can define some of their own categories for events, but that could become a documentation/standardization challenge if there are too many sources for events (notebook package, jupyterhub, spawner, local customization, etc.).
  • If they are categorical, can an event have more than one category? If so, does the subscription filter match using 'all' or 'any'? An event's request ip address might reasonably be labeled 'user-identifiable' and 'location' for instance. Not sure if that would be a good practice or not.

I think it's a good idea to sketch some real events and labels/filters folks could want to see what's the best fit. I'm honestly not sure at this point.

@ellisonbg
Copy link
Collaborator

ellisonbg commented Feb 9, 2020 via email

@Zsailer
Copy link
Member Author

Zsailer commented Feb 10, 2020

Great stuff here! Thanks, everyone.

👍 to categorical over hierarchical.

is there still a place for level in addition to category for things like user-identifiable, or is category all we really need?

I'm finding that levels are hard to prescribe. Maybe we start with categories+policies first; then, add levels if need arises.

I think @minrk is right—sketching out some "real" events around Jupyter apps might help us identify categories, policies, etc. I'll try to draft some example events for the notebook server and post them here. If someone else wants to try to sketching events for jupyterhub, spawner, etc., that would be great!

In the meantime, I've been doing a deep dive into various legislation/documentation around data privacy—e.g. GDPR, CCPA, NIST (SP 800-53 Rev. 4), OMB (Memorandum 07-16)—to clearly define "personal data" under different regulatory regimes and what policies appropriate address privacy concerns in each regime.

If anyone knows of other documents I should review, feel free to add them here. I'll create a summary document for this repo's docs.

@minrk
Copy link
Member

minrk commented Feb 13, 2020

@Zsailer I think it's probably important, at least initially and for this repo, to only cover those things to the extent that we have suggested/defined categories for the right classes of data (i.e. is PII a single category, or are there more fine-grained classifications that we need). Assigning things to the right categories should be handled in the individual packages where events are being defined.

👍 to only having categories until and unless levels show up as having a real need.

@Zsailer
Copy link
Member Author

Zsailer commented May 19, 2020

Closing in favor of #46

@Zsailer Zsailer closed this May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants