NVIDIA FLARE will provide built-in federated statistics operators (controller and executors) that can generate global statistics based on local client side statistics.
At each client site, we could have one or more datasets (such as "train" and "test" datasets); each dataset may have many features. For each feature in the dataset, we will calculate the statistics and then combine them to produce global statistics for all the numeric features. The output would be complete statistics for all datasets in clients and global.
The statistics here are commonly used statistics: count, sum, mean, std_dev and histogram for the numerical features. The max, min are not included as it might violate the client's data privacy. Median is not included due to the complexity of the algorithms. If the statistics sum and count are selected, the mean will be calculated with count and sum.
A client will only need to implement the selected methods of "Statistics" class from statistics_spec.
The result will be statistics for all features of all datasets at all sites as well as global aggregates. The result should be visualized via the visualization utility in the notebook.
Assume that clients will provide the following:
- user needs to provide target statistics such as count, histogram only
- user needs to provide the local statistics for the target statistics (by implementing the statistic_spec)
- user needs to provide the data sets and dataset features (feature name, data type)
-
- Note: count is always required as we use count to enforce data privacy policy We only support numerical features, not categorical features. But user can return all types of features the non-numerical features will be removed.
We provide several examples to demonstrate how should the operators be used.
Please make sure you set up virtual environment and Jupyterlab follows example root readme
The first example is to calculate the statistics for tabular data. The data can be loaded into Pandas DataFrame, the data can be cached in memory we can leverage DataFrame and Numpy to calculate the local statistics.
The result will be saved to the job workspace in json format, which can be loaded in Pandas DataFrame. In the jupyter notebook, one can visualize via provided visualization utility For example, this table shows the statistics for a particular feature "Age" on each site and each dataset.
You can compare global features and clients feature statistics side-by-side for each dataset for all features.
Here is an example for histograms comparison
The main steps are
- provide server side configuration to specify target statistics and their configurations and output location
- implement the local statistics generator (statistics_spec)
- provide client side configuration to specify data input location
The detailed example instructions can be found Data frame statistics
The second example provided is image histogram example. Different from Tabular data example,
The image examples show the followings
- The image_statistics.py only needs to calculate the count and histogram target statistics, then user only needs to provide the calculation count, failure_count and histogram functions. There is no need to implement other metrics functions (sum, mean,std_dev etc.) ( get_failure_count by default return 0 )
- For each site's dataset, there are several thousands of images, the local histogram is aggregate histogram of all the image histograms.
- The image files are large, we can't load everything in memory, then calculate the statistics. We will need to iterate through files for each calculation. For single feature, such as example. This is ok. If there are multiple features, such as multiple channels, reload image to memory for each channel to do histogram calculation is really wasteful.
- Unlike Data frame statistics, the histogram bin's global range is pre-defined by user [0, 256] where in Data frame statistics, besides "Age", all other features histogram global bin range is dynamically estimated based on local min/max values
An example of image histogram (the underline image files have only 1 channel)
This example Spleen CT Image Statistics demonstrated few more details in Federated statistics.
- instead of locally calculate the histogram on each image, this example shows how to get the local statistics from monai via the MONAI FLARE integration.
- to avoid the reloading the same image into memory for each feature. This example shows the one can use pre_run() method to load and cache the externally calculated statistics. The server side controller will pass the target metrics to pre_run method so it can be used to load the statistics.
Hierarchical statistics involves the analysis of data that is structured in a hierarchical (multi-level) format, such as data grouped by different categories.
This approach allows for more accurate and meaningful analysis by accounting for the nested structure of the data. Hierarchical statistics are essential because they enable us to understand the variability at each level of the hierarchy, improve model accuracy, and make more precise inferences about the relationships within and between groups.
Levels of Hierarchy:
- Manufacturer Level: Different manufacturers produce medical devices.
- Hospital Level: Hospitals use devices from various manufacturers.
- Device Level: Each hospital uses multiple devices.
Why We Need It:
- Understanding Variability: Analyze performance differences between manufacturers, hospitals, and individual devices.
- Improving Accuracy: Account for nested data structure to get precise performance estimates.
- Better Inferences: Identify top-performing manufacturers, hospitals needing support, and underperforming devices for recalibration or replacement.
Hierarchical statistics in this context help improve the reliability and effectiveness of medical devices by providing insights at multiple organizational levels.
Levels of Hierarchy:
- State Level: Different states contain multiple universities.
- University Level: Each state has several universities.
- School Level: Each university consists of various schools or departments.
Why We Need It:
- Understanding Variability: Analyze performance differences between states, universities, and individual schools.
- Improving Accuracy: Account for the nested structure to achieve precise performance metrics.
- Better Inferences: Identify top-performing states, universities that need additional resources, and schools that require targeted interventions.
Hierarchical statistics in this context help optimize educational policies and resource allocation by providing detailed insights across different organizational levels.
This example shows how to generate hierarchical statistics for data that can be represented as Pandas Data Frame.
Hierarchical Data frame statistics
Here is an example of the generated hierarchical statistics for the students data from different universities.
Expandable global and level wise stats
Expanding Global
will show visualization of global statistics similar to the following
Following visualization shows example hierarchy level global statistics
and
And the visualization of local stats at last hierarchical level should looks similar to the following
The main steps are
- provide server side configuration to specify target statistics and their configurations and output location
- implement the local statistics generator (statistics_spec)
- provide client side configuration to specify data input location
- provide hierarchy specification file providing details about all the clients and their hierarchy.
NVFLARE provide data privacy protection through privacy filters privacy-management Each site can have its own privacy policy.
privacy.json provides local site specific privacy policy. The policy is likely setup by the company and implemented by organization admin for the project. For different type of scope or categories, there are might be type of policy.
The NVFLARE privacy configuration is consists of set of task data filters and task result filters
- The task data filter applies before client executor executes;
- The task results filter applies after client executor before it sends to server;
- for both data filter and result filter, they are groups via scope.
Each job will need to have privacy scope. If not specified, the default scope will be used. If default scope is not defined and job doesn't specify the privacy scope, the job deployment will fail, and job will not executed
There are different ways to set privacy filter depending the use cases
You can specify the "task_result_filters" in config_fed_client.json to specify the privacy control. This is useful when you develop these filters
Once the company decides to instrument certain privacy policy independent of individual job, one can copy the local directory privacy.json content to clients' local privacy.json ( merge not overwrite). in this example, since we only has one app, we can simply copy the private.json from local directory to
- site-1/local/privacy.json
- site-2/local/privacy.json
We need to remove the same filters from the job definition in config_fed_client.json by simply set the "task_result_filters" to empty list to avoid double filtering
"task_result_filters": []
Privacy filters are defined within a privacy scope. If a job's privacy scope is defined or has default scope, then the scope's filters (if any) are applied before the job-specified filters (if any). This rule is enforced during task execution time.
With such rules, if we have both task result filters and privacy scoped filters, we need to understand that the privacy filters will be applied first, then job filters.
Statistics privacy filters are task result filters. We already build one for Statistics.
StatisticsPrivacyFilter
The StatisticsPrivacyFilter is consists of several StatisticsPrivacyCleanser
s focused on the statistics sent
from client to server.
StatisticsPrivacyCleanser
can be considered as an interceptor before the results delivered to server.
Currently, we use three StatisticsPrivacyCleanser
s to guard the data privacy. The reason we built
StatisticsPrivacyCleanser
instead of separate filters is to avoid repeated data de-serialization.
Check against the number of count returned from client for each dataset and each feature.
If the min_count is not satisfied, there is potential risk of reveal client's real data. Then remove that feature's statistics from the result for this client.
For histogram calculations, number of bins can't be too large compare to count. if the bins = count, then we also reveal the real data. This check to make sure that the number of bins be less than X percent of the count. X = max_bins_percent in percentage, for 10 is for 10% if the number of bins for the histogram is not satisfy this specified condition, the resulting histogram will be removed from statistics before sending to server.
For histogram calculations, if the feature's histogram bin's range is not specified, we will need to use local data's min and max values to calculate the global min/max values, then use the global min, max values as the bin range for histogram calculation. But send the server the local min, max values will reveal client's real data. To protect data privacy, we add noise to the local min/max values.
Min/max random is used to generate random noise between (min_noise_level and max_noise_level). for example, the random noise is to be within (0.1 and 0.3),i.e. 10% to 30% level. These noise will make local min values smaller than the true local min values, and max values larger than the true local max values. As result, the estimate global max and min values (i.e. with noise) are still bound the true global min/max values, in such that
est. global min value <
true global min value <
client's min value <
client's max value <
true global max <
est. global max value
Some local statistics (such as count, failure count, sum etc.) can be calculated with one round; while others statistics such as stddev, histogram ( if the global bin range is not specified) will need to two round of calculations. We design a workflow to essentially issue three round of trip to client
- pre_run() -- controller send clients the target metrics information
- 1st statistics task -- controller send clients 1st set of target metrics as well as local max/min if the global min/max estimation is needed
- 2nd statistics task -- based on the aggregated global statistics, we do the 2nd round, we calculate the VAR (with global mean) and histogram based on the global range (or estimated global range)
The sequence diagram provides the details.
sequenceDiagram
participant FileStore
participant Server
participant PrivacyFilter
participant Client
participant Statistics
Server->>Client: pre_run: optional handshake (if enabled), passing targeted statistic configs
Server->>Client: task: Fed_Stats: statistics_task_1: count, sum, mean,std_dev, min, max
loop over dataset and features
Client->>Statistics: local stats calculation
end
Client-->>PrivacyFilter: local statistic
loop over StatisticsPrivacyCleansers
PrivacyFilter->>PrivacyFilter: min_count_cleanser, min_max_cleanser, histogram_bins_cleanser
end
PrivacyFilter-->>Server: filtered local statistic
loop over clients
Server->>Server: aggregation
end
Server->>Client: task: Fed_Stats: statistics_task_2: var with input global_mean, global_count, histogram with estimated global min/max
loop over dataset and features
Client->>Statistics: local stats calculation
end
Client-->>PrivacyFilter: local statistics: var, Histogram, count
loop over StatisticsPrivacyCleansers
PrivacyFilter->>PrivacyFilter: min_count_cleanser, min_max_cleanser, histogram_bins_cleanser
end
PrivacyFilter-->>Server: filtered local statistic
loop over clients
Server->>Server: aggregate var, std_dev, histogram
end
Server->>FileStore: save to file
We provided federated statistics operators that can easily aggregate and visualize the local statistics for different data site and features. We hope this feature will make it easier to perform federated data analysis.