-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Category Proposal: Convolutional Neural Networks #3
Comments
Probably some of these should include max pooling, and some could just have convolution layers. |
I think we should specify the activation function here. I presume we intend these to be plain RELUs. Batch normalisation: I do not see this making a difference in terms of verification; so I would leave this out. Max pooling: should this be a subcategory? Bencharks: maybe also something in between MNIST and CIFAR? |
Agreed, I think based on the feedback so far, this category is going to make sense to break down further into subcategories. |
Thanks for this feedback. Yes, I'm imaging some ReLU layers given they show up frequently in CNNs, but also some explicit convolutional layers. While batch normalization is easy to handle for verification (as it's linear/affine after training), many tools do not support it, and it can (at least in our experience) drastically influence training, hence showing up in many real networks. As it's probably not widely supported though, I think this would be a subcategory, unless networks can be transformed equivalently/soundly to e.g. fully connected to handle these. To make this clearer, refer e.g. to these network for MNIST that we support all the layers for natively in NNV (convolutional layers, batch normalization layers, max pool layers, relu layers, fully connected layers, and softmax layers). I imagine though various other methods/tools have restrictions that may prevent analysis of this, at least without transformation. If transformation is needed, that potentially is okay, as I am aware that some methods convert some of these layers to fully connected ones to handle them. We'd want to be sure of soundness of that transformation though, but I presume that would be a meta-theorem in the papers on the tool methods if such transformation is needed. Questions:
|
This is to follow up on Taylor's request. |
I'd like the nnenum tool to participate in this category, although I probably won't add support for pooling layers. |
I'd like to enter the MIPVerify tool for this category as it is currently described. More details on the tool as described in my other comment:
|
I would like to enter our tool ERAN in this category: For convolutional networks, besides intensity-based robustness, it is highly desirable to also have robustness against geometric transformations. Are we going to include this or only norm-based robustness? Cheers, |
Thanks for the feedback. While I think not all methods may support this (see e.g. @vtjeng comment above), I think it would be great to include such perturbations and see which ones can handle them. Could you please share some details on how the transformations are specified so we can start to think about how to make this interoperable between methods? I think probably the most extensible way to ease comparison will be in terms of the input sets, but realize this may not be ideal vs. specifying say a rotation/translation/etc by some amount. |
The oval group would like to sign up for this category. Our work will be based on the methods presented in the journal paper `Branch and Bound for Piecewise Linear Neural Network Verification’ and developments thereof. Cheers, |
CNN Category Participants: Oval If anyone else plans to join this category, please add a comment soon, as the participants in this category need to decide on the benchmarks soon (by about May 15). |
We would like to enter the ARFramework tool (https://github.com/formal-verification-research/ARFramework) into this category. I apologize for the late response. Finishing out this semester has been a little hectic. I hope this does not cause too much inconvenience. On the subject of formatting, the ARFramework tool operates on Tensorflow graphs exported as google protocol buffers. If we can work out a way to translate the benchmark networks to protocol buffers, this would greatly facilitate our participation. Also, because the ARFramework tool makes use of a modified Fast Gradient Sign Method to generate adversarial examples, access to the gradient of the network with respect to the input is necessary. I would be happy to participate in the network formatting development in order to ensure that this tool and all others have their input requirements met. Also, I was wondering how we are planning to specify the robustness property. Would someone kindly post a formal specification of the property we will be verifying so that we can check compatibility with our tool? That would be greatly appreciated. |
Hi, We would like to enter VeriNet https://vas.doc.ic.ac.uk/software/neural/. The toolkit supports:
Regards, |
Finalized CNN Category Participants: Oval I will add some discussion on the benchmark plans soon. As indicated earlier, we will decide as a group the benchmarks, and anyone may propose some. However, by default, we will provide some CNNs that we've analyzed with NNV, and likewise anyone else may propose to provide ones they've analyzed. As there is a lot of parameterization here (on the network architecture, the data set, the specification, etc.), those are the details we will want to discuss next and finalize, as well as how all of these aspects will be specified to ease interoperability between the tools. |
Hi there, It has been the case that different tools come up with their own neural network format and this makes it very difficult to reuse those networks in other tools. I think all networks should be provided in ONNX format as it provides a common format for networks trained in different frameworks e.g., PyTorch and Tensorflow. We would propose the following networks and specifications:
We should have two types of verification tasks: complete and incomplete verification. For complete verifiers, we can compare speeds on small provably and non-provably trained networks. We can use the ground-truth from complete verifiers to test the soundness of participating tools so that one does not win the competition by returning “Verified” always :). We can disqualify a tool if it has too many unsound results. For incomplete verifiers, we can compare precision and speed by comparing the number of instances proved within a given time limit. Cheers, |
Thanks for the feedback! Responses inline.
Yes, the plan is to use ONNX for the networks. For the other parts (specs in particular, to some degree data sets), this will be harder. Part of the goal of VNN-COMP is to generate feedback as VNN-LIB gets created: Some participants (@dlshriver) have developed a translation tool from/to ONNX for various tools: https://github.com/dlshriver/DNNV This has some other features, such as a DSL for specifications, etc. that we can explore whether it's possible to use. Thanks for the model pointers, I will look at them and encourage the other participants to also do so. I think given time constraints it won't be realistic to consider everything, so we will need to identify some reasonable subset, and all participants in the category should agree on them.
Earlier we discussed a different form of categorization based on completeness, but that did not lead to a lot of discussion. I agree the distinction is important, but would suggest how this shows up will be later on as results are aggregated into a report. With respect to the benchmarks, I think how this is important at this stage is just to ensure there is a sufficient diversity of smaller through larger instances so that there are benchmarks on which the complete methods may finish in reasonable time. We could consider a subdivision of the benchmarks based on this, but I think it is premature and that the benchmarks could be attempted by both complete and incomplete, where complete of course will hit scalability boundaries (which is one of the things we aim to identify by doing this initiative). Regarding the competition aspect, as this is a first instance to get things starting and we're not planning to create a formalized ranking (so really more of a challenge vs. competition), so we can further discuss how these things would show up. I do not think we should have any plans to disqualify any participants, as a goal of this is to foster the community, etc. Of course ensuring the results of participants on benchmarks are valid is important, but for some of this, I think some basic testing approaches for checking ground truth may be sufficient, particularly for those instances where complete methods will not scale. For reference, how this is typically done in related initiatives (e.g., SV-COMP/SMT-COMP) is there is a large penalty for having a soundness bug (e.g., say a score of +1 for each instance correctly proven within a time limit, a score of 0 or -1 for not returning a result in time, and a score of -100 if there's a soundness bug, such as the tool returns something is verified when it is false). I don't think we should have such a detailed scoring system this year (for a plethora of reasons, such as common infrastructure on which to run so comparisons are completely fair, benchmark standardization, etc.), but we should discuss how to report timeouts, correct vs. incorrect results, etc. from the tools, and use this discussion to guide how future iterations may incorporate the tool results into a scoring system. |
Hi We also like to enter Venus (https://vas.doc.ic.ac.uk/software/neural/) in this category. Best Panagiotis |
I agree that we want to run this in a friendly manner to help the community rather than proclaiming a winner. If, for example, we come out of this with a preferred input format, and have raised the profile of the topic, I think these alone will have made the effort worthwhile. Having said this, I think we need to differentiate clearly complete vs incomplete methods. The issue is orthogonal to the categories (challenge problems) themselves. But this comes with advantages and disadvantages and these ought to be somewhat clear to a potentially interested party trying to understand the results. Maybe we ought to consider grouping tools together as complete or incomplete in the reporting phase. |
On the topic of making comparisons within a time limit, I think it would be valuable to have all verifiers (complete and incomplete) report the amount of time required to verify each property of interest (with a reasonable overall maximum time to spend on each property). This would allow us to:
I agree with @alessiolomuscio's suggestion of the importance of clearly differentiating between complete and incomplete methods. One way we might be able to do so is to differentiate in our final report between cases where we timed out (could happen to both complete and incomplete methods) and cases where we the algorithm runs to completion and reports failure (only happens to incomplete methods). |
Thanks for the feedback! Yes, I agree we will include the timing information, and can group in the resulting report the complete/incomplete methods results. |
Quick update---we have gone through all the proposed networks and will provide some additional detailed feedback shortly. I think most of the networks can be used, although ideally the proposer of any network will provide an ONNX format version of it soon. We have done some conversion for the proposed networks and will share that soon, but there are a variety of intricacies of ONNX that need to be considered (e.g., precision of weights, additional layers added in conversion, etc.), so we would prefer if anyone proposing a network provides it in ONNX to avoid any unintentional changes as you're the ones most familiar with your own networks. We will do this for our MNIST networks discussed above. Given our time constraints and needing to get everything finalized and distributed very soon, I would suggest we only focus on MNIST for this iteration (and skip e.g. some of the CIFAR proposals). If anyone feels strongly otherwise, please let us know. This is in part because we also need to finalize the other inputs (e.g., specific images for local robustness, local robustness bounds / specifications, etc.) for a specific verification problem quickly as well. Additionally, from the proposed networks, I think only a couple tools may support residual networks currently, so I would either suggest excluding those for this iteration, or to focus on them later if we have time / by the participants who support resnets currently (I think only ERAN and MIPverify from the discussion). We can discuss the submission of results more, but for planning purposes, we're anticipating participants will submit a Dockerfile and a script to execute everything in the category to produce a latex table with runtimes, verification results, etc. If we have time, we will then run these on a common infrastructure, and otherwise will just report the characteristics of the machine used if done by each participant. |
A question about submission formats: I would like to submit a Dockerfile so that the work is completely reproducible, but don't know whether this can be done when using the Gurobi solver --- does anyone else have a setup involving a Dockerfile that works with Gurobi? |
I think others also use Gurobi, so they can likely respond in more detail, but this should be possible just by not including a license, then it can manually be installed by the user. We do similar things with NNV for its Docker-based runs given it uses Matlab. We can discuss this more in depth as we get to that stage, e.g., if the binaries/installation files also aren't available openly, some instructions could be provided for set up. From a quick search (looks a little old): |
Hi all, We are happy to provide 2 MNIST and 2 CIFAR10 Convolutional networks in the ONNX format here: https://github.com/eth-sri/colt/tree/master/trained_models/onnx The epsilon values of the L_oo ball for the MNIST networks are 0.1 and 0.3 while those for the CIFAR10 networks are 2/255 and 8/255. The network names contain epsilon values. All networks expect input images to be first normalized to be in the range [0,1]. Next, standard mean and deviations should be applied to the normalized images before passing them to the networks. The mean and standard deviation for MNIST and CIFAR10 can be found here: https://github.com/eth-sri/colt/blob/acea4093d0eebf84b49a7dd68ca9a28a08f86e67/code/loaders.py#L7 Let me know if anyone has any trouble running the networks. Cheers, |
@GgnDpSngh which input images are intended for the networks? |
The first 100 images of the MNIST and CIFAR10 test set should be fine so that there is a good mix of easy, medium, and hard instances for verification. We provide them here: We can filter out images that are not correctly classified. Also, I was wondering if we should have the challenge live throughout the year so that if anyone gets better results they can add their results to the result table. Similar to this: |
Thanks very much for providing the networks, specs, and inputs! Regarding the continual challenge aspect: I think that's an interesting idea to consider, let's bring this up at the workshop so we can discuss in more detail how that might work. I think as long as people submit Dockerfiles and scripts to re-generate everything, something like that could be feasible without too much effort. Part of what we'll discuss in the workshop is how this first initiative went, what changes could be made going forward, etc. |
@FlorianJaeckle just to confirm, for the images, should we be using the CIFAR-10 test batch? The dataset (http://www.cs.toronto.edu/~kriz/cifar.html) also includes 5 data batches. There was also some strange behavior in the deep100 table, where the index and prop columns were sometimes stored as floating-point numbers instead of ints, but that's a small issue. Also for others: are you just analyzing the 100 image versions of this, or both the 100 and 20 image subsets? |
@FlorianJaeckle I'm having trouble running the networks on the cifar dataset. The produced labels don't match the expected cifar label. Are we supposed to scale the pixel values to [0,1]? Is there normalization we're supposed to be using for each channel? Are the channel orders R,G,B? Is the classification label the max or the min? Can I confirm the images are the test batch and not one of the other batches? Was anyone else able to execute these networks? |
Hi @stanleybak, |
Hi @FlorianJaeckle, That helped! Now things are classifying correctly. One more question: is the epsilon perturbation applied before normalization, or after? |
@stanleybak, |
@FlorianJaeckle |
@FlorianJaeckle, I've generally seen the epsilon perturbation being applied before normalization (that is, if To confirm, in your case, this is |
also, the timeout is 1 hour for all or per property? We are testing 100 or 20 properties for each network? Cheers, |
@stanleybak, all properties are unsat; i.e. there does not exists a counter-example, in other words the network is robust around the image. Can you give an example of a property, where your method returned an adversarial example? @vtjeng, given an image x and an epsilon value, the input domain for the CNN is [Normalize(x)-e, Normalize(x)+e] @GgnDpSngh, the timeout is 1 hour per property. |
For the normalization, if you want, you can convert to the epsilon values to prenormalized values by multiplying by the std_deviation (0.225), since it's the same for all channels. I also assume there's trimming at the limits, which wasn't mentioned before. @FlorianJaeckle , I think I found my problem. Since this is there is a target class, is the correct condition to check (1) |
@stanleybak since they want that the target label is not chosen for perturbed images, I guess the condition to verify should be that output[original_label] > output[target_label] holds, this is how I have been running ERAN but maybe @FlorianJaeckle can confirm? Cheers, |
@stanleybak the first condition is the correct one. |
@FlorianJaeckle i am confused now, are we verifying output[original_label] < output[target_label] holds or the other direction that i mentioned? Cheers, |
We are running into a compatibility issue between our tool's requirements and the networks that we are all evaluating on. If you have experience translating between ONNX format and Tensorflow protocol buffers, we would really appreciate you reading on and providing suggestions to overcome this obstacle. |
@GgnDpSngh we have formulated our problem as follows: |
@FlorianJaeckle alright thanks. Cheers, |
So in this category, we are trying the following benchmarks: then for benchmarks from the oval group, we use: Cheers, |
@GgnDpSngh These are the ones I've been looking at, although for oval they also have panda tables with 100 properties that's a superset of the 20 properties. Since the timeout is one hour each, though, maybe we should focus on the 20 properties version? Also, do you think it make sense in mnist_0.1 to use a timeout of 5 minutes as well, just so they all have same timeout? |
@stanleybak the deadline is Friday now, so there might be time to run the 100 properties that the oval group suggests but we wanted to be uniform. It should be fine to use a timeout of 5 minutes for the mnist_0.1 network as well. Cheers, |
@joshua-smith4, I made a python3 script to convert .onnx files to tensorflow1 .pb files that might help. It prints some warnings but it seems to produce files that run correctly. The code is here: I'm using python package versions numpy 1.17.4, onnx 1.7.0, onnxruntime 1.3.0, tensorflow 2.2.0 |
Hi, @stanleybak I got the following images as UNSAT whereas its mentioned as UNKNOWN in the table: Should I change them as UNSAT in the ggn-all-results.txt ? Thanks |
@Neelanjana314, Yes please update the table with the classification. The result columns are in the |
Yes, we need to specify the sat/unsat/unknown result of each tool, right? I think it's okay to have an individual column with the result if all tools are almost the same and only a few inconsistencies, but if there are more than a few, we may want to present in another way, e.g., including the result column for each tool. |
My hope is there are few inconsistencies, as this indicates soundness issues. They also should be easy to resolve, since the tool that produces a SAT result can provide a concrete unsafe input which we can all check. If it's just a few cases, maybe we can mark the result as INCONSIS or something similar until it's resolved. Another idea is use colors along with the times so for example green would be SAT and red UNSAT. Adding an extra column for each tool when they're in agreement 95% of the time would be a mess, in my opinion. |
Thanks! Yes, I agree for presentation to not include the result column for each tool as long as they match as it's not necessary and makes the presentation poor, but like your solution from #8 to have it available so we can easily check across tools. |
Convolutional layers, presumably in networks for classification
Representative benchmark(s): MNIST, CIFAR classifiers
Specifications: robustness
Questions: any further restrictions or should this be broken out further, e.g., based on other layers (batch normalization, max pool, etc.)?
Allow batch normalization?
Allow max pool, avg pool, etc.?
The text was updated successfully, but these errors were encountered: