Category Proposal: Convolutional Neural Networks #3

ttj · 2020-04-06T18:16:41Z

Convolutional layers, presumably in networks for classification

Representative benchmark(s): MNIST, CIFAR classifiers

Specifications: robustness

Questions: any further restrictions or should this be broken out further, e.g., based on other layers (batch normalization, max pool, etc.)?

Allow batch normalization?

Allow max pool, avg pool, etc.?

stanleybak · 2020-04-07T11:02:55Z

Probably some of these should include max pooling, and some could just have convolution layers.

alessiolomuscio · 2020-04-08T16:18:11Z

I think we should specify the activation function here. I presume we intend these to be plain RELUs.

Batch normalisation: I do not see this making a difference in terms of verification; so I would leave this out.

Max pooling: should this be a subcategory?

Bencharks: maybe also something in between MNIST and CIFAR?

ttj · 2020-04-08T21:28:47Z

Probably some of these should include max pooling, and some could just have convolution layers.

Agreed, I think based on the feedback so far, this category is going to make sense to break down further into subcategories.

ttj · 2020-04-08T21:41:01Z

I think we should specify the activation function here. I presume we intend these to be plain RELUs.

Batch normalisation: I do not see this making a difference in terms of verification; so I would leave this out.

Max pooling: should this be a subcategory?

Bencharks: maybe also something in between MNIST and CIFAR?

Thanks for this feedback. Yes, I'm imaging some ReLU layers given they show up frequently in CNNs, but also some explicit convolutional layers.

While batch normalization is easy to handle for verification (as it's linear/affine after training), many tools do not support it, and it can (at least in our experience) drastically influence training, hence showing up in many real networks. As it's probably not widely supported though, I think this would be a subcategory, unless networks can be transformed equivalently/soundly to e.g. fully connected to handle these.

To make this clearer, refer e.g. to these network for MNIST that we support all the layers for natively in NNV (convolutional layers, batch normalization layers, max pool layers, relu layers, fully connected layers, and softmax layers).

I imagine though various other methods/tools have restrictions that may prevent analysis of this, at least without transformation. If transformation is needed, that potentially is okay, as I am aware that some methods convert some of these layers to fully connected ones to handle them. We'd want to be sure of soundness of that transformation though, but I presume that would be a meta-theorem in the papers on the tool methods if such transformation is needed.

Questions:

based on your piecewise/ReLU category feedback, I am inferring you have some classification networks that only have ReLUs, is that accurate?
We're certainly open to other networks/data sets for benchmarks, do you have any data set in mind? Maybe GTSRB?

souradeep-111 · 2020-04-20T15:48:14Z

This is to follow up on Taylor's request.
I would like to sign up for this category with Sherlock.
https://github.com/souradeep-111/sherlock

stanleybak · 2020-04-20T20:19:16Z

I'd like the nnenum tool to participate in this category, although I probably won't add support for pooling layers.

vtjeng · 2020-04-24T23:51:57Z

I'd like to enter the MIPVerify tool for this category as it is currently described.

More details on the tool as described in my other comment:

Non-linearities supported: network must be piecewise linear; ReLU and max units are supported, but tanh and sigmoid layers are not.
Linear layers natively supported: fully connected layers, convolution layers, skip connections (as used in ResNets)
Requirements on specifications: The tool is currently geared towards verifying local robustness, but I intend to add support for ACAS-Xu style safety specifications as that has been a highly requested feature.
Input format
- Networks: The "native" input format for the tool is a .mat file containing a dictionary with all the parameters. I currently also have a script that can convert saved PyTorch models into the .mat input format.
- Specifications: Since I currently only verify local robustness, the specification is just the size of the L_infinity robustness radius that is to be verified.

GgnDpSngh · 2020-04-27T14:29:55Z

I would like to enter our tool ERAN in this category:
https://github.com/eth-sri/eran

For convolutional networks, besides intensity-based robustness, it is highly desirable to also have robustness against geometric transformations. Are we going to include this or only norm-based robustness?

Cheers,
Gagandeep Singh

ttj · 2020-04-27T20:00:37Z

For convolutional networks, besides intensity-based robustness, it is highly desirable to also have robustness against geometric transformations. Are we going to include this or only norm-based robustness?

Thanks for the feedback. While I think not all methods may support this (see e.g. @vtjeng comment above), I think it would be great to include such perturbations and see which ones can handle them. Could you please share some details on how the transformations are specified so we can start to think about how to make this interoperable between methods? I think probably the most extensible way to ease comparison will be in terms of the input sets, but realize this may not be ideal vs. specifying say a rotation/translation/etc by some amount.

FlorianJaeckle · 2020-04-28T13:44:58Z

The oval group would like to sign up for this category. Our work will be based on the methods presented in the journal paper `Branch and Bound for Piecewise Linear Neural Network Verification’ and developments thereof.

Cheers,
Florian Jaeckle

ttj · 2020-05-02T13:16:53Z

CNN Category Participants:

Oval
ERAN
MIPVerify
nnenum
NNV
Sherlock

If anyone else plans to join this category, please add a comment soon, as the participants in this category need to decide on the benchmarks soon (by about May 15).

joshua-smith4 · 2020-05-04T18:01:36Z

We would like to enter the ARFramework tool (https://github.com/formal-verification-research/ARFramework) into this category. I apologize for the late response. Finishing out this semester has been a little hectic. I hope this does not cause too much inconvenience.

On the subject of formatting, the ARFramework tool operates on Tensorflow graphs exported as google protocol buffers. If we can work out a way to translate the benchmark networks to protocol buffers, this would greatly facilitate our participation. Also, because the ARFramework tool makes use of a modified Fast Gradient Sign Method to generate adversarial examples, access to the gradient of the network with respect to the input is necessary. I would be happy to participate in the network formatting development in order to ensure that this tool and all others have their input requirements met.

Also, I was wondering how we are planning to specify the robustness property. Would someone kindly post a formal specification of the property we will be verifying so that we can check compatibility with our tool? That would be greatly appreciated.
Thanks,
Josh Smith

pat676 · 2020-05-04T19:48:04Z

Hi,

We would like to enter VeriNet https://vas.doc.ic.ac.uk/software/neural/.

The toolkit supports:

Local robustness properties.
Fully connected and convolutional layers.
ReLU, Sigmoid and Tanh activation functions.

Regards,
Patrick Henriksen

ttj · 2020-05-08T21:19:03Z

Finalized CNN Category Participants:

Oval
ERAN
MIPVerify
nnenum
NNV
Sherlock
VeriNet
ARFramework
Venus

I will add some discussion on the benchmark plans soon. As indicated earlier, we will decide as a group the benchmarks, and anyone may propose some. However, by default, we will provide some CNNs that we've analyzed with NNV, and likewise anyone else may propose to provide ones they've analyzed. As there is a lot of parameterization here (on the network architecture, the data set, the specification, etc.), those are the details we will want to discuss next and finalize, as well as how all of these aspects will be specified to ease interoperability between the tools.

GgnDpSngh · 2020-05-08T22:13:19Z

Hi there,

It has been the case that different tools come up with their own neural network format and this makes it very difficult to reuse those networks in other tools. I think all networks should be provided in ONNX format as it provides a common format for networks trained in different frameworks e.g., PyTorch and Tensorflow. We would propose the following networks and specifications:
Networks

Networks trained with provable defenses: We could consider state-of-the-art provably trained networks here that are also challenging to prove, i.e., cannot be proved by simple interval analysis, e.g. from https://openreview.net/pdf?id=SJxSDxrKDr. This way the state-of-the-art also benefits from potential better verification results.
Networks trained with empirical defense: We can use state-of-the-art networks trained with PGD etc. We can provide PGD trained ResNets used in
https://files.sri.inf.ethz.ch/website/papers/neurips19-deepg.pdf from our side.
Networks trained normally that achieve state-of-the-art accuracy for MNIST, CIFAR10, Imagenet etc. Besides the MNIST networks that you proposed, we can also provide our normally trained networks from https://github.com/eth-sri/eran in ONNX format.
Specifications based on robustness, i.e., the correct class has a higher score than all other labels:
L_oo-norm: For normally and PGD trained CIFAR10 networks, meaningful epsilon values have not been proven yet. We can consider provable networks from (1).
Brightness: We can consider this type of specification for PGD and normally trained networks as these are relatively easier to prove than L_oo on networks from (2) and (3).
Geometric: We can use the Polyhedra regions for geometric perturbations, e.g., rotation, translation available via https://github.com/eth-sri/deepg for all types of networks.

We should have two types of verification tasks: complete and incomplete verification. For complete verifiers, we can compare speeds on small provably and non-provably trained networks. We can use the ground-truth from complete verifiers to test the soundness of participating tools so that one does not win the competition by returning “Verified” always :). We can disqualify a tool if it has too many unsound results. For incomplete verifiers, we can compare precision and speed by comparing the number of instances proved within a given time limit.

Cheers,
Gagandeep Singh

ttj · 2020-05-09T00:55:33Z

Thanks for the feedback! Responses inline.

It has been the case that different tools come up with their own neural network format and this makes it very difficult to reuse those networks in other tools. I think all networks should be provided in ONNX format as it provides a common format for networks trained in different frameworks e.g., PyTorch and Tensorflow.

Yes, the plan is to use ONNX for the networks. For the other parts (specs in particular, to some degree data sets), this will be harder. Part of the goal of VNN-COMP is to generate feedback as VNN-LIB gets created:

http://www.vnnlib.org/

Some participants (@dlshriver) have developed a translation tool from/to ONNX for various tools:

https://github.com/dlshriver/DNNV

This has some other features, such as a DSL for specifications, etc. that we can explore whether it's possible to use.

Thanks for the model pointers, I will look at them and encourage the other participants to also do so. I think given time constraints it won't be realistic to consider everything, so we will need to identify some reasonable subset, and all participants in the category should agree on them.

We should have two types of verification tasks: complete and incomplete verification. For complete verifiers, we can compare speeds on small provably and non-provably trained networks. We can use the ground-truth from complete verifiers to test the soundness of participating tools so that one does not win the competition by returning “Verified” always :). We can disqualify a tool if it has too many unsound results. For incomplete verifiers, we can compare precision and speed by comparing the number of instances proved within a given time limit.

Earlier we discussed a different form of categorization based on completeness, but that did not lead to a lot of discussion. I agree the distinction is important, but would suggest how this shows up will be later on as results are aggregated into a report. With respect to the benchmarks, I think how this is important at this stage is just to ensure there is a sufficient diversity of smaller through larger instances so that there are benchmarks on which the complete methods may finish in reasonable time. We could consider a subdivision of the benchmarks based on this, but I think it is premature and that the benchmarks could be attempted by both complete and incomplete, where complete of course will hit scalability boundaries (which is one of the things we aim to identify by doing this initiative).

Regarding the competition aspect, as this is a first instance to get things starting and we're not planning to create a formalized ranking (so really more of a challenge vs. competition), so we can further discuss how these things would show up. I do not think we should have any plans to disqualify any participants, as a goal of this is to foster the community, etc. Of course ensuring the results of participants on benchmarks are valid is important, but for some of this, I think some basic testing approaches for checking ground truth may be sufficient, particularly for those instances where complete methods will not scale. For reference, how this is typically done in related initiatives (e.g., SV-COMP/SMT-COMP) is there is a large penalty for having a soundness bug (e.g., say a score of +1 for each instance correctly proven within a time limit, a score of 0 or -1 for not returning a result in time, and a score of -100 if there's a soundness bug, such as the tool returns something is verified when it is false).

I don't think we should have such a detailed scoring system this year (for a plethora of reasons, such as common infrastructure on which to run so comparisons are completely fair, benchmark standardization, etc.), but we should discuss how to report timeouts, correct vs. incorrect results, etc. from the tools, and use this discussion to guide how future iterations may incorporate the tool results into a scoring system.

pkouvaros · 2020-05-09T10:12:14Z

Hi

We also like to enter Venus (https://vas.doc.ic.ac.uk/software/neural/) in this category.

Best

Panagiotis

alessiolomuscio · 2020-05-10T09:50:51Z

I agree that we want to run this in a friendly manner to help the community rather than proclaiming a winner. If, for example, we come out of this with a preferred input format, and have raised the profile of the topic, I think these alone will have made the effort worthwhile.

Having said this, I think we need to differentiate clearly complete vs incomplete methods. The issue is orthogonal to the categories (challenge problems) themselves. But this comes with advantages and disadvantages and these ought to be somewhat clear to a potentially interested party trying to understand the results. Maybe we ought to consider grouping tools together as complete or incomplete in the reporting phase.

vtjeng · 2020-05-11T23:23:37Z

On the topic of making comparisons within a time limit, I think it would be valuable to have all verifiers (complete and incomplete) report the amount of time required to verify each property of interest (with a reasonable overall maximum time to spend on each property).

This would allow us to:

Compare the number of properties verified at a range of time limits (perhaps there are tools that are faster for simpler properties, and other tools that are faster for more complex properties)
Identify heterogeneity in verifier performance (perhaps a particular verifier is better at verifying certain classes of properties).

I agree with @alessiolomuscio's suggestion of the importance of clearly differentiating between complete and incomplete methods. One way we might be able to do so is to differentiate in our final report between cases where we timed out (could happen to both complete and incomplete methods) and cases where we the algorithm runs to completion and reports failure (only happens to incomplete methods).

ttj · 2020-05-12T14:11:22Z

Thanks for the feedback! Yes, I agree we will include the timing information, and can group in the resulting report the complete/incomplete methods results.

ttj · 2020-05-29T21:25:07Z

Quick update---we have gone through all the proposed networks and will provide some additional detailed feedback shortly. I think most of the networks can be used, although ideally the proposer of any network will provide an ONNX format version of it soon. We have done some conversion for the proposed networks and will share that soon, but there are a variety of intricacies of ONNX that need to be considered (e.g., precision of weights, additional layers added in conversion, etc.), so we would prefer if anyone proposing a network provides it in ONNX to avoid any unintentional changes as you're the ones most familiar with your own networks. We will do this for our MNIST networks discussed above.

Given our time constraints and needing to get everything finalized and distributed very soon, I would suggest we only focus on MNIST for this iteration (and skip e.g. some of the CIFAR proposals). If anyone feels strongly otherwise, please let us know. This is in part because we also need to finalize the other inputs (e.g., specific images for local robustness, local robustness bounds / specifications, etc.) for a specific verification problem quickly as well.

Additionally, from the proposed networks, I think only a couple tools may support residual networks currently, so I would either suggest excluding those for this iteration, or to focus on them later if we have time / by the participants who support resnets currently (I think only ERAN and MIPverify from the discussion).

We can discuss the submission of results more, but for planning purposes, we're anticipating participants will submit a Dockerfile and a script to execute everything in the category to produce a latex table with runtimes, verification results, etc. If we have time, we will then run these on a common infrastructure, and otherwise will just report the characteristics of the machine used if done by each participant.

vtjeng · 2020-05-29T21:56:16Z

A question about submission formats: MIPVerify can use a variety of back-end solvers, both proprietary and not, including Gurobi. (The performance with Gurobi is significantly faster than that with open-source solvers such as Cbc).

I would like to submit a Dockerfile so that the work is completely reproducible, but don't know whether this can be done when using the Gurobi solver --- does anyone else have a setup involving a Dockerfile that works with Gurobi?

ttj · 2020-05-29T22:06:11Z

A question about submission formats: MIPVerify can use a variety of back-end solvers, both proprietary and not, including Gurobi. (The performance with Gurobi is significantly faster than that with open-source solvers such as Cbc).

I would like to submit a Dockerfile so that the work is completely reproducible, but don't know whether this can be done when using the Gurobi solver --- does anyone else have a setup involving a Dockerfile that works with Gurobi?

I think others also use Gurobi, so they can likely respond in more detail, but this should be possible just by not including a license, then it can manually be installed by the user. We do similar things with NNV for its Docker-based runs given it uses Matlab. We can discuss this more in depth as we get to that stage, e.g., if the binaries/installation files also aren't available openly, some instructions could be provided for set up. From a quick search (looks a little old):

https://github.com/SG2B/gurobi

GgnDpSngh · 2020-06-02T17:55:54Z

Hi all,

We are happy to provide 2 MNIST and 2 CIFAR10 Convolutional networks in the ONNX format here:

https://github.com/eth-sri/colt/tree/master/trained_models/onnx

The epsilon values of the L_oo ball for the MNIST networks are 0.1 and 0.3 while those for the CIFAR10 networks are 2/255 and 8/255. The network names contain epsilon values. All networks expect input images to be first normalized to be in the range [0,1]. Next, standard mean and deviations should be applied to the normalized images before passing them to the networks. The mean and standard deviation for MNIST and CIFAR10 can be found here:

https://github.com/eth-sri/colt/blob/acea4093d0eebf84b49a7dd68ca9a28a08f86e67/code/loaders.py#L7

Let me know if anyone has any trouble running the networks.

Cheers,
Gagandeep Singh

stanleybak · 2020-06-02T21:46:10Z

@GgnDpSngh which input images are intended for the networks?

GgnDpSngh · 2020-06-03T20:02:10Z

The first 100 images of the MNIST and CIFAR10 test set should be fine so that there is a good mix of easy, medium, and hard instances for verification. We provide them here:
https://github.com/eth-sri/eran/tree/master/data

We can filter out images that are not correctly classified. Also, I was wondering if we should have the challenge live throughout the year so that if anyone gets better results they can add their results to the result table. Similar to this:
https://github.com/MadryLab/cifar10_challenge

ttj · 2020-06-04T17:37:17Z

We can filter out images that are not correctly classified. Also, I was wondering if we should have the challenge live throughout the year so that if anyone gets better results they can add their results to the result table. Similar to this:
https://github.com/MadryLab/cifar10_challenge

Thanks very much for providing the networks, specs, and inputs!

Regarding the continual challenge aspect: I think that's an interesting idea to consider, let's bring this up at the workshop so we can discuss in more detail how that might work. I think as long as people submit Dockerfiles and scripts to re-generate everything, something like that could be feasible without too much effort. Part of what we'll discuss in the workshop is how this first initiative went, what changes could be made going forward, etc.

stanleybak · 2020-07-10T20:33:45Z

@FlorianJaeckle just to confirm, for the images, should we be using the CIFAR-10 test batch? The dataset (http://www.cs.toronto.edu/~kriz/cifar.html) also includes 5 data batches.

There was also some strange behavior in the deep100 table, where the index and prop columns were sometimes stored as floating-point numbers instead of ints, but that's a small issue.

Also for others: are you just analyzing the 100 image versions of this, or both the 100 and 20 image subsets?

stanleybak · 2020-07-12T14:34:24Z

@FlorianJaeckle I'm having trouble running the networks on the cifar dataset. The produced labels don't match the expected cifar label. Are we supposed to scale the pixel values to [0,1]? Is there normalization we're supposed to be using for each channel? Are the channel orders R,G,B? Is the classification label the max or the min? Can I confirm the images are the test batch and not one of the other batches?

Was anyone else able to execute these networks?

FlorianJaeckle · 2020-07-13T10:15:23Z

Hi @stanleybak,
We do use the test batch of the CIFAR-10 dataset.
We normalise the images with the following parameters:
mean=[0.485, 0.456, 0.406]
std=[0.225, 0.225, 0.225].
Apologies, I realise I should have mentioned this before.
The channel order is indeed R,G,B.
The predicted classification label is the max.
I hope this solves your issue.

stanleybak · 2020-07-13T19:13:04Z

Hi @FlorianJaeckle,

That helped! Now things are classifying correctly. One more question: is the epsilon perturbation applied before normalization, or after?

FlorianJaeckle · 2020-07-13T20:27:56Z

@stanleybak,
it's applied after normalisation.

stanleybak · 2020-07-13T22:06:09Z

@FlorianJaeckle
are they all unsat in the 100 instance files? I'm getting some sat instances and want to make sure I'm running things correctly.

vtjeng · 2020-07-13T22:25:02Z

@FlorianJaeckle, I've generally seen the epsilon perturbation being applied before normalization (that is, if x is the input and e is the perturbation, we have the perturbed input as Normalize(x+e), where e is norm-bounded).

To confirm, in your case, this is Normalize(x)+e instead?

GgnDpSngh · 2020-07-14T07:49:01Z

also, the timeout is 1 hour for all or per property? We are testing 100 or 20 properties for each network?

Cheers,

FlorianJaeckle · 2020-07-14T08:54:49Z

@stanleybak, all properties are unsat; i.e. there does not exists a counter-example, in other words the network is robust around the image. Can you give an example of a property, where your method returned an adversarial example?

@vtjeng, given an image x and an epsilon value, the input domain for the CNN is [Normalize(x)-e, Normalize(x)+e]

@GgnDpSngh, the timeout is 1 hour per property.

stanleybak · 2020-07-14T17:08:47Z

For the normalization, if you want, you can convert to the epsilon values to prenormalized values by multiplying by the std_deviation (0.225), since it's the same for all channels. I also assume there's trimming at the limits, which wasn't mentioned before.

@FlorianJaeckle , I think I found my problem. Since this is there is a target class, is the correct condition to check (1) output[original_label] < output[target_label] or (2) argmax(output) == target_label?

GgnDpSngh · 2020-07-14T21:25:45Z

@stanleybak since they want that the target label is not chosen for perturbed images, I guess the condition to verify should be that output[original_label] > output[target_label] holds, this is how I have been running ERAN but maybe @FlorianJaeckle can confirm?

Cheers,
Gagandeep Singh

FlorianJaeckle · 2020-07-14T21:54:39Z

@stanleybak the first condition is the correct one.
@GgnDpSngh is right: since we want to check whether the target_label is chosen ahead of the original label, the condition is indeed output[original_label] < output[target_label].

GgnDpSngh · 2020-07-14T22:22:47Z

@FlorianJaeckle i am confused now, are we verifying output[original_label] < output[target_label] holds

or the other direction that i mentioned?

Cheers,

joshua-smith4 · 2020-07-15T03:55:27Z

We are running into a compatibility issue between our tool's requirements and the networks that we are all evaluating on. If you have experience translating between ONNX format and Tensorflow protocol buffers, we would really appreciate you reading on and providing suggestions to overcome this obstacle.
Our tool uses a modification of the FGSM algorithm proposed by Ian Goodfellow to perform some "refutation-based" verification prior to the main verification routine as an optimization. FGSM requires access to the gradient of the loss function with respect to the input. We have written our tool to accept the label of the tensorflow graph node associated with this gradient as an input. Does anyone know of a way that we could modify the provided networks to gain access to this gradient? Any help and suggestions would be greatly appreciated.
Thanks,
Josh

FlorianJaeckle · 2020-07-15T07:55:24Z

@GgnDpSngh we have formulated our problem as follows:
\exists x' \in [x-\epsilon, x+\epsilon] such that:
f(x')_{target\_label} > f(x')_{original\_label},
where f(x')_{i} is the score given by the CNN f to the output class i, for a given input x'.
We solve this equation by either finding an x' that satisfies it (an adversarial example), or by proving that the equation is UNSAT (and thereby verifying robustness).

GgnDpSngh · 2020-07-15T12:53:13Z

@FlorianJaeckle alright thanks.

Cheers,

GgnDpSngh · 2020-07-15T12:55:55Z

So in this category, we are trying the following benchmarks:
mnist_0.1.onnx, mnist_0.3.onnx, cifar_2_255.onnx, cifar_8_255.onnx for robustness on all the images correctly classified among the first 100 in the test set,

then for benchmarks from the oval group, we use:
base.onnx, wide.onnx, deep.onnx with the 20 properties in the panda table, is that all or I am missing some benchmarks?

Cheers,

stanleybak · 2020-07-15T13:17:02Z

@GgnDpSngh These are the ones I've been looking at, although for oval they also have panda tables with 100 properties that's a superset of the 20 properties. Since the timeout is one hour each, though, maybe we should focus on the 20 properties version? Also, do you think it make sense in mnist_0.1 to use a timeout of 5 minutes as well, just so they all have same timeout?

GgnDpSngh · 2020-07-15T13:21:43Z

@stanleybak the deadline is Friday now, so there might be time to run the 100 properties that the oval group suggests but we wanted to be uniform. It should be fine to use a timeout of 5 minutes for the mnist_0.1 network as well.

Cheers,

stanleybak · 2020-07-15T13:37:37Z

Does anyone know of a way that we could modify the provided networks to gain access to this gradient?

@joshua-smith4, I made a python3 script to convert .onnx files to tensorflow1 .pb files that might help. It prints some warnings but it seems to produce files that run correctly. The code is here:
onnx_to_tensorflow1.py.txt

I'm using python package versions numpy 1.17.4, onnx 1.7.0, onnxruntime 1.3.0, tensorflow 2.2.0

Neelanjana314 · 2020-07-18T02:07:05Z

Hi, @stanleybak I got the following images as UNSAT whereas its mentioned as UNKNOWN in the table:
Mnist_03 : 28, 68, 70, 77
Cifar_8_255: 12,89

Should I change them as UNSAT in the ggn-all-results.txt ?

Thanks
Neelanjana

stanleybak · 2020-07-18T02:43:08Z

@Neelanjana314, Yes please update the table with the classification. The result columns are in the data/*-front.txt files. Also point out if there are any inconsistencies.

ttj · 2020-07-18T12:29:46Z

@Neelanjana314, Yes please update the table with the classification. The result columns are in the data/*-front.txt files. Also point out if there are any inconsistencies.

Yes, we need to specify the sat/unsat/unknown result of each tool, right? I think it's okay to have an individual column with the result if all tools are almost the same and only a few inconsistencies, but if there are more than a few, we may want to present in another way, e.g., including the result column for each tool.

stanleybak · 2020-07-18T13:42:44Z

My hope is there are few inconsistencies, as this indicates soundness issues. They also should be easy to resolve, since the tool that produces a SAT result can provide a concrete unsafe input which we can all check. If it's just a few cases, maybe we can mark the result as INCONSIS or something similar until it's resolved. Another idea is use colors along with the times so for example green would be SAT and red UNSAT. Adding an extra column for each tool when they're in agreement 95% of the time would be a mess, in my opinion.

ttj · 2020-07-18T16:15:13Z

My hope is there are few inconsistencies, as this indicates soundness issues. They also should be easy to resolve, since the tool that produces a SAT result can provide a concrete unsafe input which we can all check. If it's just a few cases, maybe we can mark the result as INCONSIS or something similar until it's resolved. Another idea is use colors along with the times so for example green would be SAT and red UNSAT. Adding an extra column for each tool when they're in agreement 95% of the time would be a mess, in my opinion.

Thanks! Yes, I agree for presentation to not include the result column for each tool as long as they match as it's not necessary and makes the presentation poor, but like your solution from #8 to have it available so we can easily check across tools.

ttj added the category-proposal Proposed category label Apr 6, 2020

ttj mentioned this issue Apr 17, 2020

Category Feedback and Select Categories for your Tool/Method #4

Closed

ttj mentioned this issue May 29, 2020

Category Proposal: Piecewise Linear Activations (ReLU) #2

Closed

ttj closed this as completed Jul 18, 2020

ttj reopened this Jul 18, 2020

vtjeng mentioned this issue Jul 19, 2020

Submission #8

Open

ChristopherBrix mentioned this issue Aug 11, 2020

CNN: Normalization before/after L-inf is applied? #17

Closed

stanleybak mentioned this issue Apr 1, 2021

Benchmarks Discussion stanleybak/vnncomp2021#2

Open

Category Proposal: Convolutional Neural Networks #3

Category Proposal: Convolutional Neural Networks #3

Comments

ttj commented Apr 6, 2020 • edited Loading

stanleybak commented Apr 7, 2020

alessiolomuscio commented Apr 8, 2020

ttj commented Apr 8, 2020

ttj commented Apr 8, 2020 • edited Loading

souradeep-111 commented Apr 20, 2020

stanleybak commented Apr 20, 2020

vtjeng commented Apr 24, 2020

GgnDpSngh commented Apr 27, 2020

ttj commented Apr 27, 2020

FlorianJaeckle commented Apr 28, 2020

ttj commented May 2, 2020

joshua-smith4 commented May 4, 2020

pat676 commented May 4, 2020

ttj commented May 8, 2020 • edited Loading

GgnDpSngh commented May 8, 2020

ttj commented May 9, 2020

pkouvaros commented May 9, 2020

alessiolomuscio commented May 10, 2020

vtjeng commented May 11, 2020

ttj commented May 12, 2020

ttj commented May 29, 2020

vtjeng commented May 29, 2020

ttj commented May 29, 2020

GgnDpSngh commented Jun 2, 2020

stanleybak commented Jun 2, 2020

GgnDpSngh commented Jun 3, 2020

ttj commented Jun 4, 2020

stanleybak commented Jul 10, 2020 • edited Loading

stanleybak commented Jul 12, 2020 • edited Loading

FlorianJaeckle commented Jul 13, 2020

stanleybak commented Jul 13, 2020

FlorianJaeckle commented Jul 13, 2020

stanleybak commented Jul 13, 2020

vtjeng commented Jul 13, 2020 • edited Loading

GgnDpSngh commented Jul 14, 2020

FlorianJaeckle commented Jul 14, 2020

stanleybak commented Jul 14, 2020 • edited Loading

GgnDpSngh commented Jul 14, 2020

FlorianJaeckle commented Jul 14, 2020

GgnDpSngh commented Jul 14, 2020

joshua-smith4 commented Jul 15, 2020

FlorianJaeckle commented Jul 15, 2020 • edited Loading

GgnDpSngh commented Jul 15, 2020

GgnDpSngh commented Jul 15, 2020

stanleybak commented Jul 15, 2020 • edited Loading

GgnDpSngh commented Jul 15, 2020

stanleybak commented Jul 15, 2020 • edited Loading

Neelanjana314 commented Jul 18, 2020

stanleybak commented Jul 18, 2020

ttj commented Jul 18, 2020

stanleybak commented Jul 18, 2020

ttj commented Jul 18, 2020

ttj commented Apr 6, 2020 •

edited

Loading

ttj commented Apr 8, 2020 •

edited

Loading

ttj commented May 8, 2020 •

edited

Loading

stanleybak commented Jul 10, 2020 •

edited

Loading

stanleybak commented Jul 12, 2020 •

edited

Loading

vtjeng commented Jul 13, 2020 •

edited

Loading

stanleybak commented Jul 14, 2020 •

edited

Loading

FlorianJaeckle commented Jul 15, 2020 •

edited

Loading

stanleybak commented Jul 15, 2020 •

edited

Loading

stanleybak commented Jul 15, 2020 •

edited

Loading