Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Category Proposal: Convolutional Neural Networks #3

Open
ttj opened this issue Apr 6, 2020 · 86 comments
Open

Category Proposal: Convolutional Neural Networks #3

ttj opened this issue Apr 6, 2020 · 86 comments
Labels
category-proposal Proposed category

Comments

@ttj
Copy link
Contributor

ttj commented Apr 6, 2020

Convolutional layers, presumably in networks for classification

Representative benchmark(s): MNIST, CIFAR classifiers

Specifications: robustness

Questions: any further restrictions or should this be broken out further, e.g., based on other layers (batch normalization, max pool, etc.)?

Allow batch normalization?

Allow max pool, avg pool, etc.?

@ttj ttj added the category-proposal Proposed category label Apr 6, 2020
@stanleybak
Copy link

Probably some of these should include max pooling, and some could just have convolution layers.

@alessiolomuscio
Copy link

I think we should specify the activation function here. I presume we intend these to be plain RELUs.

Batch normalisation: I do not see this making a difference in terms of verification; so I would leave this out.

Max pooling: should this be a subcategory?

Bencharks: maybe also something in between MNIST and CIFAR?

@ttj
Copy link
Contributor Author

ttj commented Apr 8, 2020

Probably some of these should include max pooling, and some could just have convolution layers.

Agreed, I think based on the feedback so far, this category is going to make sense to break down further into subcategories.

@ttj
Copy link
Contributor Author

ttj commented Apr 8, 2020

I think we should specify the activation function here. I presume we intend these to be plain RELUs.

Batch normalisation: I do not see this making a difference in terms of verification; so I would leave this out.

Max pooling: should this be a subcategory?

Bencharks: maybe also something in between MNIST and CIFAR?

Thanks for this feedback. Yes, I'm imaging some ReLU layers given they show up frequently in CNNs, but also some explicit convolutional layers.

While batch normalization is easy to handle for verification (as it's linear/affine after training), many tools do not support it, and it can (at least in our experience) drastically influence training, hence showing up in many real networks. As it's probably not widely supported though, I think this would be a subcategory, unless networks can be transformed equivalently/soundly to e.g. fully connected to handle these.

To make this clearer, refer e.g. to these network for MNIST that we support all the layers for natively in NNV (convolutional layers, batch normalization layers, max pool layers, relu layers, fully connected layers, and softmax layers).

mnist_examples

I imagine though various other methods/tools have restrictions that may prevent analysis of this, at least without transformation. If transformation is needed, that potentially is okay, as I am aware that some methods convert some of these layers to fully connected ones to handle them. We'd want to be sure of soundness of that transformation though, but I presume that would be a meta-theorem in the papers on the tool methods if such transformation is needed.

Questions:

  1. based on your piecewise/ReLU category feedback, I am inferring you have some classification networks that only have ReLUs, is that accurate?

  2. We're certainly open to other networks/data sets for benchmarks, do you have any data set in mind? Maybe GTSRB?

@souradeep-111
Copy link

This is to follow up on Taylor's request.
I would like to sign up for this category with Sherlock.
https://github.com/souradeep-111/sherlock

@stanleybak
Copy link

I'd like the nnenum tool to participate in this category, although I probably won't add support for pooling layers.

@vtjeng
Copy link
Contributor

vtjeng commented Apr 24, 2020

I'd like to enter the MIPVerify tool for this category as it is currently described.

More details on the tool as described in my other comment:

  • Non-linearities supported: network must be piecewise linear; ReLU and max units are supported, but tanh and sigmoid layers are not.
  • Linear layers natively supported: fully connected layers, convolution layers, skip connections (as used in ResNets)
  • Requirements on specifications: The tool is currently geared towards verifying local robustness, but I intend to add support for ACAS-Xu style safety specifications as that has been a highly requested feature.
  • Input format
    • Networks: The "native" input format for the tool is a .mat file containing a dictionary with all the parameters. I currently also have a script that can convert saved PyTorch models into the .mat input format.
    • Specifications: Since I currently only verify local robustness, the specification is just the size of the L_infinity robustness radius that is to be verified.

@GgnDpSngh
Copy link

I would like to enter our tool ERAN in this category:
https://github.com/eth-sri/eran

For convolutional networks, besides intensity-based robustness, it is highly desirable to also have robustness against geometric transformations. Are we going to include this or only norm-based robustness?

Cheers,
Gagandeep Singh

@ttj
Copy link
Contributor Author

ttj commented Apr 27, 2020

For convolutional networks, besides intensity-based robustness, it is highly desirable to also have robustness against geometric transformations. Are we going to include this or only norm-based robustness?

Thanks for the feedback. While I think not all methods may support this (see e.g. @vtjeng comment above), I think it would be great to include such perturbations and see which ones can handle them. Could you please share some details on how the transformations are specified so we can start to think about how to make this interoperable between methods? I think probably the most extensible way to ease comparison will be in terms of the input sets, but realize this may not be ideal vs. specifying say a rotation/translation/etc by some amount.

@FlorianJaeckle
Copy link

The oval group would like to sign up for this category. Our work will be based on the methods presented in the journal paper `Branch and Bound for Piecewise Linear Neural Network Verification’ and developments thereof.

Cheers,
Florian Jaeckle

@ttj
Copy link
Contributor Author

ttj commented May 2, 2020

CNN Category Participants:

Oval
ERAN
MIPVerify
nnenum
NNV
Sherlock

If anyone else plans to join this category, please add a comment soon, as the participants in this category need to decide on the benchmarks soon (by about May 15).

@joshua-smith4
Copy link

We would like to enter the ARFramework tool (https://github.com/formal-verification-research/ARFramework) into this category. I apologize for the late response. Finishing out this semester has been a little hectic. I hope this does not cause too much inconvenience.

On the subject of formatting, the ARFramework tool operates on Tensorflow graphs exported as google protocol buffers. If we can work out a way to translate the benchmark networks to protocol buffers, this would greatly facilitate our participation. Also, because the ARFramework tool makes use of a modified Fast Gradient Sign Method to generate adversarial examples, access to the gradient of the network with respect to the input is necessary. I would be happy to participate in the network formatting development in order to ensure that this tool and all others have their input requirements met.

Also, I was wondering how we are planning to specify the robustness property. Would someone kindly post a formal specification of the property we will be verifying so that we can check compatibility with our tool? That would be greatly appreciated.
Thanks,
Josh Smith

@pat676
Copy link

pat676 commented May 4, 2020

Hi,

We would like to enter VeriNet https://vas.doc.ic.ac.uk/software/neural/.

The toolkit supports:

  • Local robustness properties.

  • Fully connected and convolutional layers.

  • ReLU, Sigmoid and Tanh activation functions.

Regards,
Patrick Henriksen

@ttj
Copy link
Contributor Author

ttj commented May 8, 2020

Finalized CNN Category Participants:

Oval
ERAN
MIPVerify
nnenum
NNV
Sherlock
VeriNet
ARFramework
Venus

I will add some discussion on the benchmark plans soon. As indicated earlier, we will decide as a group the benchmarks, and anyone may propose some. However, by default, we will provide some CNNs that we've analyzed with NNV, and likewise anyone else may propose to provide ones they've analyzed. As there is a lot of parameterization here (on the network architecture, the data set, the specification, etc.), those are the details we will want to discuss next and finalize, as well as how all of these aspects will be specified to ease interoperability between the tools.

@GgnDpSngh
Copy link

Hi there,

It has been the case that different tools come up with their own neural network format and this makes it very difficult to reuse those networks in other tools. I think all networks should be provided in ONNX format as it provides a common format for networks trained in different frameworks e.g., PyTorch and Tensorflow. We would propose the following networks and specifications:
Networks

  1. Networks trained with provable defenses: We could consider state-of-the-art provably trained networks here that are also challenging to prove, i.e., cannot be proved by simple interval analysis, e.g. from https://openreview.net/pdf?id=SJxSDxrKDr. This way the state-of-the-art also benefits from potential better verification results.
  2. Networks trained with empirical defense: We can use state-of-the-art networks trained with PGD etc. We can provide PGD trained ResNets used in
    https://files.sri.inf.ethz.ch/website/papers/neurips19-deepg.pdf from our side.
  3. Networks trained normally that achieve state-of-the-art accuracy for MNIST, CIFAR10, Imagenet etc. Besides the MNIST networks that you proposed, we can also provide our normally trained networks from https://github.com/eth-sri/eran in ONNX format.
    Specifications based on robustness, i.e., the correct class has a higher score than all other labels:
    L_oo-norm: For normally and PGD trained CIFAR10 networks, meaningful epsilon values have not been proven yet. We can consider provable networks from (1).
    Brightness: We can consider this type of specification for PGD and normally trained networks as these are relatively easier to prove than L_oo on networks from (2) and (3).
    Geometric: We can use the Polyhedra regions for geometric perturbations, e.g., rotation, translation available via https://github.com/eth-sri/deepg for all types of networks.

We should have two types of verification tasks: complete and incomplete verification. For complete verifiers, we can compare speeds on small provably and non-provably trained networks. We can use the ground-truth from complete verifiers to test the soundness of participating tools so that one does not win the competition by returning “Verified” always :). We can disqualify a tool if it has too many unsound results. For incomplete verifiers, we can compare precision and speed by comparing the number of instances proved within a given time limit.

Cheers,
Gagandeep Singh

@ttj
Copy link
Contributor Author

ttj commented May 9, 2020

Thanks for the feedback! Responses inline.

It has been the case that different tools come up with their own neural network format and this makes it very difficult to reuse those networks in other tools. I think all networks should be provided in ONNX format as it provides a common format for networks trained in different frameworks e.g., PyTorch and Tensorflow.

Yes, the plan is to use ONNX for the networks. For the other parts (specs in particular, to some degree data sets), this will be harder. Part of the goal of VNN-COMP is to generate feedback as VNN-LIB gets created:

http://www.vnnlib.org/

Some participants (@dlshriver) have developed a translation tool from/to ONNX for various tools:

https://github.com/dlshriver/DNNV

This has some other features, such as a DSL for specifications, etc. that we can explore whether it's possible to use.

Thanks for the model pointers, I will look at them and encourage the other participants to also do so. I think given time constraints it won't be realistic to consider everything, so we will need to identify some reasonable subset, and all participants in the category should agree on them.

We should have two types of verification tasks: complete and incomplete verification. For complete verifiers, we can compare speeds on small provably and non-provably trained networks. We can use the ground-truth from complete verifiers to test the soundness of participating tools so that one does not win the competition by returning “Verified” always :). We can disqualify a tool if it has too many unsound results. For incomplete verifiers, we can compare precision and speed by comparing the number of instances proved within a given time limit.

Earlier we discussed a different form of categorization based on completeness, but that did not lead to a lot of discussion. I agree the distinction is important, but would suggest how this shows up will be later on as results are aggregated into a report. With respect to the benchmarks, I think how this is important at this stage is just to ensure there is a sufficient diversity of smaller through larger instances so that there are benchmarks on which the complete methods may finish in reasonable time. We could consider a subdivision of the benchmarks based on this, but I think it is premature and that the benchmarks could be attempted by both complete and incomplete, where complete of course will hit scalability boundaries (which is one of the things we aim to identify by doing this initiative).

Regarding the competition aspect, as this is a first instance to get things starting and we're not planning to create a formalized ranking (so really more of a challenge vs. competition), so we can further discuss how these things would show up. I do not think we should have any plans to disqualify any participants, as a goal of this is to foster the community, etc. Of course ensuring the results of participants on benchmarks are valid is important, but for some of this, I think some basic testing approaches for checking ground truth may be sufficient, particularly for those instances where complete methods will not scale. For reference, how this is typically done in related initiatives (e.g., SV-COMP/SMT-COMP) is there is a large penalty for having a soundness bug (e.g., say a score of +1 for each instance correctly proven within a time limit, a score of 0 or -1 for not returning a result in time, and a score of -100 if there's a soundness bug, such as the tool returns something is verified when it is false).

I don't think we should have such a detailed scoring system this year (for a plethora of reasons, such as common infrastructure on which to run so comparisons are completely fair, benchmark standardization, etc.), but we should discuss how to report timeouts, correct vs. incorrect results, etc. from the tools, and use this discussion to guide how future iterations may incorporate the tool results into a scoring system.

@pkouvaros
Copy link

Hi

We also like to enter Venus (https://vas.doc.ic.ac.uk/software/neural/) in this category.

Best

Panagiotis

@alessiolomuscio
Copy link

I agree that we want to run this in a friendly manner to help the community rather than proclaiming a winner. If, for example, we come out of this with a preferred input format, and have raised the profile of the topic, I think these alone will have made the effort worthwhile.

Having said this, I think we need to differentiate clearly complete vs incomplete methods. The issue is orthogonal to the categories (challenge problems) themselves. But this comes with advantages and disadvantages and these ought to be somewhat clear to a potentially interested party trying to understand the results. Maybe we ought to consider grouping tools together as complete or incomplete in the reporting phase.

@vtjeng
Copy link
Contributor

vtjeng commented May 11, 2020

On the topic of making comparisons within a time limit, I think it would be valuable to have all verifiers (complete and incomplete) report the amount of time required to verify each property of interest (with a reasonable overall maximum time to spend on each property).

This would allow us to:

  1. Compare the number of properties verified at a range of time limits (perhaps there are tools that are faster for simpler properties, and other tools that are faster for more complex properties)
  2. Identify heterogeneity in verifier performance (perhaps a particular verifier is better at verifying certain classes of properties).

I agree with @alessiolomuscio's suggestion of the importance of clearly differentiating between complete and incomplete methods. One way we might be able to do so is to differentiate in our final report between cases where we timed out (could happen to both complete and incomplete methods) and cases where we the algorithm runs to completion and reports failure (only happens to incomplete methods).

@ttj
Copy link
Contributor Author

ttj commented May 12, 2020

Thanks for the feedback! Yes, I agree we will include the timing information, and can group in the resulting report the complete/incomplete methods results.

@ttj
Copy link
Contributor Author

ttj commented May 29, 2020

Quick update---we have gone through all the proposed networks and will provide some additional detailed feedback shortly. I think most of the networks can be used, although ideally the proposer of any network will provide an ONNX format version of it soon. We have done some conversion for the proposed networks and will share that soon, but there are a variety of intricacies of ONNX that need to be considered (e.g., precision of weights, additional layers added in conversion, etc.), so we would prefer if anyone proposing a network provides it in ONNX to avoid any unintentional changes as you're the ones most familiar with your own networks. We will do this for our MNIST networks discussed above.

Given our time constraints and needing to get everything finalized and distributed very soon, I would suggest we only focus on MNIST for this iteration (and skip e.g. some of the CIFAR proposals). If anyone feels strongly otherwise, please let us know. This is in part because we also need to finalize the other inputs (e.g., specific images for local robustness, local robustness bounds / specifications, etc.) for a specific verification problem quickly as well.

Additionally, from the proposed networks, I think only a couple tools may support residual networks currently, so I would either suggest excluding those for this iteration, or to focus on them later if we have time / by the participants who support resnets currently (I think only ERAN and MIPverify from the discussion).

We can discuss the submission of results more, but for planning purposes, we're anticipating participants will submit a Dockerfile and a script to execute everything in the category to produce a latex table with runtimes, verification results, etc. If we have time, we will then run these on a common infrastructure, and otherwise will just report the characteristics of the machine used if done by each participant.

@vtjeng
Copy link
Contributor

vtjeng commented May 29, 2020

A question about submission formats: MIPVerify can use a variety of back-end solvers, both proprietary and not, including Gurobi. (The performance with Gurobi is significantly faster than that with open-source solvers such as Cbc).

I would like to submit a Dockerfile so that the work is completely reproducible, but don't know whether this can be done when using the Gurobi solver --- does anyone else have a setup involving a Dockerfile that works with Gurobi?

@ttj
Copy link
Contributor Author

ttj commented May 29, 2020

A question about submission formats: MIPVerify can use a variety of back-end solvers, both proprietary and not, including Gurobi. (The performance with Gurobi is significantly faster than that with open-source solvers such as Cbc).

I would like to submit a Dockerfile so that the work is completely reproducible, but don't know whether this can be done when using the Gurobi solver --- does anyone else have a setup involving a Dockerfile that works with Gurobi?

I think others also use Gurobi, so they can likely respond in more detail, but this should be possible just by not including a license, then it can manually be installed by the user. We do similar things with NNV for its Docker-based runs given it uses Matlab. We can discuss this more in depth as we get to that stage, e.g., if the binaries/installation files also aren't available openly, some instructions could be provided for set up. From a quick search (looks a little old):

https://github.com/SG2B/gurobi

@GgnDpSngh
Copy link

Hi all,

We are happy to provide 2 MNIST and 2 CIFAR10 Convolutional networks in the ONNX format here:

https://github.com/eth-sri/colt/tree/master/trained_models/onnx

The epsilon values of the L_oo ball for the MNIST networks are 0.1 and 0.3 while those for the CIFAR10 networks are 2/255 and 8/255. The network names contain epsilon values. All networks expect input images to be first normalized to be in the range [0,1]. Next, standard mean and deviations should be applied to the normalized images before passing them to the networks. The mean and standard deviation for MNIST and CIFAR10 can be found here:

https://github.com/eth-sri/colt/blob/acea4093d0eebf84b49a7dd68ca9a28a08f86e67/code/loaders.py#L7

Let me know if anyone has any trouble running the networks.

Cheers,
Gagandeep Singh

@stanleybak
Copy link

@GgnDpSngh which input images are intended for the networks?

@GgnDpSngh
Copy link

The first 100 images of the MNIST and CIFAR10 test set should be fine so that there is a good mix of easy, medium, and hard instances for verification. We provide them here:
https://github.com/eth-sri/eran/tree/master/data

We can filter out images that are not correctly classified. Also, I was wondering if we should have the challenge live throughout the year so that if anyone gets better results they can add their results to the result table. Similar to this:
https://github.com/MadryLab/cifar10_challenge

@ttj
Copy link
Contributor Author

ttj commented Jun 4, 2020

We can filter out images that are not correctly classified. Also, I was wondering if we should have the challenge live throughout the year so that if anyone gets better results they can add their results to the result table. Similar to this:
https://github.com/MadryLab/cifar10_challenge

Thanks very much for providing the networks, specs, and inputs!

Regarding the continual challenge aspect: I think that's an interesting idea to consider, let's bring this up at the workshop so we can discuss in more detail how that might work. I think as long as people submit Dockerfiles and scripts to re-generate everything, something like that could be feasible without too much effort. Part of what we'll discuss in the workshop is how this first initiative went, what changes could be made going forward, etc.

@stanleybak
Copy link

stanleybak commented Jul 10, 2020

@FlorianJaeckle just to confirm, for the images, should we be using the CIFAR-10 test batch? The dataset (http://www.cs.toronto.edu/~kriz/cifar.html) also includes 5 data batches.

There was also some strange behavior in the deep100 table, where the index and prop columns were sometimes stored as floating-point numbers instead of ints, but that's a small issue.

Also for others: are you just analyzing the 100 image versions of this, or both the 100 and 20 image subsets?

@stanleybak
Copy link

stanleybak commented Jul 12, 2020

@FlorianJaeckle I'm having trouble running the networks on the cifar dataset. The produced labels don't match the expected cifar label. Are we supposed to scale the pixel values to [0,1]? Is there normalization we're supposed to be using for each channel? Are the channel orders R,G,B? Is the classification label the max or the min? Can I confirm the images are the test batch and not one of the other batches?

Was anyone else able to execute these networks?

@FlorianJaeckle
Copy link

Hi @stanleybak,
We do use the test batch of the CIFAR-10 dataset.
We normalise the images with the following parameters:
mean=[0.485, 0.456, 0.406]
std=[0.225, 0.225, 0.225].
Apologies, I realise I should have mentioned this before.
The channel order is indeed R,G,B.
The predicted classification label is the max.
I hope this solves your issue.

@stanleybak
Copy link

Hi @FlorianJaeckle,

That helped! Now things are classifying correctly. One more question: is the epsilon perturbation applied before normalization, or after?

@FlorianJaeckle
Copy link

@stanleybak,
it's applied after normalisation.

@stanleybak
Copy link

@FlorianJaeckle
are they all unsat in the 100 instance files? I'm getting some sat instances and want to make sure I'm running things correctly.

@vtjeng
Copy link
Contributor

vtjeng commented Jul 13, 2020

@FlorianJaeckle, I've generally seen the epsilon perturbation being applied before normalization (that is, if x is the input and e is the perturbation, we have the perturbed input as Normalize(x+e), where e is norm-bounded).

To confirm, in your case, this is Normalize(x)+e instead?

@GgnDpSngh
Copy link

also, the timeout is 1 hour for all or per property? We are testing 100 or 20 properties for each network?

Cheers,

@FlorianJaeckle
Copy link

@stanleybak, all properties are unsat; i.e. there does not exists a counter-example, in other words the network is robust around the image. Can you give an example of a property, where your method returned an adversarial example?

@vtjeng, given an image x and an epsilon value, the input domain for the CNN is [Normalize(x)-e, Normalize(x)+e]

@GgnDpSngh, the timeout is 1 hour per property.

@stanleybak
Copy link

stanleybak commented Jul 14, 2020

For the normalization, if you want, you can convert to the epsilon values to prenormalized values by multiplying by the std_deviation (0.225), since it's the same for all channels. I also assume there's trimming at the limits, which wasn't mentioned before.

@FlorianJaeckle , I think I found my problem. Since this is there is a target class, is the correct condition to check (1) output[original_label] < output[target_label] or (2) argmax(output) == target_label?

@GgnDpSngh
Copy link

@stanleybak since they want that the target label is not chosen for perturbed images, I guess the condition to verify should be that output[original_label] > output[target_label] holds, this is how I have been running ERAN but maybe @FlorianJaeckle can confirm?

Cheers,
Gagandeep Singh

@FlorianJaeckle
Copy link

@stanleybak the first condition is the correct one.
@GgnDpSngh is right: since we want to check whether the target_label is chosen ahead of the original label, the condition is indeed output[original_label] < output[target_label].

@GgnDpSngh
Copy link

@FlorianJaeckle i am confused now, are we verifying output[original_label] < output[target_label] holds

or the other direction that i mentioned?

Cheers,

@joshua-smith4
Copy link

We are running into a compatibility issue between our tool's requirements and the networks that we are all evaluating on. If you have experience translating between ONNX format and Tensorflow protocol buffers, we would really appreciate you reading on and providing suggestions to overcome this obstacle.
Our tool uses a modification of the FGSM algorithm proposed by Ian Goodfellow to perform some "refutation-based" verification prior to the main verification routine as an optimization. FGSM requires access to the gradient of the loss function with respect to the input. We have written our tool to accept the label of the tensorflow graph node associated with this gradient as an input. Does anyone know of a way that we could modify the provided networks to gain access to this gradient? Any help and suggestions would be greatly appreciated.
Thanks,
Josh

@FlorianJaeckle
Copy link

FlorianJaeckle commented Jul 15, 2020

@GgnDpSngh we have formulated our problem as follows:
\exists x' \in [x-\epsilon, x+\epsilon] such that:
f(x')_{target\_label} > f(x')_{original\_label},
where f(x')_{i} is the score given by the CNN f to the output class i, for a given input x'.
We solve this equation by either finding an x' that satisfies it (an adversarial example), or by proving that the equation is UNSAT (and thereby verifying robustness).

@GgnDpSngh
Copy link

@FlorianJaeckle alright thanks.

Cheers,

@GgnDpSngh
Copy link

So in this category, we are trying the following benchmarks:
mnist_0.1.onnx, mnist_0.3.onnx, cifar_2_255.onnx, cifar_8_255.onnx for robustness on all the images correctly classified among the first 100 in the test set,

then for benchmarks from the oval group, we use:
base.onnx, wide.onnx, deep.onnx with the 20 properties in the panda table, is that all or I am missing some benchmarks?

Cheers,

@stanleybak
Copy link

stanleybak commented Jul 15, 2020

@GgnDpSngh These are the ones I've been looking at, although for oval they also have panda tables with 100 properties that's a superset of the 20 properties. Since the timeout is one hour each, though, maybe we should focus on the 20 properties version? Also, do you think it make sense in mnist_0.1 to use a timeout of 5 minutes as well, just so they all have same timeout?

@GgnDpSngh
Copy link

@stanleybak the deadline is Friday now, so there might be time to run the 100 properties that the oval group suggests but we wanted to be uniform. It should be fine to use a timeout of 5 minutes for the mnist_0.1 network as well.

Cheers,

@stanleybak
Copy link

stanleybak commented Jul 15, 2020

Does anyone know of a way that we could modify the provided networks to gain access to this gradient?

@joshua-smith4, I made a python3 script to convert .onnx files to tensorflow1 .pb files that might help. It prints some warnings but it seems to produce files that run correctly. The code is here:
onnx_to_tensorflow1.py.txt

I'm using python package versions numpy 1.17.4, onnx 1.7.0, onnxruntime 1.3.0, tensorflow 2.2.0

@Neelanjana314
Copy link
Contributor

Hi, @stanleybak I got the following images as UNSAT whereas its mentioned as UNKNOWN in the table:
Mnist_03 : 28, 68, 70, 77
Cifar_8_255: 12,89

Should I change them as UNSAT in the ggn-all-results.txt ?

Thanks
Neelanjana

@stanleybak
Copy link

@Neelanjana314, Yes please update the table with the classification. The result columns are in the data/*-front.txt files. Also point out if there are any inconsistencies.

@ttj ttj closed this as completed Jul 18, 2020
@ttj ttj reopened this Jul 18, 2020
@ttj
Copy link
Contributor Author

ttj commented Jul 18, 2020

@Neelanjana314, Yes please update the table with the classification. The result columns are in the data/*-front.txt files. Also point out if there are any inconsistencies.

Yes, we need to specify the sat/unsat/unknown result of each tool, right? I think it's okay to have an individual column with the result if all tools are almost the same and only a few inconsistencies, but if there are more than a few, we may want to present in another way, e.g., including the result column for each tool.

@stanleybak
Copy link

My hope is there are few inconsistencies, as this indicates soundness issues. They also should be easy to resolve, since the tool that produces a SAT result can provide a concrete unsafe input which we can all check. If it's just a few cases, maybe we can mark the result as INCONSIS or something similar until it's resolved. Another idea is use colors along with the times so for example green would be SAT and red UNSAT. Adding an extra column for each tool when they're in agreement 95% of the time would be a mess, in my opinion.

@ttj
Copy link
Contributor Author

ttj commented Jul 18, 2020

My hope is there are few inconsistencies, as this indicates soundness issues. They also should be easy to resolve, since the tool that produces a SAT result can provide a concrete unsafe input which we can all check. If it's just a few cases, maybe we can mark the result as INCONSIS or something similar until it's resolved. Another idea is use colors along with the times so for example green would be SAT and red UNSAT. Adding an extra column for each tool when they're in agreement 95% of the time would be a mess, in my opinion.

Thanks! Yes, I agree for presentation to not include the result column for each tool as long as they match as it's not necessary and makes the presentation poor, but like your solution from #8 to have it available so we can easily check across tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category-proposal Proposed category
Projects
None yet
Development

No branches or pull requests