-
Notifications
You must be signed in to change notification settings - Fork 57
On reproducing the experiments
This repository is designed to make it easy to qualitatively reproduce the experiments described in the paper, in the sense that it lets you generate data from the same exact distribution used in the paper (i.e. SR(U(10, 40))), and to train (almost) the exact same model with (almost) the exact same hyper-parameters as used in the paper.
The capabilities that are currently missing are:
- Generate the exact same dataset from SR(U(10, 40)), i.e. with the same random seed that we used. Unfortunately, we neglected to store the seed for the original training data. I would be extremely surprised if the seed made any difference.
- Use the exact same model. The embedding-initialization code in this repository has been slightly simplified compared to the one we originally used. We saved the exact commit in our internal repository used for the run described in the paper, and so running the exact same model would be possible in principle, though I would be very surprised if the results depend on the subtleties in question.
- Use the exact same hyper-parameters. We originally randomly generated hyper-parameters, and in the paper I rounded a few of the floats, e.g. 1.95 x 10^{-5} --> 2 x 10^{-5}. We saved (but did not include in the paper) all the original hyper-parameters, including the numpy and tensorflow random seeds for the weight initializations. As above, I would be extremely surprised if these tiny differences mattered.
Also, we do not provide support for generating the "transfer suite" of graph problems assessed on in the paper. We haven't included the generator code only because there are several dependencies. Unfortunately, we also neglected to store the random seed during the data generation process. As above, I highly doubt that the seed matters.
The reason I don't think (1)-(3) are a big deal is that I think the results are robust, in that if somebody tries to reproduce them even with the caveats listed above they will be able to train something with qualitatively similar behavior, if slightly better or worse on any particular metric. There is also nothing magic about the curriculum, or about n=40 -- we have trained variants on many different datasets with many different settings and hyper-parameters and have gotten qualitatively similar behavior on most runs. If you have access to a GPU, you could try modifying the toy scripts to generate SR(U(10, 40)) and SR(40), and to train for several hours on a few million problems using the hyper-parameters listed in the paper. Sometimes the random seed for the weight initialization can doom you, but I would be very surprised if you did not get close to the results in the paper in a few tries at most.
Note that it is not the messiness of the code that makes it tricky to share, it is that the code is tightly integrated with our cloud services. The data generation process runs on CPUs on Google Compute Engine (GCE), the data and all the logs and stats are read to and written from a Bucket on GCE, and the training is done on GPUs also on GCE. The public repository still has most of the utilities we used and is mainly just missing the parts that write to the cloud or that interact with a database. Depending on what resources you have access to, you will need to implement these components yourself anyway. It may be possible to parameterize the system by a user's GCE account to make this all push-button reproducible, but that would likely be a lot of work.