- Fixed non-deterministic ordering when running jobs. Now jobs always run starting with the lowest JID again.
- Add the
--stagger
flag for machine setup. This helps avoid resource bottlenecks during setup tasks for many machines. - Add some more columns to
job stat
for convenience. - Add
--only_done
flag tojob stat
to only include jobs that are done. - Add retry limits to
job add
andjob matrix add
- All command line arguments that expect a JID now also accept the value
last
to indicate the JID of the last job.
- Renamed
job matrix stat
tojob matrix ls
for better consistency withjob ls
. - The following have all been removed in favor of the new
job stat
command:- The
--output
flags ofjob matrix ls
andjob ls
- The
job matrix csv
subcommand - The
job results
subcommand
- The
- The
job stat
command has gained significant super powers. It is now vastly more useful for post-processing results, and outputing data into a number of useful formats. See the help message for more info, but here are some examples:- Print a plain text table of the given experiments with only the given columns:
> j job stat --text --jid --time --machine $EXPERIMENT_JIDS JID TIME MACHINE 14923 1h34m clnode199.clemson.cloudlab.us:22 14951 1h4m clnode201.clemson.cloudlab.us:22 14956 14963 46m2s clnode212.clemson.cloudlab.us:22
- Print a JSON of ID and log path for all running jobs:
> j job stat --json --jid --log --running [{"jid":"15778","log":"/path/to/my.log\n"},{"jid":"15781","log":"/path/to/my.log\n"},{"jid":"15787","log":"/path/to/my.log\n"},{"jid":"15792","log":"/path/to/my.log\n"},{"jid":"15798","log":"/path/to/my.log\n"},{"jid":"15831","log":"/path/to/my.log\n"},{"jid":"15832","log":"/path/to/my.log\n"},{"jid":"15833","log":"/path/to/my.log\n"},{"jid":"15834","log":"/path/to/my.log\n"},{"jid":"15835","log":"/path/to/my.log\n"},{"jid":"15836","log":"/path/to/my.log\n"},{"jid":"15837","log":"/path/to/my.log\n"}]
- Print a CSV generated by mapping each job's info with the given scipt,
which takes a JSON of all info about a job:
> j job stat --id 14740 --jid --results --cmd --csv --mapper /nobackup/extract.py Data filename,Huge page,Runtime (s),cpu_clk_unhalted.thread_any,cs,dtlb_load_misses.miss_causes_a_walk,dtlb_load_misses.walk_active,dtlb_store_misses.miss_causes_a_walk,dtlb_store_misses.walk_active,faults,inst_ retired.any,migrations /nobackup/scratch/page-value/exp_10__bare_metal___hacky_spec17__-transparent_hugepage_huge_addr140721422073856-transparent_hugepage_huge_addr_mode_Less_-2020-10-13-10-25-24-316505881.mmu,TODO,356.387142448,37422 90180352,5889,4265220134,116025476938,271985527,8132321876,11494749,5143124399584,6 /nobackup/scratch/page-value/exp_10__bare_metal___hacky_spec17__-transparent_hugepage_huge_addr140721422073856-transparent_hugepage_huge_addr_mode_Less_-2020-10-13-10-25-24-414504453.mmu,TODO,360.590801957,37566 69277437,3031,4273389066,116331409111,277050659,8249010131,11494750,5142777500448,10 /nobackup/scratch/page-value/exp_10__bare_metal___hacky_spec17__-transparent_hugepage_huge_addr140721422073856-transparent_hugepage_huge_addr_mode_Less_-2020-10-13-10-25-24-822510504.mmu,TODO,356.487874315,37420 04270438,5900,4274810097,116482238285,273613875,8186410000,11494750,5142880002204,4 /nobackup/scratch/page-value/exp_10__bare_metal___hacky_spec17__-transparent_hugepage_huge_addr140721422073856-transparent_hugepage_huge_addr_mode_Less_-2020-10-13-10-27-43-860570478.mmu,TODO,359.158573514,37522 29639845,2997,4276891646,116319096098,276830049,8289222935,11494752,5142941529188,9 /nobackup/scratch/page-value/exp_10__bare_metal___hacky_spec17__-transparent_hugepage_huge_addr140721583554560-transparent_hugepage_huge_addr_mode_Less_-2020-10-13-10-34-03-577381879.mmu,TODO,358.304955228,37518 94627436,3188,4267564111,116262044231,274054941,8203832653,11258667,5141843316464,14
- Print a plain text table of the given experiments, using the given script
to map over a particular column of the data as plain text:
> j job stat --running --cmd --cmd_map /tmp/replace_a_with_unk.sh --text cmd exp00010 --mmu_overheunkd hunkcky_spec17 xunklunkncbmk exp00010 --mmu_overheunkd hunkcky_spec17 xunklunkncbmk exp00010 --mmu_overheunkd hunkcky_spec17 xunklunkncbmk exp00010 --mmu_overheunkd hunkcky_spec17 xunklunkncbmk exp00010 --mmu_overheunkd hunkcky_spec17 xunklunkncbmk exp00010 --mmu_overheunkd hunkcky_spec17 xunklunkncbmk exp00010 --mmu_overheunkd hunkcky_spec17 xunklunkncbmk exp00010 --mmu_overheunkd hunkcky_spec17 xunklunkncbmk exp00010 --mmu_overheunkd hunkcky_spec17 xunklunkncbmk exp00010 --mmu_overheunkd hunkcky_spec17 xunklunkncbmk exp00010 --mmu_overheunkd hunkcky_spec17 xunklunkncbmk
- Combine with shell to restart all failed jobs among the listed experiments:
> j job restart $(j job stat --text --jid --status $EXPERIMENT_JIDS | grep Failed | awk '{print $1}')
- Print a plain text table of the given experiments with only the given columns:
- Added the
job mvresults
subcommand to copy all file associated with a task to a new location. - Fixed issue where copying results hangs due to SSH host key verification
failure. This was a long-standing and annoying issue. Instead, we now detect
this case and print a specific error message encouraging the use to add the
given host to their
known_hosts
file. Additionally, we move the host out of the class so that further experiments won't error out wastefully. The user can move it back when the host has been added toknown_hosts
.
- Fixes a panic on "narrow" terminals.
- Change the way results files are identified. The runner should now return a common prefix of all files to be copied, and the jobserver will copy all files with that prefix. In contrast, in the past, you had to return a filepath with a glob.
- The client now has some better support for manipulating said prefixes.
- Added
machine mv
subcommand. - Fix minor bugs.
- Matrices that have become empty because all of their jobs were forgotten will also be forgotten. This is different from prior behavior, so I'm bumping the major version.
- Added support for timing out jobs.
- Added a shortcut for restarting a job.
- Added
-r
flag to list all running jobs. - Fix some bugs.
- Bump the optimization level a bit.
- Minor backwards-compatible changes to client-server protocol and vast
refactoring of client-side printing for
job ls
. These produce a major improvement in the format of job listings for matrices.
- Changes to client-side
j machine rm
arguments to allow removing classes of machines more easily. This allows removing expired reservations more easily.
- Add
j job results
subcommand. - Major improvements to handling of failed/cloned jobs in matrices:
- When a matrix job is cloned, the clone also ends up in the matrix.
- Matrix jobs automatically repeat on failure.
j job matrix add
now supports the-x
flag.j job ls
now prints a summary of the printed jobs.- Internal rearchitecting of the thread that copies results back to the host. This may allow future improvements to handling of failed/timed out/hanging copying tasks.
- Reimplemented the server state serialization for snapshots. This fixes weird
errors where tasks would become corrutped after a server restart for no
apparent reason. Unfortunately, this is breaking change to the format of the
server snapshots, so tasks that were already in the snapshot will show as
Unknown
after restarting the server into version 0.4.
- This is the first version I published on crates.io.