Show some output validator feedback #140

thorehusfeldt · 2019-10-21T08:55:38Z

If the output validator put something in feedbackdir/judgemessage.txt
or feedbackdir/teammessage.txt, then show (a summmary of) that
information, preferring the former.

The summary is just a truncation of the txt-file, except if it
is recognisable as the output from the standard output validator,
in which case it is summarised more carefully.

If the output validator put something in feedbackdir/judgemessage.txt or feedbackdir/teammessage.txt, then show (a summmary of) that information, preferring the former. The summary is just a truncation of the txt-file, except if it is recognisatble as the output from the standard output validator, in which case it is summarised more carefully.

austrin · 2019-11-09T11:40:10Z

Sorry for slow response; various other duties have demanded attention recently.

If I understand correctly, the purpose of this is to show a digest of the validator feedback even for submissions which get their expected judgement? Can you post a screenshot of what it is supposed to look like?

I'm not necessarily against adding this, but am slightly hesitant because if anything I think that the submission summary is already a bit too verbose. But I'm thinking maybe there should be a command-line flag controlling whether or not this is shown?

thorehusfeldt · 2019-11-09T15:22:34Z

Here’s the intended output from an interactive problem, askmarilyn — basically the Monty Hall problem. The judge (or team) messages are shown in square brackets. The interesting part is the WA section, where you can see the (wrong) submissions implementing various broken strategies or protocols, and triggering various judge/team messages. For instance, the WA submission “random door”, that just picks doors at random, receives 346 cars/drinks (out of 1000), which are too few. (More than 600 are needed.) This is helpful for increasing my confidence during problem development that the WA submissions achieve full code coverage in the output verifier, and that the team messages make sense.

root@77c1d7592d08:/katdev/will-code-for-drinks-F2019# verifyproblem askmarilyn
Loading problem askmarilyn
Checking config
Checking statement
Checking validators
Checking graders
Checking data
Checking submissions
   AC sub sl.cpp (C++) OK: AC [Congratulations! You got 650 drinks, CPU: 0.08s @ tc secret/1-0]
   AC sub thore.py (Python 3) OK: AC [Congratulations! You got 697 drinks, CPU: 0.15s @ tc secret/2-1]
   Slowest AC runtime: 0.150, setting timelim to 1 secs, safety margin to 2 secs
   WA sub always_door_A.py (Python 3) OK: WA [Too few drinks: 330, tc secret/1-0, CPU: 0.04s @ tc secret/1-0]
   WA sub break_protocol.py (Python 2 w/PyPy) OK: WA [Round 1(1): Invalid door name: 1, tc secret/1-0, CPU: 0.10s @ tc secret/1-0]
   WA sub first_a.py (Python 3) OK: WA [Too few drinks: 471, tc secret/1-1, CPU: 0.13s @ tc secret/1-0]
   WA sub first_b.py (Python 3) OK: WA [Too few drinks: 469, tc secret/1-2, CPU: 0.15s @ tc secret/1-2]
   WA sub first_c.py (Python 3) OK: WA [Too few drinks: 476, tc secret/1-3, CPU: 0.19s @ tc secret/1-2]
   WA sub ignore_positive_hint.py (Python 3) OK: WA [Too few drinks: 0, tc secret/1-0, CPU: 0.16s @ tc secret/1-0]
   WA sub plays-1001-rounds.py (Python 3) OK: WA [Trailing output, tc secret/1-0, CPU: 0.06s @ tc secret/1-0]
   WA sub plays-1999-halfrounds.py (Python 3) OK: WA [Round 1000(2): No door guessed, tc secret/1-0, CPU: 0.07s @ tc secret/1-0]
   WA sub plays-999-rounds.py (Python 3) OK: WA [Round 1000(1): No door guessed, tc secret/1-0, CPU: 0.06s @ tc secret/1-0]
   WA sub plays-forever.py (Python 3) OK: WA [Trailing output, tc secret/1-0, CPU: 0.16s @ tc secret/1-0]
   WA sub random_door.py (Python 3) OK: WA [Too few drinks: 346, tc secret/1-0, CPU: 0.18s @ tc secret/1-0]
   WA sub silent-death.py (Python 2 w/PyPy) OK: WA [Round 1(1): No door guessed, tc secret/1-0, CPU: 0.03s @ tc secret/1-0]
   WA sub spam.py (Python 3) OK: WA [Round 1(1): Invalid door name: s, tc secret/1-0, CPU: 0.02s @ tc secret/1-0]
   TLE sub forever-silent.py (Python 2 w/PyPy) OK: TLE [tc secret/1-0, CPU: 4.00s @ tc secret/1-0]

thorehusfeldt · 2019-11-09T15:34:51Z

Here’s the output on a simpler, non-interactive problem. The judge message from the standard output generator is shown, in a truncated but still useful way, in square brackets.

Checking submissions
   AC sub sl.cpp (C++) OK: AC [CPU: 0.00s @ tc sample/1]
   AC sub tgif-py3.py (Python 3) OK: AC [CPU: 0.03s @ tc sample/3]
   AC sub tgif-simulate-py3.py (Python 3) OK: AC [CPU: 0.01s @ tc sample/1]
   Slowest AC runtime: 0.030, setting timelim to 1 secs, safety margin to 2 secs
   WA sub tgif_31days-py3.py (Python 3) OK: WA [1/1: judge: "not", team: ":(", tc secret/18, CPU: 0.01s @ tc sample/1]
   WA sub tgif_leapday-py3.py (Python 3) OK: WA [1/1: judge: "TGIF", team: "not", tc secret/32, CPU: 0.02s @ tc secret/14]
   WA sub tgif_too_unsure-py3.py (Python 3) OK: WA [1/1: judge: ":(", team: "not", tc sample/1, CPU: 0.01s @ tc sample/1]
   RTE sub tgif-feb29-py3.py (Python 3) OK: RTE [tc secret/32, CPU: 0.02s @ tc secret/22]
   TLE sub tgif-simulate-buggy-py3.py (Python 3) OK: TLE [tc sample/1, CPU: 2.99s @ tc sample/1]
   TLE sub tgif-simulate-py3.py (Python 3) OK: TLE [tc sample/1, CPU: 3.01s @ tc sample/1]
tgif tested: 0 errors, 0 warnings

thorehusfeldt · 2019-11-09T15:35:56Z

All of these examples show that submission has become sub and test case is abbreviated to tc, following the current conventions for data generator scripts.

thorehusfeldt · 2019-11-09T15:41:31Z

Now we’re here, the way verification summaries are shown on https://github.com/RagnarGrootKoerkamp/BAPCtools (colours, alignment) are a big improvement for readability. I’m happy to submit a PR with a similar result. (But I’m a big fan of not interrupting workflows and UI conventions of experienced users. It’s highly plausible that I’m overlooking things and would prefer to have a lot more experience with actually using verifyproblems. I’m very much a newcomer.)

thorehusfeldt · 2019-11-09T15:46:29Z

As for the command line flag, the cleanest way is an option to verifyproblem that takes a format string, and a sensible default. This avoids overloading the options to verifyproblem. git’s pretty is a good model.

https://git-scm.com/docs/pretty-formats

austrin · 2019-11-10T10:05:39Z

Now we’re here, the way verification summaries are shown on https://github.com/RagnarGrootKoerkamp/BAPCtools (colours, alignment) are a big improvement for readability. I’m happy to submit a PR with a similar result.

That would be great! I've been meaning to do things like that for a long time but never get around to it. I know also @RagnarGrootKoerkamp has been wanting to merge those features of BAPCtools into problemtools.

As for the command line flag, the cleanest way is an option to verifyproblem that takes a format string, and a sensible default. This avoids overloading the options to verifyproblem. git’s pretty is a good model.

I agree, was also thinking that a format string is the way to go. Something along the lines of git's pretty sounds perfectly reasonable to me.

All of these examples show that submission has become sub and test case is abbreviated to tc, following the current conventions for data generator scripts.

I think these conventions you refer to are more localized than you think and I don't really like this part. But of course this becomes near-moot with the format string, then everyone can use whatever abbreviations they want.

thorehusfeldt · 2019-11-11T09:46:50Z

OK, I think I know what to do. The cleanest way would be to implement __format__ for (at least) SubmissionResult and SourceCode, supporting syntax like %e for the string currently in expected_verdict so that for instance %e submission %n (%l) describes what is currently in the string variable desc (namely, AC submission binarysearch.py (Python 3)). Then all the strings created by Submissions.checksubmission() should be given in this mini-language. This rewrite should be its own commit and simply maintain current functionality.

Once that is done, (some of) the formatting strings can become command-line options to verifyproblem.

Once that is done, I can add “my” team and judge message digests to the formatting mini language.

I suggest we close this PR and I task myself with (some version) of the above plan.

thorehusfeldt · 2019-11-11T13:22:37Z

Note to self (or others who understand the code better): the fields SubmissionResult.ac_runtime and SubmissionResult.ac_runtime_testcase seem to be never used and maybe should just be removed. (My guess is that runtime and runtime_testcase contain that information today.)

austrin · 2019-11-11T19:40:17Z

Sounds good. Would probably be good to settle on the details of the formatting notation. It may be more Pythonesque to use {name} notation, e.g. {expected_result} submission {submission_name} ({submission_language}) and then format these using .format()? That would also automatically enable things like writing in a fixed field width to get alignment.

thorehusfeldt · 2019-11-11T20:05:59Z

I’m doing that right now. Here’s a teaser from my current code:

 fmt_string = '   {exp}/{name:<{width}} \033[;32m OK:\033[0;0m {result}'

But I have to play around with this a bit and field-test it before I make a concrete suggestion.

Any comment on ac_runtime_testcase and ac_runtime?

Using __format__ makes immediate sense as a method for SourceCode. Submissions on the other hand are a bit in limbo, not having their own object, and not really knowing their own expected_result. (So the current desc variable can’t really turn into a __format__ method of any object.) But I have to familiarize myself with the code a bit more. It’s very … organic.

thorehusfeldt · 2019-11-12T07:03:29Z

After thinking a bit more: Would you object to either of

Extend SourceCode with self.expected_verdict. Then check_submission in

problemtools/problemtools/verifyproblem.py

Line 1266 in eb7b607

def check_submission(self, sub, args, expected_verdict, timelim, timelim_low, timelim_high):

could lose one argument (because it’s a field of sub), but the primary value would be that SourceCode could get a sensible __format__ method that would default to the string currently in desc

problemtools/problemtools/verifyproblem.py

Line 1267 in eb7b607

desc = '%s submission %s' % (expected_verdict, sub)

i.e., “PAC submission brute_force.py (Python 3)”, but which could also become “partially_accepted/brute_force.py” or “the file brute_force.py which I tentatively assume should produced a PAC result based on its directory partially_accepted and which smells like Python 3 to me” depending on options to __format__.
If you dislike SourceCode objects knowing about their expected verdict, create class Submission(SourceCode) with the desired fields. (I think this is overkill, but it’s quite clean.)

austrin · 2019-11-12T08:11:36Z

Hmm I think I would actually prefer the second option, since there are other things (validators) that are SourceCode instances which don't have an expected verdict, and I can imagine that one at one point wants to display their name.

thorehusfeldt · 2019-11-12T08:55:34Z

Where does the new class Submission live? problemtools/problemtools/run/submission.py? (Going this route would lead to quite some refactoring. I’m not sure I have enough experience with the code to do that alone.)

thorehusfeldt · 2019-11-12T10:32:16Z

1st proposal for syntax of formatting strings:

There are (at least) two formatting strings, SUBMISSION_FORMAT and AC_RESULT_FORMAT.

SUBMISSION_FORMAT has access to expected_result, language, directory and name. The default value is

SUBMISSION_FORMAT = '{expected_result} submission {name} ({language})'

Now for the problem. The format for results is much, much harder to specify, because not all fields are present in all cases. It would make sense to specify the two standard cases (AC and anything else with reason). Even then there would be whitespace issues. Check out these, just to get the ball rolling:

This would reproduce the current behaviour:

AC_RESULT_FORMAT = 'OK: {scored_verdict} CPU: {max_running_time:.2f} @ test case: {max_tc}'

And this would be BAPC-tools:

AC_RESULT_FORMAT = ' \033[;32mOK: {verdict}\033[0;0m sum: {sum_running_time:.2f} max: {max_running_time:.2f} ({max_tc})'

And here is a suggestion with judge or team messages and pythonesque truncation

AC_RESULT_FORMAT = 'OK: {verdict:<5} {team_or_judge_message:.30} {max_running_time:.2f} ({max_tc})'

These examples show some of the fields I have currently in play.

But what about AC submissions with self.runtime == -1 or self.test_case == None? I’m not sure in which contexts these appear, but the current implementation assembles details depending on the presence of these fields in a careful manner. This is very hard to specify in Python’s mini-language, unless we want to live with None in the output.

And what about non-AC submissions? Do we really need REJECTED_RESULT_FORMAT_WITH_REASON and REJECTED_RESULT_WITHOUT_REASON_BUT_POSITIVE_RUNTIME ?

thorehusfeldt · 2019-11-13T08:47:39Z

Here’s a concrete suggestion for which keywords to recognise. Expected usage is a format string like this:

fmt = '{expected_verdict:<3} submission {name:<20}: {verdict_score:<5}, cpu: {max_time:1.2f} ({maxtime_tc:<20}) {team-message:.20}'

      '''Return submission result in the specified format.

        The following keywords are recognised in the format string:

        meaning             type   examples
        verdict           * str    AC
        score               int    12
        verdict_score     * str    AC, AC(12)
        expected_verdict  * str    TLE
        name              * str    binarysearch.py
        directory         * str    accepted
        language          * str    Python 3
        max_time            float  .455411 
        total_time          float  10.125315
        max_time_tc         str    group1/021-n-100000.in
        reason              str    validator produced "score.txt" ...
        reason_tc           str    group2/03-overflow.in
        judge_message       str    l 43: int expected
        team_message        str    Congratulation! 534 drinks
        
        The starred keywords are never None. The others may be None.
        The numerical keywords (score, max_time, total_time) may
        be None. If so, they degrade gracefully to minuses, but still
        respect formatting directives. For instance, {score:03d} will 
        result in '014' if score is 12, and '---' if score is None.
        '''

Related: the current version of verifyproblem seems to be fine with reporting negative running times, in case max_runtime is None. For instance on the input-free problem hello:

Slowest AC runtime: -1.000, setting timelim to 1 secs, safety margin to 2 secs

That’s way easier to implement, of course. Still, something is fishy in that part of the code, and I don’t quite understand the expected semantics of running times being -1, None, or something else.

pehrsoderman · 2021-04-26T12:44:24Z

@thorehusfeldt Can you rebase and clean up this PR as a preparation to get it merged?

thorehusfeldt · 2021-05-07T07:39:40Z

I’d be very happy to take this up again. My own conclusion is that there are two subtasks with well-defined but different goals:

make the output of verifyproblem more readable and flexible (as per the conversation with Austrin in the thread above, from November 2019) by using some kind of formatting notation.
actually show more validator feedback, which was the original motivation for the present PR.

1 precedes 2, and solving 1 well trivialises 2. So I guess the best way to proceed is to close the present pull request and task myself with solving 1 (resulting in a new pull request).

pehrsoderman · 2021-05-07T09:27:11Z

@thorehusfeldt Sounds like a good plan.

thorehusfeldt · 2021-05-07T14:39:19Z

@pehrsoderman : I would need an answer to the question from 12 Nov 2019.

The conversation with @austrin points to the solution of introducing class Submission(SourceCode), which is a good idea in the first place, but requires some refactoring.

Would it be useful have that as a first subgoal? The counterargument is that this would lead to quite a lot of changes to the code. (Thumbs-up is enough.)

thorehusfeldt · 2021-06-09T06:14:29Z

@pehrsoderman : I have now thought through one way of making verifyproblems output both more flexible and more informative. This is by using the already existing, and quite well-thought out, logging framework of Python. Doing so first involves some refactoring of the existing code (basically moving responsibility from the superclass ProblemAspect to a logger.py module) which I have done in PR198 #198 .

Before continuing along this avenue I’d like you (or anybody else) to comment on that suggestion.

(A next step would then be to separate Submission from Submissions; the current architecture would not allow that because Submission(ProblemAspect) would be confused about which log messages to send: (Error in Submissions?) because the logger and the base class are coupled.)

thorehusfeldt mentioned this pull request Oct 21, 2019

askmarilyn: Can she (as the Judge) tell me how much I drank? thorehusfeldt/will-code-for-drinks-F2019#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show some output validator feedback #140

Show some output validator feedback #140

thorehusfeldt commented Oct 21, 2019 •

edited

Loading

austrin commented Nov 9, 2019

thorehusfeldt commented Nov 9, 2019 •

edited

Loading

thorehusfeldt commented Nov 9, 2019

thorehusfeldt commented Nov 9, 2019

thorehusfeldt commented Nov 9, 2019

thorehusfeldt commented Nov 9, 2019 •

edited

Loading

austrin commented Nov 10, 2019

thorehusfeldt commented Nov 11, 2019

thorehusfeldt commented Nov 11, 2019

austrin commented Nov 11, 2019

thorehusfeldt commented Nov 11, 2019 •

edited

Loading

thorehusfeldt commented Nov 12, 2019 •

edited

Loading

austrin commented Nov 12, 2019

thorehusfeldt commented Nov 12, 2019

thorehusfeldt commented Nov 12, 2019 •

edited

Loading

thorehusfeldt commented Nov 13, 2019 •

edited

Loading

pehrsoderman commented Apr 26, 2021

thorehusfeldt commented May 7, 2021

pehrsoderman commented May 7, 2021

thorehusfeldt commented May 7, 2021

thorehusfeldt commented Jun 9, 2021 •

edited

Loading

Show some output validator feedback #140

Are you sure you want to change the base?

Show some output validator feedback #140

Conversation

thorehusfeldt commented Oct 21, 2019 • edited Loading

austrin commented Nov 9, 2019

thorehusfeldt commented Nov 9, 2019 • edited Loading

thorehusfeldt commented Nov 9, 2019

thorehusfeldt commented Nov 9, 2019

thorehusfeldt commented Nov 9, 2019

thorehusfeldt commented Nov 9, 2019 • edited Loading

austrin commented Nov 10, 2019

thorehusfeldt commented Nov 11, 2019

thorehusfeldt commented Nov 11, 2019

austrin commented Nov 11, 2019

thorehusfeldt commented Nov 11, 2019 • edited Loading

thorehusfeldt commented Nov 12, 2019 • edited Loading

austrin commented Nov 12, 2019

thorehusfeldt commented Nov 12, 2019

thorehusfeldt commented Nov 12, 2019 • edited Loading

thorehusfeldt commented Nov 13, 2019 • edited Loading

pehrsoderman commented Apr 26, 2021

thorehusfeldt commented May 7, 2021

pehrsoderman commented May 7, 2021

thorehusfeldt commented May 7, 2021

thorehusfeldt commented Jun 9, 2021 • edited Loading

thorehusfeldt commented Oct 21, 2019 •

edited

Loading

thorehusfeldt commented Nov 9, 2019 •

edited

Loading

thorehusfeldt commented Nov 9, 2019 •

edited

Loading

thorehusfeldt commented Nov 11, 2019 •

edited

Loading

thorehusfeldt commented Nov 12, 2019 •

edited

Loading

thorehusfeldt commented Nov 12, 2019 •

edited

Loading

thorehusfeldt commented Nov 13, 2019 •

edited

Loading

thorehusfeldt commented Jun 9, 2021 •

edited

Loading