More conditional sourcepos #180

MathieuDuponchelle · 2017-01-04T23:30:07Z

There you go, three commits:

The source map code, which generation is now conditional to the SOURCEPOS option, when not enabled make bench results are identical to the results without the patch. Note that I had a bunch of follow-up commits on my end which I ended up fixing up with that patch, mostly fixes except for one patch that defines reference nodes. I also squashed up @nwellnhof's commits in there, please tell me if you mind.
The wrapper code, also updated but nothing major at all
The actual fancy stuff, a reindenting script, tested on every test of the spec with 1 to 80 columns, the test checks that the new AST is identical to the old one, minus softbreaks. The actual test I propose only checks for reindenting on one column, because 80 * 610 starts to take a little time I'm afraid. Note that the script could break in a few more places than it currently does, and certainly be massively optimized, but I think it's already in a state worth merging because, as far as I can tell, it is now provably reliable.

MathieuDuponchelle · 2017-01-05T01:25:59Z

That is a timeout, pretty sure the code is fine, @jgm if you feel like triggering a new run :)

MathieuDuponchelle · 2017-01-05T15:48:42Z

Rebased and pushed, no timeout this time :)

jgm · 2017-01-05T20:08:58Z

Is it possible to break this into smaller, logical chunks? This is currently such a big ball of code, touching so many parts of the code base, that it's hard to survey.

For example, I see changes in buffer.h like this:

-  /* Oversize the buffer by 50% to guarantee amortized linear time
 -   * complexity on append operations. */
 -  bufsize_t new_size = target_size + target_size / 2;
 -  new_size += 1;
 -  new_size = (new_size + 7) & ~7;
 +  // Oversize the buffer by 50% to guarantee amortized linear time
 +  // complexity on append operations.
 +  bufsize_t add = target_size / 2;
 +  // Account for terminating NUL byte.

I wonder: Is that a mistake, going from target_size + target_size / 2 to target_size / 2? If it's intentional, what's the motivation? There is no explanation of this in the commit history.

That's just one example.

Can we have a commit, for example, that just adds reference nodes? After all, that's a single logical change that might make sense completely indepenedently of the "extent" stuff.

As for the demo script, I wasn't sure from your description what it did. But I think it would be better to develop it outside of cmark itself, so we don't get cmark too cluttered up. It certainly shouldn't go in the src/ tree with the other library code.

MathieuDuponchelle · 2017-01-05T20:36:32Z

@jgm, the first change you mention is from the commit from @nwellnhof that you reverted, which you had squashed together from a bunch of his commits ..

MathieuDuponchelle · 2017-01-05T20:40:12Z

My initial branch was also 4 commits that you ended up squashing as one, I had re-extracted the python stuff from it as it clearly needed to be separate, but squashing things together, merging them, reverting them then asking me to re-split them up is a bit frustrating I have to say

MathieuDuponchelle · 2017-01-05T20:42:08Z

As for putting the formatting script in src/, I put it there because the cmark executable is also located in there, even though it's not part of the library, just wanted to be consistent.

jgm · 2017-01-05T20:45:18Z

+++ Mathieu Duponchelle [Jan 05 17 12:36 ]:

***@***.***, the first change you mention is from the commit from ***@***.*** that you reverted, which you had squashed together from a bunch of his commits ..

I shouldn't have done that - GitHub now makes squashing the default, and I sometimes forget that. Looking back at his commit, and looking at the code in more detail, I see that this isn't a change in functionality; it's just that he's operating with the delta rather than the total size (as in the original). Still, could you divide this into some logically separated changes? At the very least we should have the introduction of reference nodes as a separate commit, and nwellnhof's changes a separate commit; the extent changes should not be mixed with these. And as I suggested the demo program should not be in this PR.

jgm · 2017-01-05T20:51:04Z

+++ Mathieu Duponchelle [Jan 05 17 12:40 ]:

My initial branch was also 4 commits that you ended up squashing as one, I had re-extracted the python stuff from it as it clearly needed to be separate, but squashing things together, merging them, reverting them then asking me to re-split them up is a bit frustrating I have to say

I understand that, the frustration is legitimate and I take the blame. Still, I think going forward, for maintainability and for code review, we need smaller, more atomic, commits. (And this has generally what has made it hard for me to process your PRs in the past.) It's unfair to ask you to re-split this, so I won't. But I'm also reluctant to merge the ball-of-twine without some splitting, and I don't know if I'll get around to doing it any time soon, because this isn't a high-priority feature for me.

jgm · 2017-01-05T20:52:27Z

+++ Mathieu Duponchelle [Jan 05 17 12:42 ]:

As for putting the formatting script in src/, I put it there because the cmark executable is also located in there, even though it's not part of the library, just wanted to be consistent.

Yes, that's true. But I think a python script belongs somewhere else. (And really, probably not in this repository...it makes more sense to me for you to have your own repository for that, where you can develop it as you see fit and rapidly.)

MathieuDuponchelle · 2017-01-05T20:53:30Z

And this has generally what has made it hard for me to process your PRs in the past.

Really ?? My other unmerged PR is split into 20 commits, the source pos stuff was initially four commits!

jgm · 2017-01-05T20:55:02Z

+++ Mathieu Duponchelle [Jan 04 17 15:30 ]:

3. The actual fancy stuff, a reindenting script, tested on every test of the spec with 1 to 80 columns, the test checks that the new AST is identical to the old one, minus softbreaks. The actual test I propose only checks for reindenting on one column, because 80 * 610 starts to take a little time I'm afraid. Note that the script could break in a few more places than it currently does, and certainly be massively optimized, but I think it's already in a state worth merging because, as far as I can tell, it is now provably reliable.

What would help me is to hear a bit more about how this differs from what you can do already with cmark's commonmark -> commonmark transitions. I assume the answer is that your script won't make adventitious changes to things like list indentation, list number delimiter, type of header, etc.?

MathieuDuponchelle · 2017-01-05T20:56:51Z

What would help me is to hear a bit more about how this differs from what you can do already with cmark's commonmark -> commonmark transitions. I assume the answer is that your script won't make adventitious changes to things like list indentation, list number delimiter, type of header, etc.?

Yes, but far from only that, the end goal with this is to have a tool similar to clang-format for commonmark, allowing to control the style in large projects with multiple contributors.

jgm · 2017-01-05T20:57:04Z

Let me try out the formatting script and see if it is something that would make sense to keep in the repo as a demo.

MathieuDuponchelle · 2017-01-05T21:03:07Z

Example use case:

Let's say you're the maintainer of "progit", and want to enforce a certain set of rules (no soft breaks, lists have to use * as the marker, that sort of things, you could also want to follow this markdown style guide) , and the last thing you want to do when reviewing contributions is nitpick all day long about style issues.

The solution I'd like to adopt as such a maintainer, is to have a pre-commit hook that runs cmark-format (which could figure out the style from a well-known-named configuration file for example), and be reasonably certain that any commits submitted for inclusion have gone through and be validated by the formatting script.

On the contributor side, you could use the tool to automatically refactor your changes in order to make them comply with the required style, which would thus eliminate tedious and error-prone manual changes.

jgm · 2017-01-05T21:04:48Z

+++ Mathieu Duponchelle [Jan 05 17 12:56 ]:

Yes, but far from only that, the end goal with this is to have a tool similar to clang-format for commonmark, allowing to control the style in large projects with multiple contributors.

Yes, this would be useful and it's certainly something people have wanted.

MathieuDuponchelle · 2017-01-05T21:07:30Z

@jgm, well that's what this "demo" script aims at becoming, it only handles width for now because that was the only difficult thing to do, the rest can easily be added by anyone with minimum python's working knowledge, which is part of the reason why I submitted it now, the other being indeed that I wanted to demo the script to you as I figured you would see the point :)

MathieuDuponchelle · 2017-01-05T21:10:15Z

Another important thing with that script is also that it does not do unnecessary escaping, which is really what bugged me with -t commonmark, ie given:

foo > bar

--width 80 will give you:

foo > bar

whereas --width 1 will correctly give you:

foo
\>
bar

jgm · 2017-01-05T21:19:35Z

If splitting is too difficult, could you at least amend the commit message to include a more detailed accounting of the changes made and their motivations? The changelog will need to contain something like that, in any case, and I usually construct that from commit messages. All API changes, especially, need to be very carefully documented in the changelog. In this case that would include the addition of reference nodes and all the new extent-related API.

MathieuDuponchelle · 2017-01-05T21:25:28Z

If splitting is too difficult, could you at least amend the commit message to include a more detailed accounting of the changes made and their motivations? The changelog will need to contain something like that, in any case, and I usually construct that from commit messages.

Sure, I didn't do that because I figured it would all get squashed anyway, I usually tend to be quite specific about these things, but I understand now that you got tricked by github :)

I'll see what I can do for splitting up the commit anyway.

Added cmark_node_get_label(node). Added cmark_node_set_label(node). API change.

* Improve strbuf guarantees Introduce BUFSIZE_MAX macro and make sure that the strbuf implementation can handle strings up to this size. * Abort early if document size exceeds internal limit * Make parser return NULL on internal index overflow Make S_parser_feed set an error and ignore subsequent chunks if the total input document size exceeds an internal limit. Make cmark_parser_finish return NULL if an error was encountered. Add public API functions to retrieve error code and error message. strbuf overflow in renderers and OOM in parser or renderers still cause an abort.

Its generation is conditioned to the setting of the OPT_SOURCEPOS flag. Add cmark_parser_get_first_source_extent(parser). Add cmark_source_extent_get_start(extent). Add cmark_source_extent_get_stop(extent). Add cmark_source_extent_get_next(extent). Add cmark_source_extent_get_previous(extent). Add cmark_source_extent_get_node(extent). Add cmark_source_extent_get_type(extent. Add cmark_source_extent_get_type_string(extent). API change.

Add cmark_default_mem_free(pointer). API change.

The only implemented feature is reindenting to arbitrary width, but that was also the actually complicated one so there's that.

MathieuDuponchelle · 2017-01-06T21:35:35Z

@jgm, better like that ?

jgm · 2017-01-07T11:06:55Z

Much better, thanks! This looks good, but can you remove the cmark-format part and put it in a separate repository? Two more questions occur to me:

With the addition of reference nodes, the xml output will not match the DTD (CommonMark.dtd in jgm/CommonMark). We could add reference nodes to the DTD, or alternatively we could simply skip them in xml output. What do you think?
Previously we had start/end line/column info on nodes, with an option to print this as attributes in HTML and XML. (The information was also accessible via API functions exposed in cmark.h.) Can you explain what has become of these? It seems to me that they should be obsoleted by your new sourcepos functionality. It does no good to have two separate mechanisms for tracking source pos. Unless you're using your sourcepos information to populate these, I think we should just remove all the legacy sourcecode stuff on nodes, and the code in parsers that populates these.

MathieuDuponchelle · 2017-01-09T17:51:35Z

Much better, thanks! This looks good, but can you remove the cmark-format part and put it in a separate repository?

Of course, but I would really have liked to have it live alongside the library in order to ensure its test suite ran upon new commits / spec changes :) I would not mind at all moving the code outside of src/ , and I'd stick around for maintaining it, do you have any particular reason not to merge it under these conditions? As I explained in earlier comments, I think it would be a valuable tool to offer. By the way, you said you'd give it a try, did you find any issues that made you want to hold back on it?

With the addition of reference nodes, the xml output will not match the DTD (CommonMark.dtd in jgm/CommonMark). We could add reference nodes to the DTD, or alternatively we could simply skip them in xml output. What do you think?

My initial revision did not include it in the xml output, but my opinion is that this format should expose as faithfully as possible the ast, I am mainly using it for ease of development and debugging, and I like to have as much information as possible in that case. If you agree with this, I will update the schema.

Previously we had start/end line/column info on nodes, with an option to print this as attributes in HTML and XML. (The information was also accessible via API functions exposed in cmark.h.) Can you explain what has become of these? It seems to me that they should be obsoleted by your new sourcepos functionality. It does no good to have two separate mechanisms for tracking source pos. Unless you're using your sourcepos information to populate these, I think we should just remove all the legacy sourcecode stuff on nodes, and the code in parsers that populates these.

We will want to update that, however I'm not sure in which manner. Attributes in the html output are definitely helpful for things like side-by-side autoscrolling, syntax highlighting etc, and we want some form of this. A few questions:

The current API advertises line / columns indices, however it is a bit misleading as shown by:

 meh   more-conditional-sourcepos  ~  devel  cmark  130  cat test.md
’
 meh   more-conditional-sourcepos  ~  devel  cmark  ./build/src/cmark --sourcepos test.md 
<p data-sourcepos="1:1-1:3">’</p>
 meh   more-conditional-sourcepos  ~  devel  cmark 

my expectation of a column API would be that utf8 is accounted for, however that is not the case. I would rework this to simply expose byte offsets.

The current line / column API doesn't break down extents for nodes, for example:

 meh   more-conditional-sourcepos  ~  devel  cmark  1  cat test.md
>     foo
>     bar
 meh   more-conditional-sourcepos  ~  devel  cmark  1  ./build/src/cmark --sourcepos test.md 
<blockquote data-sourcepos="1:1-2:9">
<pre data-sourcepos="1:3-2:9"><code>foo
bar
</code></pre>
</blockquote>
 meh   more-conditional-sourcepos  ~  devel  cmark  1 

I think that's actually an interesting level of details, and can complement the more detailed new API by letting blocks overlap, this could be trivially implemented in the add_extent source map method, which would, whenever a new extent is added, update the start offset of the associated node, and recurse up its ancestors to update their stop offset.

MathieuDuponchelle · 2017-01-13T17:27:02Z

@jgm, can you answer that last comment ?

jgm · 2017-01-13T21:13:20Z

Of course, but I would really have liked to have it live alongside the library in order to ensure its test suite ran upon new commits / spec changes :) I would not mind at all moving the code outside of src/ , and I'd stick around for maintaining it, do you have any particular reason not to merge it under these conditions? As I explained in earlier comments, I think it would be a valuable tool to offer. By the way, you said you'd give it a try, did you find any issues that made you want to hold back on it?

I did try it and one thing I remember noticing were lots of line breaks in bad places, e.g. between a word and ). It also seems a bit strange to have it written in Python, in the context of cmark. I haven't had a chance yet to look at it carefully, though, and see what it does.

Question: could some of its aims be achieved by modifying cmark's existing commonmark renderer, or a modification of it? E.g. you could pass in the extents + the original source, and the renderer could peek in the source to see if a backslash escape was used, check indentations, bullet types, and so on, and try to preserve these in the output. One thing the existing renderer does very well I think is line breaks.

My initial revision did not include it in the xml output, but my opinion is that this format should expose as faithfully as possible the ast, I am mainly using it for ease of development and debugging, and I like to have as much information as possible in that case. If you agree with this, I will update the schema.

Yes, I think this makes sense.

my expectation of a column API would be that utf8 is accounted for, however that is not the case. I would rework this to simply expose byte offsets.

Right. That makes more sense, I agree.

The current line / column API doesn't break down extents for nodes, for example:

Wasn't sure what you meant here, can you explain?

MathieuDuponchelle · 2017-01-13T21:59:57Z

I did try it and one thing I remember noticing were lots of line breaks in bad places, e.g. between a word and ).

Weird, can you please give specific examples, cause that is just not the case here, example:

 meh   more-conditional-sourcepos  ~  devel  cmark  1  cat plop.md 
foo(bar)baz
 meh   more-conditional-sourcepos  ~  devel  cmark  ./build/src/cmark-format plop.md --width 1
foo(bar)baz
 meh   more-conditional-sourcepos  ~  devel  cmark 

MathieuDuponchelle · 2017-01-13T22:10:49Z

@jgm, maybe you're thinking about links, but IMHO my tool's behaviour is way better than cmark -t commonmark in that case:

 meh   more-conditional-sourcepos  ~  devel  cmark  130  cat plop.md 
[foo](bar)
 meh   more-conditional-sourcepos  ~  devel  cmark  ./build/src/cmark plop.md --width 1 -t commonmark
[foo](bar)
 meh   more-conditional-sourcepos  ~  devel  cmark  ./build/src/cmark-format plop.md --width 1
[
foo
](
bar
)
 meh   more-conditional-sourcepos  ~  devel  cmark 

ie, my tool tries its best to stick to the requested width, and will break on punctuation, cmark will not

MathieuDuponchelle · 2017-01-13T22:16:38Z

It also seems a bit strange to have it written in Python, in the context of cmark.

Well, the thing is I needed some sort of hash table for example, strict C89 with no dependencies makes things a bit awkward. As this is a developer tool, I considered that speed was not an absolute necessity. By the way, the glib now builds with no issues in Visual Studio ;)

jgm · 2017-01-15T10:13:10Z

my tool tries its best to stick to the requested width, and will break on punctuation, cmark will not

I disagree that this is better behavior. These are cases where both tools are going above the requested width (and such cases are going to be inevitable). cmark avoids the awkward punctuation breaks. I personally think we should never break after the [ or ( or before the ] or ) in a link.

Here's another example where I think cmark does the right thing:

cmark:

It provides a shared library
(`libcmark`) with functions for parsing

your tool:

It provides a shared library (
`libcmark`) with functions for parsing

Surely in this case the line break after the opening ( is undesirable! (I got this with cmark-format --width=40 README.md). Another awkward case:

- **Robust.** The library has been
  extensively fuzz-tested using [
  american fuzzy lop]. The test suite

I don't think anyone will think that's a desirable place to break.

Of course, all these things are easily fixable.

jgm · 2017-01-15T10:17:29Z

I'm still not completely decided about whether it makes sense to keep cmark-format in the cmark repository, but at the very least I think it should be in a separate directory with a separate CMakeLists.txt.

mattpat · 2017-01-31T21:26:42Z

Hey folks, out of curiosity, have there been any updates on this? It looks like the code is pretty much ready to go, and I’d love to be able to use source positions in a project I’m working on pretty soon. Any ideas about when it might land on master, and/or make it into a “released” version?

Thanks!

jgm · 2017-02-01T10:57:27Z

The outstanding issues concern cmark-format and whether it should go in the repository.
If it could be trimmed out, at least for now, I think the rest could be merged without resolving those issues right away.

MathieuDuponchelle · 2017-02-01T23:10:13Z

Sorry, as usual, work comes in unexpectedly :) @jgm, I have no problem with keeping the script for later, it's in a separate commit so it shouldn't be too difficult to filter it out for now :)

mattpat · 2017-02-21T00:49:57Z

Any updates on when this might make it into a release? I'm trying to get it all building correctly locally now, but just for simplicity's sake, we like to update our cmark version only to numbered tags in the Git repo (and splatting this pull request on top of 0.27.1 isn't yielding ideal results 😞).

MathieuDuponchelle force-pushed the more-conditional-sourcepos branch from ef34ee3 to b18c103 Compare January 4, 2017 23:53

MathieuDuponchelle force-pushed the more-conditional-sourcepos branch from b18c103 to 14819a2 Compare January 5, 2017 15:06

MathieuDuponchelle force-pushed the more-conditional-sourcepos branch from 14819a2 to 064df74 Compare January 6, 2017 21:02

MathieuDuponchelle and others added 5 commits January 6, 2017 22:03

Implement and expose reference nodes.

6741c12

Added cmark_node_get_label(node). Added cmark_node_set_label(node). API change.

wrapper.py: more extensive coverage

a44cfaf

Add cmark_default_mem_free(pointer). API change.

cmark-format: Initial work on a formatting tool.

014b781

The only implemented feature is reindenting to arbitrary width, but that was also the actually complicated one so there's that.

MathieuDuponchelle force-pushed the more-conditional-sourcepos branch from 064df74 to 014b781 Compare January 6, 2017 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More conditional sourcepos #180

More conditional sourcepos #180

MathieuDuponchelle commented Jan 4, 2017

MathieuDuponchelle commented Jan 5, 2017

MathieuDuponchelle commented Jan 5, 2017

jgm commented Jan 5, 2017

MathieuDuponchelle commented Jan 5, 2017

MathieuDuponchelle commented Jan 5, 2017

MathieuDuponchelle commented Jan 5, 2017

jgm commented Jan 5, 2017 via email

jgm commented Jan 5, 2017 via email

jgm commented Jan 5, 2017 via email

MathieuDuponchelle commented Jan 5, 2017

jgm commented Jan 5, 2017 via email

MathieuDuponchelle commented Jan 5, 2017

jgm commented Jan 5, 2017 via email

MathieuDuponchelle commented Jan 5, 2017 •

edited

Loading

jgm commented Jan 5, 2017 via email

MathieuDuponchelle commented Jan 5, 2017

MathieuDuponchelle commented Jan 5, 2017

jgm commented Jan 5, 2017 via email

MathieuDuponchelle commented Jan 5, 2017

MathieuDuponchelle commented Jan 6, 2017

jgm commented Jan 7, 2017

MathieuDuponchelle commented Jan 9, 2017

MathieuDuponchelle commented Jan 13, 2017

jgm commented Jan 13, 2017

MathieuDuponchelle commented Jan 13, 2017

MathieuDuponchelle commented Jan 13, 2017 •

edited

Loading

MathieuDuponchelle commented Jan 13, 2017

jgm commented Jan 15, 2017

jgm commented Jan 15, 2017

mattpat commented Jan 31, 2017

jgm commented Feb 1, 2017

MathieuDuponchelle commented Feb 1, 2017

mattpat commented Feb 21, 2017

More conditional sourcepos #180

Are you sure you want to change the base?

More conditional sourcepos #180

Conversation

MathieuDuponchelle commented Jan 4, 2017

MathieuDuponchelle commented Jan 5, 2017

MathieuDuponchelle commented Jan 5, 2017

jgm commented Jan 5, 2017

MathieuDuponchelle commented Jan 5, 2017

MathieuDuponchelle commented Jan 5, 2017

MathieuDuponchelle commented Jan 5, 2017

jgm commented Jan 5, 2017 via email

jgm commented Jan 5, 2017 via email

jgm commented Jan 5, 2017 via email

MathieuDuponchelle commented Jan 5, 2017

jgm commented Jan 5, 2017 via email

MathieuDuponchelle commented Jan 5, 2017

jgm commented Jan 5, 2017 via email

MathieuDuponchelle commented Jan 5, 2017 • edited Loading

jgm commented Jan 5, 2017 via email

MathieuDuponchelle commented Jan 5, 2017

MathieuDuponchelle commented Jan 5, 2017

jgm commented Jan 5, 2017 via email

MathieuDuponchelle commented Jan 5, 2017

MathieuDuponchelle commented Jan 6, 2017

jgm commented Jan 7, 2017

MathieuDuponchelle commented Jan 9, 2017

MathieuDuponchelle commented Jan 13, 2017

jgm commented Jan 13, 2017

MathieuDuponchelle commented Jan 13, 2017

MathieuDuponchelle commented Jan 13, 2017 • edited Loading

MathieuDuponchelle commented Jan 13, 2017

jgm commented Jan 15, 2017

jgm commented Jan 15, 2017

mattpat commented Jan 31, 2017

jgm commented Feb 1, 2017

MathieuDuponchelle commented Feb 1, 2017

mattpat commented Feb 21, 2017

MathieuDuponchelle commented Jan 5, 2017 •

edited

Loading

MathieuDuponchelle commented Jan 13, 2017 •

edited

Loading