Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More conditional sourcepos #180

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

MathieuDuponchelle
Copy link
Contributor

There you go, three commits:

  1. The source map code, which generation is now conditional to the SOURCEPOS option, when not enabled make bench results are identical to the results without the patch. Note that I had a bunch of follow-up commits on my end which I ended up fixing up with that patch, mostly fixes except for one patch that defines reference nodes. I also squashed up @nwellnhof's commits in there, please tell me if you mind.

  2. The wrapper code, also updated but nothing major at all

  3. The actual fancy stuff, a reindenting script, tested on every test of the spec with 1 to 80 columns, the test checks that the new AST is identical to the old one, minus softbreaks. The actual test I propose only checks for reindenting on one column, because 80 * 610 starts to take a little time I'm afraid. Note that the script could break in a few more places than it currently does, and certainly be massively optimized, but I think it's already in a state worth merging because, as far as I can tell, it is now provably reliable.

@MathieuDuponchelle MathieuDuponchelle force-pushed the more-conditional-sourcepos branch from ef34ee3 to b18c103 Compare January 4, 2017 23:53
@MathieuDuponchelle
Copy link
Contributor Author

That is a timeout, pretty sure the code is fine, @jgm if you feel like triggering a new run :)

@MathieuDuponchelle MathieuDuponchelle force-pushed the more-conditional-sourcepos branch from b18c103 to 14819a2 Compare January 5, 2017 15:06
@MathieuDuponchelle
Copy link
Contributor Author

Rebased and pushed, no timeout this time :)

@jgm
Copy link
Member

jgm commented Jan 5, 2017

Is it possible to break this into smaller, logical chunks? This is currently such a big ball of code, touching so many parts of the code base, that it's hard to survey.

For example, I see changes in buffer.h like this:

-  /* Oversize the buffer by 50% to guarantee amortized linear time
 -   * complexity on append operations. */
 -  bufsize_t new_size = target_size + target_size / 2;
 -  new_size += 1;
 -  new_size = (new_size + 7) & ~7;
 +  // Oversize the buffer by 50% to guarantee amortized linear time
 +  // complexity on append operations.
 +  bufsize_t add = target_size / 2;
 +  // Account for terminating NUL byte.

I wonder: Is that a mistake, going from target_size + target_size / 2 to target_size / 2? If it's intentional, what's the motivation? There is no explanation of this in the commit history.

That's just one example.

Can we have a commit, for example, that just adds reference nodes? After all, that's a single logical change that might make sense completely indepenedently of the "extent" stuff.

As for the demo script, I wasn't sure from your description what it did. But I think it would be better to develop it outside of cmark itself, so we don't get cmark too cluttered up. It certainly shouldn't go in the src/ tree with the other library code.

@MathieuDuponchelle
Copy link
Contributor Author

@jgm, the first change you mention is from the commit from @nwellnhof that you reverted, which you had squashed together from a bunch of his commits ..

@MathieuDuponchelle
Copy link
Contributor Author

My initial branch was also 4 commits that you ended up squashing as one, I had re-extracted the python stuff from it as it clearly needed to be separate, but squashing things together, merging them, reverting them then asking me to re-split them up is a bit frustrating I have to say

@MathieuDuponchelle
Copy link
Contributor Author

As for putting the formatting script in src/, I put it there because the cmark executable is also located in there, even though it's not part of the library, just wanted to be consistent.

@jgm
Copy link
Member

jgm commented Jan 5, 2017 via email

@jgm
Copy link
Member

jgm commented Jan 5, 2017 via email

@jgm
Copy link
Member

jgm commented Jan 5, 2017 via email

@MathieuDuponchelle
Copy link
Contributor Author

And this has generally what has made it hard for me to process your PRs in the past.

Really ?? My other unmerged PR is split into 20 commits, the source pos stuff was initially four commits!

@jgm
Copy link
Member

jgm commented Jan 5, 2017 via email

@MathieuDuponchelle
Copy link
Contributor Author

What would help me is to hear a bit more about how this differs from what you can do already with cmark's commonmark -> commonmark transitions. I assume the answer is that your script won't make adventitious changes to things like list indentation, list number delimiter, type of header, etc.?

Yes, but far from only that, the end goal with this is to have a tool similar to clang-format for commonmark, allowing to control the style in large projects with multiple contributors.

@jgm
Copy link
Member

jgm commented Jan 5, 2017 via email

@MathieuDuponchelle
Copy link
Contributor Author

MathieuDuponchelle commented Jan 5, 2017

Example use case:

Let's say you're the maintainer of "progit", and want to enforce a certain set of rules (no soft breaks, lists have to use * as the marker, that sort of things, you could also want to follow this markdown style guide) , and the last thing you want to do when reviewing contributions is nitpick all day long about style issues.

The solution I'd like to adopt as such a maintainer, is to have a pre-commit hook that runs cmark-format (which could figure out the style from a well-known-named configuration file for example), and be reasonably certain that any commits submitted for inclusion have gone through and be validated by the formatting script.

On the contributor side, you could use the tool to automatically refactor your changes in order to make them comply with the required style, which would thus eliminate tedious and error-prone manual changes.

@jgm
Copy link
Member

jgm commented Jan 5, 2017 via email

@MathieuDuponchelle
Copy link
Contributor Author

@jgm, well that's what this "demo" script aims at becoming, it only handles width for now because that was the only difficult thing to do, the rest can easily be added by anyone with minimum python's working knowledge, which is part of the reason why I submitted it now, the other being indeed that I wanted to demo the script to you as I figured you would see the point :)

@MathieuDuponchelle
Copy link
Contributor Author

Another important thing with that script is also that it does not do unnecessary escaping, which is really what bugged me with -t commonmark, ie given:

foo > bar

--width 80 will give you:

foo > bar

whereas --width 1 will correctly give you:

foo
\>
bar

@jgm
Copy link
Member

jgm commented Jan 5, 2017 via email

@MathieuDuponchelle
Copy link
Contributor Author

If splitting is too difficult, could you at least amend the commit message to include a more detailed accounting of the changes made and their motivations? The changelog will need to contain something like that, in any case, and I usually construct that from commit messages.

Sure, I didn't do that because I figured it would all get squashed anyway, I usually tend to be quite specific about these things, but I understand now that you got tricked by github :)

I'll see what I can do for splitting up the commit anyway.

@MathieuDuponchelle MathieuDuponchelle force-pushed the more-conditional-sourcepos branch from 14819a2 to 064df74 Compare January 6, 2017 21:02
MathieuDuponchelle and others added 5 commits January 6, 2017 22:03
Added cmark_node_get_label(node).
Added cmark_node_set_label(node).

API change.
* Improve strbuf guarantees

Introduce BUFSIZE_MAX macro and make sure that the strbuf implementation
can handle strings up to this size.

* Abort early if document size exceeds internal limit

* Make parser return NULL on internal index overflow

Make S_parser_feed set an error and ignore subsequent chunks if the
total input document size exceeds an internal limit. Make
cmark_parser_finish return NULL if an error was encountered. Add
public API functions to retrieve error code and error message.

strbuf overflow in renderers and OOM in parser or renderers still
cause an abort.
Its generation is conditioned to the setting of the
OPT_SOURCEPOS flag.

Add cmark_parser_get_first_source_extent(parser).
Add cmark_source_extent_get_start(extent).
Add cmark_source_extent_get_stop(extent).
Add cmark_source_extent_get_next(extent).
Add cmark_source_extent_get_previous(extent).
Add cmark_source_extent_get_node(extent).
Add cmark_source_extent_get_type(extent.
Add cmark_source_extent_get_type_string(extent).

API change.
Add cmark_default_mem_free(pointer).

API change.
The only implemented feature is reindenting to arbitrary width,
but that was also the actually complicated one so there's that.
@MathieuDuponchelle MathieuDuponchelle force-pushed the more-conditional-sourcepos branch from 064df74 to 014b781 Compare January 6, 2017 21:05
@MathieuDuponchelle
Copy link
Contributor Author

@jgm, better like that ?

@jgm
Copy link
Member

jgm commented Jan 7, 2017

Much better, thanks! This looks good, but can you remove the cmark-format part and put it in a separate repository? Two more questions occur to me:

  1. With the addition of reference nodes, the xml output will not match the DTD (CommonMark.dtd in jgm/CommonMark). We could add reference nodes to the DTD, or alternatively we could simply skip them in xml output. What do you think?

  2. Previously we had start/end line/column info on nodes, with an option to print this as attributes in HTML and XML. (The information was also accessible via API functions exposed in cmark.h.) Can you explain what has become of these? It seems to me that they should be obsoleted by your new sourcepos functionality. It does no good to have two separate mechanisms for tracking source pos. Unless you're using your sourcepos information to populate these, I think we should just remove all the legacy sourcecode stuff on nodes, and the code in parsers that populates these.

@MathieuDuponchelle
Copy link
Contributor Author

Much better, thanks! This looks good, but can you remove the cmark-format part and put it in a separate repository?

Of course, but I would really have liked to have it live alongside the library in order to ensure its test suite ran upon new commits / spec changes :) I would not mind at all moving the code outside of src/ , and I'd stick around for maintaining it, do you have any particular reason not to merge it under these conditions? As I explained in earlier comments, I think it would be a valuable tool to offer. By the way, you said you'd give it a try, did you find any issues that made you want to hold back on it?

With the addition of reference nodes, the xml output will not match the DTD (CommonMark.dtd in jgm/CommonMark). We could add reference nodes to the DTD, or alternatively we could simply skip them in xml output. What do you think?

My initial revision did not include it in the xml output, but my opinion is that this format should expose as faithfully as possible the ast, I am mainly using it for ease of development and debugging, and I like to have as much information as possible in that case. If you agree with this, I will update the schema.

Previously we had start/end line/column info on nodes, with an option to print this as attributes in HTML and XML. (The information was also accessible via API functions exposed in cmark.h.) Can you explain what has become of these? It seems to me that they should be obsoleted by your new sourcepos functionality. It does no good to have two separate mechanisms for tracking source pos. Unless you're using your sourcepos information to populate these, I think we should just remove all the legacy sourcecode stuff on nodes, and the code in parsers that populates these.

We will want to update that, however I'm not sure in which manner. Attributes in the html output are definitely helpful for things like side-by-side autoscrolling, syntax highlighting etc, and we want some form of this. A few questions:

  • The current API advertises line / columns indices, however it is a bit misleading as shown by:
 meh   more-conditional-sourcepos  ~  devel  cmark  130  cat test.md
’
 meh   more-conditional-sourcepos  ~  devel  cmark  ./build/src/cmark --sourcepos test.md 
<p data-sourcepos="1:1-1:3">’</p>
 meh   more-conditional-sourcepos  ~  devel  cmark 

my expectation of a column API would be that utf8 is accounted for, however that is not the case. I would rework this to simply expose byte offsets.

  • The current line / column API doesn't break down extents for nodes, for example:
 meh   more-conditional-sourcepos  ~  devel  cmark  1  cat test.md
>     foo
>     bar
 meh   more-conditional-sourcepos  ~  devel  cmark  1  ./build/src/cmark --sourcepos test.md 
<blockquote data-sourcepos="1:1-2:9">
<pre data-sourcepos="1:3-2:9"><code>foo
bar
</code></pre>
</blockquote>
 meh   more-conditional-sourcepos  ~  devel  cmark  1 

I think that's actually an interesting level of details, and can complement the more detailed new API by letting blocks overlap, this could be trivially implemented in the add_extent source map method, which would, whenever a new extent is added, update the start offset of the associated node, and recurse up its ancestors to update their stop offset.

@MathieuDuponchelle
Copy link
Contributor Author

@jgm, can you answer that last comment ?

@jgm
Copy link
Member

jgm commented Jan 13, 2017

Of course, but I would really have liked to have it live alongside the library in order to ensure its test suite ran upon new commits / spec changes :) I would not mind at all moving the code outside of src/ , and I'd stick around for maintaining it, do you have any particular reason not to merge it under these conditions? As I explained in earlier comments, I think it would be a valuable tool to offer. By the way, you said you'd give it a try, did you find any issues that made you want to hold back on it?

I did try it and one thing I remember noticing were lots of line breaks in bad places, e.g. between a word and ). It also seems a bit strange to have it written in Python, in the context of cmark. I haven't had a chance yet to look at it carefully, though, and see what it does.

Question: could some of its aims be achieved by modifying cmark's existing commonmark renderer, or a modification of it? E.g. you could pass in the extents + the original source, and the renderer could peek in the source to see if a backslash escape was used, check indentations, bullet types, and so on, and try to preserve these in the output. One thing the existing renderer does very well I think is line breaks.

My initial revision did not include it in the xml output, but my opinion is that this format should expose as faithfully as possible the ast, I am mainly using it for ease of development and debugging, and I like to have as much information as possible in that case. If you agree with this, I will update the schema.

Yes, I think this makes sense.

my expectation of a column API would be that utf8 is accounted for, however that is not the case. I would rework this to simply expose byte offsets.

Right. That makes more sense, I agree.

The current line / column API doesn't break down extents for nodes, for example:

Wasn't sure what you meant here, can you explain?

@MathieuDuponchelle
Copy link
Contributor Author

I did try it and one thing I remember noticing were lots of line breaks in bad places, e.g. between a word and ).

Weird, can you please give specific examples, cause that is just not the case here, example:

 meh   more-conditional-sourcepos  ~  devel  cmark  1  cat plop.md 
foo(bar)baz
 meh   more-conditional-sourcepos  ~  devel  cmark  ./build/src/cmark-format plop.md --width 1
foo(bar)baz
 meh   more-conditional-sourcepos  ~  devel  cmark 

@MathieuDuponchelle
Copy link
Contributor Author

MathieuDuponchelle commented Jan 13, 2017

@jgm, maybe you're thinking about links, but IMHO my tool's behaviour is way better than cmark -t commonmark in that case:

 meh   more-conditional-sourcepos  ~  devel  cmark  130  cat plop.md 
[foo](bar)
 meh   more-conditional-sourcepos  ~  devel  cmark  ./build/src/cmark plop.md --width 1 -t commonmark
[foo](bar)
 meh   more-conditional-sourcepos  ~  devel  cmark  ./build/src/cmark-format plop.md --width 1
[
foo
](
bar
)
 meh   more-conditional-sourcepos  ~  devel  cmark 

ie, my tool tries its best to stick to the requested width, and will break on punctuation, cmark will not

@MathieuDuponchelle
Copy link
Contributor Author

It also seems a bit strange to have it written in Python, in the context of cmark.

Well, the thing is I needed some sort of hash table for example, strict C89 with no dependencies makes things a bit awkward. As this is a developer tool, I considered that speed was not an absolute necessity. By the way, the glib now builds with no issues in Visual Studio ;)

@jgm
Copy link
Member

jgm commented Jan 15, 2017

my tool tries its best to stick to the requested width, and will break on punctuation, cmark will not

I disagree that this is better behavior. These are cases where both tools are going above the requested width (and such cases are going to be inevitable). cmark avoids the awkward punctuation breaks. I personally think we should never break after the [ or ( or before the ] or ) in a link.

Here's another example where I think cmark does the right thing:

cmark:

It provides a shared library
(`libcmark`) with functions for parsing

your tool:

It provides a shared library (
`libcmark`) with functions for parsing

Surely in this case the line break after the opening ( is undesirable! (I got this with cmark-format --width=40 README.md). Another awkward case:

- **Robust.** The library has been
  extensively fuzz-tested using [
  american fuzzy lop]. The test suite

I don't think anyone will think that's a desirable place to break.

Of course, all these things are easily fixable.

@jgm
Copy link
Member

jgm commented Jan 15, 2017

I'm still not completely decided about whether it makes sense to keep cmark-format in the cmark repository, but at the very least I think it should be in a separate directory with a separate CMakeLists.txt.

@mattpat
Copy link

mattpat commented Jan 31, 2017

Hey folks, out of curiosity, have there been any updates on this? It looks like the code is pretty much ready to go, and I’d love to be able to use source positions in a project I’m working on pretty soon. Any ideas about when it might land on master, and/or make it into a “released” version?

Thanks!

@jgm
Copy link
Member

jgm commented Feb 1, 2017

The outstanding issues concern cmark-format and whether it should go in the repository.
If it could be trimmed out, at least for now, I think the rest could be merged without resolving those issues right away.

@MathieuDuponchelle
Copy link
Contributor Author

Sorry, as usual, work comes in unexpectedly :) @jgm, I have no problem with keeping the script for later, it's in a separate commit so it shouldn't be too difficult to filter it out for now :)

@mattpat
Copy link

mattpat commented Feb 21, 2017

Any updates on when this might make it into a release? I'm trying to get it all building correctly locally now, but just for simplicity's sake, we like to update our cmark version only to numbered tags in the Git repo (and splatting this pull request on top of 0.27.1 isn't yielding ideal results 😞).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants