Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Syntax Highlighting #16

Closed
wants to merge 9 commits into from

Conversation

th0rex
Copy link

@th0rex th0rex commented Dec 13, 2019

I've implemented a basic version of syntax highlighting and am looking for feedback (whether the overall approach is good).

Highlighting with tree sitter:
tree_sitter

Highlighting with font lock:
font_lock

Currently I'm not really taking advantage of the more context sensitive tree sitter information, and am just using the default font lock faces.

I'm not sure how to best distribute this highlights.scm file, currently users need to specify the path to this_repo/grammars, from which the highlights.scm file for the given language is then read.

The only major bug I've noticed, is that while typing, sometimes code isn't immediatetly highlighted. For example, when I type

union X {
    a: i32,
    b: i32;
}
// or
fn foo() {
    let z = 1337;
}

Manually calling tree-sitter-highlight--highlight fixes that however, so I'm thinking it might have something to do with the comment on the after-change function in tree-sitter.el. My tree-sitter-highlight--handle-change isn't being called for unions or lets inside functions, but if I type the same let on the top level it is, I'm not quite sure what's causing that as I'm not familiar with tree sitter.
Edit: I've fixed this, changed_ranges had a bug. Even if this PR isn't merged, that one commit (6bed1e1) should definetly be merged.

My current config for using this looks like this:

(use-package f)

(add-to-list 'load-path "/vm/dev/github.com/th0rex/emacs-tree-sitter/lisp")

(require 'tree-sitter)
(seq-do #'ts-require-language '(c cpp go python rust))

;; Path to <this_repo>/grammars
(setq tree-sitter-highlight-query-dir "/vm/dev/github.com/th0rex/emacs-tree-sitter/grammars")
(require 'tree-sitter-highlight)

(defun enable-ts-hl ()
  (font-lock-mode -1) ;; Disable font-lock mode
  (tree-sitter-highlight-mode))

(add-hook 'c-mode-common-hook #'enable-ts-hl)
(add-hook 'python-mode-hook #'enable-ts-hl)
(add-hook 'rust-mode-hook #'enable-ts-hl)

(defun tree-sitter-highlight--after-change (old-tree)
(tree-sitter-highlight--highlight old-tree (window-start) (window-end)))
(tree-sitter-highlight--highlight old-tree (window-start) (window-end))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, have you considered providing tree-sitter implementations for font-lock-fontify-region-function(and related) ? Thus you will be able to share the logic related to calculating what has to be highlighted.

Copy link
Author

@th0rex th0rex Dec 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we would want to do that we would want to choose jit-lock instead, font-lock relies on that (afaik) for knowing when to fontify what.
I've considered it, but didn't implement it yet, since this way is easier. I can add it as a toggle maybe, to see if it would improve things (mainly that currently I can either delay highlighting in case of errors, or eagerly highlight but have to clear (remove all faces from) the whole window first, which causes the part after the error to not be highlighted (which you technically can't do correctly because there's an error after all) and which also causes "flickering" if you type fast enough when you rapidly produce/remove errors in the syntax tree). However I think it's unlikely that jit-lock will help with that specific problem, because we already know what changed, and if i only clear/highlight what changed, then sometimes things are not highlighted at all.

Copy link
Author

@th0rex th0rex Dec 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I've tried both font-lock and jit-lock. font-lock seems to work better, so I've only kept the code for that. I was also wrong about it being easier to just implement myself, so thanks for the suggestion.

@jakejx
Copy link

jakejx commented Dec 16, 2019

This PR looks great! I've been playing around with it and it works quite nicely. One thing I have noticed is that for some situations multiple faces are being applied to the same region. For example in Python:

p = Package(...)

The Package should be highlighted as Constructor based on the Python highlights.scm but when I inspect, the word is highlighted with (tree-sitter-variable-face tree-sitter-function-face tree-sitter-constructor-face nil). I will be digging into the code over the coming days and see whether I can help :)

Edit: it seems that this issue is likely due to overlapping queries. This may have to be fixed at the highlights.scm level where queries are ordered based on their specificity.

@ubolonton
Copy link
Collaborator

Thanks, this is great progress!

I changed the parameter order of ts-changed-ranges's and updated its docstring so that it's consistent with the Rust/C APIs.

I also pushed the fixes for ts-make-query, ts-query-captures, ts-query-matches.

I'm not sure how to best distribute this highlights.scm file, currently users need to specify the path to this_repo/grammars, from which the highlights.scm file for the given language is then read.

Ideally this file should be distributed together with the compiled language description (e.g. rust.so). Initially it could be a separate package that bundles multiple languages together. If/when tree-sitter becomes established enough that there are major modes based on it, it will be the job of the major modes to distribute these artifacts.

"Syntax highlighting with tree sitter."
:init-value nil
:lighter "tree-sitter-highlight"
;; TODO: idk if this file should do this.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependencies between minor modes are tricky: https://emacs.stackexchange.com/questions/36225/global-minor-mode-that-requires-another-during-its-lifetime.

We can either use a single minor mode, or make tree-sitter-mode aware of its possible dependent tree-sitter- minor modes, so that turning them on/off work properly.


(require 'tree-sitter)

(require 'f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency should be declared in Cask.

However, I think tree-sitter is fundamental/low-level enough that fewer external dependencies are desirable. If we just want 1-2 functions from f, we can reimplement them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added it to Cask for now, I'm just using f-join currently, we can re-implement that before we merge this. It's probably not needed if we bundle all the highlights.scm somewhere.

src/types.rs Outdated Show resolved Hide resolved
@th0rex
Copy link
Author

th0rex commented Dec 16, 2019

This PR looks great! I've been playing around with it and it works quite nicely. One thing I have noticed is that for some situations multiple faces are being applied to the same region. For example in Python:

p = Package(...)

The Package should be highlighted as Constructor based on the Python highlights.scm but when I inspect, the word is highlighted with (tree-sitter-variable-face tree-sitter-function-face tree-sitter-constructor-face nil). I will be digging into the code over the coming days and see whether I can help :)

I know that multiple faces are added, but that's actually intended. For me when I try doing x = Package(...), then the constructor face is added first ((tree-sitter-constructor-face tree-sitter-constant-face tree-sitter-function-face tree-sitter-variable-face nil)).
The reason I'm wanting to add multiple properties is that highlighted regions generally may overlap:

#[derive(Debug)]

the Debug is both a constructor but also inside an attribute, I'm sorting the query results such that larger things are added first (i.e. the attribute in this case), and more specific things later.
If I were using put-text-property (i.e. only having one face per region of text) instead of add-face-text-property, then the [ and ] wouldn't be highlighted with the attribute face (since they currently don't really have an aliased face, they're punctuation, but using put-text-property would thus also clear the attribute face from that. If everything had a proper face and everyone would want punctuation to be highlighted, we could use put-text-property instead, but I don't think thats the case, and instead of having non-highlighted punctuation in this case it's better to have it highlighted with the attribute face instead). And if I wouldn't sort the results at all, the Debug wouldn't get highlighted differently, but I think we want that to be highlighted differently.

At least for the rust implementation of querying the regions to highlight the sort is stable, i.e. if two things apply for the exact same range, the order is kept as given to use by the tree sitter api, which should (I think) reflect the order they're declared in a highlights.scm file.

If you could paste more of the python code for which the faces are being applied in the wrong order (because it doesn't happen in my python files/your code that i just pasted) that would be great. Maybe it's also caused by some incremental highlighting thing, see what happens after you've just openend that buffer and it was highlighted completely, whether the order is the same etc.

Edit:
I've added (and enabled by default for now) some debug output to the rust code, it would be nice if you could start emacs in a terminal and find what it's outputting for you. This is the relevant output for me, and the faces are added in that order (the order defined in highlights.scm):

25344 - 25347 ((776, 8) - (776, 11)) -> constructor
25344 - 25347 ((776, 8) - (776, 11)) -> constant
25344 - 25347 ((776, 8) - (776, 11)) -> function
25344 - 25347 ((776, 8) - (776, 11)) -> variable

@jakejx
Copy link

jakejx commented Dec 17, 2019

I was obtaining a different result as I was using the elisp implementation. Using the Rust implementation, I obtain the expected results.

I believe the difference in behaviour is due to this portion in highlights.rs:

for (i, (node, capture_index)) in vec.into_iter().rev().enumerate() {
    v.set(i, env.vector((capture_index, root_node.map(|_| node)))?)?;
}

The captures are reversed before they are passed to add-face-text-property. In the elisp version, the order of the captures are not reversed. I have tested adding a reverse call to tree-sitter-highlight--lisp at L240 and it seems to have fixed the issue.

I'm quite new to all of this so please let me know if I have misunderstood the code somewhere.

@th0rex
Copy link
Author

th0rex commented Dec 17, 2019

Oh yeah seems like I've just forgotten that in the elisp version then, I only use that to play around with changes anyway because I don't need to restart emacs to reload the rust code.

@rirze
Copy link

rirze commented Dec 31, 2019

Just wondering, what is the state of the branch atm? I tried to load it in this morning, but I'm unable to get updated code to appear formatted by tree-sitter. For example, activating tree-sitter-highlight-mode colors the file but also outputs this in the messages buffer:

helm-M-x: Wrong number of arguments: (2 . 2), 4Invalid face reference: nil
Invalid face reference: nil [727 times]

And then anytime code is updated or even navigated, the Invalid face reference: nil error count increments wildly. Is this a error I can mitigate on my end?

@th0rex
Copy link
Author

th0rex commented Dec 31, 2019

The code works decently well for me, I've been using it for the past two weeks in Rust and C without any huge problems. I've never seen that error, it could be my fault, I don't really know how faces work in emacs, but settings the face to nil should be allowed. In what language are you getting that error? What version of emacs are you using? I've also briefly been using helm/ivy with this and didn't have any problems there.

@rirze
Copy link

rirze commented Dec 31, 2019

Using Emacs 26.3, tried Python highlighting and ran into this error.

I started with my existing setup, downloaded your fork, ran make build and make ensure/python. Both succeeded just fine. Along with the relevant setup code you gave in the first comment in this PR, I was able to load the fork just fine.

When I opened up a Python file however, it would not automatically turn on tree-sitter-highlight-mode even though I added it as a hook. That's fine, I can just turn it on manually. As soon as I do M-x tree-sitter-highlight-mode, I run into those errors above.

I'm not really sure how to debug this.

I'll try testing this on a clean Emacs setup and see if it is Helm related or something else on my end.

@rirze
Copy link

rirze commented Jan 2, 2020

@th0rex I turned on debugging, and this is the trace I get:

Debugger entered--Lisp error: (wrong-number-of-arguments (2 . 2) 4)
  #f(compiled-function (var val) "Set variable VAR to value VAL in current buffer." #<bytecode 0x2238cf>)(font-lock-fontify-buffer-function (lambda nil (tree-sitter-highlight--jit (point-min) (point-max))) font-lock-fontify-region-function (lambda (start end _verbose) (tree-sitter-highlight--jit start end)))
  (setq-local font-lock-fontify-buffer-function (lambda nil (tree-sitter-highlight--jit (point-min) (point-max))) font-lock-fontify-region-function (lambda (start end _verbose) (tree-sitter-highlight--jit start end)))
  tree-sitter-highlight--enable()
  (prog1 (tree-sitter-highlight--enable) (setq err nil))
  (unwind-protect (prog1 (tree-sitter-highlight--enable) (setq err nil)) (if err (progn (setq tree-sitter-highlight-mode nil))))
  (let ((err t)) (unwind-protect (prog1 (tree-sitter-highlight--enable) (setq err nil)) (if err (progn (setq tree-sitter-highlight-mode nil)))))
  (if tree-sitter-highlight-mode (let ((err t)) (unwind-protect (prog1 (tree-sitter-highlight--enable) (setq err nil)) (if err (progn (setq tree-sitter-highlight-mode nil))))) (tree-sitter-highlight--disable))
  (let ((last-message (current-message))) (setq tree-sitter-highlight-mode (if (eq arg (quote toggle)) (not tree-sitter-highlight-mode) (> (prefix-numeric-value arg) 0))) (tree-sitter-mode) (if tree-sitter-highlight-mode (let ((err t)) (unwind-protect (prog1 (tree-sitter-highlight--enable) (setq err nil)) (if err (progn (setq tree-sitter-highlight-mode nil))))) (tree-sitter-highlight--disable)) (run-hooks (quote tree-sitter-highlight-mode-hook) (if tree-sitter-highlight-mode (quote tree-sitter-highlight-mode-on-hook) (quote tree-sitter-highlight-mode-off-hook))) (if (called-interactively-p (quote any)) (progn nil (if (and (current-message) (not (equal last-message (current-message)))) nil (let ((local " in current buffer")) (message "Tree-Sitter-Highlight mode %sabled%s" (if tree-sitter-highlight-mode "en" "dis") local))))))
  tree-sitter-highlight-mode()
  enable-ts-hl()
  run-hooks(change-major-mode-after-body-hook prog-mode-hook python-mode-hook)
  apply(run-hooks (change-major-mode-after-body-hook prog-mode-hook python-mode-hook))
  run-mode-hooks(python-mode-hook)
  python-mode()
  set-auto-mode-0(python-mode nil)
  set-auto-mode()
  normal-mode(t)
  after-find-file(nil t)
  find-file-noselect-1(#<buffer min_spaces_composed_str.py> "~/cart/codereview/min_spaces_composed_str.py" nil nil "~/cart/codereview/min_spaces_composed_str.py" (3937650 66306))
  find-file-noselect("~/cart/codereview/min_spaces_composed_str.py" nil nil t)
  find-file("~/cart/codereview/min_spaces_composed_str.py" t)
  funcall-interactively(find-file "~/cart/codereview/min_spaces_composed_str.py" t)
  call-interactively(find-file nil nil)
  command-execute(find-file)

I'm wondering if I missed some installation step with regards to the rust/elisp code. This is my setup in a clean Emacs launch:

(require 'f)
(add-to-list 'load-path "/somefolder/repos/emacs-tree-sitter")
(require 'tree-sitter)
(ts-require-language 'python)

(setq tree-sitter-highlight-query-dir "/somefolder/repos/emacs-tree-sitter/grammars")
(require 'tree-sitter-highlight)

(defun enable-ts-hl ()
  (tree-sitter-highlight-mode))

(add-hook 'python-mode-hook 'enable-ts-hl)

Before loading this, I do the usual make build, make ensure/python commands in the tree-sitter directory. Any insight would be appreciated!

Edit: I think I may be using the elisp code rather than the rust code-- how do I enable this switch?

@th0rex
Copy link
Author

th0rex commented Jan 2, 2020

Seems like setq-local only supports one variable and value form instead of arbitrary many in all released emacs versions, and I've been using emacs compiled from git directly and thus had support for arbitrary many. That should be fixed now with the latest commit, thanks for finding this.

The elisp version referred to the actual highlighting (which now is all in rust), I've since removed that as I've only used it to quickly experiment with changes.

@rirze
Copy link

rirze commented Jan 2, 2020

@th0rex Thanks for figuring that out! I guess I should eventually upgrade to 27, but that's good to know.

I now have the highlighting working, and am really enjoying it. I'll let you know if I find anything else.

@ubolonton
Copy link
Collaborator

font-lock/jit-lock use after-change-functions hook to compute the region to refontify .

The specific functions are font-lock-extend-after-change-region-function/font-lock-extend-jit-lock-region-after-change. They seem to do very little, relying on the font-lock-default-fontify-region function to safely extend the region, e.g. to cover multiline patterns previously fontified (with regex). In other words, they seem to expect font-lock-fontify-region-function to be stateful.

Since the tree-sitter-based highlighting function is stateless (which is better), I think we should replace these two functions with one that uses ts-changed-ranges, even if we still use the rest of font-lock/jit-lock.

@rirze
Copy link

rirze commented Feb 8, 2020

A suggestion for default behavior-- I prefer operators to be highlighted the same as keywords. Correct me if I'm wrong, but I believe this is how highlighters other this PR would treat them as well. I've made the following change to get the behavior I like:

(defface tree-sitter-operator-face '((default :inherit font-lock-keyword-face))
  "Face used for operator"
  :group 'tree-sitter-highlight-faces)

@ubolonton
Copy link
Collaborator

ubolonton commented Feb 25, 2020

Just a heads-up, if you are still working on this: I reorganized the directory structure a bit, moving all Lisp code to lisp/. You may want to rebase this branch.

I also reworked the types in #26.

@th0rex
Copy link
Author

th0rex commented Feb 26, 2020

Currently I'm not working on this anymore, but I will again in the future (at most ~6 months from now) and I hope to get a better version done then. You were correct regarding the comment about font-lock, in fact sometimes highlighting got out of sync and some regions were left unfontified during the time I was using this but font-lock thought they were fontified, which is likely a result of font-lock and tree-sitter disagreeing on what region needs to be updated.

In general I'm still not quite sure how to clear highlights correctly after a change without causing "flickering", which was the reason for trying to use font/jit-lock. Using after-change-hook and immediately clearing ts-changed-ranges and then re-applying the faces according to the query results is correct, however as soon as your code has syntax errors in it (which happens often during editing), a really large region of text was cleared and the faces were only re-applied to the point at which the syntax error was.

@maxbrunsfeld
Copy link

Using after-change-hook and immediately clearing ts-changed-ranges and then re-applying the faces according to the query results is correct

Yes, this sounds like the right approach to me; you should always be able to highlight based on the current syntax tree, regardless of whether there's a syntax error or what. That's what Atom does. There should be no need for any state besides the tree itself.

however as soon as your code has syntax errors in it (which happens often during editing), a really large region of text was cleared

I'm curious about this. Syntax errors shouldn't cause noticeable problems for syntax highlighting. If they are, you might be hitting a bug in the query system, or a parsing bug. What language are you using? What highlights are you seeing? If you get a chance, could you try to reproduce the problem with the tree-sitter command line tool, either via tree-sitter parse, or tree-sitter web-ui?

@th0rex
Copy link
Author

th0rex commented Feb 26, 2020

I'm curious about this. Syntax errors shouldn't cause noticeable problems for syntax highlighting. If they are, you might be hitting a bug in the query system, or a parsing bug. What language are you using? What highlights are you seeing? If you get a chance, could you try to reproduce the problem with the tree-sitter command line tool, either via tree-sitter parse, or tree-sitter web-ui?

@maxbrunsfeld

I'm fairly certain this isn't a bug but rather my fault, however I wasn't (and still am not) sure how to do this correctly. This was basically the code in question:

(defun tree-sitter-highlight--handle-change (change)
  "Called for each CHANGE after an edit."
  (let* ((start-byte (aref change 0))
         (end-byte   (aref change 1))
         (node       (ts-get-descendant-for-byte-range (ts-root-node tree-sitter-tree) start-byte end-byte))
         (matches    (ts-query-captures tree-sitter-highlight--queries node tree-sitter-highlight--query-cursor nil #'ts-node-text)))
    (set-text-properties (ts-node-start-position node) (ts-node-end-position node) nil)
    (seq-do #'tree-sitter-highlight--apply-match matches)))

(defun tree-sitter-highlight--after-change (old-tree)
  "Called with the OLD-TREE after the buffer has changed."
  (let* ((changes (ts-changed-ranges tree-sitter-tree old-tree)))
    (seq-do #'tree-sitter-highlight--handle-change changes)))

As far as I can remember, this worked well (it was always correct), however was rather slow since on (some) syntax errors in rust (which I tested this most on since I was mostly writing rust) the ts-get-descendant-for-byte-range would return a node that spans the whole buffer or most of the buffer, which resulted in a noticeable slowdown (at least on my fairly old processor) while typing. Thus I tried implementing that in Rust, however that also didn't speed things up too much (maybe emacs isn't too fast with removing and applying faces on these large ranges). I then tried various ways of somehow only highlighting the visible region, but most of them were broken in one way or the other (i.e. after some amount of editing all of them resulted in a region of code always being unhighlighted), this was also my fault.

That is the whole reason why I even tried integrating with jit-lock/font-lock, I hoped they would somehow help improve this, but they also had their own problems.

The syntax errors do not (as far as I noticed) cause any errors in the tree that propagate too far or result in broken queries etc, but they caused me (since I don't know of a better way) to re-highlight most of the buffer most of the time, which was slow. I do not know how to correctly determine which part I should clear and run queries on after changes, that do not result in the whole buffer being rehighlighted on some syntax errors. I've also tried looking at how atom does this but couldn't really find anything.

@maxbrunsfeld
Copy link

maxbrunsfeld commented Feb 26, 2020

@th0rex Thanks for the reply.

I haven't checked how this is exposed to elisp, but I would think that you could use one of these APIs to confine your query to the visible screen range, so that they query would not take too long.

And then, on top of that, I think you could optionally cache the result of highlighting individual lines, and invalidate/splice that cache based on the ts-changed-ranges. That way, for a given frame, you would only run the query on the portion of the syntax tree that is both:

  1. not yet cached
  2. on screen

I'm not familiar with the emacs APIs, so there might be some details that make this difficult that I'm not taking into account.

@th0rex
Copy link
Author

th0rex commented Feb 26, 2020

Ok so I've just retried this, now using the "proper" (I think) way to determine what's visible after scrolling took place. Even without caching the performance (in pure lisp now) seems to be a lot better, and it also seems to always correctly highlight now. I don't really remember why I didn't do this the first time but seems like I should have. Thanks @maxbrunsfeld

I'll still eventually implement caching however.

I've also "rebased".

(puthash key value table)))
tree-sitter-highlight-default-faces)
table))
(let ((x (tree-sitter-highlight--make-queries (alist-get major-mode
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ubolonton not related to this PR but IMO tree-sitter should not be on top of the existing major modes but it should be a replacement for them. So instead of having major-mode -> language mapping IMO it should be "file-regex" -> language.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there are tree-sitter-based major modes, the mapping mechanism would simply be auto-mode-alist.

However, that's likely too far away in the future. For adoption, I think it's better to piggyback on major modes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use tree-sitter-language here though, and leave the mapping for tree-sitter-mode.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we can define a tree-sitter-highlight-queries var, as the replacement of font-lock-keywords.

Language-specific minor modes (e.g. tree-sitter-rust) would then be responsible for loading/populating the queries however they see fit.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to raise one more point: major-mode -> language mapping is not N -> 1 but it is N -> N. We had that issue in lsp-mode - for example, web-mode could be html, javascript, typescript, and so on.

Also, why do you think that it is too far in the future? I mean except highlighting and eventually indentation what are the other things that will be missing to have tree-sitter centric major modes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to raise one more point: major-mode -> language mapping is not N -> 1 but it is N -> N. We had that issue in lsp-mode - for example, web-mode could be html, javascript, typescript, and so on.

Yeah, that's definitely an issue. I haven't thought about multi-language files in-depth.

Also, why do you think that it is too far in the future? I mean except highlighting and eventually indentation what are the other things that will be missing to have tree-sitter centric major modes?

I think the main technical barrier is that many Emacs distributions have dynamic module support turned off. Other than that, I was mainly thinking about adoption/mindshare.

th0rex added 3 commits March 29, 2020 22:35
This should fix a kind of rare bug, where still some things would not
get highlighted at all.

I've also documented the code more and done some general cleanup.

Additionally, Stefan Monnier notified me per email that I should be
using add-hook and remove-hook for window-scroll-functions.
@th0rex
Copy link
Author

th0rex commented Mar 29, 2020

Just a quick update: I've rebased on the latest master again, cleaned up the code, fixed two other bugs (although this makes scrolling slower now, but its fine if the file is byte compiled) and added changes suggested Stefan Monnier.

th0rex added 4 commits March 29, 2020 22:52
This only applies to scrolling and only in some cases, and highlighting is
usually quickly correct again after the syntax error is fixed.
Thus it's off by default, since it improves performance a lot this way.
@rirze
Copy link

rirze commented Mar 30, 2020

Is it possible to use the compiled language binaries (eg. python.so) instead of the .scm file for highlighting? More generally, is it possible to combine both functions to use the same set of language specifications? It seems like a possible redundancy to distribute both scm and so files in the eventual future of this package.

@maxbrunsfeld
Copy link

The .so files are parsers, and the .scm files are "tree queries" that specify how syntax highlighting should be applied to the syntax tree. They serve two different purposes, and I think both are needed for syntax highlighting.

@rirze
Copy link

rirze commented Mar 30, 2020

Thanks for clarifying, I figured that was the case.

@th0rex
Copy link
Author

th0rex commented Mar 30, 2020

I've now switched to using font-lock again (as per the suggestion of Stefan Monnier), however there are still lags for large buffers where ts-changed-ranges or
ts-get-descendant-for-position-range return a range spanning the whole buffer, but it still is at least always correctly highlighted now. This has nothing to do with font-lock however, and I'm absolutely not sure how to fix this. I can reliably makets-changed-ranges produce the whole buffer as a range by just inserting a ( in one rust source file at line ~1000 of ~1500, and I don't really have the time to find out how to work around that. Everything I've tried is either incorrect or slows down in that file (it's not some kind of stress test file btw, just one of the only ones where I've noticed it until now).
Initially I thought tree-sitter would somehow always emit only "small" changed ranges
(I didn't expect to have to implement caching on top of tree-sitter and thought it would only emit "small" updates when possible, instead of wanting to re-query all 1500 lines). I know this is a hard problem to solve and I don't mean to shit on the tree-sitter devs, but like I said I currently do not have any more time for this.

If anyone wants to continue this work, they're free to use all my code (if it's useful anyway) in any way they'd like. Like I said, the highlighting with font-lock should at least always be correct now, but may be slow for some files while doing some edits. Maybe this is also a problem with how the grammar is specified and not with tree-sitter itself and it would be working perfectly with other files, or I'm still using tree-sitter wrong, I'm not sure.

@th0rex th0rex closed this Mar 30, 2020
@maxbrunsfeld
Copy link

I think the solution would be to only highlight ranges that are both: 1. changed and 2. on screen.

@th0rex
Copy link
Author

th0rex commented Mar 30, 2020

I've tried that in many different variations, the problems were always with things like

r#"

// screen becomes visible here (i.e. this is the first line in the window), because we scrolled down

// we new close the raw string literal:
"#;

As far as I can see I need to run the query on the whole buffer (or on whatever ts-changed-ranges reports, which would be the whole buffer in that case, and also lead us back to the same problem with the lag) to highlight that correctly. Running the query on the window doesn't produce highlights for the raw string literal (I guess because the whole node isn't contained in the window?).

Also to be clear: only highlighting the visible lines isn't the problem, actually running the query on the whole buffer is. It is kind of slow, maybe it's also the interop between rust and emacs, I don't know.

@ubolonton
Copy link
Collaborator

@th0rex Thank you for all your hard work experimenting with this!

I'm sorry to bother you with one more request. I understand that you are frustrated and don't want to work on this anymore. I just want to get more info to continue on this. Can you send me these:

  • The example Rust file with the issue, and the location to insert (. I get the same behavior on one of the project's files, but I want to have as similar a starting point as possible.
  • The patches that Stefan sent you.
  • If possible, the code you tried to limit the ranges to rehighlight.

Running the query on the window doesn't produce highlights for the raw string literal (I guess because the whole node isn't contained in the window?).

I think extending the window range a bit with ts-get-descendant-for-position-range should work.

It is kind of slow, maybe it's also the interop between rust and emacs

This is an issue with Emacs dynamic's modules in general, more so with the Rust binding (because it includes a workaround for a safety bug in Emacs's GC, which will supposedly be fixed in Emacs 27). Calling ts--query-cursor-matches with index-only seems to be significantly faster in my quick test. I'll try making it return indexes and byte offsets only: no capture names, no nodes.

@th0rex
Copy link
Author

th0rex commented Mar 31, 2020

The example Rust file with the issue, and the location to insert (. I get the same behavior on one of the project's files, but I want to have as similar a starting point as possible.

Sadly I can't since it's from a project for work.

The patches that Stefan sent you.

https://gist.github.com/th0rex/64a48204f5129c0cd293bd90932bd018

That patch didn't work immediately, but the current state of this PR should be similar enough to it.

If possible, the code you tried to limit the ranges to rehighlight.

https://github.com/th0rex/emacs-tree-sitter/blob/only_hl_visible_query_reported/lisp/tree-sitter-highlight.el
It does some maybe hacky things related to the issue with multiline string literals I mentioned above, doesn't use font-lock and in my testing also seemed to always be correct, but it still lags for those certain edits. (That branch only highlights what's completely visible plus some more things, i.e. precisely those nodes that also either end or start in our window, or the one node (if it exists) whose start and end lies outside of the window, but which spans the window (i.e. if the raw string literal starts and ends before the window and the window looks into the middle part of it), and runs the queries on what ts-changed-ranges or ts-get-descendant-for-position-range (when scrolling) reports)

Calling ts--query-cursor-matches with index-only seems to be significantly faster in my quick test. I'll try making it return indexes and byte offsets only: no capture names, no nodes.

If this would indeed fix it, it would be very nice.

Thank you for looking into this as well.

@maxbrunsfeld
Copy link

maxbrunsfeld commented Mar 31, 2020

Running the query on the window doesn't produce highlights for the raw string literal (I guess because the whole node isn't contained in the window?).

That's surprising to me. That's not the intended behavior of ts_query_cursor_set_byte_range and ts_query_cursor_set_point_range, and it's not the behavior that I see in the web playground. The TSQueryCursor is supposed to return matches for any nodes that overlap the specified range. The nodes shouldn't need to be fully contained by the range.

It sounds like you're hitting a bug in Tree-sitter, or the Tree-sitter rust binding maybe? If anyone is interested and has time, it would be great to get a reproducible bug report for this.

@ubolonton
Copy link
Collaborator

ubolonton commented Apr 11, 2020

I'm continuing the work here.

I'm using it locally, and haven't noticed any slowdown so far. It feels than font-lock-mode sometimes.

Here are the updates to some of the previous issues.

The main slowness was due to the highlighting code adding the same face multiple times. As a result, the face attribute becomes a very long list. Emacs's display is very slow when this happens.

I can reliably make ts-changed-ranges produce the whole buffer as a range by just inserting a (

I got this with Rust code as well. It seems specific to the Rust grammar. I could never reproduce it in C code. Scoping the queries to the invalidated regions (roughly the visible region) works for me though.

Running the query on the window doesn't produce highlights for the raw string literal (I guess because the whole node isn't contained in the window?).

I couldn't reproduce this. I did get large chunks of text not being fontified, but the cause turned out to be the intricate interactions between Emacs's font-lock and jit-lock. These interactions are gnarly and poorly documented, probably because they have been effectively "internal" for so many years. For example, returning (jit-lock-bounds beg . end) is the correct way, not setting 'fontified directly.

That's surprising to me. That's not the intended behavior of ts_query_cursor_set_byte_range and ts_query_cursor_set_point_range, and it's not the behavior that I see in the web playground. The TSQueryCursor is supposed to return matches for any nodes that overlap the specified range. The nodes shouldn't need to be fully contained by the range.

This is the behavior I've observed so far (no full containment needed). However, I did get some other interesting cases involving ts_tree_get_changed_ranges and ts_query_cursor_set_point_range. They are very edge cases though, as they happen for queries more complex than those currently used in the language repos. One of them happens only under the situation created by this bug in the Rust grammar. I'm still investigating them, and will open an issue on tree-sitter later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants