Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change: (org-ql--tags-at) Support tag hierarchies #146

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

maxchaos
Copy link

@maxchaos maxchaos commented Nov 8, 2020

I think this is a simple solution for #145. It seems to work for my use cases, I haven't noticed an overhead so far and I think it is robust to cases where two tag groups reference one another thus leading to cycles and infinite recursion.

@alphapapa alphapapa force-pushed the master branch 2 times, most recently from e441d88 to 4194456 Compare November 9, 2020 02:07
Copy link
Owner

@alphapapa alphapapa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for your work on this. Please forgive my nitpicking, but I do try to be consistent. :)

org-ql.el Show resolved Hide resolved
org-ql.el Outdated Show resolved Hide resolved
org-ql.el Outdated Show resolved Hide resolved
org-ql.el Outdated Show resolved Hide resolved
org-ql.el Outdated Show resolved Hide resolved
org-ql.el Outdated Show resolved Hide resolved
org-ql.el Outdated Show resolved Hide resolved
org-ql.el Outdated Show resolved Hide resolved
org-ql.el Outdated Show resolved Hide resolved
@alphapapa
Copy link
Owner

Thanks! I appreciate your attention to detail and clarity. I have a few more suggestions that I'll make as review comments, and then a few more things to mention separately...

org-ql.el Show resolved Hide resolved
org-ql.el Show resolved Hide resolved
org-ql.el Outdated Show resolved Hide resolved
@alphapapa
Copy link
Owner

alphapapa commented Nov 10, 2020

Here are a few more thoughts:

  1. I wonder if we should add a tag-hierarchy cache. IIUC, when calling org-ql--tags-at on, e.g. several level-1 headings in an Org buffer, even if they have the same tags, the new function org-ql--expand-tag-hierarchy will be called on each heading, because we only cache tags at each heading position. In the case of a large file with several file-wide group tags, that would seem to repeat the hierarchy expansion unnecessarily. What do you think?

  2. I ran a quick benchmark (well, it wasn't that quick because I ran it 10 times each) on my own Org files with and without this PR, and I was pleasantly surprised to see that the performance impact seemed negligible. However: a) although I have the org-group-tags boolean set to non-nil, I don't actually have any group tags in any of my Org files, and b) I still feel slightly skeptical about the results.

    Would you mind running the benchmark on some of your files and sharing the results? I used my bench-multi macros from https://github.com/alphapapa/emacs-package-dev-handbook. Here's what I ran:

This is messy, but it ought to be fair.

#+BEGIN_SRC elisp
  (bench-multi-lexical :times 10 :ensure-equal t
    :forms (("without group-tags support"
             (progn
               (setf org-ql-cache (make-hash-table :weakness 'key)
                     org-ql-tags-cache (make-hash-table :weakness 'key)
                     org-ql-node-value-cache (make-hash-table :weakness 'key))
               (defun org-ql--expand-tag-hierarchy (tags &optional excluded)
                 "Return TAGS along with their associated group tags.
  This function recursively searches for groups that each given tag belongs to,
  directly or indirectly, and includes the corresponding group tags to the result.

  TAGS should be a list of tags (i.e., strings).
  If non-nil, EXCLUDED should be a list of group tags that will not be
  automatically added to the results unless they are already in TAGS."
                 (let ((groups (org-tag-alist-to-groups org-current-tag-alist))
                       (excluded (append tags excluded)))
                   (let (group-tags)
                     (dolist (tag tags)
                       (pcase-dolist (`(,group-tag . ,group-members) groups)
                         (when (and (not (member group-tag excluded))
                                    ;; Check if one of the members in the group matches tag.
                                    ;; Notice that each member may be a plain string or
                                    ;; a regexp pattern (enclosed between curly brackets).
                                    (--some (if (string-match-p "^[{].+[}]$" it)
                                                ;; If pattern (it) is a regexp, remove the brackets and
                                                ;; make sure that it either matches the whole tag or not.
                                                (string-match-p (concat "^" (substring it 1 -1) "$") tag)
                                              ;; Check if member (it) is identical to tag.
                                              (string= it tag))
                                            group-members))
                           (push group-tag group-tags))))
                     ;; If group tags not already included have been found,
                     ;; then recursively expand them as well.
                     ;; Notice that by passing (group-tags excluded) to the next call
                     ;; instead of ((append tags group-tags)) ensures that we do not
                     ;; unnecessarily loop over the elements of TAGS more than once.
                     (if group-tags
                         (append tags (org-ql--expand-tag-hierarchy group-tags excluded))
                       tags))))
               (defun org-ql--tags-at (position)
                 "Return tags for POSITION in current buffer.
  Returns cons (INHERITED-TAGS . LOCAL-TAGS)."
                 ;; I'd like to use `-if-let*', but it doesn't leave non-nil variables
                 ;; bound in the else clause, so destructured variables that are non-nil,
                 ;; like found caches, are not available in the else clause.
                 (if-let* ((buffer-cache (gethash (current-buffer) org-ql-tags-cache))
                           (modified-tick (car buffer-cache))
                           (tags-cache (cdr buffer-cache))
                           (buffer-unmodified-p (eq (buffer-modified-tick) modified-tick))
                           (cached-result (gethash position tags-cache)))
                     ;; Found in cache: return them.
                     (pcase cached-result
                       ('org-ql-nil nil)
                       (_ cached-result))
                   ;; Not found in cache: get tags and cache them.
                   (let* ((local-tags (or (when (looking-at org-ql-tag-line-re)
                                            (split-string (match-string-no-properties 2) ":" t))
                                          'org-ql-nil))
                          (inherited-tags (or (when org-use-tag-inheritance
                                                (save-excursion
                                                  (if (org-up-heading-safe)
                                                      ;; Return parent heading's tags.
                                                      (-let* (((inherited local) (org-ql--tags-at (point)))
                                                              (tags (when (or inherited local)
                                                                      (cond ((and (listp inherited)
                                                                                  (listp local))
                                                                             (->> (append inherited local)
                                                                                  -non-nil -uniq))
                                                                            ((listp inherited) inherited)
                                                                            ((listp local) local)))))
                                                        (cl-typecase org-use-tag-inheritance
                                                          (list (setf tags (-intersection tags org-use-tag-inheritance)))
                                                          (string (setf tags (--select (string-match org-use-tag-inheritance it)
                                                                                       tags))))
                                                        (pcase org-tags-exclude-from-inheritance
                                                          ('nil tags)
                                                          (_ (-difference tags org-tags-exclude-from-inheritance))))
                                                    ;; Top-level heading: use file tags.
                                                    org-file-tags)))
                                              'org-ql-nil))
                          (all-tags (list inherited-tags local-tags)))
                     ;; Check caches again, because they may have been set now.
                     ;; TODO: Is there a clever way we could avoid doing this, or is it inherently necessary?
                     (setf buffer-cache (gethash (current-buffer) org-ql-tags-cache)
                           modified-tick (car buffer-cache)
                           tags-cache (cdr buffer-cache)
                           buffer-unmodified-p (eq (buffer-modified-tick) modified-tick))
                     (unless (and buffer-cache buffer-unmodified-p)
                       ;; Buffer-local tags cache empty or invalid: make new one.
                       (setf tags-cache (make-hash-table))
                       (puthash (current-buffer)
                                (cons (buffer-modified-tick) tags-cache)
                                org-ql-tags-cache))
                     (puthash position all-tags tags-cache))))
               (org-ql-select (org-ql-search-directories-files)
                 '(tags "Emacs")
                 :action #'point)))


            ("with group-tags support"
             (progn
               (setf org-ql-cache (make-hash-table :weakness 'key)
                     org-ql-tags-cache (make-hash-table :weakness 'key)
                     org-ql-node-value-cache (make-hash-table :weakness 'key))
               (defun org-ql--expand-tag-hierarchy (tags &optional excluded)
                 "Return TAGS along with their associated group tags.
  This function recursively searches for groups that each given tag belongs to,
  directly or indirectly, and includes the corresponding group tags to the result.

  TAGS should be a list of tags (i.e., strings).
  If non-nil, EXCLUDED should be a list of group tags that will not be
  automatically added to the results unless they are already in TAGS."
                 (let ((groups (org-tag-alist-to-groups org-current-tag-alist))
                       (excluded (append tags excluded)))
                   (let (group-tags)
                     (dolist (tag tags)
                       (pcase-dolist (`(,group-tag . ,group-members) groups)
                         (when (and (not (member group-tag excluded))
                                    ;; Check if one of the members in the group matches tag.
                                    ;; Notice that each member may be a plain string or
                                    ;; a regexp pattern (enclosed between curly brackets).
                                    (--some (if (string-match-p "^[{].+[}]$" it)
                                                ;; If pattern (it) is a regexp, remove the brackets and
                                                ;; make sure that it either matches the whole tag or not.
                                                (string-match-p (concat "^" (substring it 1 -1) "$") tag)
                                              ;; Check if member (it) is identical to tag.
                                              (string= it tag))
                                            group-members))
                           (push group-tag group-tags))))
                     ;; If group tags not already included have been found,
                     ;; then recursively expand them as well.
                     ;; Notice that by passing (group-tags excluded) to the next call
                     ;; instead of ((append tags group-tags)) ensures that we do not
                     ;; unnecessarily loop over the elements of TAGS more than once.
                     (if group-tags
                         (append tags (org-ql--expand-tag-hierarchy group-tags excluded))
                       tags))))
               (defun org-ql--tags-at (position)
                 "Return tags for POSITION in current buffer.
  Returns cons (INHERITED-TAGS . LOCAL-TAGS)."
                 ;; I'd like to use `-if-let*', but it doesn't leave non-nil variables
                 ;; bound in the else clause, so destructured variables that are non-nil,
                 ;; like found caches, are not available in the else clause.
                 (if-let* ((buffer-cache (gethash (current-buffer) org-ql-tags-cache))
                           (modified-tick (car buffer-cache))
                           (tags-cache (cdr buffer-cache))
                           (buffer-unmodified-p (eq (buffer-modified-tick) modified-tick))
                           (cached-result (gethash position tags-cache)))
                     ;; Found in cache: return them.
                     (pcase cached-result
                       ('org-ql-nil nil)
                       (_ cached-result))
                   ;; Not found in cache: get tags and cache them.
                   (let* ((local-tags (or (when (looking-at org-ql-tag-line-re)
                                            (split-string (match-string-no-properties 2) ":" t))
                                          'org-ql-nil))
                          (inherited-tags (or (when org-use-tag-inheritance
                                                (save-excursion
                                                  (if (org-up-heading-safe)
                                                      ;; Return parent heading's tags.
                                                      (-let* (((inherited local) (org-ql--tags-at (point)))
                                                              (tags (when (or inherited local)
                                                                      (cond ((and (listp inherited)
                                                                                  (listp local))
                                                                             (->> (append inherited local)
                                                                                  -non-nil -uniq))
                                                                            ((listp inherited) inherited)
                                                                            ((listp local) local)))))
                                                        (cl-typecase org-use-tag-inheritance
                                                          (list (setf tags (-intersection tags org-use-tag-inheritance)))
                                                          (string (setf tags (--select (string-match org-use-tag-inheritance it)
                                                                                       tags))))
                                                        (pcase org-tags-exclude-from-inheritance
                                                          ('nil tags)
                                                          (_ (-difference tags org-tags-exclude-from-inheritance))))
                                                    ;; Top-level heading: use file tags.
                                                    org-file-tags)))
                                              'org-ql-nil))
                          all-tags)
                     (when org-group-tags
                       (unless (eq local-tags 'org-ql-nil)
                         (setq local-tags (org-ql--expand-tag-hierarchy local-tags)))
                       (unless (eq inherited-tags 'org-ql-nil)
                         (setq inherited-tags (org-ql--expand-tag-hierarchy inherited-tags))))
                     (setq all-tags (list inherited-tags local-tags))
                     ;; Check caches again, because they may have been set now.
                     ;; TODO: Is there a clever way we could avoid doing this, or is it inherently necessary?
                     (setf buffer-cache (gethash (current-buffer) org-ql-tags-cache)
                           modified-tick (car buffer-cache)
                           tags-cache (cdr buffer-cache)
                           buffer-unmodified-p (eq (buffer-modified-tick) modified-tick))
                     (unless (and buffer-cache buffer-unmodified-p)
                       ;; Buffer-local tags cache empty or invalid: make new one.
                       (setf tags-cache (make-hash-table))
                       (puthash (current-buffer)
                                (cons (buffer-modified-tick) tags-cache)
                                org-ql-tags-cache))
                     (puthash position all-tags tags-cache))))
               (org-ql-select (org-ql-search-directories-files)
                 '(tags "Emacs")
                 :action #'point)))
            ))
#+END_SRC

#+RESULTS:
| Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
|----------------------------+--------------------+---------------+----------+------------------|
| without group-tags support | 1.01               |     52.832562 |        4 |         1.989522 |
| with group-tags support    | slowest            |     53.425342 |        5 |         2.479128 |

If the impact really is that small, then it should be no problem to merge it. But I would like to know the impact for users who actually use group tags as well.

Thanks very much for your work on this. It seems like you're an experienced Lisper, so you're saving me a lot of work by implementing this feature. :)

@maxchaos
Copy link
Author

maxchaos commented Nov 10, 2020

I was also considering the idea of adding a cache here myself but I have some doubts whether it will improve things. The reasons that come to mind are the following:

  1. We need to check somewhere that the tag group definitions have not changed since the last call, to see if we need to invalidate the cache. In case this has to be done inside org-ql--tags-at, I am not sure if the speed improvement will be considerable over the brute force search.
  2. About the keys, we can use either distinct tags or the lists of tags that appear in headings. The latter case will generally result in many elements and unnecessary evaluations during the creation of the cache (since the combinations may be many) whereas in the former case we need to make sure to use -uniq or something similar to merge the results associated with each individual tag.

What is your opinion on this?

Regardless, I ran a few benchmarks using the source code block you provided (I modified it to search inside a specific file for a tag that should generally result in 3 calls deep recursion) using the benchmarking utilities from you repo (although I had to change it a bit because I think a few stuff got renamed in the meantime, e.g., second -> cl-second). Here are the results of running the benchmark inside a file with >500 entries and another with >5000 entries, respectively.

  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.18               |      4.711496 |        1 |         0.080607 |
  | with group-tags support    | slowest            |      5.555277 |        4 |         0.333895 |
  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.03               |    273.321681 |        7 |         0.692409 |
  | with group-tags support    | slowest            |    280.282486 |       26 |         2.628378 |

What do you think?

You are welcome. I am afraid that I am below average when it comes to lisp but I am glad I can help.

@alphapapa
Copy link
Owner

  1. We need to check somewhere that the tag group definitions have not changed since the last call, to see if we need to invalidate the cache. In case this has to be done inside org-ql--tags-at, I am not sure if the speed improvement will be considerable over the brute force search.

IIUC, we're already using org-current-tag-alist, which is a per-buffer list of tag groups. So if we were to cache our computation here, we'd have a two-level cache, much like the query cache: first keyed on buffer, and then keyed on that variable's value. And maybe a third-level as well... (see next)

  1. About the keys, we can use either distinct tags or the lists of tags that appear in headings. The latter case will generally result in many elements and unnecessary evaluations during the creation of the cache (since the combinations may be many) whereas in the former case we need to make sure to use -uniq or something similar to merge the results associated with each individual tag.

Yeah, it starts to feel a little complicated.

Looking at your benchmark results, I think we should not add a cache, or at least not at this time. A 3-18% decrease when tag groups are searched for is not that big a penalty, so it probably wouldn't be worth the complexity to add another cache.

Regardless, I ran a few benchmarks using the source code block you provided (I modified it to search inside a specific file for a tag that should generally result in 3 calls deep recursion) using the benchmarking utilities from you repo (although I had to change it a bit because I think a few stuff got renamed in the meantime, e.g., second -> cl-second). Here are the results of running the benchmark inside a file with >500 entries and another with >5000 entries, respectively.

  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.18               |      4.711496 |        1 |         0.080607 |
  | with group-tags support    | slowest            |      5.555277 |        4 |         0.333895 |
  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.03               |    273.321681 |        7 |         0.692409 |
  | with group-tags support    | slowest            |    280.282486 |       26 |         2.628378 |

What do you think?

Thanks, that's actually a better result than I expected. It does do a lot of GC, maybe because of the string concatting, but if you ran each benchmark for :times 10, that's not too bad as a total, I think.

Would you mind doing one more test, please? What results do you get when you use the same configuration but do a search for a tag that is not part of a tag group?

I am afraid that I am below average when it comes to lisp but I am glad I can help.

You could have fooled me! :)

Thanks.

@maxchaos
Copy link
Author

To be honest, I was also not satisfied with the test cases I considered so I took the liberty of running more benchmarks on this. Particularly, I modified the two files I used before so that every entry has the tag leading to the deepest recursion and these are the results I got for this worst case scenario, corresponding to >500 and >5000 entries, respectively:

  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.30               |      4.985834 |        0 |                0 |
  | with group-tags support    | slowest            |      6.470958 |        3 |         0.422573 |
  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.04               |    358.349630 |        6 |         0.916658 |
  | with group-tags support    | slowest            |    373.206703 |       32 |         4.969402 |

Also, here are the results I get when searching for a tag not belonging to a tag group.

  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.16               |      4.371664 |        1 |         0.118342 |
  | with group-tags support    | slowest            |      5.076697 |        3 |         0.296978 |
  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.02               |    442.450478 |        7 |         0.729795 |
  | with group-tags support    | slowest            |    450.854371 |       27 |         2.650569 |

*I guess that the total duration of the benchmarks is greater compared to the first set of benchmarks because I am working on a laptop running on battery this time, otherwise it makes no sense :)

@alphapapa
Copy link
Owner

Thanks. Okay, I have one more benchmark request: would you run the same sets, but without having group tags configured? Specifically:

  • org-group-tags should be t
  • org-current-tag-alist should be nil

That's, perhaps, the most important set of benchmarks we should consider, because it should show how these changes will affect users who don't actually use group tags. If there's a 2-16% penalty for users who do have group tags configured, when they use a (tags ...) query, that's probably acceptable, as long as the impact for users who do not have group tags configured is minimal.

Thanks very much for your diligence and patience. I don't think that many Org users use group tags, but I expect that those who do will appreciate this feature being added.

@maxchaos
Copy link
Author

You are welcome. Here are the results of running the benchmarks on the same set of files after removing all group definitions without changing the value of org-group-tags.

  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.02               |      5.036427 |        0 |                0 |
  | with group-tags support    | slowest            |      5.127543 |        0 |                0 |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.03               |    445.970389 |        6 |         0.947121 |
  | with group-tags support    | slowest            |    459.784009 |        6 |         0.953201 |

Disappointing as it may be, it appears that a ~3% overhead is unavoidable without a cache even when no tag groups are defined.

@alphapapa
Copy link
Owner

Disappointing as it may be, it appears that a ~3% overhead is unavoidable without a cache even when no tag groups are defined.

Hm, ok, let's be a little more explicit about methodology:

  1. Checkout the commit to be tested.
  2. Open org-ql.el and ensure the buffer is reverted to the commit's version of it.
  3. M-x emacs-lisp-byte-compile-and-load RET to byte-compile org-ql.el.
  4. Execute the bench-multi-lexical Org Babel block.

In the first benchmark I ran, it showed a 1% decrease in performance. It may be that 1-3% is within the margin of normal variation, and maybe I'm being too picky, but I'd like to avoid a 3% performance decrease when there are no group tags being used.

Maybe we should test org-current-tag-alist as well as org-group-tags to decide whether to call org-ql--expand-tag-hierarchy. As it is, IIUC, if org-group-tags is non-nil, but no group tags are actually configured in a buffer, org-ql--expand-tag-hierarchy will still be called twice for no reason, which could account for some of this apparent performance decrease.

What do you think? Thanks.

@maxchaos
Copy link
Author

maxchaos commented Nov 13, 2020

I took your advice and added a test for org-current-tag-alist right into org-ql--tags-at. To be exact, I also check if (org-tags-alist-to-groups org-current-tags-alist) is nil because it is possible for an org file to declare tags using #+TAGS: ... directives but without defining any groups, in which case org-current-tag-alist would be non-nil but we have no reason to call org-ql--expand-tags-hierarchy. These are the results I get now when running the benchmarks inside the corresponding org-files but with all tag group definitions removed:

  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | with group-tags support    | 1.01               |      4.861521 |        1 |         0.089599 |
  | without group-tags support | slowest            |      4.917874 |        1 |         0.090685 |
  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | with group-tags support    | 1.01               |    446.948451 |        9 |         0.902002 |
  | without group-tags support | slowest            |    450.424228 |        9 |         0.914081 |

Additionally, I modified the definition of org-ql--expand-tags-hierarchy so that it does not unnecessarily call (org-tags-alist-to-groups org-current-tags-alist) since this is done now inside org-ql--tags-at. Moreover, the new implementation removes matching groups instead of adding them to some exclusion list and seems to reduce the number of checks overall. I rerun the benchmarks for the worst case scenarios that I considered before with the new implementation and the results I get also indicate this (x1.19 instead of x1.30).

  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.19               |      4.493477 |        1 |         0.133011 |
  | with group-tags support    | slowest            |      5.369126 |        2 |         0.269279 |
  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.05               |    534.770214 |        7 |         0.927440 |
  | with group-tags support    | slowest            |    561.754649 |       19 |         2.709767 |

What do you think?

org-ql.el Outdated
(setq inherited-tags (org-ql--expand-tag-hierarchy inherited-tags))))
(let ((tag-groups (and org-group-tags
org-current-tag-alist
(org-tag-alist-to-groups org-current-tag-alist))))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this means calling org-tag-alist-to-groups every time this function is called on a heading whose tags are not cached. In a file with many top-level headings (e.g. I have an inbox file with years of captured top-level headings), that would mean calling this function many times, and every time it would be with the same argument and would return the same value.

If I'm correct about that, it means that this function call probably belongs elsewhere and its result should be reused, at least per-search.

I feel like we need to reconsider the design of this part of the code altogether. It might be that this group expansion should be done at another step of the search process. But please see the later comments I'm about to make...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct about that but this is only going to happen if org-current-tag-alist is non-nil. Nonetheless, I agree with you that normally that result should be cached (and I originally thought that this was the purpose of org-tag-groups-alist). So maybe we can add a per-file cache for that?

org-ql.el Outdated
(let ((groups (or groups (org-tag-alist-to-groups org-current-tag-alist)))
matching-group-tags)
;; Remove groups with tags already in the results.
(mapc (lambda (tag) (setq groups (assoc-delete-all tag groups))) tags)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the values these data structures have at runtime, because I haven't stepped through the code. But it seems to me like this step is something that ought to be avoided. If it seems necessary here, it might be because the design needs to be adjusted. It seems to me like this is work that shouldn't need to be done every time this function is called.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above snippet, groups essentially holds the current set of group definitions (in the first call, it should be the result of (org-tag-alist-to-groups org-current-tag-alist). In order to avoid iterating over the members of groups whose group-tag is already in the tags passed to this function, I simply delete them from this variable. I think that this is an improvement over the previous implementation because if this function ends up recursing, the number of iterations of the pcase-dolist loop will be definitely less.

@alphapapa
Copy link
Owner

I rerun the benchmarks for the worst case scenarios that I considered before with the new implementation and the results I get also indicate this (x1.19 instead of x1.30).

Thanks, that's a significant improvement. I think we need to go further, though. It seems to me that some of the work being done in this code is being done at the wrong place in the process, which causes it to be done repeatedly, every time that org-ql--tags-at is called, and every time org-ql--expand-tags-hierarchy is called. If a function is being called repeatedly with the same argument and returns the same value, that generally means that it's being called at the wrong place, and the code needs to be redesigned.

Before we go any further, I need your help understanding something: How is searching for group tags supposed to work exactly? What I mean is, if I have defined a group tag hierarchy like:

  • sports
    • baseball
    • football

And I do a search for the tag "baseball", does this code expand "baseball" into ("sports" "baseball" "football")? Or does it return "baseball" for "baseball" and only expand "sports" into ("sports" "baseball" "football")? I thought it was the latter, but re-reading the docstring for this new function now, I'm not sure I understand. How is it supposed to work? How does it work in org.el itself?

In other words, an ECM (minimal working example) would be very helpful here. I think we need a minimal example Org file, with a set of group tags defined, and then we can give example searches with the expected results, and then we can (or I can--I'm sure you've already done this) reason about the steps necessary along the way.

This ECM will also be helpful for writing test cases before merging this, as none of the existing test data sets have group tags (and since their tests already work as-is, I'd prefer not to modify the test data, because that would likely require updating the test results, which is a lot of work).

Apologies for dragging this process on for so long. I appreciate your patience.

@maxchaos
Copy link
Author

Thanks, that's a significant improvement. I think we need to go further, though. It seems to me that some of the work being done in this code is being done at the wrong place in the process, which causes it to be done repeatedly, every time that org-ql--tags-at is called, and every time org-ql--expand-tags-hierarchy is called. If a function is being called repeatedly with the same argument and returns the same value, that generally means that it's being called at the wrong place, and the code needs to be redesigned.

I am certainly open to suggestions.

Before we go any further, I need your help understanding something: How is searching for group tags supposed to work exactly? What I mean is, if I have defined a group tag hierarchy like:

For this hierarchy, what you should get after calling org-ql--expand-tags-hierarchy on baseball is (sports baseball). This should be the case because when someone searches for sports, headings that have baseball in their list of tags should match.

In other words, an ECM (minimal working example) would be very helpful here. I think we need a minimal example Org file, with a set of group tags defined, and then we can give example searches with the expected results, and then we can (or I can--I'm sure you've already done this) reason about the steps necessary along the way.
This ECM will also be helpful for writing test cases before merging this, as none of the existing test data sets have group tags (and since their tests already work as-is, I'd prefer not to modify the test data, because that would likely require updating the test results, which is a lot of work).

As a matter of fact, I was also looking into how to add unit tests for this feature, so should I go ahead and write a simple org-file to use as a test case?

Apologies for dragging this process on for so long. I appreciate your patience.

You are welcome. On the contrary, I am sorry for not being more thorough with this.

@alphapapa
Copy link
Owner

Thanks, that's a significant improvement. I think we need to go further, though. It seems to me that some of the work being done in this code is being done at the wrong place in the process, which causes it to be done repeatedly, every time that org-ql--tags-at is called, and every time org-ql--expand-tags-hierarchy is called. If a function is being called repeatedly with the same argument and returns the same value, that generally means that it's being called at the wrong place, and the code needs to be redesigned.

I am certainly open to suggestions.

Thanks. After we have an ECM with sample data and example searches, I'll be able to understand it better and think about how to integrate into the existing code.

Before we go any further, I need your help understanding something: How is searching for group tags supposed to work exactly? What I mean is, if I have defined a group tag hierarchy like:

For this hierarchy, what you should get after calling org-ql--expand-tags-hierarchy on baseball is (sports baseball). This should be the case because when someone searches for sports, headings that have baseball in their list of tags should match.

Okay, so would it be fair to call the function something like get-parent-tag? Or does it do more than that in some cases? Does it only ever return one additional tag, the parent? Or does it sometimes return more than one additional tag?

This should be the case because when someone searches for sports, headings that have baseball in their list of tags should match.

I may be speaking prematurely, but this hints to me that we may indeed be approaching the problem from the wrong direction. The org-ql--tags-at function is intended to return the tags that a heading has and that it has inherited, so adding this functionality to that function may be wrong. Group tags are a different kind of thing, I think.

Thinking about this example, if a user searched for sports and wanted to get results for headings also tagged baseball, it seems like the solution should be to handle it in the query pre-processing stage, by expanding the search predicate (tags "sports") into (tags "sports" "baseball" "football"), and then the org-ql--tags-at function wouldn't need to be changed, and it wouldn't do any additional work.

However, if group tags are configured per-buffer, expanding it in the pre-processing stage wouldn't work. In that case, it would seem necessary to have an additional, per-buffer query-rewriting stage. I don't feel enthusiastic about adding another complex stage like that, but it would seem like a more "correct" solution. And such per-buffer query-rewriting could be useful in other cases as well, I think.

In other words, an ECM (minimal working example) would be very helpful here. I think we need a minimal example Org file, with a set of group tags defined, and then we can give example searches with the expected results, and then we can (or I can--I'm sure you've already done this) reason about the steps necessary along the way.
This ECM will also be helpful for writing test cases before merging this, as none of the existing test data sets have group tags (and since their tests already work as-is, I'd prefer not to modify the test data, because that would likely require updating the test results, which is a lot of work).

As a matter of fact, I was also looking into how to add unit tests for this feature, so should I go ahead and write a simple org-file to use as a test case?

Yes, please. Take a look at the tests/data2.org file; I think it needn't be more complicated than that.

Apologies for dragging this process on for so long. I appreciate your patience.

You are welcome. On the contrary, I am sorry for not being more thorough with this.

You're more thorough than most any other contributors, so I have no complaints. :) Thanks.

@maxchaos
Copy link
Author

Okay, so would it be fair to call the function something like get-parent-tag? Or does it do more than that in some cases? Does it only ever return one additional tag, the parent? Or does it sometimes return more than one additional tag?

Actually, get-ancestor-tags seems more appropriate. Assuming that a single tag is given, it would return its parent, its parent's parent and so forth.

I may be speaking prematurely, but this hints to me that we may indeed be approaching the problem from the wrong direction. The org-ql--tags-at function is intended to return the tags that a heading has and that it has inherited, so adding this functionality to that function may be wrong. Group tags are a different kind of thing, I think.
Thinking about this example, if a user searched for sports and wanted to get results for headings also tagged baseball, it seems like the solution should be to handle it in the query pre-processing stage, by expanding the search predicate (tags "sports") into (tags "sports" "baseball" "football"), and then the org-ql--tags-at function wouldn't need to be changed, and it wouldn't do any additional work.
However, if group tags are configured per-buffer, expanding it in the pre-processing stage wouldn't work. In that case, it would seem necessary to have an additional, per-buffer query-rewriting stage. I don't feel enthusiastic about adding another complex stage like that, but it would seem like a more "correct" solution. And such per-buffer query-rewriting could be useful in other cases as well, I think.

Well, I am not really certain about this. I agree that there are actually two different ways to approach this but expanding groups when searching does not feel more efficient in terms of computation nor code maintenance. Regarding the first point, notice that each time the user specifies a tag to match, that would have to get expanded into a set (that possibly contains regular expressions) and compare each element of that set with every tag in the headings. In that case, I feel that an extra cache is imperative whereas, with the current implementation, this thing is taken care of once, when the existing cache of tags is built. Regarding the second point, I haven't looked at how the query mechanism is implemented but, in general, you would have to take care of tag expansion at every point where you do a different kind of tag search (e.g., just local tags, just inherited tags, etc). My experience so far tells me that, unless one wishes to be able to also run queries that distinguish between tags explicitly stated in the heading and tags implicitly matched because of the hierarchy, then this approach has more merits.

Yes, please. Take a look at the tests/data2.org file; I think it needn't be more complicated than that.

Great! I'll probably take a look at it some time tomorrow then.

@maxchaos
Copy link
Author

maxchaos commented Nov 13, 2020

My experience so far tells me that, unless one wishes to be able to also run queries that distinguish between tags explicitly stated in the heading and tags implicitly matched because of the hierarchy, then this approach has more merits.

On the other hand, if one decides to toggle the state of org-group-tags after the cache has been built, then they would have to invalidate it in order for their change to take effect 🤔

@alphapapa
Copy link
Owner

Okay, so would it be fair to call the function something like get-parent-tag? Or does it do more than that in some cases? Does it only ever return one additional tag, the parent? Or does it sometimes return more than one additional tag?

Actually, get-ancestor-tags seems more appropriate. Assuming that a single tag is given, it would return its parent, its parent's parent and so forth.

Yes, that name seems even better.

I may be speaking prematurely, but this hints to me that we may indeed be approaching the problem from the wrong direction. The org-ql--tags-at function is intended to return the tags that a heading has and that it has inherited, so adding this functionality to that function may be wrong. Group tags are a different kind of thing, I think.
Thinking about this example, if a user searched for sports and wanted to get results for headings also tagged baseball, it seems like the solution should be to handle it in the query pre-processing stage, by expanding the search predicate (tags "sports") into (tags "sports" "baseball" "football"), and then the org-ql--tags-at function wouldn't need to be changed, and it wouldn't do any additional work.
However, if group tags are configured per-buffer, expanding it in the pre-processing stage wouldn't work. In that case, it would seem necessary to have an additional, per-buffer query-rewriting stage. I don't feel enthusiastic about adding another complex stage like that, but it would seem like a more "correct" solution. And such per-buffer query-rewriting could be useful in other cases as well, I think.

Well, I am not really certain about this. I agree that there are actually two different ways to approach this but expanding groups when searching does not feel more efficient in terms of computation nor code maintenance. Regarding the first point, notice that each time the user specifies a tag to match, that would have to get expanded into a set (that possibly contains regular expressions) and compare each element of that set with every tag in the headings. In that case, I feel that an extra cache is imperative whereas, with the current implementation, this thing is taken care of once, when the existing cache of tags is built. Regarding the second point, I haven't looked at how the query mechanism is implemented but, in general, you would have to take care of tag expansion at every point where you do a different kind of tag search (e.g., just local tags, just inherited tags, etc). My experience so far tells me that, unless one wishes to be able to also run queries that distinguish between tags explicitly stated in the heading and tags implicitly matched because of the hierarchy, then this approach has more merits.

Hm, I don't completely follow you. I think that, once we have concrete examples to reason about, it will be easier. In the meantime, though:

unless one wishes to be able to also run queries that distinguish between tags explicitly stated in the heading and tags implicitly matched because of the hierarchy

Note that there are specialized tags predicates, i.e. tags-local. I also just added tags-regexp, which might be useful for dealing with group tags. I think that the semantics of tags-local should not change when using group tags, i.e. it should still only return tags actually specified in an entry's heading line. IIUC the current code in this PR would change that, which I think we should avoid.

It might be worth considering to have a special predicate to match group tags. Or even if we don't end up making one, thinking about it might help us with designing the solution we end up using.

Yes, please. Take a look at the tests/data2.org file; I think it needn't be more complicated than that.

Great! I'll probably take a look at it some time tomorrow then.

Thanks.

@alphapapa alphapapa force-pushed the master branch 3 times, most recently from f52bef0 to ec0f8c7 Compare November 16, 2020 11:23
@alphapapa alphapapa added this to the 0.6 milestone Nov 16, 2020
@maxchaos
Copy link
Author

maxchaos commented Nov 22, 2020

Hi. Sorry for taking so long to respond. As we discussed last time, I have created a minimal org file (data3.org) with tag group definitions and added a few very basic unit tests in org-ql-test.el for it. I have pushed these changes to this branch, after merging you master on it.

Apart from that, I have pushed on a separate branch from this one (because the changes are more extensive and I think we should discuss them first), a different implementation of this feature. Particularly, I have addressed a few concerns raised earlier, namely a) calling (org-tag-alist-to-groups org-current-tag-alist) repeatedly for no good reason, b) not altering the current behaviour of tags-local and tags-inherited, c) ensure that, even if the user toggles the value of org-group-tags, new queries return results without matching tag groups while using the readily constructed cache of each buffer (i.e., no need to invalidate it) and finally d) the performance issue.

Specifically, in order to address issues (a) and (d), I have introduced a new cache, similar to the one of org-ql-tags-cache, that holds the last valid value of (org-tag-alist-to-groups org-current-tag-alist) and also another hash table that maps tags to previously computed results of org-ql--expand-tag-hierarchy. As such, these are now the results I get after running the benchmarks for the small and large org-files with group definitions and tag hierarchy enabled:

  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.02               |      5.001042 |        1 |         0.115724 |
  | with group-tags support    | slowest            |      5.095777 |        1 |         0.114706 |
  | Form                       | x faster than next | Total runtime | # of GCs | Total GC runtime |
  |----------------------------+--------------------+---------------+----------+------------------|
  | without group-tags support | 1.01               |    559.745689 |        6 |         0.752902 |
  | with group-tags support    | slowest            |    565.241730 |        8 |         1.087575 |

In short, now the regression is limited to around 2% instead of ~19%.

Regarding issues (b) and (c), in order to avoid adding yet another cache for local and inherited group tags keys on buffer position, this time I augmented the list of values that the per buffer hash tables of org-ql-tags-cache hold. Particularly, instead of (inherited-tags local-tags), they now hold (inherited-tags local-tags inherited-group-tags local-group-tags). Also, org-ql--tags-at now returns these four distinct lists, where the first two are actually the ones returned before introducing this feature. As such, tags-local and tags-inherited continue to behave as before (the unit tests I included also reflect that) while the implementation of tags and tags-regexp had to be modified slightly to simply merge those pairs (i.e., (append inherited-tags inherited-group-tags) and (append local-tags local-group-tags)) as needed (essentially, depending on whether org-group-tags is nil or not).

As mentioned above, I placed all of these changes on a different branch mainly because I chose to modify the structure of org-ql-tags-cache, which is part of the public API. For the record, this change does not seem to conflict with the unit tests I added nor the existing ones on tags. If the changes I made on that branch (cached-tag-hierarchy-support on my fork) are ok with you, I can go ahead and merge them here.

@alphapapa alphapapa force-pushed the master branch 3 times, most recently from ae83de1 to 890c247 Compare November 24, 2020 09:59
@alphapapa
Copy link
Owner

Hi again,

Thanks for your work on this. It sounds like you've improved the performance a lot over the initial implementation.

Would you link the other branch you're talking about? I'd prefer to avoid adding another global cache if possible, but it might be worth it.

Regarding the (tags) predicate: I'm not necessarily opposed to having this feature be part of it, but considering how much it changes the code to add a new cache and associated logic, it might be best to first expose it via a new (group-tags) predicate. Then we could consider integrating it with (tags) after it's had some time to be tested in the wild. I recently added a new way to easily define predicates; please see https://github.com/alphapapa/org-ql/blob/master/examples/defpred.org.

What do you think? Thanks.

@alphapapa alphapapa linked an issue Nov 25, 2020 that may be closed by this pull request
@maxchaos
Copy link
Author

You are welcome.

Would you link the other branch you're talking about? I'd prefer to avoid adding another global cache if possible, but it might be worth it.

This is the branch with the discussed implementation. I just merged you master on it so it should be easy to compare the changes.

Regarding the (tags) predicate: I'm not necessarily opposed to having this feature be part of it, but considering how much it changes the code to add a new cache and associated logic, it might be best to first expose it via a new (group-tags) predicate. Then we could consider integrating it with (tags) after it's had some time to be tested in the wild. I recently added a new way to easily define predicates; please see https://github.com/alphapapa/org-ql/blob/master/examples/defpred.org.

Sure, sound very reasonable to me. I will take a look at the new mechanism and define a distinct predicate then. Btw, the predicates that I modified in the discussed implementation are actually (tags) and (tags-regexp), since I feel that regular expression searches on group tags also are pretty convenient. Should I define a distinct tag for regular expression queries as well?

@alphapapa
Copy link
Owner

This is the branch with the discussed implementation. I just merged you master on it so it should be easy to compare the changes.

Thanks.

Regarding the (tags) predicate: I'm not necessarily opposed to having this feature be part of it, but considering how much it changes the code to add a new cache and associated logic, it might be best to first expose it via a new (group-tags) predicate. Then we could consider integrating it with (tags) after it's had some time to be tested in the wild. I recently added a new way to easily define predicates; please see https://github.com/alphapapa/org-ql/blob/master/examples/defpred.org.

Sure, sound very reasonable to me. I will take a look at the new mechanism and define a distinct predicate then. Btw, the predicates that I modified in the discussed implementation are actually (tags) and (tags-regexp), since I feel that regular expression searches on group tags also are pretty convenient. Should I define a distinct tag for regular expression queries as well?

Yes, I think we should mirror them, so: (group-tags) and (group-tags-regexp group-tags*).

The code in the new branch looks fairly straightforward, but it is a lot of new code, so I'd like to take a little time to study it. Please let me know when you push the new predicates and I'll look at it again and test it.

Thanks.

@alphapapa
Copy link
Owner

BTW, please rebase on master again as I pushed a couple of important changes, including updates to the lint/test suite.

@maxchaos
Copy link
Author

Hi. I have defined distinct predicates for (tags), (tags-regexp) and (tags-all) and I also added unit tests for those (of course, I have reverted the behaviour of the original tags). Feel free to take a look when you have the time. Also, I have merged the latest version on your master, so it should be up to date.

@alphapapa
Copy link
Owner

Thanks. IIUC we've basically settled on going in the direction of the branch cached-tags-hierarchy-support, so why don't you go ahead and merge it into this PR's branch so we can discuss its changes here.

@maxchaos
Copy link
Author

Done.

@alphapapa
Copy link
Owner

Thanks for your patience. I'll try to look at this again soon.

@alphapapa alphapapa self-assigned this Jun 17, 2021
@alphapapa alphapapa modified the milestones: 0.6, 0.7 Jun 18, 2021
@alphapapa
Copy link
Owner

Since it's been a long time since the last stable release, and some important changes on master have been waiting for one, I'm moving this to 0.7.

@maxchaos
Copy link
Author

No problem. Let me know if you need me to help with anything.

@alphapapa alphapapa force-pushed the master branch 2 times, most recently from 4f5fbc4 to d0acc8c Compare May 30, 2022 16:27
@alphapapa alphapapa modified the milestones: 0.7, 0.8 Mar 10, 2023
@maxchaos maxchaos force-pushed the tag-hierarchy-support branch from 477e9b9 to f300d04 Compare April 9, 2023 07:55
@alphapapa alphapapa removed this from the 0.8 milestone Dec 16, 2023
@alphapapa alphapapa force-pushed the master branch 2 times, most recently from 059b10c to 77b4c2b Compare December 16, 2023 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tag hierarchy support
2 participants