Add special grouping functions #255

Martoon-00 · 2022-04-12T16:26:14Z

Description

I often witnessed the need in [(a, b)] -> [(a, [b])] grouping function, so adding it here since we are working on similar things in-parallel.

Related issues(s)

✓ Checklist for your Pull Request

Ideally a PR has all of the checkmarks set.

If something in this list is irrelevant to your PR, you should still set this
checkmark indicating that you are sure it is dealt with (be that by irrelevance).

I made sure my PR addresses a single concern, or multiple concerns which
are inextricably linked. Otherwise I should open multiple PR's.
If your PR fixes/relates to an open issue then the description should
reference this issue. See also auto linking on
github.

Related changes (conditional)

Tests
- If I added new functionality, I added tests covering it.
- If I fixed a bug, I added a regression test to prevent the bug from
  silently reappearing again.
Documentation

I checked whether I should update the docs and did so if necessary:
- README
- Haddock
Record your changes
- I added an entry to the changelog if my changes are visible to the users
  and
- provided a migration guide for breaking changes if possible

Stylistic guide (mandatory)

My commit history is clean (only contains changes relating to my
issue/pull request and no reverted-my-earlier-commit changes) and commit
messages start with identifiers of related issues in square brackets.

Example: [#42] Short commit description

If necessary both of these can be achieved even after the commits have been
made/pushed using rebase and squash.

treeowl · 2022-04-12T17:26:08Z

src/Universum/List/Safe.hs

+groupByKey toKey toVal l =
+  map
+    (\ne -> (toKey (NE.head ne), NE.map toVal ne))
+    (NE.groupWith toKey l)


If there isn't enough inlining into and around this function, then it could leak keys into the values. I think it's best to do something like this instead:

groupByKeyOn :: (Eq k', Foldable f) => (k -> k') -> (a -> (k, v)) -> f a -> [(k, NonEmpty v)]

I can try to write something up.

Here you go:

groupByKeyOn :: (Eq k', Foldable f) => (k -> k') -> (a -> (k, v)) -> f a -> [(k, NonEmpty v)] groupByKeyOn kt split = start . Foldable.toList where start [] = [] start (a : as) | (k, v) <- split a , let (ys, zs) = go (kt k) as = (k, v :| ys) : zs go _ [] = ([], []) go ko' (a : as) | (kn, v) <- split a , let kn' = kt kn = if ko' == kn' then let (vs, ws) = go ko' as in (v : vs, ws) else let (vs, ws) = go kn' as in ([], (kn, v :| vs) : ws) groupByKey :: (Eq k, Foldable f) => (a -> (k, v)) -> f a -> [(k, NonEmpty v)] groupByKey = groupByKeyOn id groupByFst :: (Eq a, Foldable f) => f (a, b) -> [(a, NonEmpty b)] groupByFst = groupByKey id

Thanks for looking into this 👍

If there isn't enough inlining into and around this function, then it could leak keys into the values.

Could you elaborate here?

Your implementation apparently seems more performant though.

My claim may not have been quite precise. You wrote

groupByKey toKey toVal l = map (\ne -> (toKey (NE.head ne), NE.map toVal ne)) (NE.groupWith toKey l)

Suppose we're just looking at pairs (if toKey is expensive, then your version also has the downside of calculating it twice for some values). The groupWith call gives you something like

[ (k1, v11) :| [(k1', v12), ...] , (k2, v21) :| [(k2', v22), ...] , ...]

The prime marks signify values that are == but not pointer equal (these are common).

Let's look at just the first element of the result. The function mapping over it gives us

let ne = (k1, v11) :| [(k1', v12), ...] in (toKey (NE.head ne), NE.map toVal ne)

The first component will start out a thunk, but fortunately it will probably be a selector thunk, so the garbage collector should be able to clean it up.

The second component will be a regular thunk, which will incrementally walk the list dropping keys. Until that's walked, however, those k1', k2', etc., values will be retained. Even where the keys are being dropped, we don't do it directly but rather create selector thunks for the GC to clean up. This is only true with optimization (which is fine), because snd needs to inline into map to make a selector thunk rather than a function application thunk.

So the risk of a serious memory leak is probably low, but it makes a lot of sense that things are kind of slow.

Yeah, that sounds reasonable, thanks!

And you propose the function with extra (k -> k') indirection because you know the use cases when this (k -> k') argument cannot be filled with just id, or because that may help GC to cleanup the keys earlier in some way?

Just that it's often useful to do things like that. I could easily imagine sticking something like Data.Text.toLower in there. It's also quite possible to use a more general k -> k -> Ordering for a groupBy-like thing. Could add that too.

Fair, toLower is a good example.

And the version with (k -> k -> Ordering) (rather with (k -> k -> Bool)?) seems worthy to have, I agree.

Gonna do the change.

Yes, Bool. Sorry.

Thanks again! I changed the impl according to your suggestion, could you review the result?

One minor difference: I went with universum's Container instead of Foldable, because I remember some tickets in this repo dedicated to switching from Foldable.

@treeowl

Problem: it is an often need to have a function of `[(a, b)] -> [(a, [b])]` signature that would group elements by the first element of the pair, and return the element we grouped by and the associated second elements of the pairs. Solution: add this function, and a more generic one that does not necessarily work with pairs. Also, we in-parallel work on making `group` function return `NonEmpty`, not lists, so I implement it via `NonEmpty` here too. Special thanks to @treeowl for providing elaborate implementations and advanced variations for special cases.

treeowl

The documentation for groupByKeyBy should specify the requirements for the comparison function. In particular, that should always define a (computable) equivalence relation. Additionally, I'd like to include groupByKeyOn—it's less general, but it avoids those explicit preconditions by leaning on Eq.

Martoon-00 · 2022-05-20T13:27:55Z

Additionally, I'd like to include groupByKeyOn—it's less general, but it avoids those explicit preconditions by leaning on Eq.

I also like how *On version would go without preconditions on the provided function.

My thought on not including it was that other packages seem to avoid the addition of *On version unless it does something special compared to passing mere on call (i.e. there is no groupOn f = groupBy ((==) 'on' f)).

Other examples: sortOn which has slightly different performance characteristics and non-existing Data.List.NonEmpty.groupOn.

So I think if we go with adding groupByKeyOn, we should be consistent and add mere groupOn too, but this is likely the work for a separate issue.

What do you think?

treeowl · 2022-06-01T08:53:36Z

I would like to have it for all the things. Sorting is weird because of sortOn f vs. sortBy (compare `on` f). I don't know if any of the other common functions have both options. I don't think these ones do.

Problem: previously we added `groupByKeyOn`, and generally we decided that adding `*On` versions of functions would be convenient. Solution: add `groupOn` for consistency.

Martoon-00 · 2022-06-22T22:11:23Z

Okay, added On versions for grouping, and sorting is out of scope of this issue.

@treeowl Requesting the final review 👀

treeowl · 2022-07-05T06:25:55Z

src/Universum/Container/Utils.hs

+-- one as the representer of the equivalence class.
+--
+-- >>> groupByKeyBy (<=) id [(1, 10), (2, 11), (0, 12), (1, 13), (3, 14)]
+-- [(1,10 :| [11]),(0,12 :| [13,14])]


The documentation indicates that the function must define an equivalence relation, but (<=) does not do so! This example is busted (i.e., its result is unspecified).

treeowl · 2022-07-05T06:28:06Z

src/Universum/Container/Utils.hs

+-- | Variation of 'groupByKey' that matches mapped keys on equality.
+--
+-- >>> groupByKeyOn toLower id [("A", 1), ("a", 2), ("b", 3)]
+-- [("A",1 :| [2]),("b",3 :| [])]


This is a good example, but there should also be one or more with non-id projection (probably not the right word) functions.

treeowl · 2022-07-05T06:29:24Z

src/Universum/Container/Utils.hs

+-- for the key to group by and for the value.
+--
+-- >>> groupByKey (\x -> (x `mod` 5, x)) [1, 6, 7, 2, 12, 11]
+-- [(1,1 :| [6]),(2,7 :| [2,12]),(1,11 :| [])]


Docs should specify groupByKey = groupByKeyOn id.

Either the docs here or the docs for groupByFst should explain the relationship between them.

Martoon-00 added the type:feature Something new is added. label Apr 12, 2022

Martoon-00 force-pushed the martoon/advanced-grouping branch from 860b88c to e7e00a6 Compare April 12, 2022 16:27

treeowl reviewed Apr 12, 2022

View reviewed changes

Martoon-00 force-pushed the martoon/advanced-grouping branch from e7e00a6 to 6a3b987 Compare April 23, 2022 13:41

treeowl suggested changes May 18, 2022

View reviewed changes

fixup! Add special grouping functions

383bf54

Martoon-00 added 2 commits June 23, 2022 01:00

fixup! Add special grouping functions

63266fa

Add groupOn

084e3e7

Problem: previously we added `groupByKeyOn`, and generally we decided that adding `*On` versions of functions would be convenient. Solution: add `groupOn` for consistency.

treeowl suggested changes Jul 5, 2022

View reviewed changes

dcastro added this to the v.1.9.0 milestone Jul 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add special grouping functions #255

Add special grouping functions #255

Martoon-00 commented Apr 12, 2022

treeowl Apr 12, 2022 •

edited

Loading

treeowl Apr 12, 2022 •

edited

Loading

Martoon-00 Apr 13, 2022

treeowl Apr 13, 2022

Martoon-00 Apr 14, 2022

treeowl Apr 14, 2022

Martoon-00 Apr 14, 2022

treeowl Apr 14, 2022

Martoon-00 Apr 23, 2022

treeowl left a comment

Martoon-00 commented May 20, 2022

treeowl commented Jun 1, 2022

Martoon-00 commented Jun 22, 2022

treeowl Jul 5, 2022

treeowl Jul 5, 2022

treeowl Jul 5, 2022

treeowl Jul 5, 2022

Add special grouping functions #255

Are you sure you want to change the base?

Add special grouping functions #255

Conversation

Martoon-00 commented Apr 12, 2022

Description

Related issues(s)

✓ Checklist for your Pull Request

Related changes (conditional)

Stylistic guide (mandatory)

treeowl Apr 12, 2022 • edited Loading

Choose a reason for hiding this comment

treeowl Apr 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

treeowl left a comment

Choose a reason for hiding this comment

Martoon-00 commented May 20, 2022

treeowl commented Jun 1, 2022

Martoon-00 commented Jun 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

treeowl Apr 12, 2022 •

edited

Loading

treeowl Apr 12, 2022 •

edited

Loading