-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add special grouping functions #255
base: master
Are you sure you want to change the base?
Conversation
860b88c
to
e7e00a6
Compare
src/Universum/List/Safe.hs
Outdated
groupByKey toKey toVal l = | ||
map | ||
(\ne -> (toKey (NE.head ne), NE.map toVal ne)) | ||
(NE.groupWith toKey l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there isn't enough inlining into and around this function, then it could leak keys into the values. I think it's best to do something like this instead:
groupByKeyOn :: (Eq k', Foldable f) => (k -> k') -> (a -> (k, v)) -> f a -> [(k, NonEmpty v)]
I can try to write something up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you go:
groupByKeyOn :: (Eq k', Foldable f) => (k -> k') -> (a -> (k, v)) -> f a -> [(k, NonEmpty v)]
groupByKeyOn kt split = start . Foldable.toList
where
start [] = []
start (a : as)
| (k, v) <- split a
, let (ys, zs) = go (kt k) as
= (k, v :| ys) : zs
go _ [] = ([], [])
go ko' (a : as)
| (kn, v) <- split a
, let kn' = kt kn
= if ko' == kn'
then let (vs, ws) = go ko' as
in (v : vs, ws)
else let (vs, ws) = go kn' as
in ([], (kn, v :| vs) : ws)
groupByKey :: (Eq k, Foldable f) => (a -> (k, v)) -> f a -> [(k, NonEmpty v)]
groupByKey = groupByKeyOn id
groupByFst :: (Eq a, Foldable f) => f (a, b) -> [(a, NonEmpty b)]
groupByFst = groupByKey id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this 👍
If there isn't enough inlining into and around this function, then it could leak keys into the values.
Could you elaborate here?
Your implementation apparently seems more performant though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My claim may not have been quite precise. You wrote
groupByKey toKey toVal l =
map
(\ne -> (toKey (NE.head ne), NE.map toVal ne))
(NE.groupWith toKey l)
Suppose we're just looking at pairs (if toKey
is expensive, then your version also has the downside of calculating it twice for some values). The groupWith
call gives you something like
[ (k1, v11) :| [(k1', v12), ...]
, (k2, v21) :| [(k2', v22), ...]
, ...]
The prime marks signify values that are ==
but not pointer equal (these are common).
Let's look at just the first element of the result. The function mapping over it gives us
let ne = (k1, v11) :| [(k1', v12), ...]
in (toKey (NE.head ne), NE.map toVal ne)
The first component will start out a thunk, but fortunately it will probably be a selector thunk, so the garbage collector should be able to clean it up.
The second component will be a regular thunk, which will incrementally walk the list dropping keys. Until that's walked, however, those k1'
, k2'
, etc., values will be retained. Even where the keys are being dropped, we don't do it directly but rather create selector thunks for the GC to clean up. This is only true with optimization (which is fine), because snd
needs to inline into map
to make a selector thunk rather than a function application thunk.
So the risk of a serious memory leak is probably low, but it makes a lot of sense that things are kind of slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds reasonable, thanks!
And you propose the function with extra (k -> k')
indirection because you know the use cases when this (k -> k')
argument cannot be filled with just id
, or because that may help GC to cleanup the keys earlier in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just that it's often useful to do things like that. I could easily imagine sticking something like Data.Text.toLower
in there. It's also quite possible to use a more general k -> k -> Ordering
for a groupBy
-like thing. Could add that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, toLower
is a good example.
And the version with (k -> k -> Ordering)
(rather with (k -> k -> Bool)
?) seems worthy to have, I agree.
Gonna do the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Bool
. Sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again! I changed the impl according to your suggestion, could you review the result?
One minor difference: I went with universum's Container
instead of Foldable
, because I remember some tickets in this repo dedicated to switching from Foldable
.
Problem: it is an often need to have a function of `[(a, b)] -> [(a, [b])]` signature that would group elements by the first element of the pair, and return the element we grouped by and the associated second elements of the pairs. Solution: add this function, and a more generic one that does not necessarily work with pairs. Also, we in-parallel work on making `group` function return `NonEmpty`, not lists, so I implement it via `NonEmpty` here too. Special thanks to @treeowl for providing elaborate implementations and advanced variations for special cases.
e7e00a6
to
6a3b987
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for groupByKeyBy
should specify the requirements for the comparison function. In particular, that should always define a (computable) equivalence relation. Additionally, I'd like to include groupByKeyOn
—it's less general, but it avoids those explicit preconditions by leaning on Eq
.
I also like how My thought on not including it was that other packages seem to avoid the addition of Other examples: So I think if we go with adding What do you think? |
I would like to have it for all the things. Sorting is weird because of |
Problem: previously we added `groupByKeyOn`, and generally we decided that adding `*On` versions of functions would be convenient. Solution: add `groupOn` for consistency.
Okay, added @treeowl Requesting the final review 👀 |
-- one as the representer of the equivalence class. | ||
-- | ||
-- >>> groupByKeyBy (<=) id [(1, 10), (2, 11), (0, 12), (1, 13), (3, 14)] | ||
-- [(1,10 :| [11]),(0,12 :| [13,14])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation indicates that the function must define an equivalence relation, but (<=)
does not do so! This example is busted (i.e., its result is unspecified).
-- | Variation of 'groupByKey' that matches mapped keys on equality. | ||
-- | ||
-- >>> groupByKeyOn toLower id [("A", 1), ("a", 2), ("b", 3)] | ||
-- [("A",1 :| [2]),("b",3 :| [])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good example, but there should also be one or more with non-id
projection (probably not the right word) functions.
-- for the key to group by and for the value. | ||
-- | ||
-- >>> groupByKey (\x -> (x `mod` 5, x)) [1, 6, 7, 2, 12, 11] | ||
-- [(1,1 :| [6]),(2,7 :| [2,12]),(1,11 :| [])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs should specify groupByKey = groupByKeyOn id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either the docs here or the docs for groupByFst
should explain the relationship between them.
Description
I often witnessed the need in
[(a, b)] -> [(a, [b])]
grouping function, so adding it here since we are working on similar things in-parallel.Related issues(s)
✓ Checklist for your Pull Request
Ideally a PR has all of the checkmarks set.
If something in this list is irrelevant to your PR, you should still set this
checkmark indicating that you are sure it is dealt with (be that by irrelevance).
are inextricably linked. Otherwise I should open multiple PR's.
reference this issue. See also auto linking on
github.
Related changes (conditional)
Tests
silently reappearing again.
Documentation
I checked whether I should update the docs and did so if necessary:
Record your changes
and
Stylistic guide (mandatory)
My commit history is clean (only contains changes relating to my
issue/pull request and no reverted-my-earlier-commit changes) and commit
messages start with identifiers of related issues in square brackets.
Example:
[#42] Short commit description
If necessary both of these can be achieved even after the commits have been
made/pushed using rebase and squash.