Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add special grouping functions #255

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Martoon-00
Copy link
Member

Description

I often witnessed the need in [(a, b)] -> [(a, [b])] grouping function, so adding it here since we are working on similar things in-parallel.

Related issues(s)

✓ Checklist for your Pull Request

Ideally a PR has all of the checkmarks set.

If something in this list is irrelevant to your PR, you should still set this
checkmark indicating that you are sure it is dealt with (be that by irrelevance).

  • I made sure my PR addresses a single concern, or multiple concerns which
    are inextricably linked. Otherwise I should open multiple PR's.
  • If your PR fixes/relates to an open issue then the description should
    reference this issue. See also auto linking on
    github
    .

Related changes (conditional)

  • Tests

    • If I added new functionality, I added tests covering it.
    • If I fixed a bug, I added a regression test to prevent the bug from
      silently reappearing again.
  • Documentation

    I checked whether I should update the docs and did so if necessary:

  • Record your changes

    • I added an entry to the changelog if my changes are visible to the users
      and
    • provided a migration guide for breaking changes if possible

Stylistic guide (mandatory)

  • My commit history is clean (only contains changes relating to my
    issue/pull request and no reverted-my-earlier-commit changes) and commit
    messages start with identifiers of related issues in square brackets.

    Example: [#42] Short commit description

    If necessary both of these can be achieved even after the commits have been
    made/pushed using rebase and squash.

@Martoon-00 Martoon-00 added the type:feature Something new is added. label Apr 12, 2022
groupByKey toKey toVal l =
map
(\ne -> (toKey (NE.head ne), NE.map toVal ne))
(NE.groupWith toKey l)
Copy link

@treeowl treeowl Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there isn't enough inlining into and around this function, then it could leak keys into the values. I think it's best to do something like this instead:

groupByKeyOn :: (Eq k', Foldable f) => (k -> k') -> (a -> (k, v)) -> f a -> [(k, NonEmpty v)]

I can try to write something up.

Copy link

@treeowl treeowl Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you go:

groupByKeyOn :: (Eq k', Foldable f) => (k -> k') -> (a -> (k, v)) -> f a -> [(k, NonEmpty v)]
groupByKeyOn kt split = start . Foldable.toList
  where
    start [] = []
    start (a : as)
      | (k, v) <- split a
      , let (ys, zs) = go (kt k) as
      = (k, v :| ys) : zs

    go _ [] = ([], [])
    go ko' (a : as)
      | (kn, v) <- split a
      , let kn' = kt kn
      = if ko' == kn'
         then let (vs, ws) = go ko' as
                 in (v : vs, ws)
         else let (vs, ws) = go kn' as
                 in ([], (kn, v :| vs) : ws)

groupByKey :: (Eq k, Foldable f) => (a -> (k, v)) -> f a -> [(k, NonEmpty v)]
groupByKey = groupByKeyOn id

groupByFst :: (Eq a, Foldable f) => f (a, b) -> [(a, NonEmpty b)]
groupByFst = groupByKey id

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this 👍

If there isn't enough inlining into and around this function, then it could leak keys into the values.

Could you elaborate here?

Your implementation apparently seems more performant though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My claim may not have been quite precise. You wrote

groupByKey toKey toVal l =
  map
    (\ne -> (toKey (NE.head ne), NE.map toVal ne))
    (NE.groupWith toKey l)

Suppose we're just looking at pairs (if toKey is expensive, then your version also has the downside of calculating it twice for some values). The groupWith call gives you something like

[ (k1, v11) :| [(k1', v12), ...]
, (k2, v21) :| [(k2', v22), ...]
, ...]

The prime marks signify values that are == but not pointer equal (these are common).

Let's look at just the first element of the result. The function mapping over it gives us

let ne = (k1, v11) :| [(k1', v12), ...]
in (toKey (NE.head ne), NE.map toVal ne)

The first component will start out a thunk, but fortunately it will probably be a selector thunk, so the garbage collector should be able to clean it up.

The second component will be a regular thunk, which will incrementally walk the list dropping keys. Until that's walked, however, those k1', k2', etc., values will be retained. Even where the keys are being dropped, we don't do it directly but rather create selector thunks for the GC to clean up. This is only true with optimization (which is fine), because snd needs to inline into map to make a selector thunk rather than a function application thunk.

So the risk of a serious memory leak is probably low, but it makes a lot of sense that things are kind of slow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds reasonable, thanks!

And you propose the function with extra (k -> k') indirection because you know the use cases when this (k -> k') argument cannot be filled with just id, or because that may help GC to cleanup the keys earlier in some way?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just that it's often useful to do things like that. I could easily imagine sticking something like Data.Text.toLower in there. It's also quite possible to use a more general k -> k -> Ordering for a groupBy-like thing. Could add that too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, toLower is a good example.

And the version with (k -> k -> Ordering) (rather with (k -> k -> Bool)?) seems worthy to have, I agree.

Gonna do the change.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Bool. Sorry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again! I changed the impl according to your suggestion, could you review the result?

One minor difference: I went with universum's Container instead of Foldable, because I remember some tickets in this repo dedicated to switching from Foldable.

Problem: it is an often need to have a function of
`[(a, b)] -> [(a, [b])]` signature that would group elements by the
first element of the pair, and return the element we grouped by and the
associated second elements of the pairs.

Solution: add this function, and a more generic one that does not
necessarily work with pairs.

Also, we in-parallel work on making `group` function return `NonEmpty`,
not lists, so I implement it via `NonEmpty` here too.

Special thanks to @treeowl for providing elaborate implementations and
advanced variations for special cases.
Copy link

@treeowl treeowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for groupByKeyBy should specify the requirements for the comparison function. In particular, that should always define a (computable) equivalence relation. Additionally, I'd like to include groupByKeyOn—it's less general, but it avoids those explicit preconditions by leaning on Eq.

@Martoon-00
Copy link
Member Author

Additionally, I'd like to include groupByKeyOn—it's less general, but it avoids those explicit preconditions by leaning on Eq.

I also like how *On version would go without preconditions on the provided function.

My thought on not including it was that other packages seem to avoid the addition of *On version unless it does something special compared to passing mere on call (i.e. there is no groupOn f = groupBy ((==) 'on' f)).

Other examples: sortOn which has slightly different performance characteristics and non-existing Data.List.NonEmpty.groupOn.

So I think if we go with adding groupByKeyOn, we should be consistent and add mere groupOn too, but this is likely the work for a separate issue.

What do you think?

@treeowl
Copy link

treeowl commented Jun 1, 2022

I would like to have it for all the things. Sorting is weird because of sortOn f vs. sortBy (compare `on` f). I don't know if any of the other common functions have both options. I don't think these ones do.

Problem: previously we added `groupByKeyOn`, and generally we decided
that adding `*On` versions of functions would be convenient.

Solution: add `groupOn` for consistency.
@Martoon-00
Copy link
Member Author

Okay, added On versions for grouping, and sorting is out of scope of this issue.

@treeowl Requesting the final review 👀

-- one as the representer of the equivalence class.
--
-- >>> groupByKeyBy (<=) id [(1, 10), (2, 11), (0, 12), (1, 13), (3, 14)]
-- [(1,10 :| [11]),(0,12 :| [13,14])]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation indicates that the function must define an equivalence relation, but (<=) does not do so! This example is busted (i.e., its result is unspecified).

-- | Variation of 'groupByKey' that matches mapped keys on equality.
--
-- >>> groupByKeyOn toLower id [("A", 1), ("a", 2), ("b", 3)]
-- [("A",1 :| [2]),("b",3 :| [])]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good example, but there should also be one or more with non-id projection (probably not the right word) functions.

-- for the key to group by and for the value.
--
-- >>> groupByKey (\x -> (x `mod` 5, x)) [1, 6, 7, 2, 12, 11]
-- [(1,1 :| [6]),(2,7 :| [2,12]),(1,11 :| [])]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs should specify groupByKey = groupByKeyOn id.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either the docs here or the docs for groupByFst should explain the relationship between them.

@dcastro dcastro added this to the v.1.9.0 milestone Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature Something new is added.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants