Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filter_map, dedup and some variants to the stdlib #2120

Merged
merged 4 commits into from
Dec 20, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ warning: plain functions as contracts are deprecated
= wrap this function using one of the constructors in `std.contract` instead, like `std.contract.from_validator` or `std.contract.custom`

warning: plain functions as contracts are deprecated
┌─ <stdlib/std.ncl>:1549:9
┌─ <stdlib/std.ncl>:1724:9
1549 │ %contract/apply% contract (%label/push_diag% label) value,
1724 │ %contract/apply% contract (%label/push_diag% label) value,
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ applied to this term
┌─ [INPUTS_PATH]/errors/subcontract_nested_custom_diagnostics.ncl:3:21
Expand Down
177 changes: 176 additions & 1 deletion core/stdlib/std.ncl
Original file line number Diff line number Diff line change
Expand Up @@ -917,6 +917,181 @@
xs
|> std.array.length
|> std.array.generate (fun i => f i (std.array.at i xs)),

filter_map
: forall a b. (a -> [| 'Some b, 'None |]) -> Array a -> Array b
| doc m%"
Applies a function to every element in the given array, filtering out
`'None` results. `filter_map` combines `std.array.map` and
`std.array.filter` in a single pass.

# Examples

```nickel
["1", "hello", "2", "world"]
|> std.array.filter_map (fun x =>
if std.string.is_match "^[+-]?\\d+$" x then
'Some (std.string.to_number x)
else
'None
)
# => [ 1, 2 ]
```
"%
= fun f array =>
fold_left
(fun acc x =>
f x
|> match {
'Some y => acc @ [y],
'None => acc,
}
)
[]
array,

dedup
: Array Dyn -> Array Dyn
| doc m%"
Removes duplicates from an array.

# Performance

This function relies on equality and has a quadratic complexity in the
length of the array. It might thus be slow for large arrays or if
called repeatedly. Prefer `std.array.sort_dedup` or
`std.array.hash_dedup` if you have efficiency concerns.

# Examples

```nickel multiline
std.array.dedup [ 4, 2, 1, 3, 5, 2, 1, 4 ]
# => [ 4, 2, 1, 3, 5 ]

std.array.dedup [ "hello", "world", "hello" ]
# => [ "hello", "world" ]
```
"%
= fun array =>
let length = %array/length% array in

fold_left
(fun acc x =>
if elem x acc then
acc
else
acc @ [x]
)
[]
array,

sort_dedup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just brainstorming here, but what if we had dedup_sorted : forall a. -> Array a -> Array a instead (which would do what the inner go function does, basically)? Then you'd use it like xs |> sort cmp |> dedup_sorted, but it feels a bit more composable than sort_dedup because if you know that your array is already sorted then you don't need to sort it again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can factor out dedup_sorted. I would still keep sort_dedup as a handy alias, though.

: forall a. (a -> a -> [| 'Lesser, 'Equal, 'Greater |]) -> Array a -> Array a
| doc m%"
Sorts an array based on the provided comparison operator and removes
duplicates.

# Performance

As opposed to `std.array.dedup`, this function has a better time
complexity (`O(n*log(n))` where `n` is the size of the array), which
should improve performance especially on large arrays. If you need to
preserve the original order of the array, see `std.array.hash_dedup`.

# Examples

```nickel multiline
std.array.sort_dedup std.number.compare [ 4, 2, 1, 3, 5, 2, 1, 4 ]
# => [ 1, 2, 3, 4, 5 ]

std.array.sort_dedup std.string.compare [ "world", "hello", "world" ]
# => [ "hello", "world" ]
```
"%
= fun cmp array =>
let sorted = sort cmp array in
let length = %array/length% sorted in

let rec go = fun acc n =>
if n == length then
acc
else
let x = %array/at% sorted n in

# we would prefer to use the primitive equality here instead of
# relying on `cmp` returning `'Equal`. The latter 1. requires that
# the user-provided function is correct and 2. is more costly.
#
# Unfortunately, that would break parametricity and require to
# substitute `a` for `Dyn` in the type of `sort_dedup`, which makes
# it more annoying to use with otherwise statically typed comparison
# function. So for now, we rely on the comparison function.
let acc =
if n == 0
|| cmp (%array/at% sorted (n - 1)) x != 'Equal then
acc @ [x]
else
acc
in

go acc (n + 1)
in

go [] 0,

hash_dedup
: forall a. (a -> String) -> Array a -> Array a
| doc m%"
Removes duplicates from an array efficiently.

# Hash function

`hash_dedup` uses a "hash" function mapping array elements to strings.
This function is used to efficiently check if an element has already
been seen before (linear time amortized in the size of array versus
quadratic for `std.array.dedup`).

The hash function must separate distinct elements with very high
probability. If two elements have the same hash, they will be considered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the mention of probability misleading, especially when talking about a "hash" function: in a hash table, having a low collision probability only matters for performance, but here you absolutely need two distinct elements to have different "hash"es or else you'll get the wrong answer.

I wonder if it's worth baking in a notion of hashes and sets directly into nickel? It seems like almost as "core" a notion as equality, and we've had a feature request for sets for a while now...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it's a logic error to not separate unequal keys. In some sense the hash function is the notion of equality at hand. Maybe I shouldn't call this a hash function at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I don't have a better name in mind that would be discoverable as a function name (mathematically what we want is an injection T -> String), but dedup_inject doesn't sound so good.

For the motivation, this function is taken from a real use-case: this is how we check for duplicates uid in our build machine profiles. Also, it's the only way to deduplicate an array efficiently (in this context meaning O(nlog(n)) or less) without re-ordering it

equal and thus deduplicated.

Although we call it a hash function, it doesn't have to produce good
hash values: the output of this function is used internally to generate
record field names, which are themselves properly hashed again by the
Nickel runtime.

# Examples

```nickel multiline
std.array.hash_dedup std.to_string [ 4, 2, 1, 3, 5, 2, 1, 4 ]
# => [ 4, 2, 1, 3, 5 ]

std.array.hash_dedup std.function.id [ "world", "hello", "world" ]
# => [ "world", "hello" ]
```
"%
= fun hash array =>
let length = %array/length% array in

let rec go = fun acc n =>
if n == length then
acc
else
let x = %array/at% array n in
let hashed = hash x in

if %record/has_field% hashed acc.seen then
go acc (n + 1)
else
let acc = {
seen = %record/insert% hashed acc.seen null,
result = acc.result @ [x],
}
in
go acc (n + 1)
in

(go { seen = {}, result = [] } 0).result,
},

contract = {
Expand Down Expand Up @@ -4348,7 +4523,7 @@
%contract/custom% (fun _label _value => 'Error { message = msg }),

fail_with
| String -> Dyn
| String -> (forall a. a)
| doc m%"
Abort the evaluation with the given message. The error will be reported as
a contract violation, as `fail_with` uses `std.FailWith` under the hood.
Expand Down
4 changes: 2 additions & 2 deletions core/tests/integration/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ fn check_annotated_nickel_file(path: &str) {
let test: TestCase<Test> =
read_annotated_test_case(path).expect("Failed to parse annotated program");

// By default, cargo runs tests with a 2MB stack, which we can overflow in
// debug mode. To avoid this we run the tests with an increased stack size.
// By default, cargo runs tests with a 2MB stack, which is easy to overflow. To avoid this we
// run the tests with an increased stack size.
const STACK_SIZE: usize = 4 * 1024 * 1024;
let path = String::from(project_root().join(path).to_string_lossy());

Expand Down
15 changes: 14 additions & 1 deletion lsp/nls/src/main.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
use std::{fs, io, path::PathBuf};
use std::{fs, io, path::PathBuf, thread};

use anyhow::Result;

Expand Down Expand Up @@ -29,6 +29,10 @@ mod usage;
mod utils;
mod world;

// Default stack size is 1MB on Windows, which is too small. We make it 8MB, which is the default
// size on Linux.
const STACK_SIZE: usize = 8 * 1024 * 1024;

use crate::{config::LspConfig, trace::Trace};

#[derive(clap::Parser, Debug)]
Expand Down Expand Up @@ -59,6 +63,15 @@ struct Options {
}

fn main() -> Result<()> {
let handle = thread::Builder::new()
.stack_size(STACK_SIZE)
.spawn(run)
.unwrap();

handle.join().unwrap()
}

fn run() -> Result<()> {
use clap::Parser;

env_logger::init();
Expand Down
1 change: 1 addition & 0 deletions lsp/nls/src/server.rs
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ impl Server {
.send(Message::Response(response))
.unwrap();
}

pub(crate) fn notify(&mut self, notification: Notification) {
trace!("Sending notification: {:#?}", notification);
self.connection
Expand Down
Loading