Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add documentation for the String.iterator API #3300

Merged
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 67 additions & 5 deletions src/Init/Data/String/Basic.lean
Original file line number Diff line number Diff line change
Expand Up @@ -290,73 +290,135 @@ where go (acc : String) (s : String) : List String → String
| a :: as => go (acc ++ s ++ a) s as
| [] => acc

/-- Iterator for `String`. That is, a `String` and a position in that string. -/
/-- Iterator over the characters (`Char`) of a `String`.

Typically created by `s.iter`, where `s` is a `String`.

An iterator is *valid* if the position `i` is *valid* for the string `s`, meaning `0 ≤ i ≤ s.endPos`
and `i` lies on a UTF8 byte boundary consistently with [std's `String.Pos.Valid`][valid]. If `i =
s.endPos`, the iterator is *at the end (of `s`)*.
AdrienChampion marked this conversation as resolved.
Show resolved Hide resolved

Most operations on iterators return arbitrary values if the iterator is not valid. The functions in
the `String.Iterator` API should prevent you from dealing with invalid iterators, with two
AdrienChampion marked this conversation as resolved.
Show resolved Hide resolved
exceptions:

- `Iterator.next iter` is invalid if `iter` is already at the end of the string (`iter.atEnd` is
true), and
- `Iterator.forward iter n`/`Iterator.nextn iter n` is invalid if `n` is strictly greater than the
number of remaining characters.

[valid]: https://leanprover-community.github.io/mathlib4_docs/Std/Data/String/Lemmas.html#String.Pos.Valid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't link to external generated doc HTML from docstrings - this will be too fragile. We need to have notation for stable doc links, but we don't have it yet, so for now we should just have the names here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[valid]: https://leanprover-community.github.io/mathlib4_docs/Std/Data/String/Lemmas.html#String.Pos.Valid

Copy link
Collaborator

@digama0 digama0 Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we use the same URL as the docs#String.Pos.Valid linkifier, that is https://leanprover-community.github.io/mathlib4_docs/find/?pattern=String.Pos.Valid#doc , that is as good as we are going to get for the near term (it is designed to be a permalink), and removing the links does users a disservice. Moreover, if and when we get a stable doc link syntax, it won't work here anyway because this is a forward reference to a definition in another repository.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to have links to external domains and downstream projects in Lean - it makes everything too fragile. I totally get the concern here! But we should provide enough information for people to understand the validity of the iterator in these docstrings, not put it elsewhere, and I think we do that now. And this is largely your doing - thanks!

Copy link
Collaborator

@digama0 digama0 Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, all doc-gen generated lean documentation is on the leanprover-community web site, for better or worse. If and when we start generating doc-gen for leanprover projects somewhere on the FRO website we can retarget these URLs if desired. (It will be much harder to re-add these links after removing them, compared to retargeting existing URLs.) This is a link from core to std, hardly "external". In fact, ideally users would not even need to know that half of the String API is in Init (in core) and the other half is in Std. Validity is an essential characterizing property of these types, and I think it is important to point to the definition itself by name. (If I could, I would put validity directly into the types themselves and save users the trouble of having to ever deal with invalid iterators. But this has its own issues.)

-/
structure Iterator where
/-- The string the iterator is for. -/
s : String
/-- The current position.

This position is not necessarily legal for the string, for instance if one keeps calling
`Iterator.next` when `Iterator.atEnd` is true. If the position is not a legal one, then the
current character is `(default : Char)`, similar to `String.get` on an illegal position. -/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This position is not necessarily legal for the string, for instance if one keeps calling
`Iterator.next` when `Iterator.atEnd` is true. If the position is not a legal one, then the
current character is `(default : Char)`, similar to `String.get` on an illegal position. -/
This position is not necessarily legal for the string, for instance if one keeps calling
`Iterator.next` when `Iterator.atEnd` is true. If the position is not valid, then the
current character is `(default : Char)`, similar to `String.get` on an invalid position. -/

The other docs for strings refer to positions as "valid" or "invalid" rather than "legal" or "illegal" - I think it's good to be consistent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see, it's still using the "legal" terminology...

Copy link
Collaborator

@digama0 digama0 Feb 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wording here should discourage invalid positions a bit more strongly. There is no proof witness in the type asserting that the position is valid, but it should be unequivocally a logic bug to have such a misaligned iterator and all string functions should be allowed to have arbitrary behavior in such cases.

Suggested change
This position is not necessarily legal for the string, for instance if one keeps calling
`Iterator.next` when `Iterator.atEnd` is true. If the position is not a legal one, then the
current character is `(default : Char)`, similar to `String.get` on an illegal position. -/
An iterator is *valid* if the position `i` is valid for the string `s`. Most operations
on iterators return garbage if the iterator is not valid. If you use the functions in the API here,
you should not have to deal with invalid iterators, with the notable exception of `Iterator.next`,
which produces an invalid iterator if the iterator is already at the end of the string. -/

Copy link
Contributor Author

@AdrienChampion AdrienChampion Feb 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you're saying but I find this a bit tricky. I think one expects "position i is valid for string s" to mean that i is in-bounds, but then "".iter is invalid, and so is "1".iter.next although "1".iter.atEnd is false.

So you mean that i is valid for s if i is in-bounds or i is the successor of the index of the last character in s, right?

Copy link
Collaborator

@digama0 digama0 Feb 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A p : String.Pos is valid for s : String when 0 <= p <= s.endPos (aka p is in-bounds) and p lies on a UTF8 byte boundary. This is defined in std. (In fact all of the validity claims in this file are proved in std, so it might be a good idea to double check there to ensure you make the right claim.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are two options of validity here: valid for character access, and valid for iterator comparison (which also allows atEnd).

Maybe the C++ documentation has some terminology for this distinction that we can copy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C++ a pointer is considered valid (not immediate UB) if it is within the allocation or one-past-the-end. I'd rather not borrow the one-past-the-end terminology though, it's very awkward and reinforces the idea that a pointer points to an element rather than a separation point "between characters". Std currently does not have any terminology for "valid and not at the end"; if necessary it would just say that as two things. String.get_of_valid is not stated in a way that directly uses String.Pos.Valid because it needs to know more specifically what is the list of characters on the right of the partition.

Copy link
Collaborator

@digama0 digama0 Feb 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of string position validity should also be moved or copied to the String.Pos docs. Also both should forward reference the String.Pos.Valid / String.Iterator.Valid definitions in std.

Your last point is missing an is invalid:

-- `Iterator.forward iter n`/`Iterator.nextn iter n` if `n` is strictly greater than the number of
+- `Iterator.forward iter n`/`Iterator.nextn iter n` is invalid if `n` is strictly greater than the number of
   remaining characters.

Copy link
Contributor Author

@AdrienChampion AdrienChampion Feb 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Forgot to document the exceptions...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no need to emphasize *at the end of the string* here - this is an ordinary and expected state for an iterator to have.

How about this minor edit:

An iterator is *valid* if its current position `i` is valid for the string `s`, meaning `0 ≤ i ≤ s.endPos`
and `i` lies on a UTF8 byte boundary. If `i = s.endPos`, the iterator is at the end of the string.

Most operations on iterators return arbitrary values if the iterator is not valid. The functions in
the `String.Iterator` API should rule out the creation of invalid iterators, with two
exceptions:

- `Iterator.next iter` is invalid if `iter` is already at the end of the string (`iter.atEnd` is
  true), and
- `Iterator.forward iter n`/`Iterator.nextn iter n` if `n` is strictly greater than the number of
  remaining characters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, with the link to std4 removed

i : Pos
deriving DecidableEq

/-- Creates an iterator at the beginning of a string. -/
def mkIterator (s : String) : Iterator :=
⟨s, 0⟩

@[inherit_doc mkIterator]
abbrev iter := mkIterator

/-- The size of a string iterator is the number of bytes remaining. -/
instance : SizeOf String.Iterator where
sizeOf i := i.1.utf8ByteSize - i.2.byteIdx

theorem Iterator.sizeOf_eq (i : String.Iterator) : sizeOf i = i.1.utf8ByteSize - i.2.byteIdx :=
rfl

namespace Iterator
def toString : Iterator → String
| ⟨s, _⟩ => s
@[inherit_doc Iterator.s]
def toString := Iterator.s

/-- Number of bytes remaining in the iterator. -/
def remainingBytes : Iterator → Nat
| ⟨s, i⟩ => s.endPos.byteIdx - i.byteIdx

def pos : Iterator → Pos
| ⟨_, i⟩ => i
@[inherit_doc Iterator.i]
def pos := Iterator.i

/-- The character at the current position.

On an invalid position, returns `(default : Char)`. -/
def curr : Iterator → Char
| ⟨s, i⟩ => get s i

/-- Moves the iterator's position forward by one character, unconditionally.

It is only valid to call this function if the iterator is not a the end of the string, *i.e.*
AdrienChampion marked this conversation as resolved.
Show resolved Hide resolved
`Iterator.atEnd` is false; otherwise, the resulting iterator will be invalid. -/
def next : Iterator → Iterator
| ⟨s, i⟩ => ⟨s, s.next i⟩

/-- Decreases the iterator's position.

If the position is zero, this function is the identity. -/
def prev : Iterator → Iterator
| ⟨s, i⟩ => ⟨s, s.prev i⟩

/-- True if the iterator is past the string's last character. -/
def atEnd : Iterator → Bool
| ⟨s, i⟩ => i.byteIdx ≥ s.endPos.byteIdx

/-- True if the iterator is not past the string's last character. -/
def hasNext : Iterator → Bool
| ⟨s, i⟩ => i.byteIdx < s.endPos.byteIdx

/-- True if the position is not zero. -/
def hasPrev : Iterator → Bool
| ⟨_, i⟩ => i.byteIdx > 0

/-- Replaces the current character in the string.
AdrienChampion marked this conversation as resolved.
Show resolved Hide resolved

Does nothing if the iterator is at the end of the string. If the iterator contains the only
reference to its string, this function will mutate the string in-place instead of allocating a new
one. -/
def setCurr : Iterator → Char → Iterator
| ⟨s, i⟩, c => ⟨s.set i c, i⟩

/-- Moves the iterator's position to the end of the string.

Note that `i.toEnd.atEnd` is always true. -/
AdrienChampion marked this conversation as resolved.
Show resolved Hide resolved
def toEnd : Iterator → Iterator
| ⟨s, _⟩ => ⟨s, s.endPos⟩

/-- Extracts the substring between the positions of two iterators.

Returns the empty string if the iterators are for different strings, or if the position of the first
iterator is past the position of the second iterator. -/
def extract : Iterator → Iterator → String
| ⟨s₁, b⟩, ⟨s₂, e⟩ =>
if s₁ ≠ s₂ || b > e then ""
else s₁.extract b e

/-- Moves the iterator's position several characters forward.

Calling this function is only legal if the number of character to skip is less than or equal to the
AdrienChampion marked this conversation as resolved.
Show resolved Hide resolved
number of characters left in the iterator. -/
def forward : Iterator → Nat → Iterator
| it, 0 => it
| it, n+1 => forward it.next n

/-- The remaining characters in an iterator, as a string. -/
def remainingToString : Iterator → String
| ⟨s, i⟩ => s.extract i s.endPos

@[inherit_doc forward]
def nextn : Iterator → Nat → Iterator
| it, 0 => it
| it, i+1 => nextn it.next i

/-- Moves the iterator's position several characters back.

If asked to go back more characters than available, stops at the beginning of the string. -/
def prevn : Iterator → Nat → Iterator
| it, 0 => it
| it, i+1 => prevn it.prev i
Expand Down
6 changes: 6 additions & 0 deletions src/Init/Prelude.lean
Original file line number Diff line number Diff line change
Expand Up @@ -2373,6 +2373,12 @@ Codepoint positions (counting the Unicode codepoints rather than bytes)
are represented by plain `Nat`s instead.
Indexing a `String` by a byte position is constant-time, while codepoint
positions need to be translated internally to byte positions in linear-time.

A byte position `p` is *valid* for a string `s` if `0 ≤ p ≤ s.endPos` and `p`
lies on a UTF8 byte boundary. This notion is properly introduced in [std's
`String.Pos.Valid`][valid].

[valid]: https://leanprover-community.github.io/mathlib4_docs/Std/Data/String/Lemmas.html#String.Pos.Valid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lies on a UTF8 byte boundary. This notion is properly introduced in [std's
`String.Pos.Valid`][valid].
[valid]: https://leanprover-community.github.io/mathlib4_docs/Std/Data/String/Lemmas.html#String.Pos.Valid
lies on a UTF8 byte boundary.

We shouldn't be linking to external URLs like this before we get a way to keep them working over time, even if it would be nice.

Copy link
Collaborator

@digama0 digama0 Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem ditto. We've been using docs permalinks on zulip for years.

-/
structure String.Pos where
/-- Get the underlying byte index of a `String.Pos` -/
Expand Down
Loading