-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Lexer#stream() #36
base: main
Are you sure you want to change the base?
Conversation
moo.js
Outdated
@@ -273,6 +273,23 @@ | |||
this.reset() | |||
} | |||
|
|||
if (typeof require !== 'undefined') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't typeof module === 'object' && module.exports
the Cool Way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that depends on what you're checking. But in this case, probably.
Agreed! This doesn't do anything about token boundaries, does it? I recall you raised some concerns about a stream API from #18:
|
Correct. It is up to the user to push data at token boundaries. |
As for
Probably, yes. We haven't needed to you yet, but it probably would make sense for Lexers to keep track of |
Done. |
Are streams compelling? Would you use them? :-) |
It is quite useful to be able to do this: fs.createReadStream(wherever)
.pipe(split(/(\n)/))
.pipe(lexer.clone())
.pipe(new Parser())
.on('data', console.log) (modulo disgusting But it's somewhat more difficult—or perhaps just not conventional—to write EDIT: nearley might make this easier since it retains parse state between Here's an example that parses a sequence of s-expression-like things such as class Parser extends Transform {
constructor() {
super({objectMode: true})
this.stack = []
this.result = null
}
process(tok) {
switch (tok.type) {
case 'lpar':
const inner = []
if (this.result) {
this.result.push(inner)
this.stack.push(this.result)
}
this.result = inner
break
case 'rpar':
if (!this.stack.length) {
if (!this.result) {
throw new Error("I'm not gonna match your parentheses for you")
}
this.push(this.result)
}
this.result = this.stack.pop()
break
case 'word':
if (!this.result) {
throw new Error("They go inside the parentheses")
}
this.result.push(tok.value)
break
case 'space': break
}
}
flush() {
if (this.result) {
throw new Error("Aren't you forgetting something?")
}
}
_transform(chunk, _, cb) {
try {
this.process(chunk)
cb()
} catch(e) {cb(e)}
}
_flush(cb) {
try {
this.flush()
cb()
} catch(e) {cb(e)}
}
} |
@tjvr Another option for the stream API is to buffer input until we get a regex match that doesn't extend to the end of the buffer—with an optional maximum buffer size like 1MB—and to have a method that signals no more input, or perhaps better a flag to lexer.feed('some inp', {stream: true})
lexer.next() // --> word some
lexer.next() // --> ws
lexer.next() // --> undefined
lexer.feed('ut some more input', {stream: true}
lexer.next() // --> word input
// ...
lexer.next() // --> word more
lexer.next() // --> ws
lexer.next() // --> undefined
lexer.feed(' the last of the input')
// ...
lexer.next() // -> word input Using streams that look like this: const rs = new Readable()
rs.push('some inp')
rs.push('ut more input')
rs.end(' the last of the input')
rs.pipe(lexer.clone()) |
I thought you said that still isn't correct? #18 (comment) |
It's not correct, though it would give the correct result in the example I gave. Where it wouldn't give the correct result is if you had regexes like The rule here is that for every valid token, all non-empty prefixes of that token must also parse as a single token. As long as that's true, the buffering method works fine. You can always let the number regex match |
Okay. That seems like a good way to do this. :-) |
Actually, this is less reasonable than I first thought :( Doesn't that mean that given the language |
Correct. |
The reason we shouldn't just make Lexer a transform stream is so you can reuse a single lexer on multiple input streams.
However: with this implementation, horrible things will happen if you call
stream()
again before the first one is closed; specifically, the two streams will be interleaved based on which input stream is readable first. If you don't use states, you might not notice this unless you look at the offset/line/column information.Should
stream()
automatically clone the lexer to avoid this behavior? Should we make Lexer a transform stream so you can just callclone()
to get a new, independent stream?stream()
automatically callsreset()
because switching to streaming halfway through your input doesn't sound like a good idea.Closes #20.