Add Lexer#stream() #36

nathan · 2017-03-17T18:22:14Z

The reason we shouldn't just make Lexer a transform stream is so you can reuse a single lexer on multiple input streams.

However: with this implementation, horrible things will happen if you call stream() again before the first one is closed; specifically, the two streams will be interleaved based on which input stream is readable first. If you don't use states, you might not notice this unless you look at the offset/line/column information.

Should stream() automatically clone the lexer to avoid this behavior? Should we make Lexer a transform stream so you can just call clone() to get a new, independent stream?

stream() automatically calls reset() because switching to streaming halfway through your input doesn't sound like a good idea.

Closes #20.

tjvr · 2017-03-17T18:31:10Z

moo.js

@@ -273,6 +273,23 @@
    this.reset()
  }

+  if (typeof require !== 'undefined') {


Isn't typeof module === 'object' && module.exports the Cool Way to do this?

I suppose that depends on what you're checking. But in this case, probably.

tjvr · 2017-03-17T18:33:12Z

switching to streaming halfway through your input doesn't sound like a good idea.

Agreed!

This doesn't do anything about token boundaries, does it? I recall you raised some concerns about a stream API from #18:

the API is too generic to allow a useful stream wrapper.

it's incompatible with a stream API that doesn't re-lex the stream when it gets new data.

If you leave the task of pushing data only at token boundaries up to the user, then it's easy to add a stream interface (that's what feed() is, essentially). Packages like stream-split are good for, e.g., pushing the data line-by-line.

It's also possible, I guess, to re-lex the last token when we get new data, but you can still break this if you have two tokens which, when concatenated, form a different token; and this makes the API weird because you need to sometimes return a token twice.

nathan · 2017-03-17T18:34:30Z

This doesn't do anything about token boundaries, does it?

Correct. It is up to the user to push data at token boundaries.

tjvr · 2017-03-17T18:45:25Z

As for

Should stream() automatically clone the lexer to avoid this behavior?

Probably, yes. We haven't needed to you yet, but it probably would make sense for Lexers to keep track of lastIndex themselves; then clone() is cheap since we won't have to re-compile the RegExp.

nathan · 2017-03-18T00:45:50Z

Probably, yes.

Done.

tjvr · 2017-03-18T01:31:36Z

Are streams compelling? Would you use them? :-)

nathan · 2017-03-18T02:27:43Z

It is quite useful to be able to do this:

fs.createReadStream(wherever)
.pipe(split(/(\n)/))
.pipe(lexer.clone())
.pipe(new Parser())
.on('data', console.log)

(modulo disgusting .on('error', …)s everywhere because the stream API is broken)

But it's somewhat more difficult—or perhaps just not conventional—to write Parser like this than it is to write a recursive descent parser.

EDIT: nearley might make this easier since it retains parse state between feed()s.

Here's an example that parses a sequence of s-expression-like things such as (foo (bar baz (qux))), and emits each one as it's parsed.

class Parser extends Transform {
  constructor() {
    super({objectMode: true})
    this.stack = []
    this.result = null
  }
  process(tok) {
    switch (tok.type) {
      case 'lpar':
        const inner = []
        if (this.result) {
          this.result.push(inner)
          this.stack.push(this.result)
        }
        this.result = inner
        break
      case 'rpar':
        if (!this.stack.length) {
          if (!this.result) {
            throw new Error("I'm not gonna match your parentheses for you")
          }
          this.push(this.result)
        }
        this.result = this.stack.pop()
        break
      case 'word':
        if (!this.result) {
          throw new Error("They go inside the parentheses")
        }
        this.result.push(tok.value)
        break
      case 'space': break
    }
  }
  flush() {
    if (this.result) {
      throw new Error("Aren't you forgetting something?")
    }
  }
  _transform(chunk, _, cb) {
    try {
      this.process(chunk)
      cb()
    } catch(e) {cb(e)}
  }
  _flush(cb) {
    try {
      this.flush()
      cb()
    } catch(e) {cb(e)}
  }
}

nathan · 2017-03-18T22:42:09Z

@tjvr Another option for the stream API is to buffer input until we get a regex match that doesn't extend to the end of the buffer—with an optional maximum buffer size like 1MB—and to have a method that signals no more input, or perhaps better a flag to feed() like TextDecoder#decode:

lexer.feed('some inp', {stream: true})
lexer.next() // --> word some
lexer.next() // --> ws
lexer.next() // --> undefined
lexer.feed('ut some more input', {stream: true}
lexer.next() // --> word input
// ...
lexer.next() // --> word more
lexer.next() // --> ws
lexer.next() // --> undefined
lexer.feed(' the last of the input')
// ...
lexer.next() // -> word input

Using streams that look like this:

const rs = new Readable()
rs.push('some inp')
rs.push('ut more input')
rs.end(' the last of the input')

rs.pipe(lexer.clone())

tjvr · 2017-03-20T18:56:50Z

buffer input until we get a regex match that doesn't extend to the end of the buffer

I thought you said that still isn't correct? #18 (comment)

nathan · 2017-03-20T23:54:30Z

I thought you said that still isn't correct?

It's not correct, though it would give the correct result in the example I gave. Where it wouldn't give the correct result is if you had regexes like /[a-z]+/ and /\d+(?:[Ee]\d+)?/ and fed the input as 103e followed by 23, because the latter regex would match 103, which stops before the end of the buffer.

The rule here is that for every valid token, all non-empty prefixes of that token must also parse as a single token. As long as that's true, the buffering method works fine. You can always let the number regex match 103e and do the validation at parse time.

tjvr · 2017-03-21T20:50:30Z

for every valid token, all non-empty prefixes of that token must also parse as a single token

Okay. That seems like a good way to do this. :-)

tjvr · 2017-03-29T20:59:50Z

The rule here is that for every valid token, all non-empty prefixes of that token must also parse as a single token. As long as that's true, the buffering method works fine.

Actually, this is less reasonable than I first thought :(

Doesn't that mean that given the language foo | foobar, we wouldn't correctly lex foob, ar?

nathan · 2017-04-05T13:28:51Z

Doesn't that mean that given the language foo | foobar, we wouldn't correctly lex foob, ar?

Correct.

tjvr reviewed Mar 17, 2017

View reviewed changes

tjvr mentioned this pull request Mar 18, 2017

Don't re-compile RegExp in clone() #37

Merged

nathan added 3 commits March 17, 2017 20:37

Add Lexer#stream()

1943958

Cool

47ac663

Make Lexer a transform stream

ca92f26

nathan force-pushed the streams branch from dc1f70c to ca92f26 Compare March 18, 2017 00:40

Merge remote-tracking branch 'origin/master' into streams

db7c258

tjvr mentioned this pull request Mar 28, 2017

Lexer interface mega-PR kach/nearley#220

Merged

tjvr mentioned this pull request Mar 29, 2017

Remove feed() #43

Closed

tjvr force-pushed the master branch from 82acf8d to 8c3e622 Compare September 18, 2018 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Lexer#stream() #36

Add Lexer#stream() #36

nathan commented Mar 17, 2017 •

edited

Loading

tjvr Mar 17, 2017

nathan Mar 17, 2017

tjvr commented Mar 17, 2017 •

edited

Loading

nathan commented Mar 17, 2017

tjvr commented Mar 17, 2017

nathan commented Mar 18, 2017

tjvr commented Mar 18, 2017

nathan commented Mar 18, 2017 •

edited

Loading

nathan commented Mar 18, 2017 •

edited

Loading

tjvr commented Mar 20, 2017

nathan commented Mar 20, 2017 •

edited

Loading

tjvr commented Mar 21, 2017

tjvr commented Mar 29, 2017

nathan commented Apr 5, 2017

Add Lexer#stream() #36

Are you sure you want to change the base?

Add Lexer#stream() #36

Conversation

nathan commented Mar 17, 2017 • edited Loading

tjvr Mar 17, 2017

Choose a reason for hiding this comment

nathan Mar 17, 2017

Choose a reason for hiding this comment

tjvr commented Mar 17, 2017 • edited Loading

nathan commented Mar 17, 2017

tjvr commented Mar 17, 2017

nathan commented Mar 18, 2017

tjvr commented Mar 18, 2017

nathan commented Mar 18, 2017 • edited Loading

nathan commented Mar 18, 2017 • edited Loading

tjvr commented Mar 20, 2017

nathan commented Mar 20, 2017 • edited Loading

tjvr commented Mar 21, 2017

tjvr commented Mar 29, 2017

nathan commented Apr 5, 2017

nathan commented Mar 17, 2017 •

edited

Loading

tjvr commented Mar 17, 2017 •

edited

Loading

nathan commented Mar 18, 2017 •

edited

Loading

nathan commented Mar 18, 2017 •

edited

Loading

nathan commented Mar 20, 2017 •

edited

Loading