1. 67
  1. 8

    This looks really well done! But I’m always compelled, in response to tutorials like these, to advocate for using parser/lexer generator tools instead of hand-writing them. In my experience writing your own parser is sort of a boiling-the-frog experience, where it seems pretty simple at first and then gets steadily hairier as you add more features and deal with bugs.

    Of course if the goal is to learn how parsers work, it’s great to write one from scratch. I wrote a nontrivial one in Pascal back in the day (and that’s part of why I don’t want to do it again!)

    Of the available types of parser generators, I find PEG ones the nicest to use. They tend to unify lexing and parsing, and the grammars are cleaner than the old yacc-type LALR grammars.

    1. 27

      This looks really well done! But I’m always compelled, in response to tutorials like these, to advocate for using parser/lexer generator tools instead of hand-writing them. In my experience writing your own parser is sort of a boiling-the-frog experience, where it seems pretty simple at first and then gets steadily hairier as you add more features and deal with bugs.

      That’s the opposite of my experience. Writing a parser in a parser generator is fine for prototyping and when you don’t care about particularly good error reporting or want something that you can reuse for things like LSP support but then it will hurt you. In contrast, a hand-written recursive descent parser is more effort to write at the start but is then easy to maintain and extend. I don’t think any of the production compilers that I’ve worked on has used a parser generator.

      1. 8

        Also once you know the “trick” to recursive descent (that you are using the function call stack and regular control flow statements to model the grammar) it is pretty straightforward to write a parser in that style, and it’s all code you control and understand, versus another tool to learn.

        1. 3

          What’s the trick for left recursive grammars, like expressions with infix operators?

          1. 4

            Shunting yard?

            1. 3

              The trick is the while loop. If you have something like A = A '.' ident | ident you code this as

              loop {
                ident()
                if !eat('.') { break }
              }
              
              1. 3

                You can’t just blindly do left-recursion for obvious reasons, but infix is pretty easy to deal with with Pratt parsing (shunting yard, precedence climbing - all the same).

            2. 1

              Having used and written both parser generators and hand-written parsers, I agree: parser generators are nice for prototypes of very simple formats, but end up becoming a pain for larger formats.

            3. 7

              Thanks! My thought process is normally: handwritten is simple enough to write, easy to explain, and common in real world software, so why learn a new tool when the fun part is what’s after the parser? I just try to focus on the rest.

              1. 4

                Lua’s grammar is also pretty carefully designed to not put you into weird ambiguous situations, with only one exception I can think of. (The ambiguity of a colon-based method call in certain contexts.)

              2. 6

                In general that is true, but with Lua there are compelling reasons (which I won’t get into here) to handwrite a single step parser for real implementations.

                That’s what the reference implementation does, despite its author being well-known for his research in PEGs (and authoring LPEG).

                1. 2

                  do you have suggestions on PEG parsers that generate JS as well Javascript/Kotlin ?
                  I was searching for something that I can use on a frontend webapp as well as on a backend (which is in Java).

                  1. 2

                    No, sorry; the one I’ve used only generates C.

                2. 5

                  Very nice! I would love to see this become a full Lua 5.4 impl, even if a slow one.

                  1. 8

                    For the most part, projects on my personal Github and blog are purely educational/for teaching rather than something I intend to support or grow on their own. Just minimal examples to help folks learn.

                    But it’s open source if anyone else wants to build on it!

                  2. 4

                    Looking forward to this read. Ever since I learned about CQL through work, I’ve wanted to write a WASM-powered engine in order to make it much more performant in the browser, ideally compiled from Rust. Just the thing to help me learn language interpretation in practice.

                    1. 2

                      Just a superficial observation, but this seems a good example of a program that only deals with natural numbers (indexes, sizes, line numbers and so on). Unless there is a specific reason (which there could be), I think the pattern of representing all these natural numbers as i32 is often an antipattern. If anything, it generates a lot of casting:

                      fp = data.len() as i32;
                      

                      If it was me, I would say that there is usually only one correct integer type natural number type, namely usize in Rust and size_t in C. Of course, there are cases where you subtract two numbers and the result could be negative – offsets in this case. This is an actual integer in the mathematical sense – a legit use of what programmers call signed integers. But even here (depending on what’s correctest and simplest), it may make even more sense to use a natural number representation if you know something is always positive or always negative, or even treat the two cases separately. This just depends on the invariants of the program. A sign of this is when the invalid range of values is artificially split in two by the signed representation, so that you have to check for both the lower and upper limit. I don’t see that here in its explicit form, but the many casts back to usize before using indexes, together with Rust’s implicit range checks, should be equivalent.

                      1. 3

                        Yeah I was a bit torn because this implementation only allows integers anyway and I’m unlikely to follow up on this post.

                        BUT in a real implementation it’s going to be some byte sequence that stores all the datatypes and that won’t be usize either. So all the things that will always be usize I kept as usize but all the things that get stored as data I kept as i32 just kinda hinting that data is its own thing.