This doesn’t feel very different to the cross-compile version to me. To build the WASM version, I need to have the compiler running on some other system. If I can do that, then why can it build a WASM version that can target my platform and architecture but not target my platform and architecture directly?
The reason for wanting to bootstrap is to avoid trusting-trust issues: to start with a compiler that I trust already and know that it faithfully reproduced the behaviour expressed in the source code of the compiler that I’m compiling. I lose that if there is a blob of WASM in the middle.
The important difference for me is that I can put wasm spec & wasm seed into a spaceship and be reasonably sure that the aliens would figure that out.
I could pack x86 seed and the Intel manual instead, but I would feel embarrassed about that!
But what problem are you actually trying to solve? Can you bootstrap the compiler (written in language X) when no existing compilers exist for language X? If you have to do this by implementing a compiler or interpreter for Y, then this makes sense only if Y is significantly simpler than X or if you can assume that an implementation of Y exists. WebAssembly is quite a complex language with a lot of features that are intended to make it possible to statically verify very coarse-grained memory safety and control-flow integrity properties. You’re also going to need at least WASI to be able to do I/O, so there’s a lot more in the embedding than just WebAssembly. SPARC, MIPS, and RISC-V are all simpler than WebAssembly, so you could just ship a binary for these and a spec for the architecture and a UART that you feed source into and get binaries out of. For any non-aliens use case, you can just run it in QEMU.
That’s never the problem I’m trying to solve with bootstrapping though. The problem that I am trying to solve is how do I get a binary for the compiler that I am reasonably confident that I can trust. If my FreeBSD/AArch64 FooLang compiler needs me to run a Linux/x86-64 FooLang compiler to cross-compile the first version and the Linux version of the compiler needs ten different versions to be built with eachother to get from source code to a binary, then there are a lot of places where someone could have inserted a trojan. Going via WebAssembly on the Linux/x86-64 -> FreeBSD/AArch64 step doesn’t address any of that, it still puts every single version of the Linux compiler in my TCB. It also isn’t easier to use.
No one bothers to solve this problem for C (and, increasingly, not for C++) because C/C++ compilers are assumed to be trusted for no good reason, and so the usual dodge is to write a bootstrap compiler in C/C++, then:
If they are, then you can be moderately confident that the bootstrap compiler produces semantically equivalent output to your original.
I liked Andrew Kelly’s article on Zig’s new bootstrap process, which uses WASM+WASI as matklad outlined. WRT Y being significantly simpler (where Y = WASM), they ended up with a wasm2c compiler in 4000 lines of C, which impressed me with its smallness. They skipped the wasm verification and safety features (not needed in this situation) to make it simpler.
Zig still relies on a C compiler for bootstrapping, tho. I guess an alternative would be for Zig’s wasm bootstrap compiler to have a wasm backend instead of a C backend, then the stage2 compiler runs on a wasm compiler or interpreter instead of natively via C. Then porting to a new target requires a wasm interpreter or compiler, eg a wasm2asm instead of wasm2c.
We explored this a while back in the Bootstrapping group. That page, esp “groups” page, has about every tool we found. I’m linking it in case anything there is useful to you. What I write next is specifically about bootstrapping for trustworthy software.
My own interest goes back to Paul Karger designing a compiler subversion during the MULTICS evaluation. The high-assurance, security field that they launched found that attacks would happen on every level at every step. So, all would need to be secured with rigorous design. What does that take for a trustworthy, compiler binary?
For source to object code, we would need verified compilers that do the right thing at a design level, code that’s actually reviewable by humans, that’s analyzable by machines, proof that the object code matched their source, exhaustive testing of all of that, high-security repos, trusted distribution, and ability for users to re-check and re-generate all those artifacts on their own. Most projects, from bootstrapping to reproducible builds, don’t attempt to address the whole thing even minimally. If anything, I believed that most projects are trying to make sure that, in the worst scenario, the potentially-malicious source on one machine will execute the same attacks on their own machine. A huge drop from prior goals.
There are common attacks filtered by some of their work. Let’s say it’s beneficial. Let’s say we use boostrapping with an eye for having a result we can trust. This sub-field’s solutions have been all over the place. Here’s a few categories I saw.
Some are about running a long chain of programs to get from simple tools to the binary of a gigantic compiler that nobody can fully review. I think of that as untrustworthy source to untrustworthy source compilation. The source can still be attacked with subversion. It might also already drop security-critical code, such as via optimization. If we build this chain, even most security-focused users are not going to both inspect and re-run it themselves. The reality of the situation is that almost all of them take some expert’s word that the system is secure. That seems similar to just relying on a trusted team to build a compiler. There’s a lot of OSS demand for these designs, though.
Next type: formally-verified or certifying compiler. Examples of formal verification included VLISP, Myreen’s LISP, CompCert, and CakeML. Although CompCert is proprietary, GPL version could be used for bootstrapping. Proof tools often output ML which CakeML might handle. Its intermediate languages could be building blocks for non-ML languages and bootstrapping projects. On certifying, there were tools like Flint’s for ML and Necula’s for Java. Most people on demand side neither know nor want to learn these methods. Those who know them mostly won’t build components like these because they often have different interests. That’s killed on the demand side.
Next type: build a LISP, add higher-level features, turn it into a real language, and write the compiler in that. I’ve seen so many versions of this that I think it’s the easiest. Most people coding in C, C++, Java, and .NET won’t learn Lisp at all. If they will, they don’t want it to be a critical part of their non-Lisp system. So, the highest-productivity option is gone.
Many, hand-written interpreters existed for many simple languages that could bootstrap the compiler. Like Lisp, they often had features that would make it easier than what GCC was written in. Others, like Pascal/P to P-code, made the target platform easier, too. Their simplicity also made code review easier. Even if they’d like the language, most interested in bootstrapping wanted to use the heavy language the compiler is for or their personal favorites that were less suitable. The demand side again makes the language requirements harder that way.
Conclusion: this is not a technology problem. There’s examples of everything I listed already built in many forms. The resistance to that was about non-interest in security requirements, specific languages, use of formally-verified components, etc. From the requirements to the implementations, most things are being decided for non-security-centered reasons which also sometimes lowered productivity. So, I don’t worry about it any more.
I’ll just compliment, trust, and use whatever their teams end up producing. Like I’ve been doing for GCC, LLVM, and recently Python. :)
Yeah, security angle here is fascinating for some people, but not me :0) I already trust soo many things that hacking my compiler isn’t really the easiest way to pwn me.
I am interested in bootstrapping from literally nothing case, because starting with an arbitrary turing tarping and ending up with Rust in O(1) steps with a rather small constant is nifty!
It sounds like you have a more realistic goal! Many in the bootstrapping community were doing it more for fun or an intellectual challenge. One thing we discussed was the balance of expressiveness, ease of implementation, and resulting performance (eg optimizations). We were curious how many steps were needed in between two points to express those ideas cleanly with acceptable performance. Also, what techniques at one layer dramatically reduced the work at another layer. All fascinating stuff.
I’m curious what your goal is. The article opened with “if you start out with nothing.” We almost always have something on one of the platforms. The language designer can also choose portable options in their implementation. From nothing designs might not be necessary but maybe it’s still helpful. For example, WebAssembly design might aid portability. On a higher level, I’m curious what you’re personally aiming for or would like to see in this space? Whether for practical or fun reasons.
Practically, I am pretty happy with downloading prebuilt compilers and/or cross compiling.
Theoretically, I am interested in self-contained seeds, which do not rely on existence of a particular computation device outside (https://250bpm.com/blog:157/index.html)
I guess, important clarification is “self contained seeds for real software”.
Obviously, if you start from forth or lisp, you can get a rather neat chain. But, if you want to jump from that to Rust, you’d have “implement Rust compiler in forth” step, which doesn’t seem elegant
In the linked example, I’d suggest a good chunk of what the DNA encodes is instructions directly to its environment on what and how to produce the thing. More like a series of instructions in a Linux box that have the syscalls, ABI, etc. You’d have to re-create whatever environment that information is designed for so it can “run.” In computing, you’d have to either change your instructions to match the target’s environment or your environment to match the instructions. Finding the balance between the two, or how many links in the chain are necessary, is the hobby of many boostrappers.
Back to compilers. Before I mention my concept, I’ll mention what inspired it: Pascal/P. Wirth’s design philosophy was to stop adding features when the compiler got too hard to build. His work was easy to read and build. For Pascal/P, he wanted non-compiler experts to be able to port the compiler. His P-code interpreter was close to the metal. The Pascal/P system, including libraries, was compiled entirely to P-Code. Then, just port the interpreter to bootstrap the system on a new architecture. They claimed to have 70 ports or something in two years on machines ranging from tiny machines to mainframes with unusual bits. Proven model.
So, instead of P-code, you start with the primitives that C or Rust is built on. You use their machine-level, data types. You use pointers with operations directly on memory. The ability to move what’s in a file back and forth into memory. You’ll probably need lookup storage for names, a tree w/ tree ops, maybe a stack, and so on. Active Oberon suggests objects might be easy to add. Macros or templates might go in tree ops.
All of this is stuff you can encode as bytecodes in an interpreter. The bytecodes of this interpreter are mostly self-contained functions that operate directly on memory or some shared state (eg interpreter). They’re implemented in non-optimized, straight-forward assembler. I advise even minimizing which instruction types you use so reviewers can learn just a few.
How to pick those things. I noticed that people often described C a certain way when explaining what goes on “under the hood.” They might say we’d put the function arguments on a stack, load the address of the pointer from memory, malloc, do a syscall, etc. I imagine Rust’s coders do something similar. The primitive operations should either be those very things or easy to compose into those things. Directly execute it or compose it with function calls. The concept is that you can either build up your interpreter as your language gets more expressive or you can more easily convert an existing compiler to simpler source for your interpreter. They match up.
You can drop a lot in the compiler, too. Your untrusted compiler can do the type-checking, error handling, standard library, 100+ optimizations, etc. Your source will be in an acceptable state before you mess with bootstrapping. Then, forget all those complex features in the seed compiler. You’re just reading and directly executing the statements in the compiler with no checks, optimizations, etc. That’s simpler. You build it from the start to support this or just trim it into this form in a fork for bootstrapping. This lowers the bar on what your seed tool must support.
You have multiple routes in this last step. It depends on how many links in the chain you want with how much rewriting of basically the same code. Some people keep adding language features, rewriting the compiler at that level of abstraction, and running old version through it. Bigger and bigger. Others like Wirth just translate the high-level form of compiler plus standard library to the simpler form one time to produce a full-featured compiler. Then, they keep using the current version of the compiler which produces code for the interpreter.
So, there was my concept. Grow the machine just a bit toward your language. Build it in your platform’s assembler. Then, swap a heavy language for just lots of function calls to its simpler language. Keep those with your high-level source. Realistically, it could be just another back-end of your compiler like many projects do compiling to C. Alternatively, the transpiler re-uses existing source of the compiler for a separate pass. That pass ensure compilation is traceable (diffable), more human-readable, and to an easier target.
It effectively gets the same result as cross-compiling, but now you’ve got 1 binary instead of N binaries for every platform, and that 1 binary can be checked into git. And since there’s this blob in the middle, you can optimize the WASM compiler separately or you could let users bring their own WASM+WASI environment to run it.
It seems like the trusting-trust issues weren’t the biggest reason for Zig to switch to this system - iiuc this was the fastest way to get off the old C++-based Zig compiler + Zig-based Zig compiler combo. This also has the added benefit of not requiring a C++ compiler (one less thing you need to trust). I think you currently need to have blind faith that the updates to this WASM blob aren’t malicious. Though, since this is WASM, you could attempt to sandbox it appropriately.
You have the trusting trust issue if your bootstrap is C too, though no? And C actually did have a trusting trust attack from Thompson, whereas there are no known WASM trusting trust attacks. :-) In either case, to be paranoid you want multiple implementations with independent provenance.