Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A hello from a similar project 👋🏻 #56

Open
federicotdn opened this issue Jan 7, 2024 · 20 comments
Open

A hello from a similar project 👋🏻 #56

federicotdn opened this issue Jan 7, 2024 · 20 comments

Comments

@federicotdn
Copy link

Hey! I found your project some weeks ago and found it super interesting - I am doing something similar, but in Go: https://github.com/federicotdn/pimacs.

Your project is more active than mine, but I thought it would be interesting to share ideas. Go and Rust are very different languages of course but some challenges are more or less the same regardless of the language. Some of the conversations For example I've seen the following topics discussed on this repo, that I've also thought about!

  • Handling if strings, multibyte vs. unibyte, using plain UTF-8 vs. Emacs' custom UTF-8-like encoding
  • Importing Elisp files from the Emacs codebase into Rune
  • Concurrency! (a fun one to tackle in Go, given goroutines + channels)
  • Implementation of an Elisp reader, bytecode interpreter and/or JIT compiler
  • Testing in general (using Emacs' ERT?)

I've written about some of my design decisions here: https://github.com/federicotdn/pimacs/blob/main/etc/design.md. As a part of the project, I've written a Python script called extract.py that creates a JSON file with information about all the subroutine decalarations in Emacs, perhaps that could be useful for Rune as well! (https://github.com/federicotdn/pimacs/blob/main/test/data/emacs_subroutines.json)

I'm aware that a GitHub issue is not a great medium for these types of discussions, so please feel free to close it, as it is not really an "issue" with Rune itself. Cheers!

@CeleritasCelery
Copy link
Owner

CeleritasCelery commented Jan 8, 2024

Hi @federicotdn! it is so cool to see your project! I think there could be a lot for me to learn from your implementation. I read through your design doc and ran some sample programs. I think Github issues are a fine medium for discussion, so if it is okay with you I have a few questions/observations.

why go?

Why did you choose go for this project? having a builtin in garbage collector seems useful. Also goroutines can make parallelism easier. However I feel like the tricky part of adding goroutines to Emacs is not implementing the interpreter to support it, but defining the semantics in elisp. I have written about that here and am curious what your thoughts are about using threads in Emacs. We currently support multi-threaded elisp in the most basic sense with the go function. But most of the elisp semantics have yet to be nailed down.

elisp string compatibility

It looks like you are trying to be "bug compatible" with Emacs because you say that "The current main design guide for Pimacs is to implement an Elisp interpreter that allows for running any Elisp script without modifications". How do you plan to handle mutable strings, given that go strings are immutable? Also Curious what you plan to do about the encoding of "raw bytes" in multibyte strings?

debugger

The interpreter in Rune was initially structured similar to pimacs, in that we don't use longjump but instead return error types and propagate them. This was how we handled backtraces and displaying errors. But I observed that in GNU Emacs you can enter the debugger when you get an error. You can also resume from this error point. That means you can't unwind the stack on error and need the context (the specpdl stack) without unwinding. You also need to know if there is a condition case to catch the error or not without unwinding first. We are in the process of reworking the runtime to support a debugger and not only unwinding the stack when the debugger is not active. Though I do think your use of defer to automatically clean up the specpdl stack is quite clever and will save your project from a lot of bugs.

tagging

How are your LispObject's defined? I don't understand go well enough to determine that. Are you using some sort of tagging scheme so they will fit in a machine word? How is nil represented?

passing many arguments

When calling a LispFnM, which takes an unbounded number of arguments, where are you storing the arguments for the call? Are you allocating an array each time? I ask because that is something we have been looking recently.

Emacs subroutines.json

I thought this was a very cool idea. I currently don't have any way to track which subroutines we are missing, or if we have the wrong number of arguments etc. Your solution makes it easy to upgrade Emacs version because you can always find new functions or ones that have changed. I eventually want have a testrunner that can run directly against GNU Emacs to compare functionality.

@federicotdn
Copy link
Author

(I'm reading a bit the project code and your blog posts, I'll get back to you!)

@CeleritasCelery
Copy link
Owner

We also have a design doc, but it is not as well organized as yours.

@federicotdn
Copy link
Author

federicotdn commented Jan 11, 2024

Ok getting back to you!

Why Go?

Mainly, I find Go programming fun so I thought it was a cool project to test it out. And I also thought, re-implementing Emacs is such a massive project, that I'm going to need all the help I can get. The GC is great to have OOTB, plus the syntax of Go itself is quite easy, so re-writing all the subroutines would be (in theory) a quicker process than with a more complex language. It also has a good library ecosystem, which helps as well. I was definitely focusing more than anything on developer efficiency. The concurrency utilities are of course very handy as well, but I wasn't thinking of them so much when I started (given that concurrency in Emacs is not thaaat important after all).

I read your blog post about concurrency, and the design doc! It is definitely a hard problem to nail down. Personally I went with having some things shared among goroutines, and others not:
shared/global:

  • obarray for symbols interned during startup (protected by mutex)
  • global buffers list (protected by mutex)
  • all symbols
  • all other Elisp objects

goroutine-local:

  • the stack(s)
  • the current buffer
  • obarray for symbols interned after goroutine starts

So with this approach, each goroutine also has its own obarray. This allows binding variables like lexical-binding per-goroutine. However this may have been better to do via symbol value lookup, though I feel like the logic for it in Emacs is already complex enough. With the current system, a symbol interned in the local obarray shadows the one in the global obarray. Maybe both systems are equivalent...

I haven't really tested this design so I don't know if it would work in practice. Your approach seems much more though-out (and probably tested as well). Have you looked into how Emacs does it? I believe it does have a per-thread struct where it stores some state, but maybe it's not using it for much.

Elisp string compatibility

Yes, they are immutable! One can actually cheat by using the unsafe module (which gives access to the raw data being pointed at), but I decided not to, because Go asserts that all strings need to be immutable.

So instead my lispString is basically:

type lispString struct {
	valMut []byte
	val    string
	size_  int
}

where val is used by default to store the string value. If the user tries to modify any part of the string, then the value is copied over to valMut (as bytes) and val is set to "" (there's no way of setting a string to null in Go). So it's kind of a copy-on-write.

In Go, string is basically an immutable []byte. The reason I did not use []byte and nothing else is that so many library utilities operate on string, that it is actually very convenient to start off with string, as I assume that most strings in Elisp are actually never mutated. Converting []byte to string requires copying bytes, so once the lispString has been mutated, converting it to a string is a bit expensive.

To keep the struct as small as possible, I use size_ to signal different things: if its value is -1 then that means the string is unibyte - i.e. you can know the length by just doing len(val) or len(valMut) (that is, just counting the bytes).

If size is 0, then it means that the number of runes need to be counted (maybe because the string bytes were modified recently). If size is larger than 0 then that is effectively the cached number of runes in the string. In both cases, the string is multibyte. Here I'm always using UTF-8 (the Go stdlib has utilities for it).

This was as small as I could get the strings without potentially breaking things inside the Go runtime itself. I bet you can do something much more compact with Rust. How would you describe the one in Rune at the moment?

Regarding Emacs' custom UTF-8 variant (lets call it EU8) - I thought about it and decided that it wasn't worth the complexity. By using plain UTF-8 I can use Go's existing libraries (plus, the language uses UTF-8 for string literals as well), so everything fits in nicely.

Do you know how common it is to store raw bytes within multibyte strings? I feel like it can't be too normal, right? The Emacs docs also state that:

Emacs extends this range with codepoints in the range #x110000..#x3FFFFF, which it uses for representing characters that are not unified with Unicode and raw 8-bit bytes that cannot be interpreted as characters.

what would be examples of "characters not unified with Unicode"? I wonder if Emacs makes use of this range itself. But in any case, I have a strong feeling that normal UTF-8 should cover the majority of use cases.

But yeah, strings and encoding is a very cool topic to think about.

Debugger

That is an interesting point about the debugger. My Rust knowledge is basic so I couldn't find how you are currently managing the stack aspects in Rune. But if you say it could be a problem, then you are probably right, I didn't get around to implementing a debugger yet.

For catching errors, I do think I managed to get it working though: when calling throw, the function scans the execution stacks and checks if a tag was pushed into it earlier during execution: https://github.com/federicotdn/pimacs/blob/main/core/eval.go#L276

Thank you for the comment about defer. It makes me miss a bit C++'s RAII features.

Tagging

For this it took a veeeery easy solution, but probably very memory wasteful as well. Basically all my Elisp objects are structs that also satisfy an interface called lispObject. This post by Russ Cox explains the internals of interfaces a bit: https://research.swtch.com/interfaces . But essentially interfaces are a two-word pair, one pointer pointing at the underlying data, and another pointing pointing at a structure (itab) containing type information about the underlying data (e.g. lispString).

I pass around lispObject everywhere, which did simplify some things. I believe the Rune equivalent would be Rt<GcObj> right?

I find the Rune approach very cool though, like in:

#[defun(name = "%")]
pub(crate) fn remainder(x: i64, y: i64) -> i64 {
    // TODO: Handle markers
    x % y
}

The fact that you can operate directly with i64 is much more convenient. In Pimacs this would be:

defun (ec *execContext) modulo(x, y lispObject) (lispObject, error) {
    if !integerp(x) {
        return ec.wrongTypeArgument(ec.s.integerp, x)
    } else if !integerp(y) {
        return ec.wrongTypeArgument(ec.s.integerp, y)
    }

    return xInteger(x) % xInteger(y), nil
}

Also using tagged pointers is clearly much more memory efficient!

In my case, nil is just a lispSymbol pointing at itself: https://github.com/federicotdn/pimacs/blob/main/core/symbols.go#L124
I also set it as an execContext-level variable, so that it can be accessed from "anywhere" by just writing ec.nil_.

Passing too many arguments

For this I convert the Elisp list of arguments into a Go slice of lispObject ([]lispObject), here: https://github.com/federicotdn/pimacs/blob/main/core/eval.go#L16 . This function itself is called by eval_sub.

This is a very basic approach though, maybe there is a better way of doing it.

emacs_subroutines.json

Thanks. I also figured that I would "tie" Pimacs to a certain Emacs version - this is configured here: https://github.com/federicotdn/pimacs/blob/main/tools/extract/config.json - this file determines the contents of emacs_subroutines.json, but also the version of the Elisp files that I copy over! So then everythin is consistent. How does Rune decide which version of Emacs to target?


more topics

UI/UX

I also share your idea of separating the core part of the editor from the UI, like in the xi editor. However I've looked around extensively and the status of GUI libraries is not great, for Go at least. The problem is that you need an extremely flexible library - after all, you need to implement all Emacs features, which do not really follow any the guidelines of any native GUI toolkit.

So then, you maybe can use something a bit more low-level, like some immediate-mode library, but then you need to implement a lot of things yourself. Text rendering in particular seems like a scary topic. I am curious as if you've had ideas on this yet.

A UI for the terminal seems a bit easier to achieve, and still very useful (thankfully). Maybe that would be a good starting point for both projects.

Gap buffer vs. Rope

I agree that the Gap buffer is fast enough. The fact that they are easy to implement is a huge plus. But additionally, I think every design decision that makes Rune/Pimacs internally more similar to Emacs is also a good thing in the long term; it can potentially make it easier to port over more code from Emacs (even though technically how a buffer is implemented should not really matter, but in practice things tend to be more complicated). The Emacs codebase has a lot of surprises.

Bytecode / source interpeter

My understanding is that currently Rune can read Elisp source code, and it can also compile it to bytecode and interpret it from there. Is that correct? Can Rune read .elc files generated by Emacs? Would that even be desireable?

Project philosophy

I am very suprised at how both of the projects are very similar in terms of design and purpose! It seems like we arrived to very similar ideas independently; using very different tools.

@CeleritasCelery
Copy link
Owner

I am going to respond in multiple comments as I get time.

threading

You seemed have arrived at a similar conclusion for what should be shared vs thread local. In Rune this is how we split it up.

shared globally

  • interned obarray (mutex)
  • buffer list (mutex)
  • interned symbol functions (atomic)

All interned symbols are shared between all threads. They (currently) have an atomic field where the function pointer is stored, so functions are shared as well. values of symbols are thread local and stored in a hashmap. I think properties will be shared as well.

thread local

  • all lisp variables
  • uninterned symbols
  • symbol values
  • the stack

We don't share any lisp objects between threads, because doing so makes reasoning about your program and resolving bugs much harder. We also have a local obarray for uninterned symbols. The value of symbol is first looked up in the threads variable cache, and if not found it requests the value from the "main" thread. If it is not found there, it is a void variable.

As I mentioned in my post, buffers are kind of unique in that they are shared but only a single thread can have it open at any time. This is controlled by a mutex. I still have not decided how to handle buffer locals though (does each thread get a copy or does accessing a buffer local share it with that thread).

@CeleritasCelery
Copy link
Owner

CeleritasCelery commented Jan 12, 2024

String type

Strings in rune are defined very similarly to pimacs.

#[derive(PartialEq, Eq, Trace)]
pub(crate) struct LispString {
gc: GcMark,
#[no_trace]
string: StrType,
}
unsafe impl Sync for LispString {}
#[derive(Debug, PartialEq, Eq)]
enum StrType {
String(String),
BString(BString),
}

We have an enum that can either be a UTF-8 string or a raw byte array. I agree with you that I don't think supporting raw bytes is worth the extra complexity.

Do you know how common it is to store raw bytes within multibyte strings?

I don't think it is that common, but Emacs has support for it. If you want to fully compatible with emacs lisp you would need to handle it though. For example:

(string-to-multibyte (unibyte-string 128))

what would be examples of "characters not unified with Unicode"?

These are the raw bytes. For example if you go to a buffer and eval (insert-byte 128 1) it will insert a raw byte. Then call M-x describe-char on that character it will report it is code point #x3fff80 which is outside valid unicode range. It actually uses a special encoding so that raw bytes are two bytes instead of four.

tagging

the #[defun] macro is something we lifted from remacs. It is really nice because it automatically handles type checking and conversion of all the arguments and return value, which let's you write really succinct code like you saw with remainder.

I pass around lispObject everywhere, which did simplify some things. I believe the Rune equivalent would be Rt right?

A LispObject is just a GcObj (which is a type alias for Gc<Object>). Rt represents an object that is reachable from a root (and hence can survive calls to garbage_collect). Since we don't have a garbage collector, we have to manage all that ourselves. Having a garbage collector like Go saves a lot of implementation complexity.

@CeleritasCelery
Copy link
Owner

CeleritasCelery commented Jan 12, 2024

UI

I also share your idea of separating the core part of the editor from the UI, like in the xi editor. However I've looked around extensively and the status of GUI libraries is not great, for Go at least. The problem is that you need an extremely flexible library - after all, you need to implement all Emacs features, which do not really follow any the guidelines of any native GUI toolkit.

I will admit that I am pretty ignorant about GUI's. The Rust ecosystem does not have great options either. There are some new projects like xilem that are exploring this space. Though it seems like most of the requirements for an Emacs GUI are pretty simple. We need fast text layout and rendering in multiple panes (what Emacs would call windows). It doesn't have fancy widgets or graphics or animations. I think using something lower level (like GTK) is probably the way to go.

As far as TUI's go, I am trying to avoid going down that path first. A lot of the functionality in Emacs is hamstrung by being a "TUI first" program. For example it treating C-i and tab as the same key, and treating control keybindings as ascii inputs in general. I am not opposed to having a TUI, but I would like to make Emacs "GUI first" and undo some of the legacy around being run in a terminal.

Bytecode / source interpreter

Rune is currently to the point where the interpreter can bootstrap enough elisp to load the byte compiler (written in lisp) and then compile the functions. The bytecode VM is also implemented and is run as well. We don't load bytecode functions (.elc) directly, but it would be trivial to implement that. However given that rust is a memory safe language, I would like to have some way to verify bytecode that we load before we execute it (I have crashed Emacs many times playing around with bytecode).

Project philosophy

I do find it interesting that we arrived at the same approach on many things, especially around concurrency. I think it would be so cool to have the capability in Emacs.

@Qkessler
Copy link
Contributor

This has been cool to read! Thanks both!

@federicotdn
Copy link
Author

Amazing, thank you for taking the time to reply with so many details. I'm going to be keeping an eye on this project to see how it progresses. If some day I decide to resume learning Rust, maybe I can try submitting a PR here!

@CeleritasCelery
Copy link
Owner

@federicotdn I have been thinking there might be a way for the projects to collaborate. I have been wanting to create a tool to test the implementation against GNU Emacs, since that is accurate measure of correctness. I was planning on creating a tool that can extract the arguments for builtin function from Emacs and then generate inputs for them. We would than feed this into both GNU Emacs and Rune and ensure that the inputs match. Essentially property testing against Emacs. That will hopefully help flush out a bunch of issues, and anything that is found can be added to normal unit tests.

This got me thinking that you already have a great start to this with your extract tool in pimacs! It can already parse the source files of GNU Emacs format them nicely in JSON. We could build on this to make a tool that read the JSON and generates the inputs and then runs it against both GNU Emacs and the other implementation. This could be used for both Rune and pimacs. What do you think?

@federicotdn
Copy link
Author

Yes! That sounds like a good idea. I'm curious as to how you would generate inputs for each function though, as you would need to know the type of each argument beforehand, and also have a set of expected outputs for those arguments. So basically you would need hand-written test cases for each function. And at that point, automatic extraction of all functions stops being so useful 🤔 . Unless you have something else in mind.

In Pimacs I was using the JSON file to quickly test that any subroutine defined in Pimacs, with a name identical to one in Emacs, had the same set of attributes: arity, lack of return value, etc. (well actually I only got to test the arity but the rest would not be so difficult to add). (https://github.com/federicotdn/pimacs/blob/main/core/exec_context_test.go#L75)

But yeah, in any case of course we can bring the tool into Rune 👍🏻

@CeleritasCelery
Copy link
Owner

CeleritasCelery commented Feb 13, 2024

I'm curious as to how you would generate inputs for each function though, as you would need to know the type of each argument beforehand, and also have a set of expected outputs for those arguments.

The idea is that you could generate random input for a function and then feed it into both Emacs and the alternative implementation and make sure they get the same output. So the expected output is whatever Emacs returns (including errors). We already know the arity of functions thanks to your script, and that would be enough for a first pass. But rune also has type annotations on the builtin functions. That could be fed into the JSON and then used to have more specific inputs.

For example if we picked the + function. We could feed it random data and make sure that we return that same thing as Emacs. Most of these would be wrong-type-argument errors, but some would be valid by chance. If we used the type information from the JSON we could also limit the inputs to only numbers, and this would get much more interesting results. In fact we could run it in both modes.

I see this as a low effort way to flush out bugs, because we don't need to define manual tests or types.

@federicotdn
Copy link
Author

Aah ok, I see, then more or less the flow would be:

  • For each subroutine S defined in Rune:
    • Check S against emacs_subroutines.json to check arity and potentially other attributes
    • Generate random inputs based on the Rust parameter types for S
    • Evaluate S(args) and check that it returns the same as in Emacs itself (or errors out in the same way)

To me the fact that you are already using Emacs itself in this process makes the JSON file less valuable, since you can probably evaluate some Elisp code within Emacs that will return you the subroutine metadata anyway.

For example you could do:

  • For each subroutine S defined in Rune:
    • Start up Emacs and look up S. If S is not defined there, skip this iteration
    • If S is found, use help-function-arlist (or equivalent) to check that the Rune <-> Emacs signatures are compatible
    • Generate random inputs based on the Rust parameter types
    • Evaluate S(args) and check that it returns the same as in Emacs itself (or errors out in the same way)

@federicotdn
Copy link
Author

Or maybe you could do two separate things: one where you use the JSON file as a very quick arity/subroutines name check, and then a slower one using the second method described above. Because I can imagine that for ~1500 subroutines, doing multiple tests for both Rune and Emacs might get slow!

@CeleritasCelery
Copy link
Owner

You are right that we could just do this all from Rust. Have it parse the arity and types and then generate the input forms, run the Emacsen, and report the results. But the idea is that if you instead write out a JSON with that information you can feed that into a python program that does the generation and comparison of functions. That way Rune could have a small program to create the JSON and pimacs could have one to create the JSON. This let's both projects share the common python test framework to compare to GNU Emacs.

@federicotdn
Copy link
Author

federicotdn commented Feb 22, 2024

Sounds good. I can start in the following weeks by bringing over the Python script to this repo and outputting the JSON somewhere. In this regard:

  • Where should I assume the source code for Emacs is?
  • Should we skip some older Emacs files? For example, I was skipping some MS-DOS .c files because they had some quirks in terms of their DEFUNs, that I didn't want to waste time on. I can check again what this was.
  • Any preferred output directory that we can .gitignore?

@CeleritasCelery
Copy link
Owner

I would think that the Emacs source code as well as the actually Emacs executables should be arguments to the script. That way if a project is tracking a different version (we are currently 29.1) there are no builtin assumptions.

I think is safe to ignore MS-DOS or anything else that does not seem very useful. I am hoping that as I dive into this that there will be more useless functions that I can just throw out. The easiest code to maintain is code that does not exist!

As far the output directory goes, Rust usually puts things in target/ but we could create another one called output/ if we wanted.

@gudzpoz
Copy link

gudzpoz commented Nov 27, 2024

Hello! I have also been working on a tiny Emacs reimplementation project (https://github.com/gudzpoz/Juicemacs) and learned a lot from the design docs and implementation of rune and pimacs!

strings

When implementing ELisp strings, however, I ended up first using TruffleString (the string type offered by the JIT framework I use) but then reimplementing my own 0 ~ 0x3FFFFF string type. And I would like to share a bit about the switching here.

the last straw: "characters not unified with Unicode"

(Well, the haystack under the last straw was that I needed to implement my own regexp engine.)

It was not easy to come up with an example, or, at least I didn't find any when searching in English. But following a Japanese blog, I managed to produce an example with the iso-2022-jp coding system.

Basically, when decoded with iso-2022-jp coding, the following bytes "\x1b$B='yu?@<R\x1b(B" includes a #x7975 char (in japanese-jisx0208 charset), which is not yet convertible to Unicode. (The char itself seems to be in Unicode though, but I guess the conversion part is not yet covered by Emacs or Unicode?)

Edit: just found another example right in Emacs' source code: lisp/language/ethiopic.el. It is encoded in utf-8-emacs and contains a non-Unicode character (codepoint #x1a01ca, file code #xF6 #xA0 #x87 #x8A). (A few more files: tibetan.el, ind-util.el.)

reinventing strings

I don't use the UTF-8-like encoding as Emacs does, however. For example, nowadays, Java uses compact strings to encode latin-1 strings in bytes and use only 16-bit char[] for strings containing larger codepoints. Similarly, when TruffleString (the string type offered by the JIT framework I use) encodes UTF-32 data, it "transparently compacts UTF-32 strings to 8-bit or 16-bit representations, where possible".

Compared to UTF-8-like encoding, these "transparently compacting" methods simplify things quite a lot. (Also, this makes fetching by codepoint naturally an O(1) operation.) But it might have some problems with Lisp code that depends strictly on UTF-8 string-bytes implementation, for example.

The switch from TruffleString to a reinvented string type was easier than I expected, maybe partly because I haven't implemented many string built-ins yet. I haven't looked into the performance yet, but I do think a bunch of boilerplates (especially for operations involving strings of different compaction levels, like equal) will be necessary to make the performance match original string types (currently it uses codepoint iterators everywhere).

auto extraction & codegen

I have been using the subroutines.py script from pimacs for codegen and I am wondering maybe we can also collaborate on ways to code generation.

For example, the initialization of the error-condition property on various symbols is scattered in the syms_of_<file> function all over the code base. Personally, I really don't want to have to manually rewrite all these initialization details, so I have chosen to directly generate them from Emacs source code (by somehow interpreting C code as Python... through hacky regexp). It is certainly not a nice way and still requires lots of hard-coding, but, anyway, it seems to work.

I have also thought about other ways. For example, it might be possible to let Emacs initialize itself and replace its loadup.el with some export-state.el. But this way might miss some internal initialization details (for example setup_coding_system).

Anyway, most of the operations in syms_of_<file> functions seems to be straightforward lispy operations (like Fcons, list1, etc.). If you are considering codegen from Emacs source, we might be able to share some of the extraction info in some kind of lispy structure. (It suddenly occurs to me we can even turn (part of it) into ELisp.) What do you think?

Part of the codegen: (https://github.com/gudzpoz/Juicemacs/blob/main/elisp/src/main/java/party/iroiro/juicemacs/elisp/runtime/ELispGlobals.java)

        ERROR.putProperty(ERROR_CONDITIONS, new ELispCons(ERROR, NIL));
        ERROR.putProperty(ERROR_MESSAGE, new ELispString("error"));
        QUIT.putProperty(ERROR_CONDITIONS, new ELispCons(QUIT, NIL));
        QUIT.putProperty(ERROR_MESSAGE, new ELispString("Quit"));
        MINIBUFFER_QUIT.putProperty(ERROR_CONDITIONS, new ELispCons(MINIBUFFER_QUIT, new ELispCons(QUIT, NIL)));
        MINIBUFFER_QUIT.putProperty(ERROR_MESSAGE, new ELispString("Quit"));
        USER_ERROR.putProperty(ERROR_CONDITIONS, new ELispCons(USER_ERROR, new ELispCons(ERROR, NIL)));
        USER_ERROR.putProperty(ERROR_MESSAGE, new ELispString(""));
        WRONG_LENGTH_ARGUMENT.putProperty(ERROR_CONDITIONS, new ELispCons(WRONG_LENGTH_ARGUMENT, new ELispCons(ERROR, NIL)));
        WRONG_LENGTH_ARGUMENT.putProperty(ERROR_MESSAGE, new ELispString("Wrong length argument"));
        WRONG_TYPE_ARGUMENT.putProperty(ERROR_CONDITIONS, new ELispCons(WRONG_TYPE_ARGUMENT, new ELispCons(ERROR, NIL)));
        WRONG_TYPE_ARGUMENT.putProperty(ERROR_MESSAGE, new ELispString("Wrong type argument"));

@CeleritasCelery
Copy link
Owner

Hi @gudzpoz,
Interesting to see another Emacs project.

When you are talking about "characters not unified with Unicode" I assume you are talking about this. Basically Emacs uses a custom encoding to handle raw bytes in text. I have not tried to tackle this issue yet, because as you said, it would require reimplementing a bunch of primitives. I am trying to see how long I can avoid it 😁 .

I have never thought about using a script to generate code from Emacs source. But that would save bunch of effort for some symbols. If we could export the output into some common format (maybe JSON) then it could be used in multiple projects and they could handle transforming into the correct output code.

@gudzpoz
Copy link

gudzpoz commented Dec 12, 2024

On the raw bytes, Emacs uses =#x110000..#x3FFFFF= for both "characters that are not unified with Unicode and raw 8-bit bytes" and I just wanted to provide an example for the former case, although codepoints not yet unified might be much rarer that raw bytes.

As for the Python scripts, I've set up a repo for it here (https://github.com/gudzpoz/emacs-extractor). Currently, it seems basically working (tested against some source files in Emacs 30) and I've generated runnable Java code with the extracted info. To extract the initialization logic, it also extracts some =#define/enum= constants as well as =DEFVAR_*= variables. I haven't tested against Emacs 29 yet, but I hope the convoluted config file it uses can be enough for adapting it...

(BTW, I just watched your EmacsConf talk, and it was great! It's also very interesting to see three Emacs implementations on EmacsConf. 😃)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants