Why Emacs remote editing is janky and potential ways to fix it

I've recently been experimenting with editing remote files with Emacs over SSH. Emacs has a special module called TRAMP that abstracts special "filesystems" and allows Emacs to not care where the file is located. You can just edit the file as if it was local, or traverse directories, run commands etc.

In many cases, authors of plugins such as Magit (git UI) or Eglot (LSP client) don't need to do significant work to make sure their plugins work over the remote connections. They only need to use make-process function, which will start the command they want to run either locally or remotely depending on the location of current open file.

If you try using this kind of remote editing yourself, you'll find that there's a noticeable delay when executing commands this way. But if you go directly over SSH in a terminal the delay is negligible. So clearly something is not quite right with the TRAMP implementation. And also a certain combination of plugins may lead to subtle corruption in the TRAMP logic.

The problem with latency

Digging in the tramp-sh.el (the implementation of the TRAMP SSH methods) was illuminating, because I think I understand now how it works. Every time you run make-process, tramp would allocate a temporary buffer for the command and run that command over ssh by executing ssh -o ControlMaster=... <user>@<host> <command>. It means that for every command you end up executing the SSH client again. As you can see, we have the ControlMaster option, which makes things better by not forcing the SSH client to reestablish the connection every time, instead reusing the existing one.

The re-launching of the ssh client is still problematic, because depending on the ssh configuration it may require initializing things like PAM modules and in practice on some machines I see it take 0.2 seconds on average. And of course that adds latency to every emacs action.

Race conditions in the primary tramp session

The make-process is not the only mechanism for running commands though. For a lot of operations, there's a dedicated shell opened in a background which is used to query directory contents, executable paths, and so on. The associated buffer will be named something like *tramp/ssh <hostname>*. It has an associated process (ssh <user>@<host> /bin/sh) and the commands are sent there with proces-send-string.

The problem with process-send-string is that it's not "reentrant". It means that if a timer triggers sending a command via this method while another command is being processed, the results are surprising. Here's roughly what it does:

;; Clean up the buffer.  We cannot call `erase-buffer' because
;; narrowing might be in effect.
(let ((inhibit-read-only t)) (delete-region (point-min) (point-max)))
...
;; This must be protected by the "locked" property.
(with-tramp-locked-connection p
    ...
    )
...

As you can see, the buffer cleanup is not protected by a lock, so in case of reentrant process-send-string, another command will potentially destroy part of the output of whichever command came first. This is not an abstract problem: in fact, if you enable any advanced completion framework like Corfu and try to run commands in eshell, you'll reproduce it pretty quickly. This is because Corfu uses timers to collect completion candidates in background to not lock the user input.

Potential solution

I'd say that many processes started up by emacs on remote machines don't need to be allocated their own SSH session. Especially if they don't need a TTY. What we could do instead is make the primary tramp ssh session be able to multiplex multiple parallel commands at once. This can be done by starting a Python or Perl program on the remote side which will spawn processes when it receives a command via stdin to do so. The command will contain a unique number associated with a particular call, and every line from that process will be prefixed with the number. The receiver will then know how to differentiate which output line is from which program.

This multiplexer program can also implement the base set of commands that tramp needs, such as listing files, getting file properties, and so on. Just to avoid the extra bash calls and parsing overhead.

The protocol of the multiplexer may be as follows:

1 run-text wc -l
1 stdin foo\n
1 stdin bar\n
1 stdin baz\n
1 stdin-close

The server will respond with:

1 stdout 3\n
1 stdout-close
1 exit 0

The 1 in the beginning of each line corresponds to the prefix sent at the initial run-text. If you were to run this with run-bin, the stdin and stdout lines will need to be base64-encoded and chunked. On the Emacs side, these lines can be read back, and distributed across individual process buffers.