Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADDED] In-process connections #2360

Merged
merged 7 commits into from
Jun 28, 2022
Merged

[ADDED] In-process connections #2360

merged 7 commits into from
Jun 28, 2022

Conversation

neilalexander
Copy link
Member

@neilalexander neilalexander commented Jul 13, 2021

This PR adds three things:

  1. InProcessConn() function to Server which builds a net.Pipe to get a connection to the NATS server without using TCP sockets
  2. DontListen option which tells the NATS server not to listen on the usual TCP listener
  3. startupComplete channel, which is closed right before we start AcceptLoop, and readyForConnections will wait for it

The main motivation for this is that we have an application that can run either in a monolith (single-process) mode or a polylith (multi-process) mode. We'd like to be able to use NATS for both modes for simplicity, but the monolith mode has to be able to cater for a variety of platforms where opening socket connections either doesn't make sense (mobile) or just isn't possible (WASM). These changes will allow us to use NATS entirely in-process instead.

An accompanying PR nats-io/nats.go#774 adds support to the client side.

This is my first PR to this project so apologies in advance if I've missed anything obvious anywhere.

/cc @nats-io/core

Copy link
Contributor

@ripienaar ripienaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this, been keen to have something like this for quite some time :)

server/server.go Outdated Show resolved Hide resolved
Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution!

@@ -184,6 +184,7 @@ type Options struct {
ServerName string `json:"server_name"`
Host string `json:"addr"`
Port int `json:"port"`
DontListen bool `json:"dont_listen"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a new option, I wonder if we could decide on a port that would disable listening. For instance, port set to 0 means that we use default port 4222. Setting to -1 means that we let OS pick a random free port. We could say anything negative lower than -1 (say -2) means disabled?

You don't have to update the PR for that, just wondering if that would be better or not and let's see what others think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this could work, although it does create the risk of the API shape having more "magic numbers". I'll be happy to accept direction on this from whoever has strong enough opinions on it.

server/server.go Outdated Show resolved Hide resolved
server/server.go Outdated Show resolved Hide resolved
server/server.go Outdated Show resolved Hide resolved
server/server.go Show resolved Hide resolved
Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. Let's see if other maintainers have an opinion on the DontListen vs special port number.

@ripienaar
Copy link
Contributor

The magic ports seem a bit unclear and hard to discover to me, otoh we have too many settings already but I think a documented setting might be better.

@neilalexander
Copy link
Member Author

Hm, ultimately it doesn't appear as if the ReadyForConnections solution works, I guess because none of the chk cases fire in a simple setup and therefore we return after the first 25ms pause, regardless of whether NATS is set up or not by that point.

It might make sense to re-add the startupComplete channel, close it right before the AcceptLoop would start and then wait for that in ReadyForConnections before checking the sockets.

@neilalexander
Copy link
Member Author

neilalexander commented Jul 14, 2021

I've gone ahead and done that. There's no new API surface this time — readyForConnections will instead wait for the startup channel to close, which will happen right before AcceptLoop will be called. I hope that's OK — it fixes the problem in my tests.

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a go with this change and bringing the changes in the nats.go to create a quick test in the server package and my main concern is the behavior of the net.Pipe when one side closes the connection.

It appears that it results in internal pipe write timeout, which is not that good. For instance, I tried to connect with wrong user/password and the server would report a WriteDeadline exceeded (when trying to write the authorization violation error).

Without more testing (which is going to be a bit annoying since it would require changes from the client that I would not want to commit if we end-up rejecting the whole idea), I think we can't merge this PR.
If you could have some tests and figure out what the issue is with those time outs, then we could reconsider.

server/server.go Show resolved Hide resolved
server/server.go Show resolved Hide resolved
@neilalexander
Copy link
Member Author

neilalexander commented Jul 14, 2021

It appears that it results in internal pipe write timeout, which is not that good. For instance, I tried to connect with wrong user/password and the server would report a WriteDeadline exceeded (when trying to write the authorization violation error).

Looking at the source for net.Pipe, the only way it can return a os.ErrDeadlineExceeded is if a deadline has been set and not cleared or re-set afterwards. Otherwise it should return an io.ErrClosedPipe.

https://github.com/golang/go/blob/d8f348a589b3df4bb48636023be4ccf9ac96d307/src/net/pipe.go#L183-L188

I haven't made any changes to deadlines anywhere so wondering if there is some existing deadline being set elsewhere?

If you could have some tests and figure out what the issue is with those time outs, then we could reconsider.

Have you got an example or a specific test that I can look at? We've been experimenting with in-process conns using these commits in our codebase for the last couple of days and we have yet to see this. We'll ultimately need to abandon NATS if we can't figure out a way to perform in-process connections which would be a shame, as it's looking to be a good fit so far.

@kozlovic
Copy link
Member

Have you got an example or a specific test that I can look at?

I was trying to connect with wrong username/password, so what I see is that the server blocks on flushOutbound() writing -ERR 'Authorization Violation'\r\n message (up to write deadline which default is 10 seconds) and the client blocks in writing the CONNECT + PING protocols in nats.go sendConnect() at this place: if err := nc.bw.writeDirect(cProto, pingProto); err != nil {.

So the server reads the CONNECT (not the PING yet), fails the client by trying to write -ERR .. to it and that blocks because the client is not reading. The connect process in the client is synchronous, that is, there is no read loop started yet.

So if I have the client sends only CONNECT and then PING later, the issue is the same, this time the client blocks writing PING because the server is also blocked writing -ERR (since client is not reading). Having buffered pipes may help, but to the point where buffer will be full. At least that may help pass the phase of CONNECT...

@neilalexander
Copy link
Member Author

Lack of buffering is probably the crux of the issue here — net.Pipe doesn't buffer. As per the godoc:

Reads on one end are matched with writes on the other, copying data directly between the two; there is no internal buffering.

I suppose a custom implementation could be made which wraps net.Pipe with buffered readers/writers, but I suspect that probably only masks the issue and makes for an even more confusing implementation. Not entirely sure what the best answer here is really and we won't have seen it because we don't use the user/password authentication (there's really no point when we know that the connection is coming from within the process).

@variadico
Copy link
Contributor

variadico commented Jul 14, 2021

Hm, ultimately it doesn't appear as if the ReadyForConnections solution works

Just curious, what is failing? Is there a test that consistently fails? Sometimes tests can be flaky, so you may need to rerun them a few times.

For example, I did this 537a5bc
and got green tests https://github.com/nats-io/nats-server/actions/runs/1031222424

@kozlovic
Copy link
Member

@variadico No, it was just that in the case of "DontListen" (part of this PR), ReadyForConnections() should ignore the client port, so I asked to add ok: s.listener != nil || opts.DontListen, but the issue was that user calling ReadyForConnections() in that case would return immediately with "ok", while the user wanted at least all of Server.Start() (minus AcceptLoop) to be done, hence the use of the startupComplete channel that is just closed prior to starting the AcceptLoop() (which is blocking).

@kozlovic
Copy link
Member

@neilalexander We will discuss this PR internally and come up with a decision maybe later next week. Thanks again!

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Switching to Draft until we decide internally (so it does not get merged by mistake).

@kozlovic kozlovic marked this pull request as draft July 15, 2021 17:50
@neilalexander
Copy link
Member Author

Thanks @kozlovic, will wait to hear back on a way forward. Hoping we can come up with something suitable — NATS is shaping up very well for us so far, we just need to be able to get it to work in obscure places like WASM as well as cross-process!

@neilalexander
Copy link
Member Author

Is there any update on whether this was discussed further? We're optimistically hoping for good news. 😄

@derekcollison
Copy link
Member

@kozlovic and I plan to discuss this week IIRC.

@kozlovic
Copy link
Member

Actually beginning of next since a member of the team is on vacation this week and we wanted his inputs.

@kozlovic
Copy link
Member

kozlovic commented Aug 2, 2021

@neilalexander Sorry for the delay. We just had a discussion today about this, and we feel it is important and part of future kind of deployments, but we would need a bit more information. The wasm limitation may be temporary, are you following wasi developments?
What wasm environment are you planning on deploying this?

Finally, would you have 15/20 minutes to speak with @derekcollison? He would like to have a short discussion with you. Thanks!

@derekcollison
Copy link
Member

We are starting to see more interest in this and we ourselves are very interested in the WASM/WASI ecosystem. Possible to get 20 minutes of your time to chat live via Zoom or GH?

@neilalexander
Copy link
Member Author

Hey @kozlovic, @derekcollison — more than happy to have a chat. I'm in GMT but please feel free to drop me a mail at neilalexander at matrix.org (or, if you are Matrix natives, the same MXID) to arrange a time.

At a high level, our project is Dendrite, which is a Matrix homeserver implementation. We basically need to be able to run everywhere — within the web browser, on mobile devices and on servers alike. We see the in-process connections as a way of getting around the limitations with not being able to open arbitrary sockets with WASM (and this probably still holds true in WASI as of today, not sure?) and also not opening unnecessary sockets on mobile devices either. I can explain more certainly about what we're hoping to do on the call.

@derekcollison
Copy link
Member

Neil,

Have not forgotten about you, promise.

@neilalexander
Copy link
Member Author

Hey Derek,

Is there any update? Cheers!

@derekcollison
Copy link
Member

Still very much on our list, just prioritizing KV and ObjectStore abstractions atm.

@neilalexander
Copy link
Member Author

Hi all, is there any update on this?

@derekcollison
Copy link
Member

Unfortunately nothing has changed atm.

Still a nice to have I think for us but not a high priority.

@derekcollison
Copy link
Member

@neilalexander if you want to update this to the main branch (rebase etc). We can see if we can get this one in..

@neilalexander neilalexander marked this pull request as ready for review June 21, 2022 08:43
@neilalexander
Copy link
Member Author

@derekcollison Have rebased against main!

@derekcollison
Copy link
Member

Thanks! We still have the hang issues that were mentioned yes?

@neilalexander
Copy link
Member Author

neilalexander commented Jun 21, 2022

I’m not entirely sure what issues it might introduce, but we’ve been using this in Dendrite for months without any particular issues. Admittedly we don’t ever stop NATS or close the connection until the point that the entire process shuts down. We’ve also seen one or two cases of slow consumer but I suspect that’s more to do with our message sizes as we’ve also seen it when running NATS in a separate process via TCP.

With an “experimental” label attached it’s usable as-is, but you’d probably be better qualified to figure out if there’s anything I’m missing.

@derekcollison
Copy link
Member

I think Ivan had mentioned some tests he was doing that could block on failed auth or something.

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind rebasing again (sorry about that). There was a test failure and I tried to recycle it on Travis but it complains about the script not being found (was changed during the Go 1.18 PR). Thanks!

server/opts.go Show resolved Hide resolved
@neilalexander
Copy link
Member Author

Have rebased again.

server/server.go Outdated
for time.Now().Before(end) {
s.mu.RLock()
chk["server"] = info{ok: s.listener != nil, err: s.listenerErr}
s.mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to change that to an RLock(). This has changed in main so that is causing issue now with your PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

server/opts.go Show resolved Hide resolved
Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kozlovic kozlovic changed the title In-process connections [ADDED] In-process connections Jun 28, 2022
@kozlovic kozlovic merged commit cb406ed into nats-io:main Jun 28, 2022
kozlovic added a commit that referenced this pull request Jun 28, 2022
The "InProcess" change make readyForConnections() possibly return
much faster than it used to, which could cause tests to fail.

Restore the original behavior, but in case of DontListen option
wait on the startupComplete gate.

Also fixed some missing checks for leafnode connections.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
derekcollison added a commit that referenced this pull request Jun 29, 2022
neilalexander added a commit to neilalexander/nats.go that referenced this pull request Jul 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants