Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GraphQL over WebSocket #140

Merged
merged 18 commits into from
Jun 6, 2022
Merged

GraphQL over WebSocket #140

merged 18 commits into from
Jun 6, 2022

Conversation

enisdenjo
Copy link
Member

@enisdenjo enisdenjo commented Sep 11, 2020

Read the formatted GraphQL over WebSocket Protocol

Implementations

  • graphql-ws
  • Feel free to add your implementation here!

Copy link
Member

@benjie benjie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this work @enisdenjo 🙌 I've not read this in sufficient detail to be sure there's not any major issues, but I did spot one thing about ConnectionInit you might want to address.

rfcs/GraphQLOverWebSocket.md Show resolved Hide resolved
rfcs/GraphQLOverWebSocket.md Outdated Show resolved Hide resolved
@hoangvvo
Copy link

hoangvvo commented Oct 2, 2020

Just my two cents but I think we should drop connectionInit from the spec. It is mostly used for authentication but that should be a matter of user's implementation in the outer websocket layer, just like the way the current GraphQL over HTTP does not have a say on authentication inside GraphQL layer. ConnectionAck will also be dropped this way.

Something like this would be more optimal since the connection is rejected right before the upgrade (in this spec, the socket is established either way). We should steer user over that direction instead.

For connectionParams, user should send them as query params in the WebSocket URL.

There is no mention of connection_terminate in this spec like the one in https://github.com/apollographql/subscriptions-transport-ws, which is good imo since it is also a redundant one.

The less the better, imo.

@enisdenjo
Copy link
Member Author

enisdenjo commented Oct 3, 2020

A protocol should be about security too. It can help steer decisions with a standardised backbone of staying protected. But, in the same time it should be flexible enough to allow bullet-proofing your system with a custom implementation.

Keeping this in mind, the protocol does not specify how you should authenticate, whether you should use cookies or tokens to uniquely identify clients; it simply specifies the means for communicating authorisation and a concrete way to prevent unwanted access.

The most secure and reliable way to authenticate through HTTP is by using headers. Unfortunately, you cannot pass custom headers in the WebSocket HTTP upgrade request from the browser - so this auth approach is simply not feasible.

Well, there are still cookies which the browser handles and provides through headers regardless of the HTTP request. However, this approach is prone to Cross-Site WebSocket Hijacking (CSWSH). There are measures you can take to protect yourself (which are applicable in general), yet the conclusion is still:

If you don't need to access the web session from the server-side WebSocket counterpart, just separately handle authentication and/or authorization using custom tokens or similar techniques within your WebSocket protocol and avoid relating to the web session via cookies or HTTP-Authentication during the handshake request.

Furthermore, URLs should never be used for transferring secrets. Back in 2012, the OAuth specification provided recommendations on threat mitigations. One of the items in the summary is:

Don't pass bearer tokens in page URLs: Bearer tokens SHOULD NOT be passed in page URLs (for example as query string parameters). Instead, bearer tokens SHOULD be passed in HTTP message headers or message bodies for which confidentiality measures are taken. Browsers, web servers, and other software may not adequately secure URLs in the browser history, web server logs, and other data structures. If bearer tokens are passed in page URLs, attackers might be able to steal them from the history data, logs, or other unsecured locations.

@benjie
Copy link
Member

benjie commented Oct 21, 2020

I don't think that "live queries" should be specified in this spec - there's not an established approach for live queries, they can work in multiple ways depending on the vendor (sometimes as subscription operation, sometimes as query operation with @live directive, sometimes as fragments within a query operation with the same directive) so specifying them at this stage seems premature.

If you're using live to refer to an open iterator (e.g. for subscriptions/stream/defer), this is probably ill-advised due to the confusion with the concept of "live queries". We should find an alternative for this, such as an open or streaming operation.

@enisdenjo
Copy link
Member Author

enisdenjo commented Oct 21, 2020

My intention with this commit was just to elevate the restriction of single emission on execute operations. There is an example named "Live Query operation", though, but I added it just for the sake of clarifications and sharing knowledge that the protocol does support the @defered, @stream and @live directive.

But, you're right! I am misusing the live term to refer to an open iterator. Will put an extra thought in this and push a better fitted alternative.

Thanks @benjie!

@hoangvvo
Copy link

hoangvvo commented Oct 31, 2020

Server triggers the x callback

In my opinion, this should not be part of the spec. It looks more like extra implementations by specific libraries.

For example current GraphQL HTTP servers/libraries often have things like formatError or formatErrorFn API to optionally format the errors in a specific way, but there is nothing like Server triggers formatErrorFn if defined in the HTTP spec.

Likewise Keep-Alive should be user-land implementation. For example the doc of the JavaScript library ws even specify a easy-to-do section for it here instead of having it built-in.

@hoangvvo
Copy link

hoangvvo commented Oct 31, 2020

A protocol should be about security too. It can help steer decisions with a standardised backbone of staying protected. But, in the same time it should be flexible enough to allow bullet-proofing your system with a custom implementation.

Keeping this in mind, the protocol does not specify how you should authenticate, whether you should use cookies or tokens to uniquely identify clients; it simply specifies the means for communicating authorisation and a concrete way to prevent unwanted access.

The most secure and reliable way to authenticate through HTTP is by using headers. Unfortunately, you cannot pass custom headers in the WebSocket HTTP upgrade request from the browser - so this auth approach is simply not feasible.

Well, there are still cookies which the browser handles and provides through headers regardless of the HTTP request. However, this approach is prone to Cross-Site WebSocket Hijacking (CSWSH). There are measures you can take to protect yourself (which are applicable in general), yet the conclusion is still:

If you don't need to access the web session from the server-side WebSocket counterpart, just separately handle authentication and/or authorization using custom tokens or similar techniques within your WebSocket protocol and avoid relating to the web session via cookies or HTTP-Authentication during the handshake request.

Furthermore, URLs should never be used for transferring secrets. Back in 2012, the OAuth specification provided recommendations on threat mitigations. One of the items in the summary is:

Don't pass bearer tokens in page URLs: Bearer tokens SHOULD NOT be passed in page URLs (for example as query string parameters). Instead, bearer tokens SHOULD be passed in HTTP message headers or message bodies for which confidentiality measures are taken. Browsers, web servers, and other software may not adequately secure URLs in the browser history, web server logs, and other data structures. If bearer tokens are passed in page URLs, attackers might be able to steal them from the history data, logs, or other unsecured locations.

Cross-Site WebSocket Hijacking can be mitigated by checking Origin header (which is not mutable). However, if Cookie should not be relied on, query param can still be used. https://tools.ietf.org/html/draft-ietf-oauth-v2-bearer-23#section-2.3

The HTTP request URI query can include other request-specific
   parameters, in which case, the "access_token" parameter MUST be
   properly separated from the request-specific parameters using "&"
   character(s) (ASCII code 38).
---
Because of the security weaknesses associated with the URI method
   (see Section 5), including the high likelihood that the URL
   containing the access token will be logged, it SHOULD NOT be used
   unless it is impossible to transport the access token in the
   "Authorization" request header field or the HTTP request entity-body.

It is true that there are security implications in using query params but this approach is still used in many production products. For example, Spotify creates a WebSocket connection by including access_token in the URL.

image

It is always possible to make the access token short-lived so counter security implications.

We provide connection_init but do not really include any specific instruction/usage anyway, so developers can still fall into security loophole anyway. The best thing we can do as library maintainers is to have some notes or hints inside the library's doc so developers can figure out there own approaches.

Since the current HTTP spec do not specify how we should handle authentication or gate-keeping, this spec should not either. Specific GraphQL library can still implemented such feature as an extra.

My point is that this spec should be inopinionated as possible (like the current HTTP spec which does not include such things).

@enisdenjo
Copy link
Member Author

enisdenjo commented Oct 31, 2020

Great feedback! I liked a lot of your suggestions and added them in the recent commit.

I also somewhat elevated authentication restrictions from the protocol, you should be free to deny the WebSocket upgrade through HTTP however you like before even reaching to the point of communication. However, I left the connection Init -> Ack flow inside, I still think that this is the securest way to promote connection acknowledgment, if you chose to do auth within a socket. Having done this, the protocol doesn't care about authentication that happens before the actual socket is established, but specifies how to acknowledge the connection from within it.

Still, I think much of your points make sense, would like to hear more opinions on this topic.

@morris
Copy link

morris commented Dec 6, 2020

Just wrote some interoperability suggestions, maybe someone wants to jump into the discussion: enisdenjo/graphql-ws#83

@D1plo1d
Copy link

D1plo1d commented Mar 12, 2021

@enisdenjo Sorry about the delay, hope this answers everything.

GraphQL Over WebRTC

This Protocol is curated exactly to WebSockets, if WebRTC (or other similar transport protocols) lacks close events - I'd recommend introducing other GraphQL transport protocols specifically made for transports in question.

Sounds good. I'm not immediately sure how to phrase "All the message types from GraphQL over WebSocket plus a connectionError message" for the spec but that sounds like something that can be iterated in a separate PR.

For reference: A proof of concept GraphQL over WebRTC can be found here.

Live Queries

I am a bit out of the loop, what is this patch field you're mentioning? How does it work and why do we need it in the Next message payload?

So before I get into detail I'm not 100% sure whether this belongs in GraphQL Over WebSocket or the GraphQL spec itself. Perhaps that's something you can provide insight on!

The patch field allows a @live query to stream partial updates using RFC 6902 json patches without incurring the serialization and bandwidth overhead of re-sending the entire (potentially lengthy) response to the query. It is similar to apollo's path responses to @defer but allows for array insertions and for multiple keys to be updated in a single payload.

An example workflow would go like this:

  1. A query containing @live is received by the server and it replies with an initial response:
{
  id: '1',
  type: 'next',
  payload: {
    data: {
      article: {
        title: 'Cat Facts',
        likes: 1,
        likedBy: [
          "bill-the-cat-dude@example.com"
        ],
        markdownFullText: "# Imagine a few pages of markdown here"
      }
    }
  },
}
  1. Another user likes the article and so the server sends a patch to only update the values that have changed (via a server-side diffing mechanism):
{
  id: '1',
  type: 'next',
  payload: {
    patch: [
      {
        "op": "replace",
        "path": "/article/likes",
        "value": 2,
      },
      {
        "op": "add",
        "path": "/article/likedBy/1",
        "value": "bills-friend-who-also-likes-cats@example.com",
      }
    ]
  },
}

An implementation of the patch response payload can be found here

@enisdenjo
Copy link
Member Author

GraphQL Over WebRTC

@D1plo1d sounds cool! However, I am wary of changing the Protocol to support additional real-time transports because as we broaden the transport spectrum, we deviate from the WebSocket's transport itself - leaving much of its intricacies to guessing (or misuse). Your change requirement is a perfect example of this, as soon as we allow usage of ConnectionError to communicate the following close event, misinformed users would rely on it instead of the close event itself when using WebSockets (as they did with subscriptions-transport-ws).

Furthermore, seem like in the WebRTC 1.0 spec you can announce a data channel as closed with an error. Could this potentially be synonymous to the WebSockets close event?

Live Queries

@D1plo1d I'd say this is more of an issue with the GraphQL spec because you're actually changing the ExecutionResult's contents ({ data: { ... } } vs { patch: [ ... ] }). The Protocol simply requires a GraphQL ExecutionResult to be transmitted over the Next's message payload. As this Protocol is tailored to GraphQL, execution changes (such as JSON patches) should start from GraphQL as well. I don't think graphql-js's execute or subscribe give you a patch result? The argument is backed by looking at the provided patch response payload implementation, it requires the patch response field to comply with the Operation from fast-json-patch interface.

@D1plo1d
Copy link

D1plo1d commented Mar 15, 2021

@enisdenjo Sounds good on both fronts. Re: Announcing close WebRTC data channels: I'm reading through this now but I think this is specifying if the underly SCTP transport closes with an error as opposed to an error that is settable from within JavaScript which would be more like websockets. In any case a seperate GraphQL Over WebRTC proposal can follow and build off of this one.

@enisdenjo
Copy link
Member Author

About the client to server ping topic: I hope enisdenjo/graphql-ws#117 (comment) shines some light on how all argumented problems can be solved without any Protocol adjustments.

@enisdenjo
Copy link
Member Author

Heads up everybody, this Protocol now supports bidirectional subprotocol ping and pong message types. More info on the PR: enisdenjo/graphql-ws#201, together with the full discussion: enisdenjo/graphql-ws#117. 😄

@hoangvvo
Copy link

hoangvvo commented Jun 9, 2021

Heads up everybody, this Protocol now supports bidirectional subprotocol ping and pong message types. More info on the PR: enisdenjo/graphql-ws#201, together with the full discussion: enisdenjo/graphql-ws#117.

Why do we choose to include this in the spec? I think it might be better to be a library-specific implementation. Since I think ping-pong is not something GraphQL-specific but is more of a transport-concern.

@enisdenjo
Copy link
Member Author

Also because it is a transport concern. Please read more here: enisdenjo/graphql-ws#117.

@enisdenjo
Copy link
Member Author

To summarise the ping/pong message addition:

  • The keep-alive belongs on the transport level, not the application level. Since this Protocol is strict about message validation, the ping and pong messages have to be explicitly defined.

    About the suggestion on using a custom GQL schema for ping and pongs (in Client to server ping message enisdenjo/graphql-ws#117 (comment)), it is indeed an overkill and not the right way to do it. Some of the reasons are that you have to pollute your schema, or even have a separate one; and, if used for latency metrics, the measurement is not accurate simply because you have the overhead of validation, schema setup and the complete GQL execution.

  • Browsers cannot ping the server; while the HTML spec allows client initiated pings, it does not require exposure of any and most probably never will.

  • Some server side solutions, specifically managed ones, dont expose the WebSocket transport level ping/pong APIs. One example is the AWS API Gateway (Socket idleness detection andyrichardson/subscriptionless#3)

  • Both parties should be able to ping the other party at any time. This addition guarantees that even if the WebSocket API is partial.

  • Suggestions on using browser's online and offline events are partial since they handle only one case - if the client's internet is gone. However, one crucial thing it does not consider are issues/disruptions somewhere between the gateways to the server itself (for example a proxy).

  • Users should not worry about transport level concerns, they'd want connection liveliness and stability guarantees out-of-the-box. Ability to measure latency, drop unresponsive or unstable connections, need to be easy to implement on both sides all while being within the transport's scope.

  • Although TCP/IP in theory notifies you when a socket breaks, in practice, particularly on things like mobile and satellite links, which often "fake" TCP over the air and put headers back on at each end, it's quite possible for a TCP session to "black hole", i.e. it appears to be open still, but in fact is just dumping anything you write to it onto the floor.

    So the keepalive confirms that you really are still talking to a broker (and from the broker end, that a client really is still connected), particularly when you're sitting on a long-connected connection, either subscribed to an infrequently publishing topic, or publishing at qos0 (i.e. no acknowledgement) to a broker.

    Andy Stanford-Clark on the topic "Why is the keep-alive needed?"

@hoangvvo
Copy link

hoangvvo commented Jun 11, 2021

Thanks for such a detailed answer! I understood and it made sense. However notice that a libarary can always handle ping/pong outside of the transport by directly using the underlying WebSockets API. For example:

function someGraphqlWsLibary(ws, options) {
   ws.on('connection', (socket, request) => {
     handleGraphQL(socket, request)
   })
   
   // library will add another listener
     if (options.usePingPong) {
       ws.on('connection', (socket, request) => {
         handlePingPongThing(socket)
       })
   }
 }

This spec should be more about integration of GraphQL on WebSocket. And GraphQL is more about execution/validation of a request rather than connectivity status of a specific transport. Notice that this GraphQL spec is a protocol of a transport, not the entire transport itself: we can more than just GraphQL living on the WebSocket (like some regular WebSocket APIs, some socketio channel, etc.). Delegating the connection stayalive to GraphQL WS handler, (sometimes) a component of the entire WebSocket setup, may not be the best idea.

But I think the use case for adding ping-pong does have their sense too, especially dx. Let's see what others think about this.

@enisdenjo
Copy link
Member Author

enisdenjo commented Jun 11, 2021

I understand your argument; however, it is not possible to handle ping pongs in the same WebSocket using the graphql-transport-ws subprotocol because "receiving a message of a type or format which is not specified in this document will result in an immediate socket closure with the event 4400: <error-message>."

What you're aiming at would require a separate connection, which is an even greater overkill than polluting your GraphQL schema with ping/pong fields (as suggested here: enisdenjo/graphql-ws#117 (comment)).

If using the WebSocket spec level pings and pongs would be an option (by having a consistent implementation across all parties and services), this protocol would've never mentioned them (which was my original path); however, the plain reality is that you cannot use them consistently. Hence, the need of ping/pong message types.

Furthermore, the protocol document simply allows them, and vaguely describes their intent; but, you're absolutely free not to use them at all, there is no requirement. This is exactly the default mode of the graphql-ws library, using subprotocol level ping/pongs is an opt-in.

@glasser
Copy link
Contributor

glasser commented Jul 1, 2021

It's really exciting to see this protocol standardization coming together!

Out of curiosity, is there anything written anywhere that compares this protocol to the old subscriptions-transport-ws protocol? While I trust that you're correct in believing that this protocol improves security and stability and reduces ambiguity, I'd be curious to understand more precisely how it does that.

(Do I understand correctly that the graphql-ws npm package implements the graphql-transport-ws subprotocol and the subscriptions-transport-ws and graphql-transport-ws npm packages implement the graphql-ws subprotocol?)

(For context, I'm the lead maintainer of Apollo Server. We originally built subscriptions-transport-ws at Apollo but as is pretty evident, we have not actively maintained it for a long time. It's great to see a better-supported package and protocol out there in the ecosystem. We are hoping to bring subscription support to the Apollo platform in a more holistic and supported fashion than we did last time, sometime in the next year. Most likely we'll want to build on this newer protocol and library and collaborate with everyone else using it rather than resurrect the old one, so I'm trying to start the process of learning about the differences between the old and new protocol.)

@enisdenjo
Copy link
Member Author

Hey there, I am very excited too! I'll jump straight to the point.

This protocol is designed to accompany WebSockets and implement on top of it. One example of it is that it does not have a "GQL_CONNECTION_TERMINATE" message and it rather uses the WS level close and communicates the reasoning through the close event's code and reason. It removes unnecessary messages and reduces the type name sizes increasing efficiency by a few notches.

It was also originally designed without describing a keep-alive mechanism (allowing the implementor to leverage the WebSocket spec level "heartbeat"), but because of #140 (comment), subprotocol level ping and pongs were introduced. These pings/pongs are much more useful and extensible than unidirectional keep-alive message (comparing subscription-transport-ws's keep-alive message). Allowing payloads and clear responses (pong to a ping, bidirectionally).

Furthermore, security-wise, it is very strict on the authentication process and requires message conformity otherwise kicking the client off immediately. For example, it does not allow invalid messages (messages not defined by the protocol) and it absolutely requires a single connection acknowledgement message which also has a timeout. Read more about the reasoning behind the acknowledgement message here: #140 (comment) and here: #140 (comment).

Also, is a bit more descriptive about what each message transmits and how that looks like. Exactly describing the JSON fields accompanying messages, allowing it to classify everything non-compliant as invalid.

In essence, it is inspired by the initial subscription-transport-ws protocol; however, it aims to refine the rough edges and bring the protocol to a far-reaching level. I discussed the, then current state, of GraphQL over WebSockets at the August 2020 GraphQL WG meeting and got the green light of rewriting the spec (due to tight integration of subscription-transport-ws protocol and the related libraries, and Facebook using a protocol of their own). You can watch the discussion over at YouTube.

Do I understand correctly that the graphql-ws npm package implements the graphql-transport-ws subprotocol and the subscriptions-transport-ws and graphql-transport-ws npm packages implement the graphql-ws subprotocol?

Yes, two reasons: graphql-ws was originally named graphql-transport-ws; and graphql-ws suprotocol name was already reserved by subscriptions-transport-ws.

I hope I didnt overlook anything and managed to answer your questions. 😄

@glasser
Copy link
Contributor

glasser commented Jul 1, 2021

@enisdenjo Thanks, that's great! We're not quite at the point of kicking off the "subscriptions at Apollo, for real this time" project but when we are I'd definitely love to chat more.

@enisdenjo
Copy link
Member Author

enisdenjo commented Jul 1, 2021

@glasser cool, thats fine!

I'd just like to point out that you dont need to rip the bandage off suddenly by kicking off subscriptions completely. You can support both in parallel (server and client side) to ease off the transition because of the distinct subprotocol names.

For the server, you can consult the "ws server usage with subscriptions-transport-ws backwards compatibility" recipe.

And for the client, you simply can do a test-try connect to a server and check if the socket got closed with a 1002. Very similar to the GraphQL Playground graphql-ws support PR: graphql/graphql-playground#1326.

P.S. left a comment at the AS3 issue (apollographql/apollo-server#4960 (comment)) too. Better to discuss the specifics of supporting both there instead.

@fmatosqg
Copy link

My 2 cents, SSE are unidirectional, and if there are tweaks for the Ping/Pong part of the protocol, the same can be used for Graphql over SSE. This is somehting I'm working on for Android / kotlin clients, pls see apollographql/apollo-kotlin#3756

@benjie
Copy link
Member

benjie commented Jun 6, 2022

As this is an RFC, it should be merged. (I cannot merge it currently.)

@ghmcadams ghmcadams merged commit 66e04b9 into graphql:main Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.