Skip to content

Security Review of TLS1.3 0-RTT #1001

Closed
@colmmacc

Description

Introduction

At the Eurocrypt/IACR TLS:DIV workshop on 2017-04-30, kindly facilitated by the Everest team, I presented the results of a security review of the TLS 1.3 0-RTT section. The security review was performed as part of the process of implementing TLS1.3 in s2n.

The review focused on two known-issues: the absence of forward secrecy for all data, and the replayability of 0-RTT data. As it turns out, these issues can be worked around, and it is possible to to provide 0RTT, Forward Secrecy and anti-replayability (save for the Gilmor downgrade attack case) at the same time. Many thanks to Eric Rescorla for identifying how the work around can be integrated with the existing TLS1.3 scheme, modulo a neat optimization that Eric also came up with.

However TLS1.3 0-RTT is insecure by default, and based on the current draft, it is likely that TLS implementations not using work arounds will create real-world vulnerabilities. I believe that the attacks enabled by these vulnerabilities are more practical, and more serious, than is generally appreciated. Each attack enabled is more severe than other vulnerabilities that have been considered "must upgrade" for TLS implementations.

This issue is intended as a summary of the attacks, their implications, and the mitigations that an implementation may perform, as well as suggested changes to the specification that would reduce the risk related to these issues.

The most serious issues concern replays, and this summary includes five practical real-world attacks against applications using TLS1.3 as described in the draft. However before discussing replays, it is helpful to understand how TLS1.3 and tickets interact.

TLS1.3 and tickets: STEKs remain a weakness in TLS

To support TLS resumption and 0-RTT a server must know the session parameters to be resumed (PSK, authentication, etc ...). The most common implementations of TLS tickets have the server using Session Ticket Encryption Keys (STEKs) to create an encrypted copy of the session parameters which is then stored by the client. When the client resumes, it supplies this encrypted copy, the server decrypts it, and has the parameters it needs to resume. The server need only remember the STEK.

If a STEK is disclosed to an adversary, then all of the data encrypted by sessions protected by the STEK may be decrypted by an adversary. STEKs are therefore a weak spot in the over-all design of TLS; disclosure of a single small key can result in compromising an unbounded amount of data. While it is never possible to secure data that is transmitted during a compromise, it is a regression from forward secrecy that historical data transmitted prior to the compromise is not protected.

STEKs must be synchronized across large number of hosts. The Akamai CDN, for example, consists of over 200,000 hosts. To be most effective for resumption, a STEK must be accessible on the subset of these the hosts responsible for handling a domain. This subset is measurable as at least tens to hundreds of thousands of hosts based on DNS queries and host finger-printing. In the case of some operators, hosts may also have the STEKs on-disk, subject to risk of physical theft or seizure, depending on the architecture of the provider (though generally large providers such as Akamai do not store keys on disk).

This scope presents a large surface area to attackers. A single vulnerable host, or a vulnerability in how STEKs are synchronized, can lead to STEK disclosure. For the most part these security challenges are handled out of view of public audit, and it is difficult to capture how well best practices are applied. There has been some recent work by Springall, Durumeric, Halderman on Measuring the Security Harm of TLS Crypto Shortcuts which quantifies the use of overly-longlasting STEKs.

Seperate from traditional host security risk, there is also a cryptographic risk. Attackers may record and replay a ticket to a server at will, and tickets are commonly encrypted using AES, and algorithm that is vulnerable to side-channel analysis in some situations. For example, if an attacker can gain a vantage point "close" enough to a non-constant-time non-contant-memory-access implementation of AES (e.g. software encryption on an old host that does not support hardware AES acceleration) then they may be able to discover the STEK through side-channel analysis. At a minimum, server implementations of TLS would be wise to use an algorithm designed for side-channel resistance for ticket encryption, regardless of the encryption algorithm intended for the session itself.

TLS1.3 is much better, but critical data still lacks forward secrecy

TLS1.3 makes huge strides in improving these security risks. Tickets are now related with Resumption Pre-Shared Keys (RPSKs) which are not the same as the keys encrypting the original session. Additionally, upon resumption, TLS1.3 supports a key schedule that means the only user data protected by the RPSK is 0-RTT data, which is optional.

While 0-RTT is intended for a relatively small volume of data at the beginning of a connection, it is unfortunately very likely that this section of data will contain critical credentials: credit card numbers, passwords, cookies and other bearer tokens. Typically these requests, if compromised, can be used to generate the entire response. Thus, with TLS1.3, it remains that a large volume of critical user data remains secured ultimately by STEKs - which as we've seen are a weak spot. In practise, meaningful forward secrecy is not provided with 0-RTT is enabled.

Suggested mitigation: Support Forward Secrecy for 0-RTT data

An alternative to using STEKs and encrypted tickets is to use tickets as keys for a single-use session cache. When a server issues a ticket, it can store the session parameters in a cache. When a server receives a ticket for use, it looks up this ticket in the cache, which supports an atomic retrieve-and-delete operation.

This arrangement provides Forward Secrecy for all sessions successfully retrieved from the cache. If the server, or cache, is compromised then generally only data pertaining to future, yet-to-be-used, sessions is disclosed. Conveniently, operational and application errors favor security; for example downtime, crashes, and so on generally result in key erasure. Economic incentives also favor deleting keys (to make room for new ones) over keeping keys (as with STEKs, where a more long lived key is operationally cheaper).

The transactional requirements for a single-use cache and strike registers (an anti-replay mechanism) are also different. With strike registers, it is critical to know when a strike register was available and unavailable, to discard tickets from any period the strike register may not have recorded observations durably. To perform this all updates to a strike register must be sequential relative to a global checkpoint (i.e. all updates arriving prior to the checkpoint must be commited). A single-use cache is free to make concurrent updates that are unsequenced relative to each other or any checkpoint (but updates and reads to any single key must be sequenced).

If forward secrecy can be provided in this manner, why is this arrangement not common today? The easy answer is that mode is not efficient in current versions of TLS, where tickets are intended for multiple uses. This arrangement also comes at a cost: a server must operate the cache. At least in my view, the operational costs are worth the security benefit of meaningful Forward Secrecy. The cost of memory has lowered considerably since the inception of tickets. For example, an AWS Elasticache Redis instance capable of caching millions of sessions costs as little as $0.017 cents / hour. A custom cache with a dedicated design can be implemented considerably cheaper again.

The second cost is latency; performing a lookup costs time. Within the data-center, this cost too is no longer significant. For example within the AWS EC2 VPC network it is possible to achieve look up times measured in tens of microseconds. This time is not significant in the context of a TLS connection that would benefit from 0-RTT optimization, where latencies are usually over three orders of magnitude greater at tens to hundreds of milliseconds.

The third cost is that it is not feasible to share state across data centers or a wide geographic area (e.g. a global CDN). A ticket issued from a CDN node in one city would not help a user resume a connection if they are later routed to a different city. In my view (having built two CDNs), this is not a significant problem. In practice, when users are routed to different locations it is common for cache misses to occur, and for the related TCP Fast Open optimization (it also speeds up connections) to fail (due to different IP addresses when not using IP any cast). Operators already work hard to maximize user affinity to locations, and any cache misses at the TLS level can be very quickly repaired.

Note also that one of the implications of Krawczyk's 2005 paper on Perfect Forward Secrecy is that some transactionally mutable server side state is required to provide Forward Secrecy for 0-RTT-like message (prior to any handshake). For example, puncturable encryption, another technique aimed at providing 0-RTT forward secrecy, requires transactionally mutable server-side state. In the case where a 0-RTT section arrives on the other side of the globe from wherever the store is located, it takes hundreds of milliseconds to complete a transaction, defeating the point of the 0-RTT feature. Forward secrecy and global resumption are likely mutually incompatible in any scheme. We should favor the security of forward secrecy.

Suggested changes to TLS1.3 draft

Make multiple tickets from the same connection cryptographically independent

In today's TLS1.3 draft a server may issue multiple tickets on a connection, but these tickets are not cryptographically independent. Unfortunately this makes it impossible to for a single-use session cache to distinguish between a ticket issued pre and post authentication, and it prevents servers from issuing meaningfully different tickets in order for a client to build up a pool of tickets.

With #998 , Eric Rescorla has suggested an easy fix for this. I strongly support adopting this change, and thank Eric for the suggestion.

Change clients "SHOULD" use tickets once to "MUST"

The TLS 1.3 draft-20 specifies that:

Clients SHOULD attempt to use each ticket no more than once, with more recent tickets being used first.

To provide meaningful forward secrecy on client side, clients "MUST" use each ticket no more than once, and "MUST" delete any session parameters stored with the session.

There is also another reason for this "MUST": any client that attempts to use a ticket multiple times will also likely leak the obfuscated_ticket_age value, which is intended to be secret.

Designate or Discern STEK-less tickets

While today STEKs are a common practice, and many operators do have reasonable implementations in place, over time their presence as a weak spot may (hopefully!) lead to their eradication, similar to how non-PFS key agreement has been greatly diminished through tools such as the ssllabs.com ratings.

It is possible for a client (or eventually an ssllabs.com) to validate that a ticket cannot have used a STEK: if the size of the ticket data is smaller than the RPSK, however today it is not possible for clients to ask for such a ticket, only to reject them.

In practice it would be useful for clients to advertise or encourage support for STEK-less tickets by advertising a maximum ticket size supported, or by having a designated "STEK-free" ticket type. At a minimum, this prevent servers STEK-dependent from generating tickets that clients have no desire to keep.

Replay is a big problem, Replay is a big problem

A delete-on-use session-cache robustly prevents any replay of a ticket (and hence replay of an associated 0-RTT data section) however that is neither common nor required for tickets in TLS 1.3. To the contrary, the draft specification calls out that replay is expected:

There are no guarantees of non-replay between connections. Unless the server takes special measures outside those provided by TLS, the server has no guarantee that the same 0-RTT data was not transmitted on multiple 0-RTT connections (see {{replay-time}} for more details).

This is unworkably insecure and will lead to many practical real-world security issues. As also noted in the draft:

TLS provides a limited mechanism for replay protection for data sent by the client in the first flight.

This suggested mechanism is to validate the ticket age presented by a client. The ticket age presented should correspond to the time since the ticket was issued by the server plus the round-trip time between the client and server.

There are three problems with this mechanism. The first problem is that the maximum bound for a round-trip-time is quite high. RTTs of 500ms are not unusual for satellite internet, or rural broadband schemes, and these are the very use-cases 0-RTT most benefits. In a 500ms window an attacker can easily send tens of thousands of replays.

The second problem is that in the real world there is clock skew. Clocks and timers can drift based on factors such as temperature, CPU power saving, system hibernation and more. This is particularly true of low-power devices which are increasingly prevalent. Indeed the TLS1.3 draft suggests tolerating up to 10 seconds. In the real-world, providers may go higher. The maximum life time for a ticket is 7 days. My own private experiment based on requests from low power devices showed that over 7 days that the 99th percentile window clock drift was around 2 seconds, but that the 99.9th percentile window around 40 seconds. A provider therefore has a natural and strong incentive to increase the window of tolerance, in order to permit more clients to resume and use 0-RTT. Regardless, even with a 10 second window, an attacker can send millions of replays.

The third problem is that both of these cases so far are per-single-destination. If many servers share a STEK, as is common, then it is possible to replay at these rates to each server. With CDNs consisting of hundreds of thousands of nodes, it is likely that the attackers ability to generate replays is bounded only by the availability of bandwidth. In short: millions to billions of replays are possible within the scheme outlined by the TLS1.3 draft.

In practise, millions of replays is sufficient to exploit measureable cryptographic side-channels, if the underlying implementation is to vulnerable to any. As we'll see it also enables at least five types of serious attack.

Attack 1: HTTP is not replay tolerant

In the evolution of the TLS1.3 specification, it has been stated that web requests must already be replay tolerant, as browsers will retry interrupted connections. While the latter is true, it does not imply the former.

Firstly, not all use of TLS is for HTTP, and users of protocols other than HTTP are likely to desire and enable the benefits of 0-RTT. Secondly, not all HTTP requests are made by web browsers. In fact, to render a single browser request for amazon.com, Facebook, or a Google search often requires hundreds of "behind the scenes" HTTP requests to internal micro-services. These requests sometimes span data centers and can benefit from 0-RTT optimization. As the "Internet of things" develops, these same types of requests are now also common between industrial and home settings and cloud servers accessed via the WAN. These settings include networks that are not well-secured and are subject to relatively easy eavesdropping.

Many clients for these services do not retry by default, or pay tedious attention to how retries are implemented. This is especially true of requests that implement transactional applications. In some asynchronous transaction schemes, clients need to be careful to provide each commit attempt a unique ID, separate from the unique ID for the commit itself. These applications are often careful to preserve an order between related requests to resolve dead-locks and ties.

More common still is clients that form a "Try, Wait, Read, Retry-if-not-there" cycle to avoid creating duplicate entries. For example a client may try a request, and if that request times out, it may wait a fixed time period, poll for success of the previous request (maybe it did succeed) and only then try it again. Applications such as these are generally not tolerant of even a single uncontrolled replay. Other applications eliminate retries entirely, and make requests at a fixed constant rate as an important measure designed to reduce the risk of overload or cascade outages during partial failure events.

In 2015, Daniel Kahn Gillmor discovered a combined replay and downgrade attack against 0-RTT sections: if an active attacker can block server responses to a 0-RTT request, while also disabling the server's record (strike register) of observed 0-RTT sections (A DOS attack may achieve this), then the server may be forced to refuse 0-RTT data on the subsequent retry. This will force the client to downgrade and repeat the request as regular non-0-RTT data.

In fact, as noted earlier, it is never safe for clients to repeat a ticket if one is concerned about keeping the secrecy of the obfuscated ticket age. So reasonable clients may always retry with a non-0-RTT attempt, or a use a different ticket if it is available to them. Although a provider may hash the entire 0-RTT section to derive a key for use with a strike register, this requires buffering, and it is more common to use a key derived from the ticket. Both of these factors make the makes the Gillmor attack even more practical; the strike register is probably irrelevant anyway.

A single-use cache does not mitigate this attack, but notice that in the Gillmor attack the client is made aware of the original failure, and can control the nature of the retry. The client knows that the request may have failed, or may have succeeded. And so careful clients may enforce their retry requirements. Without an anti 0-RTT anti-replay mechanism a request may be silently replayed millions of times without any knowledge to the client. That is a materially different kind of attack that breaks existing systems in unexpected ways.

Attack 2: Exhausting tolerances

Many applications also fail to take into account fully correct REST design patterns, and implement non-idempotent GET requests. Cloudflare provide a great example in their blog post on Zero-RTT.

For example, HTTP requests can trigger transfers of money. If someone makes a request to their bank to “pay $1000 to Craig” and that request is replayed, it could cause Craig to be paid multiple times. A good deal if you’re Craig.

While it is true that browsers may retry a request like this today, as we've seen in attack 1, it not true that only browsers make such requests.

More important is that while an attacker may cause a browser to retry this type of request perhaps tens to hundreds of times, that kind of attack is active and consumes the users bandwidth and CPU and must pass through the user's firewalls and other controls.

A 0-RTT replay attack, as we've seen, can be performed up to millions of times, and mostly out of band using the attacker's resources (though the attacker must also be passively in-band, to copy the original 0-RTT section). Repeating a request once, as the Gillmor attack permits may ocasionally trigger a manual refund process. Repeating the request millions of times may bankrupt a business. This is a materially different kind of risk.

How practical is it for applications to mitigate attacks 1 and 2?

For the purposes of attacks 1 and 2, consider what is neccessary on the application's side to mitigate this kind of issue; it must make the requests themselves replay-safe. One popular approach is to make the request itself idempotent by adding an explicit, or synthesized, idempotency key that represents the invokation. See the Stripe blog post on just this topic. The key must be commited to a data store that can provide an atomic uniqueness guarantee, and since this commit must be concurrent to the operation itself, it must generally occur in the data store the application is mutating.

One immediate problem is that not all applications use such data stores. An eventually consistent data store does not provide these kinds of guarantees, though may provide a guarantee around when "eventually" the store is consistent. This is why some clients perform the "Try, poll, Try again" cycle.

It can be tempting to suggest that idempotency could be provided by a logically-seperate component, responsible only for preventing re-occurences. As it turns out, it is not possible to effectively guarantee uniqueness from "outside" of the application's central data store. Consider a theoritical micro-service designed to provide "idempotency as a service", it could accept idempotency keys and commit them on behalf of the application while refusing duplicates. This naive arrangment breaks when the the micro-service accepts and commits the key, but the application's own update to its data store fails. Then the user's operation fails but cannot ever be repeated. To resolve this the micro-service and the application service must use a coupled and distributed transaction protocol and things get complicated quickly.

To underscore how subtle and hard a problem this area of idemptency can be it is worth looking at one of CloudFlare's 0-RTT anti-replay mechanisms. To make things easier for applications, CloudFlare adds a HTTP header to outgoing, proxied, requests that originated in 0-RTT sections:

Cf-0rtt-Unique: 37033bcb6b42d2bcf08af3b8dbae305a

The hexadecimal value is derived from a piece of data called a PSK binder, which is unique per 0-RTT request.

An application can use this value as a convenient uniqueness key, to mitigate 0-RTT replay. However, this isn't quite sufficient. A retry request triggered by the Gillmor attack will not be associated with the the same PSK binder, and so application level idempotency is required anyway. Of course an application designer who has no idempotency key available to them may decide to use the CloudFlare-provided key pragmatically; at least they are now defended againdt 0-RTT mass-replays, and this is a sensible use of the neat feature. However since applications often evolve to eventually include idempotency keys, an application may be left with a transition period where both uniqueness keys are required. Many NoSQL datastores are limited to a single index and do not provide for enforcing multiple uniqueness constraints.

Attack 3: Compromising secrecy with TLS1.3 0-RTT

To provide the security guarantee of secrecy, it is not sufficient that requests are idempotent and replay-tolerant. The requests must also be handled in a manner that is free of any observable side-effects. This is extremely difficult to achieve. This is a core focus of strong cryptography, where side-effect free programming errors are an area of constant research and frequent vulnerabilities. Higher level applications are generally not concerned with this challenge at all, and are poorly prepared for the implications of replays.

Take for example a simple side effect; caching. A read-only request is by definition idempotent, but if a cache is present this cache can effect observable timings and response-headers that defeat the secrecy of the request.

Suppose a user fetches a piece of content from a CDN using a 0-RTT request, and that piece of content is prohibited and contrary to the principles of a totalitarian regime. Ordinarily only the size of the download of the content is disclosed to a man-in-the-middle attacker, and as we'll see later, TLS1.3 includes support for measures designed to help defeat traffic-analysis attacks that can use this size to identify the content.

But with replays it is now feasible to probe the CDN caches to determine what the content was. First the attacker copies the zero-RTT section and then replays it to a series of CDN nodes. The attacker can choose CDN nodes that are unlikely to have the content already (e.g. in different geographic regions), and replay the request. The CDN will then fetch, and cache, the content.

The attacker can then make probe requests for suspected illicit content and determine if it was cached or not (if it loads quickly, or slowly, or if a cache max-age header lines up with the replayed request). Note that the attacker can take their time with the probes and can spread probe requests over a relatively long time period. Any noise or uncertainty in the process can be countered by using additional replays to more nodes to increase confidence.

This is just one basic example using a typical CDN cache, but applications use caches at many other layers. It is likely that all of these caches can be probed in some way to reveal details of encrypted requests.

Attack 4: Exhausting throttles with TLS1.3 0-RTT

It is a common operational and security measure to throttle application requests. For example, a given customer may be permitted to perform as many as 10,000 requests/second but no more.

To avoid simple spoofing risks, many such systems perform throttling post-authentication. For example the request may be signed cryptographically (see the AWS SIGv4 signing protocol or the OATH signing process), that signature is verified prior to throttling. This post-authentication property is one reason why such protocols are designed to be extremely fast to verify, which often means as much cryptography as possible must be pre-computed, making random nonces infeasible in many cases.

For such systems, 0-RTT data means that legitimately signed requests that were previously considered to be secret and non-spoofable are now re-playable by attackers. This enables a new and realistic denial of service vulnerability capable of locking customers out of their accounts.

Attack 5: Enabling easier Traffic Analysis with TLS1.3 0-RTT

With traffic analysis it is often possible for a passive attacker to decloak what content an encrypted session handled. For example when a user browses Wikipedia an attacker may be able to determine which page the user is viewing because the combination of html, image and CSS sizes on a particular Wikipedia page is highly likely to be unique and even though the content is encrypted, the attacker can observe the sizes. This type of attack has become slightly easier with the recent adoption of stream ciphersuites such as AES-GCM and ChaCha20-Poly1305 that do not mask content-length to at least a block size.

TLS1.3 includes a record layer padding mechanism, designed to make these kinds of attacks more difficult. However 0-RTT replay also enables a new kind of traffic analsys attack. Today, traffic analysis is most effective against fixed-size responses, as in the Wikipedia example. With 0-RTT data an attacker can repeat a request millions to billions of times, and by observing variability in response size and response times can gain additional information that may enable the attacker to decloak data.

Violation of layers and seperation of actors

As we've seen, the problems of replay tolerance are resolveable only at application layer, but solutions can be subtle and hard to reason through and test.In my experience, deliberately replaying requests will uncover surprising issues in many systems. Indeed there are race detectors, trashers and other testing tools for a variety of languages that have evolved to find these kinds of issues. But these bugs remain common, and many systems make the fair assumption that TLS provides anti-replay properties for their messaging/transport layer.

The approach with the TLS1.3 draft is to say that 0-RTT data is optional, that it should not be enabled without a careful analysis and that the application must be made aware that data was potentially replayed. In my view, in light of the above attacks, this advice is unworkable. It is not simple, or maybe even possible, to secure all applications against replay and measurable side effects such as cache timing and throttling. Fully-correct idempotency is very difficult and vanishingly rare.

But beyond that, let's examine the advice given:

Protocols MUST NOT use 0-RTT data without a profile that defines its use. That profile needs to identify which messages or interactions are safe to use with 0-RTT. In addition, to avoid accidental misuse, implementations SHOULD NOT enable 0-RTT unless specifically requested. Implementations SHOULD provide special functions for 0-RTT data to ensure that an application is always aware that it is sending or receiving data that might be replayed.

There are also several reasons to believe that even this advice will not always be taken, indeed some existing experimental deployments do not follow it.

The first and strongest reason is that the benefits of 0-RTT are considerable, immediate, and measurable. By turning on 0-RTT a provider can save 100s of milliseconds, and it's been reported that savings of 100ms can impact revenue by as much as 1%. Providers also exist in a competitive landscape and are constantly trying to beat each other on every conceivable metric. At the same time, the security risks are non-obvious (indeed, this write up is coming very late in the TLS draft process) and hard to test for. In other words; providers have an extremely large incentive to turn on 0-RTT on, de-prioritizing harder to measure security concerns.

The second practical problem is that the world of application authors, writing high level code for websites and applications is very seperate from the worlds of TLS implementors and server administrators. It is predictable that site administrators will enable 0-RTT without an appreciation for the risk to the application, whose authors are likely not even aware of a change. Indeed acceleration providers are already making this easy in the current experimental deployments of TLS1.3, where 0-RTT support is being offered to websites and applications in a backwards compatible way as a single stream of data towards their own servers.

If providers are to render the advice of the TLS draft moot, and to provide a single-stream of data anyway, then arguably it would be better if the TLS1.3 RFC defined that as the default mode of operation. Maintaining seperate "may be replayed" and "can't be replayed" sections is complexity that can clutter applications and increase risks of application-level state machine bugs.

Even a fully-aware and conscienscous site administrator faces a practical difficulty: applications are often made up of many URLs and request paths. Some may be replay be safe, and others not. But 0-RTT is enabled at a protocol level, for all requests. There are also "Layer 4" TLS proxies which accelerate TLS by terminating it at edge sites (similar to a CDN), or provide security benefits by handling certificate management, but are completely agnostic to the protocol being handled by TLS. Will administrators and providers in these situations resist the temptation the accelerate TLS with 0-RTT mode?

Furthermore the expectations and guarantees provided by layers are expected to be consistent across providers. A customer may legimately use an API proxy offered by one provider, in combination with a load balancer offered by another, together with a CDN or edge-accelerator offered by another. A change in 0-RTT behavior on the part of any one provider can impact the security assumptions of the others. For example, the CDN layer administrator might enable 0-RTT for what appears to be an idempotent request pattern, without being aware that the the API proxy implements request-level throttles. Now an attacker who happens to grab a single 0-RTT data request from an unsecured WiFi network can turn this into a broad denial-of-service attack that may lock the caller out in all locations.

At the core of this problem is that the proposed change with TLS1.3 violates the established layering boundaries of applications and transport protocols, and violates the principle of least surprise. This is somewhat ironic, as TLS itself benefits from important guarantees from its underlying protocol, TCP. For example the Lucky13 vulnerability was practical against DTLS, but not TLS. This is because UDP-based DTLS tolerates a certain amount of replays, while TLS does not, due to the reliable-transmission guarantees provided by TCP. Suppose that the TCP protocol WG were to decide that TCP would sometimes no longer provide reliable tranmission, and that data may be missing or duplicated in a stream, would we be happy with that as TLS maintainers?

Suggested changes to TLS1.3 draft

Require implementations to robustly prevent Ticket re-use for 0-RTT

TLS1.3 should require that TLS implementions handling 0-RTT "MUST" provide a mechanism to prevent duplicate tickets from being used for 0-RTT data. Implementations could use strike-registers, single-use caches, or other mechanisms such as puncturable encryption, to achieve this effect; rejecting 0-RTT sections when uncertain of replay.

While this does leave open the small window of Gillmor-style attacks, these attacks are different in magnitude, consequence, and can be handled reasonably by clients in a manner that existing clients are used to.

Additionally, if TLS implementations are to provide replay protection as a built-in property, it is simpler for applications to expose all TLS plaintext data as a single stream. This appears to be what applications are doing anyway.

Partial mitigation for Gilmor attacks: deliberately duplicate 0-RTT data

If 0-RTT data and regular data are to remain seperate streams, then another way to address Gilmor attacks is to intentionally duplicate 0-RTT sections. If 0-RTT sections are to be replayable, it is better that they should be replayed as an ordinary event. TLS implementations should ocasionally intentionally duplicate zero-RTT data torwads the application. This helps "innoculate" applications against idempotency bugs, triggering them early in a controlled way, before attackers do in an uncontrolled way.

Require TLS proxies to operate 0-RTT transitively

Some, though not all, of the attacks outlined can be lessened by passing full knowledge of 0-RTT state end-to-end to applications. For example, a CDN or TLS accelerator could accept a 0-RTT data request only if the origin also supports 0-RTT. It could then match byte-for-byte the plaintext of the incoming 0-RTT section with an outgoing 0-RTT section. Rather than emulating a single stream, this would allow end applications to reason more precisely about exactly which data was originally replayable.

Conclusion

TLS 1.3 0-RTT is not secure by default, but it is possible to provide both anti-replay and forward-secrecy properties for 0-RTT data with workarounds. As long as TLS 1.3 is not secure by default it is likely to lead to exploitable vulnerabilities that can only be fixed at the application level, distant from the cause. In general, it is also very challenging to fix applications to be idempotent and side-effect free.

Instead of shifting the problem to applications, we should strongly consider modifying the TLS 1.3 draft to make TLS 0-RTT secure by default, at least against replays. While Gillmor-style retry attacks will persist, these attacks may be mitigated with reasonable client behavior, and in many cases the existing client behavior is already fault tolerant.

Lastly, to end on a positive note, in general; TLS1.3 is still a welcome and vast simplification over prior versions of TLS and improves the security posture of TLS generally, including much better forward-secrecy for all non-0-RTT data.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions