David Fifield <david@bamsoftware.com>
I'm going to cover three topics today. The first is what I mean by "Internet censorship," how we model it and what the challenges are. The second is a certain circumvention technique called "domain fronting" and how it meets those challenges. And then, how the world of censorship circumvention has changed in the past few months and what changes are coming on the horizon.
Whenever I think about Internet censorship, this is the picture I have in mind. It's unfair that we use the very generic term "Internet censorship" to refer to this narrow slice of all the topics that could come under that label. We could use a more specific term. Anyway, we have a censored client—that's you, you're being censored. The client sits within a network controlled by a censor. The censor controls all the network links within its network, all the routers, etc. The censor can manipulate traffic however it wants, can block packets, replay them, inject new ones. Just imagine that you own a router. You can run software on it, you can do whatever you want. Outside the censor's network is some destination that the client wants to reach. Sending a message to the destination, despite the censor's controls, is called circumvention. If the client can do it, the client wins; otherwise the censor wins.
But we're missing something? What prevents the censor from just winning trivally, by just shutting down all communication—that certainly prevents circumvention. The answer is that the censor suffers some cost when it blocks, after all there's a benefit to Internet connectivity that a total shutdown would harm. The censor doesn't want to block everything—only a subset of traffic, only certain enumerated things and nothing more. What inhibits the operation of the censor is a fear of costs caused by overblocking. We'll come back to this idea.
Some typical censor behaviors are blocking traffic based on destination address, or blocking on the basis of the contents of packets, like keywords. I find this classification useful: blocking by address and blocking by content. The challenge is, you're a censored client, and you need to send out a message, and you know that both the contents of your message and the destination address will be blocked by the censor. How do you do it? As for the contents, you can use encryption, encrypt your traffic and the censor cannot tell what keywords it contains, only that it is encrypted. And for the address, if direct communication with the destination is blocked, then the only alternative is indirect communication. So at minimum, you have to route your message through some third party, which we generally call a "proxy." We'll use the word "proxy" very generically—it's not just a certain kind of server, but it could for example be a VPN, a special router, a program running on someone's home PC; anything acting on behalf of the client is a proxy.
So circumvention starts with access via an encrypted proxy, but that's not where it ends. Now you have to worry about second-order censorship. First-order censorship is the direct blocking of things the censor wants to block, like web sites and keywords. Second-order censorship is the blocking of proxies, because a proxy allows the client to get around the first-order censorship. Now the details matter. The encryption between the client and proxy, how does it work? If it's a custom protocol, not used by anything else, then that's something the censor can detect and you can get blocked on that basis. Even more difficult: how do you prevent the censor from blocking the proxy's address? This is actually really hard: you want to build a circumvention system and provide it for the use of the general public, people you don't know. You have to somehow inform them of proxy addresses. But how do you do that, without also informing the censor of those same addresses? The censor can do anything a normal user can do, it can download your software and reverse engineer it, or simply run it and see what addresses it connects to. It's a weird model, right? You have to share secret, trusted information with a group of people you don't have a trust relationship with.
Until a few years ago, the state of the art was to use an encrypted proxy protocol that either imitates some other real-world protocol, or else is totally random such that it doesn't match any protocol on a censor's blacklist. (Why a blacklist? Here we observe empirically that censors prefer to block narrowly rather than broadly when they can, again likely because of an aversion to overblocking.) And then to have many proxies that you can replenish as they get discovered and blocked, with some sort of rate-limiting or reputation-based system for distributing the addresses to limit the rate of their discovery.
I'm going to tell you about an alternative—domain fronting—that avoids this need for secrecy. To understand how it works, we need to take a historical detour through HTTP and TLS and how they evolved together to create HTTPS. Domain fronting is fundamentally based on HTTPS.
without TLS | with TLS |
---|---|
Go back to early HTTP, HTTP/1.0, circa 1995. How does it work? The client sends a GET, and the name of the page it wants, and the server sends back a status code and the content.
To turn HTTP into HTTPS, you add a layer of TLS. The client initiates the handshake, and the server responds with its (one and only) certificate. After that, the HTTP exchange is unmodified, except that it happens underneath the TLS encryption and authentication. The layer separation is very clean.
without TLS | with TLS |
---|---|
One of the things that HTTP/1.1 added
was support for virtual hosting.
Virtual hosting is when one web server,
one IP address, serves several domains.
The server needs to know which domain the client intends;
the client communicates that using the
new required Host
header.
How does TLS interact with virtual hosting? The client initiates the handshake as before, but there's a problem—what certificate is the server supposed to send? The server has certificates for multiple domains and it needs to know which one to use. We have a chicken-and-egg problem: the client cannot send its desired host until the handshake is complete, but the handshake cannot complete without the server knowing the desired host. So for a long time, about ten years, virtual hosting was just incompatible with HTTPS. If you wanted an HTTPS server, it had to be on its own dedicated IP address, with one and only one certificate.
SNI: Server Name Indication (TLS extension)
The resolution to the impasse was an extension to TLS, called SNI for "server name indication." It solves the problem in about the simplest, stupidest way possible: the client just staples in plaintext its desired domain to its initial handshake message.
SNI solved the problem of HTTPS virtual hosting, which actually was an acute problem that needed a solution. But a consequence is that you leak your destination domain when you make an HTTPS connection. When you browse to an HTTPS URL, how much of it does an eavesdropper get to see? They see the scheme (https://), because they can infer that from the destination port and the fact that TLS is in use. They see the domain (example.com), because that's attached in plaintext in the SNI. The only part they don't get to see is the path (/path) and anything else that comes after the domain. If you're like me, it's a little disappointing, like, I thought I had this nice encrypted protocol, and look at all the information it's leaking. It means that HTTPS by itself is not very helpful for circumvention, because a censor can still block by address by reading the SNI.
Notice something weird in the HTTPS with SNI diagram. The domain "example.com" appears, redundantly, in two different places. This is kind of a historical accident; if you were to design HTTPS from scratch today, it wouldn't have this peculiarity. But the fact is, we have the same name in two places. Does that give you any ideas? What happens if the two names do not match?
When the Host header and SNI names do not match, we call it "domain fronting."
You know, like a front organization may be the public face of some other covert organization.
I'm not aware of any standard that says exactly what should happen,
and in fact implementations differ.
But one common implementation means that you get the TLS certificate
corresponding to the SNI,
but the HTTP contents corresponding to the Host
header.
Now, this is a little weird,
because if you wanted end-to-end TLS security with the Host header web site,
you're not getting it from this.
If you need that, you need to add it in a separate tunneled layer.
But you can see how this would be useful for circumvention in the case of virtual hosting:
you can access one site (presumably blocked by the censor)
while appearing to access another site (presumably unblocked).
It really is indistinguishable from an ordinary non-fronted
visit to the SNI domain, up to things like website fingerprinting.
If the censor can only prevent circumvention by blocking
access to the SNI domain.
Why wouldn't the censor do that?
Well, we can never be sure that it won't,
but it again comes down to the costs associated with overblocking:
ideally we find a front domain that is valuable enough that the censor
doesn't want to block it.
$ wget -q -O - https://www.google.com/ --header "Host: www.android.com" | grep "<title>" <title>Android</title> $ curl -s https://www.google.com/ -H "Host: www.android.com" | grep "<title>" <title>Android</title>
Here's how you can try domain fronting in the command line. Somewhere there's a server that does virtual hosting for www.google.com and www.android.com (and a ton of other domains). If you construct an HTTPS request for www.google.com, but rewrite the Host header to say www.android.com, you actually get the content of the Android page.
And virtual hosting is extremely common these days, in the form of CDNs (content delivery networks). If we can find one valuable domain on a CDN, that means we can use it as a front to access any other domain on the CDN. But we can actually do a little better. In place of the blocked site, let's become a customer of the CDN and run a proxy on our own server. Now, the client can use domain fronting to reach our proxy, and our proxy can serve as the last mile towards whatever other destination the client may desire. Domain fronting combined with some ancillary proxy tunneling features makes a complete circumvention system. In the context of Tor, we call the system "meek."
Notice that there's no secrecy required in this model. We don't have to carefully manage many proxy addresses. There's effectively only one proxy address—and the censor knows it—but it's too valuable to block. Well, actually different censors have different resources, and depending on the specific domain, some may be able to afford blocking it, and some may not. But what I like about this model is that it makes resistance to censorship in some sense quantifiable: instead of saying, the system won't be blocked as long as the censor doesn't know X, we can say, in order to block Y, it will cost the censor Z.
Earlier history (2014–2017)
Domain fronting has served us well for the past five years or so. In the last few months, though, things have started to change. Back in April and May 2018, Amazon and Google announced that they would stop supporting domain fronting. You can see here some articles about it. Me, personally, I'm more ambivalent, I mean, domain fronting is something of a hack, and it depends on specific implementation details, and there's nothing that says they have to keep working the way they have worked until now. The current situation, as I understand it, is that Google didn't blocking domain fronting completely (as demonstrated by the www.google.com/www.android.com example earlier), they only blocked access to *.appspot.com, which is a service that lets you run code such as a proxy, virtually hosted on Google's servers. Amazon hasn't technically blocked domain fronting, as in it will still work, but I understand they will send you nastygrams if they detect you doing it too much. Currently, with Tor, we're still using the Microsoft CDN, called Azure. That's how people in, for example, China, are accessing Tor today. So domain fronting is limping along for now, but its sustainability is in question, and we are going to need something to take its place.
The good news is that there is now an IETF draft for SNI encryption, which is exactly what it sounds like. It's not part of the core TLS 1.3 standard, it will be an extension to TLS, just like SNI was. It works by distributing temporary SNI encryption keys through an out-of-band channel like DNS, which the client then uses to encrypt the SNI field. If encrypted SNI takes off—and I hope it does—it will solve all our current circumvention problems. Basically everything we can do today with domain fronting we'll be able to do with encrypted SNI, in a standards-based way that the cloud operators seem more happy with.
The draft standard if encrypted SNI is actually already deployed today. It's present on Cloudflare sites, and the client side is in nightly builds of Firefox.
Solving our current problems will allow us to focus on other problems. One of the problems I'm thinking about is what will happen, as protocols become more and more secure, and censors' favored techniques of the past, namely monitoring at network routers, becomes more and more ineffective? Censors will be forced to adapt, perhaps by developing improved encrypted-traffic analysis. But what I think is more likely in the near term is that censors will put more and more pressure on the network intermediaries, like CDNs, who, because of the security of network protocols, will be the obvious places at which to implement censorship. So we may see things like letters saying, "we want you to drop this customer." It will be a restructuring of power arrangements, with end users depending increasingly on the benevolence of intermediaries.