[Condensed version of this narrative: the news.yc code, prior to the the release of arc3, contains a remotely-exploitable vulnerability permitting account theft. Anyone running a news installation who has not yet upgraded to arc3 should do so.]
Hacker News login cookies are random eight-character strings, stored server-side in a hash table mapping them to user names. I discovered a few weeks ago that these strings were rather less random than they were meant to be, and, through a delightful combination of exploits, could be predicted, enabling an attacker to steal accounts.
Here's the rand-string function from arc.arc, version 2. It gets called with n=8 to generate login cookies, and n=10 for the "fnids" that get used all over the site as hash keys identifying closures.
(def rand-string (n) (with (cap (fn () (+ 65 (rand 26))) sm (fn () (+ 97 (rand 26))) dig (fn () (+ 48 (rand 10)))) (coerce (map [coerce _ 'char] (cons (rand-choice (cap) (sm)) (n-of (- n 1) (rand-choice (cap) (sm) (dig))))) 'string)))
The first thing you might notice about this function is that not all characters are equally probable. Each digit has a 1/30 chance of occuring, while each letter has a 1/78 chance. This alone is no big deal: this distribution means that each character carries 5.826 bits of entropy, versus the 5.954 that a uniform distribution would provide. So for an eight-character string, this bug reduces the effective keyspace by just over a factor of two -- not enough to have any practical implications.
The 'rand' function is an arc primitive, bound directly to mzscheme's 'random':
; need to use a better seed (xdef 'rand random)
The comment seen here is prescient, as we'll see.
This is the C function which implements mzscheme's 'random' function:
static long sch_int_rand(long n, Scheme_Random_State *rs) { double x, q, qn, xq; /* generate result in {0..n-1} using the rejection method */ q = (double)( (unsigned long)(m1 / (double)n) ); qn = q * n; do { x = mrg32k3a(rs); } while (x >= qn); xq = x / q; /* return result */ return (long)xq; }
Where mrg32k3a() is:
static double mrg32k3a(Scheme_Random_State *s) { /*(double), in {0..m1-1}*/ double x10, x20, y; long k10, k20; /* component 1 */ x10 = a12*(s->x11) - a13n*(s->x12); k10 = (long)(x10 / m1); x10 -= k10 * m1; if (x10 < 0.0) x10 += m1; s->x12 = s->x11; s->x11 = s->x10; s->x10 = x10; /* component 2 */ x20 = a21*(s->x20) - a23n*(s->x22); k20 = (long)(x20 / m2); x20 -= k20 * m2; if (x20 < 0.0) x20 += m2; s->x22 = s->x21; s->x21 = s->x20; s->x20 = x20; /* combination of component */ y = x10 - x20; if (y < 0.0) y += m1; return y; }
This, obviously, is not a cryptographically strong PRNG. Is it possible that we could break it, computing its internal state by seeing a few consecutively-generated rand-strings? Probably: it looks as though it could be represented as the solution to a manageable system of diophantine equations. That, though, was more math than I felt like doing, so I went looking for an easier approach.
Where does the RNG seed come from? Ah ha:
rs = scheme_make_random_state(scheme_get_milliseconds());
Where scheme_get_milliseconds is defined, after eliding some preprocessor cruft, as:
long scheme_get_milliseconds(void) { struct timeb now; ftime(&now); return now.time * 1000 + now.millitm; }
In other words, the random seed is merely the number of milliseconds since epoch at the time the seed function was called.
The part of mzscheme that calls the seed function is a bit daunting: it appears that in some cases, the PRNG state can be thread-local and be initialized when the thread is spawned. However, instrumenting sch_int_rand() with some debug output showed that in arc, the same state vector gets used everywhere, and is initialized when the mzscheme runtime starts up.
The millisecond at which news.yc last started is not an immediately simple thing to determine, though it was at least easy to verify the sanity of the system clock, thanks to an open NTP serevr:
dfranke@feanor:~$ sudo ntpdate -q news.ycombinator.com server 174.132.225.106, stratum 2, offset 0.370866, delay 0.08228 17 May 01:45:13 ntpdate[27901]: adjust time server 174.132.225.106 offset 0.370866 sec
So for a start, I thought, perhaps I could determine the server's start time to within a few seconds or minutes. A boring way to go about this would be simply to monitor the server for downtime, and record when it became accessible again. But impatience is one of the three great programmer's virtues, and the best way to predict the future is to create it, and so forth, so I decided on a more proactive approach: crash it!
A couple months ago, PG left this comment after news.yc recovered from some downtime:
HN was down today for around 2 hours. Sorry about that. The News server currently crashes a couple times a day when it runs out of memory. All the comments and stories no longer fit in the 2 GB we can get on a 32 bit machine. We'd been planning to upgrade to a new 64 bit server. In the meantime it was arguably a rather slow form of GC. Unfortunately the process somehow got wedged in the middle of segfaulting. We're not sure why and will probably never know. But that meant the process that usually notices when News is wedged and restarts it was unable to kill it.
(The server had since been upgraded, so these crashes are/were no longer happening.)
I figured that the watchdog works by requesting a page and checking to make sure it gets a response, and that if it doesn't get one, then it assumes the server is wedged and restarts it.
Here's arc2's top-level request handler:
(= srvthreads* nil threadlimit* 50 threadlife* 30) ; Could auto-throttle ips, e.g. if one has more than x% of recent requests. (= requests* 0 requests/ip* (table) throttle-ips* (table) throttle-time* 30) (def handle-request (s (o life threadlife*)) (if (len< (pull dead srvthreads*) threadlimit*) (let (i o ip) (socket-accept s) (++ requests*) (= (requests/ip* ip) (+ 1 (or (requests/ip* ip) 0))) (let th (thread (if (throttle-ips* ip) (sleep (rand throttle-time*))) (handle-request-thread i o ip)) (push th srvthreads*) (thread (sleep life) (unless (dead th) (prn "srv thread took too long")) (break-thread th) (close i o)))) (sleep .2)))
So, there's a limit of 50 concurrent threads, and threads are killed after 30 seconds if they haven't already terminated. So if I were to hold open 50 concurrent connections, and the watchdog were to run during the following 30 seconds, then the server ought to restart.
The watchdog code has not been released, so rather than soil my hat color by DoSing the production server, I decided to continue hacking on my local install on the assumption that I had the ability to determine the server's start time to within one minute.
So, a one-minute interval is 60,000 possible PRNG seeds. If I kept polling to see when the server came back up after the watchdog killed it, then let's very conservatively assume that I could be among the first 50 people to issue an HTTP request. Each page that comes back from the server typically contains 2-3 fnids, so the reply I got would contain some from among first few hundred to be generated, and thus from among the first few thousand iterations of of the PRNG.
This leaves determination of the PRNG seed comfortably within the reach of brute force: run the PRNG for 10,000 iterations for each of the 60,000 possible seeds, and see which one produces the fnids I saw in response to my request. I wrote a program that does just this:
http://dfranke.us/hacknews.c
So now I was able to determine PRNG seeds, but I couldn't conclude my adventure quite yet. Since logging into news.yc is an uncommon operation compared to simply browsing around, only a tiny fraction of rand-strings that the server generates correspond to login cookies. Furthermore, since fnids and login cookies have different lengths, and since the PRNG gets called for a few other purposes at unpredictable times, every individual PRNG iteration begins a candidate login cookie. That's 40 or more false candidates produced for every page view.
Nonetheless, online brute force would still be manageable. If each page view produces an average of 50 candidates, and one in every thousand page views is a login (this might be slightly optimistic), that's 50,000 attempts necessary in order to find a working login. HN gets about 500,000 hits on a busy day, so this could be done in a day or two while likely staying under the radar.
A marginally more efficient approach would be a bit of social engineering:
1. Request a page. Find a generated fnid from the page source and look it up in our candidate list. Call this A.
2. ERC> /join #startups <dfranke> Hey guys, I haven't been able to log in to news.yc since the server restarted a little while ago. Anyone else having problems? <jrandomsucker> dfranke: Works for me. <dfranke> Hmm, weird. I'll just try again later I guess.
3. Request another page, note the fnid, find it in the candidate list. Call this B.
Step 4: Test the cookies that fall between A and B.
If this conversation takes one minute, then this reduces the search to about 17,500 attempts -- less than a day's worth at a modest rate of querying -- and possibly picks up multiple accounts in the process.
Epilogue:
I sent PG a draft of this post. RTM and I wrote a better implementation of rand-string which reads from /dev/urandom and obeys a proper uniform distribution. This new version appears in arc3:
(def rand-string (n) (let c "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" (with (nc 62 s (newstring n) i 0) (w/infile str "/dev/urandom" (while (< i n) (let x (readb str) (unless (> x 247) (= (s i) (c (mod x nc))) (++ i))))) s)))
PG removed the 50-thread concurrency limit and replaced it with a per-IP rate limiter, so the DoS attack described here should no longer work.
Like Pandora, he is so curious, he has to check this out.
Like a rat in a maze, he keeps going looking for the clear path.
Like Sherlock Holmes, he applies logic to determine the next step.
Like General Sherman, he keeps marching, building tools along the way as he needs them.
Like William Randolph Hearst, he defines the landscape. ("You provide the pictures, I'll provide the war.") "so I decided on a more proactive approach: crash it!" (hilarious)
And like any parent, he didn't quit until his baby walked.
Thank you, Daniel. I sure hope you've found a way to channel that talent in your day job.