Amazon Web Services, distributed.net

Amazon Web Services (AWS) Release new g2.8xlarge GPU instance

Today AWS announce a long-awaited upgrade to their G2 family of instances – the g2.8xlarge. The big brother of the 2x, which was hitherto the only GPU-backed instance available on the AWS platform.

Here’s how the two compare in specifications:

Instance vCPU ECU Memory (GiB) Instance Storage (GB)  EC2 Spot Price ($/hr)
g2.2xlarge 8 26 15 60 SSD $0.08
g2.8xlarge 32 104 60 2 x 120 SSD $0.32

The four GPUs are the same NVIDIA GRID K520 seen in the 2x instance, and as you can see by the numbers the 8x is exactly four times larger in every respect. The indicative spot price at the time of writing was also very close at roughly 4x the cost.

In my previous post where I ran a benchmark of the g2.2xlarge instances using the Distributed.net RC5-72 project, I re-ran the same test using an 8x. You will not be surprised to learn that the results shows a linear increase in the crunching keyrate to roughly 1.7GKeys/sec (previously 432 Mkeys/sec on the 2x).

Is bigger better?

AWS’ fleet and pricing structure is generally linear. For an instance twice the size, you pay twice the cost, both in spot and on-demand. The major difference that is not very clearly advertised is that the network performance is greater for larger instances. AWS are vague as to what ‘Low’, ‘Moderate’, and ‘High’ mean in terms of raw speed (many others have tried to benchmark this), but in the largest instances this is explicitly stated at 10 Gigabit. If you assume a larger box is pumping out more data, it will need a network connection to match. But you also assume that an instance only generating 1/4 as much data will be equally well served by a ‘Moderate’ connection.

A real world use case

In my day job I set up a variety of analyses on genetic data that is supplemented by EC2 computation clusters (The recent AWS Whitepaper on Genomic Architecting in the Cloud is a really useful resource I can throw at scientists when they have questions). I investigated the viability of G2 instances and for a specific analysis that was GPU-capable, it did indeed run roughly 3-4 times faster than the same job running on a single CPU core. The problem was memory – each job used roughly 3-5GiB of memory meaning I couldn’t run more than 3 or 4 jobs on a single g2.2x GPU at once.

However on a r3.8xlarge – a CPU instance with 32 cores and 244GiB memory, I could run 32 concurrent jobs with memory to spare. Sure, the jobs took 30 minutes each instead of 10, but I could run 32 of them.

Then I drilled down on cost/benefit. The G2.2x was $0.08 on spot, and the r3.8x was $0.32. Four times as much per hour to run, but with 10 times as many jobs running. It ended up being a no-brainer that a CPU instance was the way to go.

Perhaps this is a poor example, because the capabilities of genetic analysis is badly limited by the tools available for the specific job, and it’s reasonably rare to find anything that is built for multi-threading, let alone something designed specifically to run on GPUs. The implementation of these analysis tools are black box and we’re not software developers. Our tests were probably very bad exemplars for the power of a GPU instance, but it did show that a mere 15GiB RAM on the 2X just wasn’t anywhere near enough. 60GB on the 8x is a little better but in my use case it still wouldn’t offer me any additional benefit because I wouldn’t be able to leverage all of the GPUs I’m paying for (our software just isn’t good enough). FastROCS, the example cited in Jeff Barr’s AWS Blog annoucement about the g2.8x also mentions the 15GiB of the 2x being the limiting factor, so presumably they’re running jobs that can leverage more GPU in a single job without a proportional increase in memory usage.

The main benefit of one vertically-scaled box four times the size is speed. If your application can utilise four GPUs simultaneously within the memory limits then you could, for example, transcode a single video extremely quickly. If speed is the main factor in your service delivery then this is the instance for you. If however you’re running smaller jobs that are less time-critical that the 2x can handle just as well, there’s little benefit here, unless you consider the logical management of four GPU instances to be more hasslesome than running one four times the size. But then all of your GPU eggs are in one basket, and if the single instance goes down so does all of your capacity.

As with all AWS usage your chosen method of implementation will depend on what’s right for you. This is going to be great news for a lot of people – but unfortunately not me!

distributed.net, Raspberry Pi

Raspberry Pi 2 Model B with Distributed.net dnetc RC5-72 client

A little while ago I wrote up a summary of the distributed.net RC5-72 project. One of my habits over the years has been to run the good old cow client on every new computer I’ve built just to see how the speed compares.

So when I picked up a new Raspberry Pi 2 Model B this week, this habit held true. For this I installed the ARM/embi client v2.9110.519 (sadly dated 2012 – there have been very few client updates in the last few years), and it ran without any issues. By default it doesn’t detect the Pi 2’s quad core architecture, so requires manually setting the Performance options to use 4 cores.

It’s not exactly speedy, but then this is a computer that I can fit in my back pocket.

Results

Four simultaneous crunchers took 1 hour 3 minutes to complete 4 stat units, at a combined keyrate of 4.5Mkeys/sec. At that rate the Pi could crunch around 91 stat units per day. That means that running by itself, the Pi could complete the remaining work on the project in a mere 32 million years. I don’t think the warranty lasts that long.

The CPU temperature held fairly steady at 54 Celsius, and I will also note that I didn’t overclock the Pi so this was running at the default 900mhz.

While the Pi does boast a Broadcom GPU with 1GB RAM (shared with CPU), there are no compatible crunching clients for a GPU test, and I suspect the project is in such a lacklustre state now that I don’t foresee any budding development on one soon. This was still a fun little test, and isn’t that much slower than a decent home PC would have been 15 years ago.

Amazon Web Services, distributed.net

Distributed.net RC5-72 on Amazon Web Services (AWS) EC2 Cluster – A modern solution to an old problem

History

Anyone kicking around the internet since the early days will have heard of distributed.net‘s RC5-72 distributed computing project. Arising from RSA labs Secret-Key Challenge, the project sought to utilise distributed computing power to perform a brute force attack in an attempt to decrypt a series of secret messages encrypted using the RC5 block cipher. Sponsored by RSA Labs, a $10,000 reward was offered to the participant whose machine was responsible for finding the correct key for each of the challenges, starting at 56-bit encryption, and scaling up to 64, 72, 80, 88, 96, 104, 112, 120, and 128.

I started contributing to the 64-bit instance of the project (termed RC5-64) back in 2001, and with the combined computing power of around 300,000 participants, the key was cracked after almost 5 years of work in July 2002. This project required the testing of 68 billion key blocks (2^64 keys) and found the correct key after searching 82.7% of the total keyspace.

A new project to tackle the next message, encrypted with a 72-bit key, was started on December 2nd 2002. This project required the testing of 1.1 trillion key blocks (2 ^ 72 keys) – 256 times larger than the original project that had taken 5 years to complete. After a few years it became apparent that this was going to take ‘a very long time’, and RSA Labs withdrew the challenges in May 2007, along with the $10,000 prizes. Shortly after this news, distributed.net announced that they would continue to run the project and would fund a $4000 prize alternative.

Today

As of today the project has made it through 3.378% of the keyspace after 12 years of work. At the current rate the project anticipates hitting 100% in a mere 219 years.

You can imagine that so many years of effort has dulled the enthusiasm of its participants. The distributed.net statistics report that there have been some 94,000 unique participants to RC5-72 (significantly down on the 300,000 of the previous project), but that only 1200 of them remain active. With the rise of other distributed computing projects in the early 2000s such as SETI@home, Folding@home and many others, this humble project has been rather forgotten by the internet at large. The advent of Bitcoin and other e-currencies have led to people turning their spare processing power to more profitable ends.

And yet, the RC5-72 project’s overall keyrate remains higher today than it ever has been in the past. The reason for this was the development and widespread use of powerful GPUs in home computers. I’m the first to admit I don’t really understand the finer points of how computer hardware works, but GPUs turned out to be roughly 1000 times faster than even the fastest commercial CPU at crunching through keys. Traditional CPU crunching makes up less than 10% of the total daily production, with the vast majority coming from the GPU ATI Stream technology (70%), with the rest coming from NVIDIA CUDA (5%) and the more recent OpenCL – supported on both ATI and NVIDIA hardware (14%).

As much as I salute distributed.net for continuing to maintain the project and run the the supporting key servers to distribute the work and run the stats, I have to say I’ve never seen any active effort to promote it. The website is rather dated and to my memory has the exact same design that it had in 2001 when I first started. There are no social media buttons, no methods of incentivisation, and even some of the more basic things like keyrate graphs have been broken for months (or years) and nobody has worried about fixing them.

A possible solution

I know that it must be difficult to commit any real effort to something that has been a back-burner operation for almost 10 years, but the completionist in me is desperate to somehow mobilise the modern massive internet to attack the project with gusto and get it to 100% in a month. I know that’s a laughably ridiculous suggestion, because the scale of the work required is massive. If you had a reasonably powerful ATI graphics card that could churn through one billion keys per second, you might be able to crunch around 19,000 work units per day if the GPU was dedicated to the task 24/7. At that rate you could expect to complete the project in around 1.6 million years.

So 1200 people aren’t going to get this done, even though the top 100 of those can pump out almost 10 million work units a day.

The advent of ‘the cloud’ has made potentially unlimited computing power available to anyone in the world – at a cost. Amazon Web Services (AWS) have the largest compute infrastructure of any provider and I’ve spent quite a while familiarising myself with the platform. It occurred to me that a key-crunching test of RC5-72 was in order.

The Test

For the basic test, I provisioned a compute-heavy c3.8xlarge instance running a traditional Linux x86 CPU client, and a GPU g2.2xlarge instance running the CUDA 3.1 distributed.net client, and I latterly also tested the OpenCL client.

The keyrate and cost results were as follows:

Instance Type dnetc Client Keyrate (Mkeys/Sec) EC2 Spot Price ($/hr)
c3.8xlarge v2.9109.518 180 $0.32
g2.2xlarge v2.9109.518 (CUDA 3.1) 423 $0.08
g2.2xlarge v2.9109.520 (OpenCL) 432 $0.08

 

The OpenCL client was the winner, offering 432 million keys/sec for 8 cents an hour. It should be noted that this falls far short of the best possible recorded speed from a GPU, where an ATI Radeon HD 7970 can do a stunning 3.6 billion keys/sec – although this benchmark list is at least a year old so it’s probable there exists cards out there that are even more powerful. Compared to that a mere 0.43 billion/sec is only 11% as powerful.

The potential advantage of the AWS cloud is seemingly not in its raw speed, but its scale. I can’t run 10 graphics cards at home but I can run 10 instances of the dnetc client. So that’s what I did for around 36 hours. The cost of 10 instances for 1 day equated to $0.80 x 24 = $19.20. Not bank-breaking but quite a lot to achieve a total speed of 4.32 billion keys/sec. That singular 24 hour effort put me at #26 in the top 100 rankings for the day, and the 87,000 work units completed bolstered my grand total to 1.4 million units over the 12 years I’ve been working on the project. A fairly hefty chunk relative to my total effort, but still a tiny drop in an enormous ocean.

Getting to 100%?

After a few calculations, I estimated that I would require 346,000 nodes like this, running 24/7 for one year, to complete the project. Assuming I could maintain a spot price of 8 cents an hour, it would cost ‘only’ $232 million to provision the cloud to complete this project. I think it’s pretty unlikely I could crowd source this from the internet in order to complete an old cryptography project, but there are other, more simple solutions.

1) Better GPUs. AWS aren’t the only cloud provider and one of the many others may provide GPU instances with better computational power – but none of them offer AWS’ novel ‘spot pricing’ model to get instances at rock-bottom prices, so this is unlikely to be cheaper.

2) Don’t use the cloud at all. Somehow compel millions of people to start running the client on their high-end graphics cards at home and work too. Might happen… but not without a concerted social media effort, refresh of the website, and some kind of modern fun ‘game’ incentivisation thing. Doing it for the fun of cryptographic curiosity is unlikely to motivate too many casual users.

3) More bespoke supercomputers like this one. The Center for Scientific Computing and Visualization Research at the University of Massachusetts have built a bit of a Frankenstein super-computer from old Playstation and AMD Radeon graphics cards. It’s seemingly used for other computational purposes, but the excess capacity is put into the RC5-72 project and recently has been churning out an eye-popping 1.2 million work units a day (more than 10% of the total work of the top 100) – cool huh?

4) Maybe Amazon would like to take on the challenge directly to demonstrate the power of their cloud. They’ve done things like this in the past (like creating the #72 supercomputer in the world with 26,496 cores rated at 593 Tflops/sec) and would presumably do it for free as a bit of a boast, but then there would also be the fear that the other tenacious users on the project would feel a bit cheated by having a giant multinational come along and solve the problem without them. But it’s also possible that the joy of having it complete would be worth the disappointment of having not achieved it personally.

This entire experiment and post was a bit of nostalgic indulgence for me. Rc5-72 has been a tempting Everest of distributed computing and back in the day I wanted to be part of the pioneering team that conquered it. Now there are only a few of us left, and we haven’t even reached basecamp yet. At this point I’d gladly hop on a helicopter to the summit except I don’t know where to get one.