Amazon Web Services, DevOps

If you’re using the AWS ACM console to create a certificate, and at the final stage you get this annoying and useless com.amazon.coral.service.InternalFailure, I’ve got the solution for you.

You’re probably working in an AWS Organization with a Service Control Policy (SCP) or a restricted IAM user where you’ve been given the acm:* permissions thinking this is enough. Sadly not, the solution is you additionally need to add:

kms:CreateGrant

to your IAM or SCP policy in order to successfully create the certificate request. Given certificates need to be accessed by Key Management Service (kms) it makes sense that a new certificate needs the permission to create a grant for it.

That’s it, hope this saved you from fruitless googling!

IR35, Recruitment

I’ve read a lot of articles lately talking in doomworthy terms about the Off-Payroll working rules from April 2020 that will apply to the private-sector for contractors. Similar rules have been in force for the public-sector since 2017, with many widely predicting that private-sector IT firms will be either curtailing the use of contractors or forcing them to become permanent employees.

If you somehow haven’t already read an article explaining what all this means, in short the burden of IR35 status determination shifts from the contractor’s business to the end client or ‘fee-payer’ from April 2020. This is designed to catch ‘shadow-employees’ – those who earn money without Tax and National Insurance Contributions deducted – by threatening the fee-payers directly with fines and a large retrospective tax bill if they’re discovered to be using contractors that fall within IR35 – or in other words if they’re considered basically employees where their use of a limited company is just a means to pay less tax on employment earnings.

The taxman really hates this, and there are certainly situations where this is exactly what’s happening. The simple logic goes that if you were to remove all the indirection that a limited company ‘intermediary’ (or Personal Service Company (PSC) as HMRC like to call it) provides, and still have the same working relationship with the client, would you essentially just be working like an employee? If the answer is yes, then logically you should be taxed like one.

Just an attention-grabbing article headline?

The reason I say IR35 is easily avoided is because…well, it is. A significant majority of contractor engagements legitimately fall outside of IR35, but risk being considered inside because of a fundamental lack of understanding of the rules by clients, agencies, and contractors alike. Agencies often recruit for permanent employees and contractors both, and have somewhat sleep-walked into a tendency to use the same processes for contractor engagement as they do for ‘permies’. They use the same recruitment terminology, and the engagement follows the same routine of sending a CV and having an ‘interview’ with the client. This is not how a B2B relationship works.

Meanwhile clients are usually just looking for talent to fulfil a need, and might first look for a permanent employee but resort to backfilling with a contractor due to specialist demand and contention in the IT market. Their unfortunate mentality is that a contractor is essentially an employee but just paid differently.

And sadly, contractors themselves who aren’t terribly savvy let themselves be carried along in the processes set forth by clients and agencies alike who have got the wrong end of the stick as to how working arrangements and engagement terms should work. They send you a bunch of forms to fill out, and give you a contract, and because you want the work you fill it all out and assume this is how it’s supposed to be.

The combination of these poor-diligence factors from all parties is a situation where if HMRC come knocking and want to know about a particular engagement, you’re going to have a heck of a time trying to pretend this is a legitimate business-to-business relationship and not just an employee by any other name.

The Good News

All of this is easy to overcome. Inevitably, it just requires a bit of education and a realignment of the way you engage a business to service the requirement of another business. These are my top tips whether you’re a client looking for talent directly, or an agency seeking a contractor:

  • Don’t ask me for my CV. A CV is for an individual, and I’m a business. Ask for my business’ experience profile, and I’ll send you all of the skills my company is able to provide along with an overview of its previous clients and achievements. It’ll contain all of the information you’re looking for.
  • Don’t send me a job description. A JD is for an employee. What your client should have is a project-based scoped requirement for the delivery of work. If this is compressed into a job title, e.g. ‘IT Engineer’, you’re doing it wrong. What does the client actually want achieved within the timeframe they have in mind? Sorry, you’ll need to have a think and fully specify it.
  • Don’t tell me you want to put ‘me’ forward for an ‘interview’. Employees have interviews – I’m just the person representing the business. I’ll have an exploratory discussion with the client to determine whether we think there’s a feasible working relationship our businesses can have. If we both agree, great!
  • Remember that because you’re engaging my business, it won’t necessarily be the same person doing all of the work all of the time. At any time the person my business provides might be substituted out for someone else of equal skill who is able to do the work instead. We’ve written that into the contract, so don’t be surprised if it happens.
  • I don’t want to know about how cool or fun your office workplace is, or how the people who work there go for pizza and beers on a Friday afternoon. That’s great for your employees, but I’m running my own business here and I don’t need to know about that. Similarly I don’t want a company-branded nametag, or t-shirt, or any of the loot that you give your employees. I don’t want to partake of any of your employee benefits. Thanks for the offer though.
  • Don’t tell me who my line manager will be. I don’t have a line manager. I work for my own company. I might, on the other hand, have a senior point of contact in the client’s business. That’s fine with me, I need to know who to speak to after all, but let’s not get confused that I’m under someone’s direction and control. I’m not.
  • Don’t tell me there are opportunities to work from home. Work from home? I’m a business, I’ll be working from my own office (which might happen to be in my home) unless there are reasons to be present on the client site. There might be security constraints that require physical presence, meetings, scheduled updates, etc. But generally speaking you don’t get to control where my employee works. Don’t give ‘me’ my own desk – it’s not my desk, it’s your desk. My business is interested in a productive working relationship, so we’ll discuss the best solution as part of our initial discussion.
  • I don’t really want to use your equipment. I’ve got my own equipment that my business has bought and paid for. So unless there is an impossible constraint that requires me to use your laptop (usually security, restrictions on encryption, migration of data off-site, etc.), I’ll be using my company’s gear. Thanks though!
  • Don’t send my business a contract that aligns with IR35 insofar as the wording of the contract goes if you’re not absolutely certain that it properly reflects what the working-relationship with the client will be. An actual IR35 investigation will be far more interested in those critical working practices as described above than whether the contract sounds correct.
  • After this project (that the client has carefully specified) concludes, we both understand that the client not obligated to provide any new work to my business, nor is my business obligated to undertake it. If the client has got a new project they’d like my business to work on, we’ll put together a separate contract with those specifications embedded as before.

If all of this sounds really unreasonable, or precious on the part of the contracting business, I can only apologise but also reiterate that this is precisely how it’s supposed to be. Anything less than the above and you risk falling foul of IR35 Off-Payroll rules that would rightfully identify you as effectively an employee, or a ‘permitractor’ as we like to say in the industry. You’re part of the problem. Fee-paying clients have full discretion on how they arrange their internal project-based work requirements, and it should be pretty straightforward, and not the enormous burden that everyone is making it out to be, to properly package work into engagements that legitimate contractor businesses can consume. The taxman is relying on businesses ultimately being too fearful or just plain feeble to address these issues confidently.

Some of this will require some tweaking in your expectations, your processes, your use of terminology and more importantly your mentality. IR35 is designed to target people using limited company intermediaries as a thin shell for tax avoidance. Ultimately it’s pretty trivial to properly align your terms of engagement to guarantee that your search for a contractor is correctly and legitimately outside of IR35, and way beyond April 2020 private-sector business can continue to legitimately engage contractors as much as they wish, provided they are mindful and take note of the points above.

It might be possible (but you’d really need to convince me) that your business just cannot comply with the above for one reason or another. At which point you concede and advertise the ‘role’ as inside IR35 and search for contractors on that basis. You pay their Tax and NI and increase your costs. I’d expect them to be harder to find mind you, as their business is probably off working for a client that doesn’t have your problem – although I’ll agree that this is somewhat dependent on industry.

Contractors are businesses, treat them like one. It’s time to get with the programme.

Amazon Web Services

In November AWS announced Reserved Instance Purchase Recommendations in the Cost Explorer.

Great! I thought, finally some free native support for what a lot of cost-analysis companies offer at a price.

And yet when I look at the recommendations, they seem completely wrong to me. Each recommendation provides a link to see the usage that it was based on – but the main issue there is that Cost Explorer only knows your total EC2 usage hours per day.

Why is this important? Because RI applied utilisation only span usage within an hour.

Here’s an example:

I have an autoscaling web app that runs c5.large instances with a normalisation factor of 4. It idles at 2 instances outside of working hours, where it autoscales up to 10 for 8 hours a day.

Over the day my normalised units are:

  • 2 instances x 4 units x 24 hours = 192
  • 8 instances x 4 units x 8 hours = 256
  • Total units: 448.

That’s consistent every day. AWS sees 448 units per 24 hours = 13.3 normalised units per hour. At 4 per instance, that’s a recommendation of 3 c5.large Reserved Instances.

So I buy 3 instances. Since I have at least 2 all day, that’s 100% utilisation for those. But the remaining one is only being used for 33% of the day, in the 8 hour period when I scale above 2.

To recap:

  • I’ve bought 3 Reserved c5.large Instances. No upfront, N.Virginia, at a price of $39.42 each per month which I’m paying whether I use them or not.
  • If I buy the third instance on demand, for 8 hours a day, it costs me only $20.68 per month. But I’ve bought a Reserved Instance for it as recommended, and it’s now costing me $39.42 instead.

So that’s the problem – the recommendations are completely useless for anything but baseline steady-state load. If you scale during the day, particularly if you have very short windows of high peak usage – it’ll equate to a high number of normalised unit/hours that the recommendation is based on, which completely fail to map to the application of the reserved instance billing as this spans only across an hour.

This is a significant loss with this small example, and factored out over a large estate the recommendations and indicated savings are irresponsibly misleading. To work properly the tool needs to see a moving hour-window of concurrent instance usage across the day to work out if a lack of full utilisation would result in an overall saving. Currently, for an average auto-scaled application, I can’t see that it would – but in it’s current state it’s impossible to calculate with any accuracy.

References

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/apply_ri.html
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts-reserved-instances-application.html
https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/ri-recommendations.html


TL;DR: Recommendations are calculated over an aggregated day, Reserved Instance billing is only applied over an hour, so they’re worse than useless.

Docker

Recently I was frustrated in a Jenkins build when I was running Docker-in-Docker to build and push a container to AWS Elastic Container Registry (ECR).

The error on push was a familiar `no basic auth credentials` which means some issue with the credentials stored in ~/.docker/config.cfg (or perhaps ~/.dockercfg in earlier versions).

In this case I initially couldn’t understand the error, as the Jenkins declarative pipeline was using a docker.withRegistry function for the registry login, and this was being successfully written to, so what was going on?

Eventually it occurred to me, although it’s not obvious at first – as we’re running docker-in-docker, you might assume that the credentials are looked for relative to where the Docker daemon is running (i.e. on the host), but actually it’s being looked for relative to where the client is calling the daemon from. In this case – within the container. The docker.withRegistry that I was doing with Jenkins was creating credentials on the host – not within the container where the client itself was running.

There were two possible solutions here – one is to ensure you run the docker login command within the client context of the docker-in-docker container, or to mount the .docker directory on the host into the container using something like `-v /root/.docker:/root/.docker` depending on what user you’re running your containers as.

DevOps, Recruitment

This is a true story, although names and details have been removed or changed to protect the awful. Quotes however are verbatim.

As often happens when you have ‘DevOps’ in your job title, I received some recruitment agency spam from someone I’ll call Jane, c/o LinkedIn, alerting me to a ‘fantastic opportunity’ that I should call to discuss. I replied and asked Jane to send over the job spec. Her reply was ‘Send me your CV first, and I’ll send you the spec’.

Huh. Given that my LinkedIn profile is complete and public – there should be no need to hand over my personal data before I even get to see the job description, bearing in mind that I was cold-contacted first, but in any case I fired it over along with an indication of my salary expectations as was requested. I received the spec along with a warning: ‘Keep the company name to yourself!’.  We’ll call the company Sprockets Incorporated (not their real name, but what a great name!)

Jane then called to more formally introduce herself, and asked to meet with me, stressing that she liked to meet everyone she represents, and we agreed a date for a few days time. At 4pm the day before the meeting I received an email from Jane which said:

Hope you are well. We have a meeting set up for tomorrow at 11am, unfortunately I’m going be out of the office therefore need to cancel our meeting.

I’ve also had feedback from Sprockets and unfortunately they won’t be progressing to interview.

I’d like to thanks you for your interest and wish you luck in your job search going forward.

Well, what a coincidence! The company didn’t want to proceed to an interview and simultaneously, just by chance, her schedule had suddenly filled up and she could no longer meet me. Almost like the second I wasn’t immediately valuable to her, I wasn’t worth meeting and summarily dropped.

I almost replied to remark as much, but instead politely asked if there was any feedback. It transpires that my salary expectations were slightly higher than Sprockets were looking for, although not by much. I said to Jane that money was not the single most important factor in any job, so I would be happy to negotiate on that point and  have a conversation with them on that basis.

Suddenly, it’s all back on and Jane is my best friend again. She also wants to know if I know any Test Leads that I can put her in touch with because she has other roles to recruit for. Ahem. If you’d like a professional referral you can talk to me about the signing bonuses you’re offering, otherwise I’m not doing your job for you!

I have a telephone call with Sprockets, which converts into a face to face interview arranged for a couple of days later. Jane sends me an interview confirmation for the ‘Test Lead’ role, and I point out I’m interviewing for the DevOps position. To which she says ‘Oops’, and incredibly asks again if I know of any Test Leads that might be looking.

As timing would have it, I’m having another interview the day after Sprockets with a company called Spacemax (still not a real name). The interview with Sprockets goes great – everyone seems really nice and they seem like they’d be a good place to work. During the course of the interview they ask if I’m looking anywhere else, so I admit to having another interview the next day (it can’t hurt). To my surprise, a job offer comes through later that same day. I thank them and ask for some time to consider, at least to allow me to have the other interview, but I assure them that afterwards I’ll get back to them promptly.

Suddenly, I can’t get rid of Jane. She calls me several times, pressuring me to take the offer – telling me Sprockets is great. Not only that, but that her agency has the exclusive contract for Sprockets, and if I let this pass up I won’t see the job advertised anywhere else. I already know that’s a lie, of course, because in the meantime I’ve already been contacted on LinkedIn by another agency representing the same role. I repeat that I’d like to have the Spacemax interview and I’ll let her know after that.

Unfortunately for me, the interview with Spacemax goes great too, and they make me an offer on the spot. Now I have a difficult decision to make – I like both companies and there’s not a lot to separate them in terms of salary or benefits. Luckily my phone was on silent mode, because it turns out Jane was calling me several times during the interview to see if I’d made up my mind about Sprockets yet. As I’m walking out of Spacemax, my phone rings again and it’s Jane. I tell her I’ve been made an offer and will need time to think.

This doesn’t go down well. She assures me Sprockets are a great company, with a bright future, and I’d do really well there and she’s not just saying that, and then goes into a little three minute sales pitch about why this place is great and how I should definitely take the job. Incredibly when she’s finished, she says to me: ‘So what do you think now?‘ Wait, what? Did she think a couple of minutes of off-the-cuff pressure selling was going to sway me into accepting?

Because I like to keep everyone in the loop, I let both companies know I have competing offers and I’ll be deciding shortly. I make it clear this is not a tactic to play one offer off against the other – I’d just like a little time to consider. It’s damn tricky, I really do like them both.

Jane keeps calling me, Sprockets are keen to convince me and various senior members of their team call me individually to talk through their take on the business and why it’d be great if I joined them. After each call Jane is back onto me –  I am now dreading having to pick up the phone to her – to see if any of it has done the trick in swaying me. She tells me Sprockets is much better than Spacemax, and has a better future ahead. She also tells me that she was so confident that I was the right person for the job that I was the only candidate she put forward.  I wish I could say this is a bit of embellishment for the sake of the story, but it’s not. If I’m the only candidate she’s put forward, it’s because she couldn’t find anyone else – or perhaps they’d been snapped up by the other, much more competent agency. I wish I had.

After deliberating, I decide I’m going with Spacemax. I thank Sprockets profusely for their consideration and tell them genuinely that it was an incredibly marginal and difficult choice, and wished them the very best for the future.

I receive the following email from Jane:

Thanks for your email. Really disappointing but you’ve made your decision.

I wish you every success and if anything changes please get in touch.

Do you know of anyone else that may be interested?

Sheesh! Crass and absurd.

But wait, the story isn’t quite over yet. I start with Spacemax a week later (the turnaround on interview and starting was crazy!), and precisely 5 days into my new job I get the following email from Jane:

Hi Pete

How are you? How’s the new role going? Hope it’s all working out.

I’d love the opportunity to work with SpaceMax and wondered if you could help me/introduce me to right person to discuss recruitment?

Speak soon

Sweet Jesus. I ignore this, and a couple of weeks later I see that I have a missed call from Jane. I email her, asking what the call was about. She replies to repeat her request for an introduction to be able to recruit for us. Having now finally had enough, I tell Jane that I felt her approach was completely inappropriate, particularly in light of the fact she was so recently pushing me very hard NOT to work for Spacemax.

I believe that even had I been interacting with a credible, competent agency, I’d have made the same decision, but wow Jane did not make it easy and I can only imagine that Sprockets would have been furious if they’d known they were being represented by someone so unprofessional.

If you ever want to hire anyone, for anything, don’t do this.

On the flip side, here’s what the other recruiter representing Spacemax said to me about my competing offers:

Obviously if you go for Spacemax it is in my interests, but I appreciate it’s a decision that’s entirely up to you and Sprockets are another good company.

Much, much better.

Ansible, DevOps

So you want a simple Slack failure handler for Ansible to ping your alerts channel whenever a deployment fails. Despite a thorough search I couldn’t find any examples that did this adequately, and even the solution here is a little constrained. The basic requirements are:

  1. A Slack notification on any task failure.
  2. The name of the task.
  3. The name of the host.
  4. The error debug message.

Sadly there isn’t one global failure handler configuration in Ansible. I investigated the native Slack Callback plugin available in Ansible 2.x (and not to be confused with the existing Slack module) but this seemed to be more of a generic passthrough for outputting all plays into Slack, failed or not, which isn’t what I wanted. After discarding the idea of a custom callback plugin for my purposes (which could work, but felt overly complex), I settled on a per-playbook failure role.

This is not an actual ‘handler’ as per Ansible parlance, but I was coming from experience of Chef where I could have a global failure handler baked into the Chef-client config.

To make this work we need to use Playbook Blocks (as of 2.0) and essentially enclose the entire playbook in a block/rescue. The main hassle here is that block does not support the top-level pre-task role block (and wouldn’t catch any failures therein), and so I had to convert all of my role calls to tasks that used include_role instead.

A simple playbook example looks as follows:

playbooks/playbook.yml

- hosts: "{{ target_host | default('127.0.0.1') }}"
  gather_facts: true

  tasks:
  - block:
    - include_role:
        name: install_app
    - name: Greet the world
      shell: echo "hello world!"
    - fail:
       msg: "I've gone and failed the play!"
    rescue:
      - include_role:
          name: slack_handler
          tasks_from: failure

And in my slack_handler role (for reusability):

roles/slack_handler/tasks/failure.yml

- name: Notify Slack of Playbook Failure
  slack:
    username: 'Ansible'
    color: danger
    token: "{{ slack_webhook.split('https://hooks.slack.com/services/')[1] }}"
    channel: "#deployment-alerts"
    msg: "Ansible failed on *{{ ansible_hostname }} ({{ inventory_hostname }})* \n
    *Task*: {{ ansible_failed_task.name }} \n
    *Action*: {{ ansible_failed_task.action }} \n
    *Error Message*: \n ```{{ ansible_failed_result | to_nice_json }}``` "
  delegate_to: localhost

ansible_failed_task and ansible_failed_result are two currently painfully undocumented (shout-out to Brian Coca for pointing me in the right direction) but delightfully detailed variables that are populated on playbook failure. ansible_failed_task is a map that contains a lot of data, so you may want to add additional debug for your purposes. The raw error message is single-line JSON so we prettify it before sending it using the Slack module using our token. The string split is a bit of an ugly hack to extract the webhook token part from the full webhook URL which is used elsewhere in my plays and passed at deploy time. For some reason the Slack module requires only the token rather than the full URL, in contrast to a lot of other integrations that want the whole thing.

The remaining hassle is the need to implement this per playbook, but aside from the possibility of a custom callback plugin this seems like the simplest way to implement this in Ansible currently.

Running

The original ASICS Marathon training plans remain one of the most popular, much loved guides for those preparing for their first marathon (or marathon improvers). Having used these myself, I was slightly frantic recently when I realised that Runners World, who had originally hosted the plans, removed them from their site sometime in August 2017.

To avoid these being lost in the internet ether, I quickly located copies and rehost them here for people like me that want a solid reference guide for their marathon training plans 🙂

ASICS_TRAININGPLANS_Sub 3.00
ASICS_TRAININGPLANS_Sub 3.30

ASICS_TRAININGPLANS_Sub 4.00
ASICS_TRAININGPLANS_Sub 4.30
ASICS_TRAININGPLANS_Sub 5.00

Amazon Web Services

Courtesy of the AWS subreddit, I was alerted to the fact the recently-updated AWS Service Terms (to covertheir new 3D Lumberyard Game Engine) have included this specific clause:

57.10 Acceptable Use; Safety-Critical Systems. Your use of the Lumberyard Materials must comply with the AWS Acceptable Use Policy. The Lumberyard Materials are not intended for use with life-critical or safety-critical systems, such as use in operation of medical equipment, automated transportation systems, autonomous vehicles, aircraft or air traffic control, nuclear facilities, manned spacecraft, or military use in connection with live combat. However, this restriction will not apply in the event of the occurrence (certified by the United States Centers for Disease Control or successor body) of a widespread viral infection transmitted via bites or contact with bodily fluids that causes human corpses to reanimate and seek to consume living human flesh, blood, brain or nerve tissue and is likely to result in the fall of organized civilization.

It’s good to know that AWS are accommodating all possible contingencies for social and economic failover.

Amazon Web Services, distributed.net

Amazon Web Services (AWS) Release new g2.8xlarge GPU instance

Today AWS announce a long-awaited upgrade to their G2 family of instances – the g2.8xlarge. The big brother of the 2x, which was hitherto the only GPU-backed instance available on the AWS platform.

Here’s how the two compare in specifications:

Instance vCPU ECU Memory (GiB) Instance Storage (GB)  EC2 Spot Price ($/hr)
g2.2xlarge 8 26 15 60 SSD $0.08
g2.8xlarge 32 104 60 2 x 120 SSD $0.32

The four GPUs are the same NVIDIA GRID K520 seen in the 2x instance, and as you can see by the numbers the 8x is exactly four times larger in every respect. The indicative spot price at the time of writing was also very close at roughly 4x the cost.

In my previous post where I ran a benchmark of the g2.2xlarge instances using the Distributed.net RC5-72 project, I re-ran the same test using an 8x. You will not be surprised to learn that the results shows a linear increase in the crunching keyrate to roughly 1.7GKeys/sec (previously 432 Mkeys/sec on the 2x).

Is bigger better?

AWS’ fleet and pricing structure is generally linear. For an instance twice the size, you pay twice the cost, both in spot and on-demand. The major difference that is not very clearly advertised is that the network performance is greater for larger instances. AWS are vague as to what ‘Low’, ‘Moderate’, and ‘High’ mean in terms of raw speed (many others have tried to benchmark this), but in the largest instances this is explicitly stated at 10 Gigabit. If you assume a larger box is pumping out more data, it will need a network connection to match. But you also assume that an instance only generating 1/4 as much data will be equally well served by a ‘Moderate’ connection.

A real world use case

In my day job I set up a variety of analyses on genetic data that is supplemented by EC2 computation clusters (The recent AWS Whitepaper on Genomic Architecting in the Cloud is a really useful resource I can throw at scientists when they have questions). I investigated the viability of G2 instances and for a specific analysis that was GPU-capable, it did indeed run roughly 3-4 times faster than the same job running on a single CPU core. The problem was memory – each job used roughly 3-5GiB of memory meaning I couldn’t run more than 3 or 4 jobs on a single g2.2x GPU at once.

However on a r3.8xlarge – a CPU instance with 32 cores and 244GiB memory, I could run 32 concurrent jobs with memory to spare. Sure, the jobs took 30 minutes each instead of 10, but I could run 32 of them.

Then I drilled down on cost/benefit. The G2.2x was $0.08 on spot, and the r3.8x was $0.32. Four times as much per hour to run, but with 10 times as many jobs running. It ended up being a no-brainer that a CPU instance was the way to go.

Perhaps this is a poor example, because the capabilities of genetic analysis is badly limited by the tools available for the specific job, and it’s reasonably rare to find anything that is built for multi-threading, let alone something designed specifically to run on GPUs. The implementation of these analysis tools are black box and we’re not software developers. Our tests were probably very bad exemplars for the power of a GPU instance, but it did show that a mere 15GiB RAM on the 2X just wasn’t anywhere near enough. 60GB on the 8x is a little better but in my use case it still wouldn’t offer me any additional benefit because I wouldn’t be able to leverage all of the GPUs I’m paying for (our software just isn’t good enough). FastROCS, the example cited in Jeff Barr’s AWS Blog annoucement about the g2.8x also mentions the 15GiB of the 2x being the limiting factor, so presumably they’re running jobs that can leverage more GPU in a single job without a proportional increase in memory usage.

The main benefit of one vertically-scaled box four times the size is speed. If your application can utilise four GPUs simultaneously within the memory limits then you could, for example, transcode a single video extremely quickly. If speed is the main factor in your service delivery then this is the instance for you. If however you’re running smaller jobs that are less time-critical that the 2x can handle just as well, there’s little benefit here, unless you consider the logical management of four GPU instances to be more hasslesome than running one four times the size. But then all of your GPU eggs are in one basket, and if the single instance goes down so does all of your capacity.

As with all AWS usage your chosen method of implementation will depend on what’s right for you. This is going to be great news for a lot of people – but unfortunately not me!

Amazon Web Services, cPanel WHM

CPanel Webserver on EC2  (or How I learned to stop worrying and love the cloud )

It’s been a long time coming but my company is now fully based on Amazon Web Services EC2 for our web hosting. It’s been a long journey to get here.

For more than 15 years we’ve cycled between a variety of providers who offered different things in different ways. First our sites were hosted by Gradwell, on a single shared server where we paid a per-domain cost for additional hosting. We left them in 2004 after 6 years and following a very brief flirtation with Heart Internet we moved to a now-defunct IT company called Amard. This gave us our first taste of CPanel as a way of centralising and easily managing our hosted sites. It was great – so great that it made moving to ServerShed (also defunct) in 2005 very easy.

This was our first dedicated server with CPanel and we enjoyed that greater level of control while still having the convenience of a largely self-managing service. In 2007 we hopped ship to PoundHost where we remained until last week. Three or four times in the last 8 years we’ve re-negotiated for a new dedicated server and battled with the various frailties of ageing hardware and changing software. Originally we had no RAID and downloaded backups manually, then software RAID0, then hardware RAID0, then most recently with S3 backups on top.

Historically ‘the cloud’ was a bit of an impenetrable conundrum. You knew it existed in some form, but didn’t really understand what it meant or how its infrastructure was set up. Access to the AWS Cloud in the early days, as far as I understand, was largely command-line (CLI) based and required a lot of knowledge to get going on the platform. It required a lot of mental visualisation and I’m sure the complexity would have been beyond me back then. Some services didn’t exist, others were in their infancy. Everything was harder.

In that respect I almost don’t mind being a bit late to this party. It’s only in the last two or three years that it seems the platform has been opened up to the less-specialised user. Most CLI functions have been abstracted into a pretty, functional web console. Services interact with each other more fluidly. Access controls that govern permissions to every resource in a really granular way have been introduced. Monitoring, billing, automated resource scaling, load balancing and a whole host of other features are now a reality and can boast really solid stability.

Even then migration has not been a one-click process. CPanel do offer a VPS-optimized version of the Web Host Manager (WHM) and that’s the one we’re running. Its main boast is that it uses less memory than the regular version, apparently as a concession to the fact that a Virtual Machine is more likely to exist as a small slice of an existing server and won’t have as many resources allocated to it. It looked to be the best fit.

Then we needed to find a compatible Amazon Machine Image (AMI) to install it. It seemed to be largely a toss-up between CentOS and RedHat Enterprise Linux. We’d used CentOS 6 in the past with good results (and unlike RHEL it requires no paid-up subscription), so we fired it up and slowly started banging the tin into shape. Compared to other set-ups there is very little AWS-specific documentation on setting up a CPanel environment, so we were mostly sustained by a couple of good articles by Rob Scott and Antonie Potgieter. CPanel themselves wrote an article on how to set up the environment but this didn’t quite cover enough and already the screenshots and references there are out of date. I will write a comprehensive overview of ‘How to Set up CPanel on EC2’ in another article, but to conclude here I will talk about what made this transition so tentative for us that we waited almost a year after setting up AWS before we went live with our main server on the platform:

Fear. Non-specific, but mostly of the unknown. It’s irrational, because when you’re buying managed hardware from resellers you never actually get to see the physical box you’re renting. It’s no more tangible to you than an ephemeral VM is, and yet there’s something oddly reassuring in knowing that your server is a physical brick of a thing loaded into a rack somewhere. If something goes wrong, an engineer that you can speak to on the phone can go up to it and plug in a monitor to see what’s happening. It’s not suddenly going to disappear without a trace.

Not so with the cloud. It’s all a logical abstraction. You don’t know, you’re not allowed to know, precisely where the data centres are. You don’t know how the internal configuration works. Infrastructure is suddenly just software configuration and we all know how easy it is to make a mistake. Click the wrong box during the set-up and you might have your root device deleted accidentally, or be able to terminate the server with all of your precious data because you didn’t enable Termination Protection. If you’re a little careless with your access keys and they become public, you’ll find your account has been compromised to run expensive clusters farming bitcoins at your expense.

Terrifying if you don’t know what you’re doing. So you investigate, learn slowly, play on the Free Tier. Make things, break things. Migrate one small site to make sure it doesn’t explode. Back it up, kill it, restore it, and be absolutely confident that this is really going to work.

And it does. For the first time I feel like we’re our own hosting provider. I spec the server, I buy it by the hour, I configure and deploy it, and Amazon is merely the ‘manufacturer’. Except that everything happens in minutes and seconds instead of weeks and days. Hardware failure is handled as simply as just stopping and restarting the instance where it will redeploy on different hardware. If you’ve architected for failure as you should, backups can be spun up in less than ten minutes. if I get nervous about spiralling running costs I can just flip the off switch and the charges stop, or the Trusted Advisor can offer suggestions on how to run my configuration more efficiently.

It’s telling that every company that I’ve ever bought a physical server from are now pushing cloud-based offerings to their customers too. The time of procuring your own hardware is passing and being replaced by more dynamic, more durable, more robust solutions.

You’d be forgiven for thinking this whole post was merely a sly advert for AWS, but I really am just a humble end user. Admittedly one that has been completely converted, and I haven’t even enumerated the full list services that I now use, to say nothing of the others that I haven’t even had time to investigate. This is a great time to make the transition, because while adoption has been growing at a huge rate, I think it’s going to skyrocket to even greater heights in the next few years.

If you don’t have a pilot for the journey, it’s time to train one.