Steve: Developing on the Edge - A Cloud Tools Manifesto
Steve: Developing on the Edge
Thoughts on development, Web-services, technology and mountains.
2Oct
Fri2009
A Cloud Tools Manifesto

Dedicated to the belief that tooling that works reliably can only be achieved by good designs and adequate testing

There's been lots of discussion about what people who deploy their applicationss in "the cloud" want -it gets fairly political, as the Open Cloud Manifesto shows. That manifesto contains requirements about portability that are anathema to people hosting applications directly in their infrastructure -Google, Microsoft, Salesforce.com- and I worry about it too. I worry about the HTML source of the manifesto -have you seen it? Scary. Someone has gone to a lot of effort to make a web page look like a document.

Having read the manifesto, I cannot help but smirk when I read the a bit about re-using existing standards and judicious creation of new ones. This manifesto came from the same company that gave the world WS-ResourceFramework? Either they have learned the error of their ways, or WS-RF is one of the existing standards they have in mind.

Anyway, I'm not going to comment on it in detail, except to say I'm working at a different level, namely the problem of getting client code to work with different clouds with different machine management APIs. Listing what you want to do with the machines, that's something to worry about in a different place -my concern is what be can be done to work with the infrastructures, what do tool authors need from the service providers.

Here, then, is my Cloud Tools Manifesto. I have no signatories for it yet, having recently written it. I did post a draft to the Typica list, where there's been recent talk of a mock EC2 stack. One of the responses was an invitation to get involved in the Open Cloud Computing Interface (OCCI) group, which is part of the Open Grid Forum. I was most amused, for reasons some people (Savas) should recognise. Time spent with Savas and Jim were the best parts of the GGF process.

Cloud Tools Manifesto

API requirements

  1. Provide enough information about library/protocol calls that anyone can implement tools to work with the infrastructure -any vendor specific tooling is the start, not the finish
  2. Licensing of the header files, any other parts of the specification, should not prevent open source -any license- or closed source implementations.
  3. Do not require that the tooling is implemented in a specific language, using a specific SOAP library, on a specific OS. There may be restrictions on what the infrastructure can run, but that does not need to affect the tools used to get that code working.
  4. Where possible, use underlying protocols/specifications that are, by virtue of their stability in the field or rigorous specification and test suite, highly interoperable.
  5. When XML is used, generate well-formed XML 1.0 in responses and error messages.
  6. Parse XML formatted responses in a proper XML parser; not be brittle to different XML encodings.
  7. If you add a new authentication method to HTTP, provide the relevant patches and tests for popular libraries such as Apache HttpClient.
  8. Have a structured form for error messages. SoapFaults, ugly as they are, are something you can chuck back on HTTP responses.
  9. Provide stable constants for some failure modes (no auth, no credit, not enough machines), document them online, ideally with a URL rule that lets us take an error name "e_no_auth" and map to some documentation such as "http://cloud.example.org/messages/en/e_no_auth" . Better yet, make the URL the fault constant, as it includes explicit namespace information.
  10. List the constants, in machine parseable XML as well as HTML, so that XSL transforms can generate language-specific error constants
  11. If you adopt someone else's cloud machine management API, retain their error response structure and the error codes. You may need to add new faults, in which case they should be your own URLs. If you do not provide faults that parse the same way as the original API, you have not implemented the API.
  12. Include API/build version data in the error. This is very useful for fielding bugreps against rapidly evolving implementations.
  13. Don't assume the caller's clock is accurate, it really makes testing under VMWare tricky
  14. If the service is somehow sensitive to clock times, provide a documented means for a caller to easily determine your endpoint's view of system time; this can be used to calculate the offset the client needs to apply to its own clock.
  15. Provide some means of contact with people who can help debug interoperability problems. Good: Forums, email, issue trackers. Not acceptable: requiring the developers to travel round the world for meetings and conferences.
  16. Where possible, engage us in discussions about API futures. NDAs and conflicts of interest complicate this, but it would still be useful.
  17. Listen to our feedback. If something is hard for us to test, it probably doesn't work right in our code.

Testing

Mock Endpoint

Provide a mock endpoint that:

  • Has the same API and error responses as the production endpoint
  • Simulates the allocation/release of VMs and other assets, validates all requests
  • Can be set up by a caller to fail for the next request from a specific account, with a specific failure.
  • Is free to use to everyone with an account.
  • Can be used by test accounts whose authentication details aren't required to be kept a secret. This would let us embed the tests in open source releases, run on hudson, etc.
  • If the mock endpoint can be redistributed as a program , a library or a VM Image, provide a means of downloading or hosting it for independent testing.

Note that while we create our own mock endpoints -and often do, those mock endpoints will contain our assumptions about the API, our beliefs on what the failure messages will be. A mock endpoint provided by the production team would fail in the ways the production team expect things to go wrong, and be more rigorous.

Production Endpoint

On the production endpoint

  • Provide discounted/free machines to the test tool teams. These can be massively underpowered VMs, as we are normally simulating complex systems, not doing real work. That we can pay for. It's the unit tests that run up our bills; creating and destroying machines all the time.
  • Offer access to forthcoming features/API versions, NDAs permitting

Nightly builds

If the infrastructure team has an automated build process with a staging cluster, consider:

  • Offering the tooling developers access to this endpoint, so that they can report problems sooner rather than later.
  • Running a local copy of the tooling against the development branch's endpoint, as part of the CI process.
  • Adding open source tools build and test runs to the CI server's build and test process. This helps find interop problems with the trunk versions of everyone's code.

Our Obligations

In exchange we agree to:

  1. Read your documentation and look at your examples before getting into trouble.
  2. Write code that usually appears to work.
  3. When XML is needed, generate well-formed XML.
  4. Parse XML formatted responses in a proper XML parser.
  5. Document our client for others to use.
  6. Provide our client identification/version info in an HTTP header.
  7. Write functional tests.
  8. Test our code against your endpoints.
  9. Test our code against your endpoints with a proxy in the way.
  10. Write code that fails in some vaguely useful way when things go wrong.
  11. Write code that provides diagnostics information when things go wrong, so as to help in blame assignment when something does not work. For example, list the endpoint, proxy settings, client code, dump the error response.
  12. Have an option to log interactions with the server in more detail.
  13. Write client applications that can be switched to different endpoints, such as the test clusters, or third-party implementations
  14. Where the far end requires the caller's clock to be roughly similar, get the system time from the far end and use the calculated offset to drive the timestamps.
  15. Not to cache DNS values indefinitely; to assume that hostnames move around.

There, that isn't asking for too much, is it?

Comments

On 2 October 2009 at 16: 18 Tim commented:
Why XML & only XML? Several of the implementations are XML/JSON and the one we're doing over at Sun is just JSON. It's not obvious to me that XML's virtues are well-suited to this problem space.
On 2 October 2009 at 17: 30 SteveL commented:
good point. How about "We will use a stable JSON parser to parse JSON, not some perl regexp". My main comment about XML was related to some of the handling typica has to do for Eucalyptus faults
On 2 October 2009 at 18: 50 Steve Loughran commented:
Tim, I've qualified the XML statements to not say that the protocol must use XML, but rather if it is XML, it must be real XML.
On 2 October 2009 at 16: 20 Tim again commented:
I'm really unconvinced of the benefits of structured errors. It seems to me that when something goes wrong, the most important thing is to provide lots of clear human-readable text to empower someone to repair the problem.
I've never actually seen an implementation of anything network-centric that reacted programmatically in a useful way to structured errors.
On 2 October 2009 at 17: 32 SteveL commented:
I don't expect the machine to react reliably either, but I do like the far end to send back as much information as it can, in a way that can be logged and printed later. SoapFaults are an option to look at -we used to stick things like stack traces in them too.
I should add "promise not to try and turn your fault into a complex fault hierarchy"
founder jcloudsreply to this thread
On 2 October 2009 at 17: 34 Adrian commented:
Great stuff!
Another important detail is that we should be able to address the api based on user specified keys.
not
With proper account scoping, resources should feel unique to the user, even if the provider uses global keys beneath the scenes
On 2 October 2009 at 17: 52 SteveL commented:
That's an interesting thought, works well with PUT operations. Note how I haven't actually said "be RESTy", so the qualifier should be "RESTful endpoints (which we prefer) should support user constructed paths for PUT operations and such like"
Not just for toolsreply to this thread
On 2 October 2009 at 19: 49 William commented:
This "manifesto" is not just for tools. Any consumer of a Cloud service (whether it's a tool or an app - where you draw the line is up to you btw) wants this.
I'd pick provider-specific APIs (if all providers respect this manifesto) over a shared standard API (for which this is not provided by implementers because they think that "implementing the standard" is enough). Any day.
In the business world, would you rather deal with a business partner who respects the letter of the law but doesn't care about you or one who cares about you and wants you to succeed? The 1st one nominally implements the standard, the second one follows this manifesto.
From someone who has seen the errors of the "implementation whatever, let's make a cool spec" way from the inside.