If you’ve built an application that calls any kind of remote Web API – be it based on SOAP, REST/HTTP, Thrift, JSON-RPC or otherwise – your code must absolutely expect it to fail. Expecting your remote API calls to always work is a recipe for a late-night call to your phone during a future system disaster. As Steve Vinoski correctly states, the “illusion of RPC” runs deep in the programming world, and engineers who treat remote Web API calls the same way as local function calls do so at their peril.
Here are some of the many ways that a call to a distributed system can fail. It’s not an exhaustive list by any means, but it’s a start.
Failures of Infrastructure
These are the most basic of failures, usually experienced at the network level.
DNS Lookup Failures
The GoDaddy failure of August 2012 caused millions of DNS names to be unreachable. It’s likely that your application code expects healthy, speedy conversion of domain names to IP addresses, but if DNS doesn’t respond then it won’t get past even the first hurdle.
Lack of Network Access
Skilled mobile developers assume that their code will communicate over a cellular network that may not be always available, and handle its intermittent nature gracefully. How many developers assume the same unreliability when creating an application intended for deployment within a data center? Routers, cables, switches and power supplies fail all the time, and a failure of any of those devices could leave your system without the ability to reach the outside world right when you need it most.
Loss of Data Center
Amazon’s popular and affordable AWS services are used by a significant number of SaaS providers. Their EC2 uptime guarantee of 99.95% entices many developers to build Web APIs to be hosted within it. While that SLA sounds good, the 2011 failure of Amazon’s US-East availability zone showed the technology world what happens when during the 0.05% period. A vast majority of the systems hosted there were completely unreachable during that time, causing a major disruption for many. Importantly, those that followed Amazon’s advice to build redundancy across availability zones (such as Netflix) were able to handle the outage without interruption.
SSL Certificate Issues
SSL certificates are the backbone of secure web communications, but they can be the cause of a lot of frustration. If a server’s domain name changes and the certificate presented by it doesn’t match, the SSL handshake will fail (with good reason).
Should a certificate on either side expire, the SSL handshake will fail too. Web server administrators usually do a good job of preventing server-side certificate expiration, but client-side certificates – while required much less frequently – are sometimes deployed by users who are not very tech-savvy and who don’t keep track of when they expire.
Failures in Communication
Even if your application has basic network connectivity, a variety of errors can still be experienced as it tries to communicate with the target Web API.
A web service that is experiencing greater than peak load might take many seconds (or worse, minutes) to respond to your call. A query resulting in a huge amount of data retrieval and analysis might take just as long to process even on a lightly loaded server. Both of these scenarios could lead to your client giving up and throwing a timeout error when it fails to receive a response. Even if your client call is configured to never time out, a firewall in between your application and the server might decide to kill your connection if it looks idle for so long.
If you’re building a browser-based application and expect all your Ajax calls to reach your server, you’re in for a surprise if you ever break one of the same-origin rules as an attempt to make a call to another endpoint will be blocked by almost all browsers. Even changing something as innocuous as the TCP port number in your URL could prevent your call from working, so make sure you only make Ajax calls to your origin web server.
Products that implement older middleware technologies such as SOAP, CORBA and DCOM provide stubs that take care of encoding the entire payload for you, as well as transmitting it to the server and decoding any response. However using newer architectural approaches like REST can mean that the management of the media types and payload formats will be left to the application programmer. While offering great flexibility, that approach does place a bigger burden on the client developer to ensure that it transmits and understands data in the form the server expects.
A number of RESTful Web APIs use a redirection technique to instruct the client to resend their original request to another URI, and the HTTP standard offers response codes like 301 and 302 for that purpose. But if your client code hasn’t been built using a library that handles them automatically then you’ll have to add redirection logic for all API calls to prevent surprises when 300-range response codes are returned. And when you’re working with well-designed RESTful APIs, you should expect them to be returned.
Almost all OAuth tokens and server-side account passwords expire after a limited lifetime. It can be easy to forget this fact when you are in the heat of rolling out your application to production, so make sure that you mark your calendar ahead of their expiration dates, and that you build behaviors into your application to handle a credential failure if it occurs (perhaps by raising a system alert or emailing your IT team).
Unexpected International Content
If your application makes any assumption that the server will always send back ASCII or ISO-Latin-1 characters, you be in for a rough day when the server sends back Unicode content and your code has no idea how to decode it. Using data formats that natively support encodings like UTF-8 (for example, XML and JSON) should help somewhat but it doesn’t mean that all your work is done. You’ll still have to handle multi-byte characters and store or render them appropriately.
Proxy Servers Getting in the Way
Using a hotel or airplane’s Wi-Fi connection can cause some really interesting behavior for your application. Most pay-to-surf hotspots use an HTTP proxy server to intercept web traffic and ensure that payment is made, but sometimes those proxies do more than just ask for your money. I’ve seen some proxy servers force all HTTP 1.1 traffic down to version 1.0, causing difficulties for applications that relied on features in the 1.1 protocol. Thankfully, most invasive HTTP proxy server behavior can be bypassed by moving to HTTPS (because proxies can’t decrypt that kind of traffic between client browsers and the server as long as the server uses a properly-signed certificate).
Most web services do a good job of truncating huge query result sets before transmitting them to the caller, but some do not. If you accidentally run an enormous query on the server side, it might cause a failure for your client application. Receiving a huge amount of data requires at least as much memory to store it, and (depending on the quality of the parser) might even need to be temporarily duplicated in memory in order to parse it. Make sure that you use any pagination or truncation features available in your API to prevent your application from being slammed with a gigantic result set. If those kinds of features are not available, try to craft your queries to prevent enormous responses.
Failures in Conversation
Even if your application successfully communicates with a web service API at first, failures can still occur after that point.
This is one of the most common reasons for web service API failure. If you hit an API hard enough – even one that you’re paying for – you might discover that the vendor offering it has just cut you off. Of course, this could very well happen to your system at the worst possible moment. Some APIs have no traffic limit when they’re first released but apply them later (something Twitter API developers realized recently, much to their surprise and disappointment).
Keep in mind too that an orchestrated DDOS attack on your system (or even an innocuous load test of your own) could lead to you quickly reaching an unexpected limit with the Web APIs your application depends on.
February 29th. Daylight Saving Time. The International Date Line. The Leap Second. Time zones with partial–hour offsets. Even if you believe your application isn’t time-sensitive, any of these temporal oddities could affect the results you get from the Web APIs you are calling. Testing for them ahead of time might be difficult, but it could be time well spent.
When you sign up for an API on the Web, make sure you set your calendar to remind you ahead of the renewal date. Forgetting to renew an API account subscription will leave egg on your face – and nasty errors on the screens of your users.
The API Disappears
It’s rare, but it happens: vendors pull their API out of your market area, or they go out of business completely and shut down their services without notice. Either way the net effect is the same, and you’ll have to quickly scramble to find an alternative service.
Unexpected Payload Format
Yahoo! recently made an unannounced incompatible upgrade to their popular Placefinder API, moving it from version 1.0 to 2.0 overnight. Normally a move like that would be orchestrated in such a way to provide both old and new API versions side-by-side, but the company offered users no way to keep using the older 1.0 data format while they made the switch.
Instead, their API users woke up one day to find that Yahoo! had completely broken their applications, and the only options on the table were to either quickly move to the 2.0 format or to switch to a different service. Worse still, some users quickly made the switch to 2.0, only for Yahoo! to realize their mistake two days later and switch the API back to 1.0 – two incompatible changes in as many days.
If vendors as large as Yahoo! can make accidental incompatible changes to their API services, you have to assume that all vendors could do the same.
What You Can Do
While there are innumerable ways for Web API calls to fail, protecting your application against problems with them can be done by following a few simple guidelines:
Assume that every Web API call you make will fail. Always check that you get a response (and do so within a reasonable timeframe) and that you parse the returned payload carefully. Code very defensively when calling what are essentially unreliable functions. Build monitoring and instrumentation into your system so that your IT team gets called when remote APIs stop responding. And if you’re really confident, inject failure into your production system.
Know what your users will experience when a Web API fails. For core services that you are completely dependent on (for example, PayPal for payment processing) make sure that you fail gracefully and tell the user something useful instead of just throwing a stack trace on the screen. For secondary services (for example, Google Maps for showing store locations) consider having an alternative service available that you can fall back to, especially if you hit a traffic ceiling with your main API provider due to high traffic.
Simulate the failure of each Web API you depend upon. Testing for failure is by far the best way to defend against future surprises. If you can do it in an automatic fashion then all the better, but simply changing your hosts file to make the hostname of your remote Web APIs fail to resolve to a usable IP address will probably uncover a ton of issues that you might not have expected.
Web APIs are great, and developers love mashing them up into something exciting. But if you don’t plan for the failure of those APIs, you’ll just end up frustrating your users and driving them away from your product.