Undelivered events
There might be times when a notification can’t be delivered to your listener endpoint (for example, if your endpoint is offline while undergoing a software upgrade). In the event of a delivery failure, Webhooks v3 automatically schedules delivery attempts based on the following timetable:
- Webhooks v3 waits 3 seconds and then tries again.
- If the second delivery attempt fails Webhooks v3 waits 30 seconds and then tries again.
- If the third delivery attempt fails Webhooks v3 waits 5 minutes and then tries again.
- If the fourth delivery attempt fails Webhooks v3 waits 1 hour and then tries again.
- If the fifth delivery attempt fails Webhooks v3 waits 24 hours and then tries again.
If delivery can’t be made after six tries Webhooks v3 gives up and assigns the notification the failure state. After 7 days events are automatically deleted from the event store.
However, during those 7 days you can use the event redelivery service to schedule redelivery of any failed events.
The retry schedule also depends on the HTTP status code returned from the listener endpoint. For example, suppose you make a first attempt at delivering a notification and the server responds with a 5xx status code. That’s fine: a 5xx error typically refers to a short-lived problem (for example, temporary network congestion). Because the problem is likely to be resolved soon, the event state is changed to awaiting-retry and, after a few seconds, delivery will be attempted again.
However, suppose your first attempt to deliver a notification fails with a 3xx error. A 3xx error means that an additional action, such as a redirection, must be completed before the request can be accepted and processed. A webhooks delivery should never require an additional action, and should never be redirected to another server. Consequently, any time a 3xx error is returned the event state is immediately set to failure, and no retries are scheduled.
The following table features a more complete list of the actions taken following a specific event delivery response:
Response | Next state | Notes |
---|---|---|
Malformed endpoint URL | failure | If the endpoint URL is invalid there’s no point trying to deliver an event: where would you even deliver that event to? If this error occurs the customer will have to use the subscription API to assign a new (and valid) endpoint URL. Note that this error is unlikely to occur simply because URLs are validated any time a subscription is created or updated. |
DNS error | failure | Usually the result of a DNS configuration error that has made the endpoint unreachable. If this error is returned you should ping the endpoint URL and see if you get a response. If not, you will need to make a change to your DNS configuration. |
SSL error | failure | Usually a result of the customer having an expired, incorrectly-configured, or otherwise-invalid SSL certificate. Webhook events cannot be delivered unless the customer has a valid and accessible certificate from a public Certificate Authority. |
Network error | awaiting-retry | Error accompanied by messages such as “Connection refused,” or “Connection closed.” Because these types of network issues (e.g., network congestion) are generally sporadic and short-lived, the event will be marked as awaiting-retry and another delivery attempt will be made based on the retry schedule. |
Endpoint responds with HTTP 1xx | failure | The server has accepted the request, and processing is continuing. This is considered a failed request because processing a webhook request should only take a fraction of a second. If an HTTP status code of 1xxis returned the event state is changed to failure and no further attempts are made to deliver the notification. |
Endpoint responds with HTTP 2xx | success | The notification was successfully delivered. |
Endpoint responds with HTTP 3xx | failure | Further action (often a redirect of some kind) is required before the request can be acted upon. If an HTTP status code of 3xx is returned then the event state is changed to failure and no further attempts are made to deliver the notification. |
Endpoint responds with HTTP 4xx | awaiting-retry | The server rejected delivery due to a problem with the request itself (for example, a parameter name might have been misspelled). If a 4xx error is returned the event will be marked as awaiting-retry and another delivery attempt will be made based on the retry schedule. |
Endpoint responds with HTTP 5xx | awaiting-retry | The initial request was accepted but a server error prevented delivery from being completed; 5xx errors are often the result of transient network congestion or server overload. If a 5xx error is returned the event will be marked as awaiting-retry and another delivery attempt will be made based on the retry schedule. |
Request timeout | awaiting-retry | The server failed to respond to the webhooks request within 10 seconds. In this case, the event will be marked as awaiting-retry and another delivery attempt will be made based on the retry schedule. |
Retry limit reached | failure | Webhooks v3 has made 6 unsuccessful delivery attempts. As a result the event state is changed to failure, and no further delivery attempts will be made. |
What happens if none of my Webhooks v3 notifications can be delivered?
That's what happens if a single event notification (or 2 or 3 or any other number of notifications) can’t be delivered. But suppose something more catastrophic happens, and none of your event notifications can be delivered. What happens then?
Let’s start by taking a look at failure on the receiving end; for example, what happens if your listener endpoint crashes and can’t be restarted?
At first, nothing will happen: Webhooks v3 will continue to send event notifications and, when those notifications can’t be delivered, will assign each notification to the retry cycle. That continues for 24 hours. If 24 hours has gone by and Webhooks v3 has not been able to deliver any notifications to a subscription then that subscription will be disabled. After a subscription has been disabled, Webhooks will not deliver notifications for that subscription. However, you can use the /webhooks/subscriptions/{subscriptionId} API endpoint and the PATCH method to re-enable the subscription. After the subscription has been re-enabled deliveries will restart immediately.
Now, what happens if the problem lies on the Identity Cloud side of things? Here’s a look at the things that could go wrong with Webhooks v3 and what the ramifications are:
The Webhooks API nodes go down
If some of the API nodes go down, the Amazon Web Service auto-scaler (a service that monitors your application workload and then adds or removes virtual machines as needed to match demand) will create new API nodes to replace the failed nodes. In that case, Webhooks management could, at worse, slow down a bit until the replacement nodes are all up and running.
If all the API nodes go down organizations will not be able to manage their Webhooks subscriptions until service has been restored. However, even an extreme case like that won’t prevent webhook deliveries: events will be generated and added to the event queue and dispatchers will collect those events and send event notifications, with little discernible impact on performance. What will be affected is webhooks management: organizations won’t be able to make API calls against the webhooks database nor will they be able to create, update, or manage webhooks subscriptions until service has been restored.
The Webhooks event monitoring nodes go down
If some of the event monitoring nodes go down, the AWS auto scaler will create nodes to replace the failed ones; in that case, notifications could slow down a little (until all the failed nodes have been replaced) but will not stop. If all the nodes fail then no new events will be added to the event store; dispatchers will continue to deliver events already in the store, but after the store is cleared deliveries will stop (because there won’t be anything left that requires delivery). As soon as even one monitoring node is available events will again be added to the store, although that process will be slowed until the monitoring nodes have been restored to their usual number.
To be honest, it’s unlikely that all the nodes will be unavailable at the same time; should that happen, it’s also unlikely that the outage would last very long (the AWS architecture allows new instances of virtual servers to be created in a remarkably short amount of time). If it does happen, however, data will not be available for events generated during the outage.
The Webhooks delivery nodes go down
If some of the delivery nodes go down, the AWS auto scaler will create replacements for the failed nodes; in that case, notifications could slow down a bit (until all the failed nodes have been replaced) but will not be stopped. If all the nodes fail then event deliveries will be suspended until a replacement is available to delivery notifications. However, while the nodes are down new events will continue to be added to the event store; as soon as new nodes are in place they can begin delivering those event notifications. There should be no loss of data even if all the nodes go down; at worst, event deliveries might be delayed a bit.
The Webhooks database goes down
Identity Cloud databases rely on the use of “hot standbys” to provide high availability. When an event is written to the events database, copies of that event are instantly synched to one or more hot standbys: mirror databases located in other Amazon Web Service availability zones. Should the primary database fail, one of the hot standbys can take over, with little (if any) disruption in service.
In the unlikely event that neither the primary database nor any of the hot standbys are available, the event service will temporarily stop: new events will not be added to the event queue, dispatchers will not be able to deliver those events, and organizations will not be able to use the APIs to manage their Webhooks subscriptions. When database service is restored Webhooks will immediately restart; however, no data will be available for any events that occurred during the outage.
Updated about 2 years ago