Schedule redelivery for a failed webhook event

Redeliver a failed event


Webhooks v3 was designed, from the start, to guard against single-point-of-failure issues that could arise when delivering webhook notifications. For example, Webhooks v2 has a limited ability to deal with notifications that can’t be delivered. In Webhooks v2, the system makes an initial attempt to deliver the notification. If delivery fails (due to any number of reasons, ranging from network congestion to the listener endpoint being offline), the system waits 10 seconds and then make a second delivery attempt. If that effort fails, the system waits 10 more seconds and then makes a third delivery attempt. If that effort fails, the notification is deleted and no further delivery tries take place.

Ever.

By comparison, Webhooks v3 takes a much more expansive approach to redelivery efforts. If a delivery fails in Webhooks v3 then:

  • Webhooks v3 waits 3 seconds and then tries again.
  • If the second delivery attempt fails Webhooks v3 waits 30 seconds and then tries again.
  • If the third delivery attempt fails Webhooks v3 waits 5 minutes and then tries again.
  • If the fourth delivery attempt fails Webhooks v3 waits 1 hour and then tries again.
  • If the fifth delivery attempt fails Webhooks v3 waits 24 hours and then tries again.

If, 24 hours after the first attempt, delivery still fails then Webhooks v3 gives up and assigns the notification the failure state. Even then, however, failed deliveries aren’t instantly deleted (like they are in Webhooks v2). Instead, failed delivery attempts are kept in the Webhooks event store (and are available for viewing) for the next 7 days. It’s only after those 7 days have elapsed that events are automatically deleted from the event store, and are no longer available.

More often than not, this system works extremely well. After all, failed deliveries are typically due to temporary problems; for example, your listener endpoint was momentarily offline while undergoing a software upgrade. Assuming that this β€œglitch” lasts no more than 24 hours, delivery will eventually take place.

As we all know, however, glitches sometimes last longer than 24 hours: for example, your listener endpoint could fail entirely and need to be replaced, or a hurricane could knock out power and leave you offline for days. In cases like that, Webhooks v3 will dutifully run through the delivery process and then, after 24 hours, mark the event as failed. The event remains in the event store for the next 7 days, but admittedly, there’s never been much you can do with it: although you could use the Webhook APIs to view the event, until now there’s never been a way to retrieve the event. For some organizations, in some situations, that’s been a problem.

To help you work around that problem, the Identity Cloud has added a redelivery endpoint to the Webhook APIs. This endpoint enables you to restart the delivery process for an event that has been marked as failed and is still in the event store (note that events are still retained for only 7 days). In other words, suppose the power goes out and your listener endpoint is offline for 3 days. During those 3 days, a number of Webhook events are generated, but none of those event notifications could be delivered (because your listener endpoint was unavailable). You now have the ability to retrieve those failed events … albeit with a few cautions and limitations.


Cautions about, and limitations of, the event redelivery service

It’s important to note, right from the start, that the event redelivery service is meant to complement – not to replace – the basic Webhooks event delivery service. For example, suppose you decide to shut off your listener endpoint. and, instead of receiving event notifications in real time, you allow those delivery attempts to fail and be saved to the event store. Β­You do this thinking, β€œIn a few days I’ll just run the /redeliver API and retrieve all the failed events that have accumulated in the past week.” We can tell you right now that an approach similar to that probably won’t work. And here’s why:

  • At this point in time there’s no API endpoint that can retrieve all your failed events. Instead, the /redeliver endpoint reschedules event deliveries on an event-by-event basis. For example, suppose events A, B, C, and D have all failed. To retrieve those events, you’ll need to make an API call that redelivers event A; make a second API call that redelivers event B; make a third API call that redelivers event C; and so on. If you have 1,000 failed events in the event store you’ll need to make 1,000 separate API calls to retrieve those events. If you have 10,000 failed events ….

  • In addition to allowing only one event delivery scheduling per API call, rate limits have also been imposed on the /redeliver endpoint. In theory, you could write a script that simply calls the /redeliver endpoint over and over again, in rapid-fire succession. However, if you make too many API calls in too short a time (actual limits have not been announced yet) you’ll exceed the rate limits and your script will fail.

    On top of that, making too many API calls could cause your Webhook subscription to be temporarily suspended. That means that not only will you not get any event deliveries, but you won’t get any real-time event notifications either, at least not until the subscription is re-enabled.

  • Event redeliveries are placed in different, lower-priority queues than events occurring in real time. What does that mean? Well, suppose you have a single failed event (Event A) and you mark that event for redelivery. At the same time, new 10 Webhook events are generated for your domain. The failed event, Event A, is placed in a lower-priority queue than the 10 new events. Because of that, it’s possible that the 10 new events could be delivered before your single failed event is delivered: priority is always placed on new events, and redeliveries take place only when time and resources allow. There is no SLA (service level agreement) saying that redeliveries will occur within X amount of time: a redelivery might arrive a few seconds after you make your API call, or a redelivery might arrive several minutes after you make your API call.

The moral of the story? The event delivery service is designed as a way to help you retrieve the occasional event that, somehow or another, fell through the cracks and didn’t get delivered. It is not the method by which you should routinely receive your event notifications.


How the Webhooks v3 redelivery service works

The Webhooks v3 redelivery service is remarkably simple. As noted previously, it consists of a single API endpoint that can be called to reschedule delivery of a single event notification. Or at least it can do that as long as:

  • The event is marked as failed. Your API call will fail if the event is in any other state (e.g., awaiting-executing). This includes events marked as success (i.e., events that were successfully delivered). Suppose Event A is delivered and, as a result, is marked as success. Suppose you then inadvertently delete Event A from your listener endpoint. In a case like that, you can’t use the /redeliver endpoint to get a β€œreplacement” copy of Event A: there’s no way to redeliver events that have already been delivered.

  • The event is in the event store. Remember, events are automatically deleted from the event store after 7 days: once deleted, those events cannot be retrieved. Let’s assume that, for whatever reason, you were unable to receive Webhook notifications for the past 8 days. Events for the last 7Β daysΒ will still be in the event store; however, events generated on day 1 (8 days ago) will have already been deleted from the store. Those events can’t be retrieved and they can’t be scheduled for redelivery.


πŸ“˜

But keep reading: if necessary, there is a way to keep events in the event store indefinitely.


The first step in scheduling event redelivery is to determine the unique IDs of all the failed events currently in the event store; that’s something that can be done using the /events endpoint and the state parameter. For example, this call returns a collection of all the failed events for the Webhooks subscription a6de662c-e93b-4041-96f0-283214de75b6:

curl -X GET \
  https://v1.api.us.janrain.com/e0a70b4f-1eef-4856-bcdb-f050fee66aae/webhooks/subscriptions/a6de662c-e93b-4041-96f0-283214de75b6/events?state=failure \
  -H Authorization: 'Bearer Xk7EzdpGq5GPQcsxCWM2SxdlwU_iTsA4i2Px4TEzBrfLIvddjnDVBJxjPDuCARHH'

πŸ“˜

Yes, you must retrieve failed events on a subscription-by-subscription basis. If you have 4 Webhook subscriptions you’ll need to make the preceding API call against each subscription in order to return all your failed events.


For each failed event (or at least for each failed event that you’d like redelivered), your next step is to retrieve the event ID. For example:

"_embedded": [{
    "id": "1b8773c6-5f6a-4ba5-8f3b-210732476cd6",
    "createdAt": "2020-01-28T18:16:04.034726Z",
    "updatedAt": "2020-01-28T18:16:04.616963Z",
    "state": "success",
    "attempts": 1,
    "request": {
      "endpoint": "https://webhook.site/46ff3c5e-ae95-43df-b32d-d07bb84746b4",
      "headers": {
        "Accept": "*/*",
        "Content-Length": "1252",
        "Content-Type": "application/secevent+jwt",
        "Host": "webhook.site",
        "User-Agent": "Akamai Identity Cloud Webhooks/v3.0.0"
        },

That event ID is then included in your call to the /redeliver endpoint:

curl -L -X POST \
 'https://v1.api.us.janrain.com/e0a70b4f-1eef-4856-bcdb-f050fee66aae/webhooks/subscriptions/a6de662c-e93b-4041-96f0-283214de75b6/events/1b8773c6-5f6a-4ba5-8f3b-210732476cd6/redeliver'
   -H 'Authorization: Bearer ELfZB8fwZIKewDiv7iiXdef4CFMtjI5An9N1BI-BzQixRPtRmm9U6lzyPzHHmbdv'

After making your API call, repeat the process with the next event in your list.

Here are a couple quick notes regarding the /redeliver endpoint. For one, you must use the POST method. For another, you can’t include anything in the API call’s body parameter. As shown in the preceding example, your API call can only include the Authorization header and the endpoint URL.

If your API call succeeds, two things will happen. First, you’ll get back a 202 Accepted HTTP response. Second, the event state for the specified event will be changed from failed to awaiting-executing, the same state assigned to brand-new events. In fact, from that point on the formerly-failed event is treated exactly the same as a brand-new event. (Again, with the caveat that the rescheduled event is in a lower-priority delivery queue.) That means that the regular delivery cycle will be in force; Webhooks will attempt to deliver the event and, if delivery fails, will wait 3 seconds and then make aa second attempt. If that second attempt fails, Webhooks will wait 30 seconds and then try a third attempt, and so on. If, 24 hours later, the event still can’t be delivered then the event is marked as failed and is kept in the event store for 7 days. After 7 days, the event is automatically deleted.

And what if you use the /redeliver endpoint to make a second stab at redelivering the event? In that case, the cycle repeats itself: the event is marked as awaiting-execution and the process starts all over again. There’s no limit on how many times you can run the /redeliver endpoint against a specific event. In theory, you could keep an event in the event store forever, as long as you schedule it for redelivery before the 7-day time period expired.


πŸ“˜

And assuming that the event hasn’t been delivered. Once an event has been delivered and marked as a success, the event can’t be redelivered and is deleted from the event store after 7 days.


For visual learners, a somewhat-simplified version of the process looks like this: