Understand GTM concepts

It's important to familiarize yourself with these in-depth explanations of GTM functions.

Compute answer to DNS query

When GTM receives a DNS query, it completes theses actions to compute the answer.

  • Decides which data center the answer should come from

  • Computes the answer based on the chosen data center

How the data center is chosen depends on the property type. For performance properties, GTM usually chooses the data center closest to the requester (if it is up, and not overloaded). A data center that is considered down is never chosen. If all data centers are down, then GTM considers them all up.

After GTM chooses a data center, it computes the answer based on the data center chosen. In the common case in which you have configured a number of server IPs in the data center, all of the servers IPs that are considered up are potentially included in the answer. Because the size of a DNS response packet is limited, if more than eight servers are up, GTM chooses eight IP addresses at random for the answer. This random choice is made again for each query, so over a large number of queries, the answers tend to distribute evenly over all of the live servers in the data center.

📘

Default number of IP addresses

The number eight, in reference to the number of IP addresses, is default. You can change this number in your configuration.

Determine server liveness

There are two distinct parts to determining server liveness: data collection and decision making. Server liveness data is collected using a process called servermonitor that is running on liveness testing agents. Liveness scores are sent from the liveness testing agents to a component called DatacenterStateAgent (DSA) every ten seconds. DSA makes the liveness decisions, and sends updates to GTM nameservers every five seconds.

Each liveness testing agent periodically performs one or more liveness tests on each of your servers. The liveness testing agent then computes a score that is either the download time in seconds or a penalty score if the download request times out or if the download encounters an error such a 404 error. The default penalty for a timeout is 25 and the default penalty for an error is 75. These penalties are configurable on the back-end. The time to allow before declaring a timeout (by default, 10 seconds) is configurable in the portal (see Manage liveness tests).

📘

Connection timeouts as errors

Connection timeouts are treated as errors (penalty 75), not timeouts. Only timeouts that occur in the data transfer stage receive the timeout penalty (25). This prevents servers that are disconnected or shut off from being preferred over servers that are returning errors or refusing connections.

Each of your servers is tested from a number (usually seven) of liveness testing agents in disparate locations and networks. The scores are collected from each liveness testing agent; for each server, the median of the scores from all liveness testing agents is used for the remainder of the calculation.

In addition to the instantaneous scores, servermonitor also computes and reports an exponentially-decaying average. In the calculation that follows, the score used is the greater of the instantaneous score and the average score. This means that when a server goes down, GTM stops handing it out immediately, but when it comes back up, GTM does not hand it out again until the liveness testing agents have had several successful downloads. If a server is intermittent, it is declared down when several liveness testing agents get errors within a few minutes of each other. While some of the averages are falling, they are still above the cutoff when the new errors occur.

A cutoff value for the property is computed from the median server scores across all datacenters. Any server with a score over the cutoff value is considered down, and load is not be sent to it. The cutoff is computed from the minimum score across all servers (for a given property) and with two parameters; health_multiplier and health_threshold. The cutoff is either health_multiplier (default value: 1.5) times the minimum score or the health_threshold (default value: 4), whichever is greater.

Theses examples describe how the cutoff is determined.

Example 1

In this example, server A has the best score of 1.0 seconds so the cutoff is 4. As discussed above, the cutoff is 1.5 times the best score or 4, whichever is greater. Servers A, B, and C are declared up, while server D is declared down, as its score (15) is greater than 4.

ServerScoreStatusCutoff
A1.0Up4
B1.2Up
C3.0Up
D15Down

Example 2

This example shows a high load situation, where the servers are slow, but still responding. Server A has the best score of 8 seconds so the cutoff is 12 (1.5 * 8). Servers A, B, and D are declared up, while server C is declared down, as its score (15) is greater than 12.

ServerScoreStatusCutoff
A8Up12
B11Up
C15Down
D10Up

Example 3

In this example, server A has the best score of 25 seconds so the cutoff is 37.5 (25 * 1.5). Server A is declared up, while servers B, C, and D are declared down, as their scores are greater than 37.5.

ServerScoreStatusCutoff
A25 (timeoutUp37.5
B75 (error)Down
C75 (error)Down
D75 (error)Down

The algorithm is modified slightly if a backup CNAME exists. If the cutoff score computed as described above is greater than 0.9 times the timeout penalty, use 0.9 times the timeout penalty as the cutoff value. This is to guarantee that, if all servers are timing out or returning errors, GTM declares them down so that the backup CNAME is handed out. (Normally if all servers are timing out or returning errors, the standard algorithm declares them all up so that they're handed out; you don't want this if there's a backup CNAME).

Determine server liveness with multiple liveness tests

When you configure multiple tests, servermonitor aggregates the scores for a server across the tests to produce a single score. You can configure how these scores are aggregated by choosing these methods:

  • mean

  • median

  • worst (Control Center default creating new properties)

  • best

If the aggregation type is mean, and a server returns an HTTP test object in 2 seconds and an HTTPS test object in 4 seconds, servermonitor reports a score of three seconds. More significantly, if one test succeeds in 5 seconds and the other incurs an HTTP error, servermonitor reports the mean of 5 and 75, which is 40.

In the case in which one test succeeds and the other fails, the server is usually considered down. However, if the second test is failing on all servers in the property, they all have a similar mean score, and that test is discounted in the liveness algorithm. If any server is failing both tests, its mean is 75; it is considered down (because its score is more than 1.5 times the best score), while all the servers with scores around 40 are considered up.

Note how the result of the previous example changes if you switch to using worst as the aggregation method. As an example, if one of the tests is failing on all servers, they all have a score of 75. As a result, they all appear to be down. If there's a backup CNAME, GTM hands that out; otherwise, GTM hands them all out (including the server failing both tests). When using this option, it is important that all the tests succeed on live servers. A problem that causes a test to fail on all servers, such as a typo in the configuration of the test object, effectively renders all the liveness tests unusable.

Timeout back off

A large number of test object downloads timing out could stall other tests. To prevent this, a test that times out is assigned a back off interval, which is an increment to the test interval. Each time the test times out, the back off interval is increased. The initial value for the back off interval is equal to the test interval, and each time the test times out the back off interval is multiplied by 1.5, up to a maximum of 15 minutes. As an example, if a test with a test interval of three minutes has been timing out for a while, the liveness testing agents perform the test only once every 18 minutes. If a non-timeout error occurs, the back off interval remains unchanged. As soon as the test succeeds, the back off interval is reset to zero.

📘

Reset timeout back off

If you need to bring a recently resurrected server back into rotation immediately, but this is being delayed by the timeout back off, you can reset the back off. Changing a test parameter, such as the timeout interval, discards the old test and creates a new test with a zero back off. As an example, you could change the timeout from 25 to 26, and then back to 25.

Persistent assignment

Although it is impossible for any DNS-based load balancing system to guarantee that any given user
remains mapped ("stick") to the same server indefinitely, there are some things that can be done to reduce the probability of a user being mapped away. One of these is the Persistent assignment handout mode. Normally, when GTM is computing the answer to return to a query, it maps the request to a data center, and then returns all of the IP addresses of servers in that data center that are considered live.

If the Persistent assignment handout mode is enabled for the property, GTM instead returns just one live IP from among all the live IPs. The IP is chosen based on a hash of the client nameserver's IP address, so that requests from different nameservers may get mapped to different server IPs (thus spreading the load across all of them), but all requests from the same nameserver IP always gets the same answer.

Persistent assignment has no effect if a data center has only one server IP address configured; the portal issues a warning to that effect if you try to submit such a configuration.

Persistent assignment does not have any effect on which data center is chosen. It only controls which IPs in the data center are returned, once the data center choice has been made. A user could still be moved from one data center to another, if, for example, load feedback attributes for that data center suddenly go over target.

Round-robin prefix

Normally, when GTM answers a DNS query for a property name, it chooses a data center and then returns IP addresses for every live server in the data center.

If you need to be able to perform a single DNS query that returns every server in every data center for a property, regardless of liveness, you can use a round-robin prefix. A round-robin prefix is a string that when configured automatically creates a shadow property for each normal property. The shadow property's name is the round-robin prefix followed by an underscore and then the normal property name.

For example, if a property named roundrobin is in the domain test.akadns.net and the round-robin prefix is showall, you can issue a DNS query for showall_roundrobin.test.akadns.net. This query returns all the IP addresses in all data centers for the property.

The handout limit applies to the maximum number of IP addresses that can be returned.

To configure a round-robin prefix for a domain, submit a ticket to ​Akamai Technologies, Inc.​ Support.

Failover delay and failback delay examples

GTM properties with liveness tests contain settings for failover delay and failback delay. These settings work together to allow you to switch the traffic between your primary and secondary data centers.

Consider this example: The GTM domain example.com.akadns.net has a property named property1, with a primary data center DC1.example.com and a secondary data center DC2.example.com.

As the primary data center, DC1.example.com is expected to be available at any time. If for some reason DC1 goes down, end users receive error messages. You can resolve the errors by moving traffic from DC1 to another data center. If you are not using GTM, it might take an IT team at least 30 minutes to move the traffic from DC1. GTM can quickly move your traffic from DC1 to another data center based on your property's failover and failback settings.

In this example, the failover delay and failback delay settings are both 300s (5 minutes) for property1. If you enabled liveness tests, GTM detects the failure as soon as liveness tests start failing for DC1. GTM does an internal calculation of scores returned by the liveness test, and depending on these scores, GTM determines a cut-off score. If the aggregated score of liveness tests from all the liveness testing agents extends beyond this cut-off value, DC1 is marked as down.

Because the failover delay in the example is set to 5 minutes, GTM does not mark DC1 as down immediately upon detecting this failure. Instead GTM schedules a time (a 5-minute failover delay time) in the future to mark DC1 as down. After 5 minutes, GTM evaluates the score once again to see whether the situation has changed. If the score remains unchanged, DC1 is marked as down and all the traffic is moved to DC2.

Failback delay works in a similar manner but in reverse, as the traffic is now on DC2. For example, let's say DC1 went down because of a power outage. You switched to backup power and manually checked that DC1 is functioning again. When DC1's liveness test starts returning successful responses, the liveness score improves and falls below the cut-off score. The moment the liveness test falls below the cut-off score, GTM schedules a time (5 minutes in the future) to mark DC1 as up. After 5 minutes, GTM verifies whether or not the situation is sustained. If it is sustained, DC1 is marked as up and all traffic reverts to DC1.

Failover delay and failback delay both default to zero seconds. You can adjust them as desired to meet your needs.

📘

If liveness tests are failing for both data centers, the traffic remains on the primary data center even if it is down.

Nameserver demand estimation

GTM's function is to return answers for DNS queries based on who is asking. Queries are sent by client nameservers, which are usually operated by the end user's ISP or public recursive providers such as Google Public DNS or OpenDNS. As a DNS-based system, GTM has no way of knowing the IP addresses of the actual end users; it only knows them by the client nameservers they use.

When GTM sends or maps a client nameserver to a data center, it is implicitly directing all end users who use that nameserver to that data center. Client nameservers are usually shared by many users. Some client nameservers might have only a few users, while others might be used by all the customers in a metropolitan area of a large cable TV company for example. Thus, some nameservers represent more Internet traffic than others.

To balance load effectively, GTM needs to know whether a nameserver has a lot of traffic demand behind it or a little. To learn this, ​Akamai​ continually analyzes the logs of GTM's authoritative nameservers and keeps track of how often a client nameserver sends a query for a particular name. Consider client nameserver A, which repeatedly refreshes a name as soon as its TTL expires. It probably has more load than client nameserver B, which only refreshes the name once an hour. B probably has more load than client name server C, which only refreshes the name once a day. By analyzing the logs, GTM can learn the IP addresses of nearly all the nameservers on the Internet and develop an estimate of how much demand is behind each one.

Internet traffic measured across large populations shows characteristic variations. As an example, most nameservers show more load in the middle of the day than in the middle of the night (local time), and the daily peaks tend to be lower on weekends. For this reason, nameserver demand estimates are aggregated and averaged over five to seven days.

Nameserver demand estimates are used by some of GTM's load balancing modes. See Load balancing.

Limitations of DNS-based load balancing

As a DNS-based system, GTM only communicates with client caching nameservers. It has no idea of the actual user IP addresses on behalf of whom the client caching nameservers are acting, unless EDNS0 End-User Client Subnet (ECS) is enabled for the domain and the end user is using a resolver with which ​Akamai​ has an ECS agreement, for example Google Public DNS or OpenDNS. Moreover, GTM cannot keep track of what answers it has given in the past per nameserver, as there are too many client caching nameservers for that to be practical.

One consequence of this is that it is impossible to guarantee stickiness. Using the Weighted Random Load Balancing with Data Center Stickiness property type and the Persistent Assignment handout mode can improve the chances of keeping a user mapped to the same server over time, so as to preserve things like server-side session state, but can not guarantee, stickiness.

Another consequence is that GTM can only direct load represented by client caching nameserver, not users. As an extreme example, suppose that all of your users happen to use the same client caching nameserver. It is impossible for GTM to split this load based on the identity of the requester, because all requests come from the same IP. The only way to balance this load would be by weighted random load balancing, which computes a random answer on each request without regard for who is asking.

Name servers and load balancing

When GTM is load balancing, it computes a new assignment of nameservers to data centers as often as once per minute. There are approximately 6 million active nameservers on the Internet, but most of them have only insignificant demand behind them. To bound the computation required by load balancing, GTM only balances those nameservers that are estimated to represent a substantial portion (95 percent by default, but this is configurable) of the demand on a domain. For most domains with large user bases, this typically runs between 20,000 and 100,000 nameservers. This means that some users, those behind nameservers representing very small demand, can get mapped to a data center even if the data center is considered to be over its load target.