Most network teams use the DNS as a domain name system (DNS) to manage their authority. Administrators will say you shouldn’t mess up with success if the system works and users find connections to revenue-generating services, applications, and content.
Unfortunately, we often take DNS for granted because of its reliability. DNS is easy to dismiss as a “background service” because it works so well. This “set it up and forget it” approach can create blind spots in network teams, as it leaves performance and reliability problems undiagnosed. If you let those undiagnosed problems pile up, or if they go unaddressed over a period of time, They can quickly become a larger network performance issue.
DNS, just like any other machine or system, requires periodic maintenance. DNS needs to be checked for specific errors, even when it’s working well. This is so that minor problems don’t escalate into more serious issues.
Top 5 DNS Troubleshooting Tips for Network Teams
I’d like to give some pointers to network teams about what they should look for when troubleshooting DNS problems.
Set Baseline DNS Metrics
There are no two networks that are the same. No two networks are the same. Each network is unique and has its own quirks. It’s important to know what your network “normal” is before diagnosing problems.
You can use DNS data to get an idea of the average query volume. This number is likely to be relatively stable for most businesses. Seasonal variations are likely to occur (especially for industries such as retail), but they’re usually predictable. As a business’s customer base or service volume increases, the number of queries will increase gradually. However, this pattern is generally predictable.
You should also consider the volume of queries. Does your DNS traffic go to one domain in particular? What is the stability (or volatility) of DNS queries across various back-end resources? Answers to these questions are different for each enterprise and can change depending on the network team’s decisions about issues such as load balancing and product resourcing.
Monitor NXDOMAIN responses
NXDOMAIN response is a clear indicator that something is wrong. NXDOMAIN responses are normal for “fat-finger” queries, redirection errors, and user-side problems that are outside the network team’s control.
NS1, IBM’s Global Domain Data Report shows that 3-6% of DNS requests receive an NXDOMAIN reply for some reason. In a “normal”, network setup, anything in or around that range should be expected.
If you get to double digits then something big is likely happening. It’s important to note the nature of any pattern. Slow but steady increases in NXDOMAIN responses are likely to be a misconfiguration that has been ongoing for a while and is mimicking the overall traffic volume. An abrupt spike in NXDOMAINs can be caused by either a localized misconfiguration (but one that has a high impact) or a DDoS.
It is important to monitor NXDOMAIN’s responses in relation to the overall query volume. A deviation from the norm can be a sign that there is something wrong. The next step is to figure out what is wrong and how you can fix it. A deeper look at the timing and characteristics will usually provide clues as to why the abnormal increase is occurring.
NXDOMAIN’s responses aren’t always bad. They could even represent an opportunity for business. If you’re unable to find a domain, subdomain, or website that belongs to you when someone tries to search for it, this could be a sign that the domain is one that’s worth buying or using.
Beware of the Exposure Internal DNS Data
Misconfigurations can cause NXDOMAIN responses that are particularly alarming. These misconfigurations expose DNS zone and record data on the internet. This type of misconfiguration not only affects performance by increasing query volume but also poses a serious security risk.
Stale URL redirects can expose internal records. In the midst of a merger, acquisition, or other major change, it is possible that systems are pointed to properties that have been repurposed or faded away. They are still searching for the old connection, but they don’t find the answer. The lower the workload is, the greater the chance that it will go unnoticed.
Pay Attention to Geography
When you establish a baseline of where your traffic comes from, you can more easily detect anomalous attacks and misconfigurations. You can even discover broader changes to usage patterns. The sudden increase in traffic to one specific server in a particular region is different from an overall increase in query volume. By tracking your DNS data geographically, you can identify any issues and get clues as to how to fix them.
Check SERVFAILs for Misconfigured Alias Records
Alias records can be a source of misconfigurations. They deserve to be audited regularly. It’s not uncommon for me to trace an increase in SERVFAIL response rates — whether it is a sudden spike, or a gradual rise — back to alias record problems.
NOERROR NODATA? Consider IPv6
NXDOMAIN answers are straightforward — no record was found. The response is returned as NOERROR. However, you can also see that there was no answer There is no official RFC for this response, but it’s known as a NOERROR NODATA when the answer counter returns 0 NOERROR NODATA indicates that the record has been found but is not the type of record that should be there.
In our experience, if you see a lot of NOERROR NODATA replies, the resolver will usually be looking for an AAAA entry. Add support for IPv6 if you get a lot of NOERROR NODATA responses.
DNS Cardinality and Security Implications
There are two different types of cardinality in the world of DNS. The resolver cardinality is the number of resolvers that are querying your DNS records. The number of DNS names that you receive each minute is known as Query Name Cardinality.
It is important to measure DNS cardinality because it can indicate malicious activity. An increase in DNS query names cardinality may indicate a random-label attack or probing your infrastructure on a mass scale. An increase in resolver cardinality may indicate that you are being targeted by botnets. It’s possible that you are being attacked if your resolver cardinality suddenly increases.
These tips should help you understand the impact of DNS query behavior, and what steps you can do to restore your DNS to its healthy state. Please feel free to share any other tips that you have learned over the course of your career.