Spotlight

Measuring Cross AZ Data in Default VPC Flow Logs

How to Construct a Switch Statement in CloudWatch Log Insights
Joel Haubold Trek10
Joel Haubold | Aug 16 2023
5 min read

As part of our CloudOps offering here at Trek10, we monitor our clients' AWS accounts for cost anomalies (sudden and/or gradual increase/decrease in AWS Cost). While investigating a cost-related alert for a client I determined that there was a significant increase ($100+ per day) in the poorly named cost code USW2-DataTransfer-Regional-Bytes. This is of course the charge for data transfer between Availability Zones in the same region.

The client which received this alert had recently made some changes to their application but with half a dozen different application clusters and another half-dozen different database/cache clusters, it wasn't obvious what would be generating the massive increase in traffic. Fortunately, the client had VPC flow logs enabled for all of their VPCs. Unfortunately, the client's VPC flow logs were using the default format which does not include the Availability Zone ID in the logs. Not a problem, I would just need to map from the various private IP addresses in the flow logs to the subnet ID. “Great,” I thought, “I'll just use a switch statement with the isIpv4InSubnet function. CloudWatch Logs Insights has a switch or choice function, right?” Wrong.

No problem, we can construct a switch function using other existing functions. If we concatenate all possible outputs we need into a single string, we can extract what we need using the substring function substr. E.g. if we need to select between option1, option2, and option3, we can use a function call similar to the following to get the correct option:

substr("option1option2option3", <indexToStartOfStringHere>, 7)


Since we can choose from multiple options using substr, we just need a way to calculate the starting position of the correct result (e.g. 0 for option1, 7 for option2, and 14 for option3 in the example above). CloudWatch Logs Insights allows implicit conversion of boolean true/false values into 1 and 0, respectively. Because of this implicit conversion, we can use the conditions that we would normally use in a switch/choice/ifs statement to calculate the index of the correct result. We can multiply each condition by the index of the output that corresponds to that condition and then sum all of these products. The formula becomes:

substr("output1output2output3", condition1 * 0 + condition2 * 7 + condition3 * 14, 7)


The only caveat is that all of the conditions must be mutually exclusive or you'll get incorrect results, since if more than one condition is true the resulting index will either be larger than the length of the string or match the index of a subsequent condition. E.g. if in the previous example condition1 and condition2 were both true then the output would be output3.

You can also add a default at the start of the string like in the following example:

Example: suppose you have a metric in the logs that you need to categorize into three buckets, micro, small, and large. The options string would look like microsmalllarge. If the threshold between micro and small is 10, and the threshold between small and large is 1000, then the full formula would be:

substr(“othermicrosmalllarge”, (N < 10) * 1 * 5 + (N >= 10 AND N<1000) * 2 * 5 + (N > 1000) * 3 * 5, 5)


Now, if you have more than a couple of conditions this becomes cumbersome to write manually so I just threw together a script to programmatically generate the query itself. This script evaluates all the subnets and their respective availability zones to produce the functions I needed inside my query. This client had no overlapping subnet CIDR ranges so I could make a few assumptions to simplify the logic. Once written, this script allowed me to then copy-paste its output into the AWS Console to create the appropriate query in CloudWatch Logs Insights.

#!/usr/bin/env node

const { EC2Client, paginateDescribeSubnets } = require("@aws-sdk/client-ec2");

const ec2Client = new EC2Client()

/**
 *
 * @param {String} vpcId
 * @returns {Subnet[]}
 */
async function getAllSubnets(vpcId) {
  let subnets = [];
  for await (const page of paginateDescribeSubnets({client: ec2Client}, {})) {
    subnets = subnets.concat(page.Subnets ?? [])
  }
  return subnets
}

async function main() {
  let subnets = await getAllSubnets();
  // case statement for ip -> cidr -> azId
  let azIds = ['other'].concat(subnets.map(subnet => subnet.AvailabilityZoneId));
  let longestAzId = Math.max(...azIds.map(a=>a.length));
  let paddedAzIds = azIds.map(a => a.padEnd(longestAzId,' ')).join('');
  let field = 'srcAddr';
  // this assumes all the subnets are not overlapping
  let subnetChecks = subnets.map((subnet, index) => `isIpv4InSubnet(${field}, "${subnet.CidrBlock}") * ${index+1} * ${longestAzId}`).join (' + ')

  let statement = `trim(substr("${paddedAzIds}", ${subnetChecks}, ${longestAzId})) as srcAz`
  console.log(statement)
  field = 'dstAddr';
  // this assumes all the subnets are not overlapping
  subnetChecks = subnets.map((subnet, index) => `isIpv4InSubnet(${field}, "${subnet.CidrBlock}") * ${index+1} * ${longestAzId}`).join (' + ')

  statement = `trim(substr("${paddedAzIds}", ${subnetChecks}, ${longestAzId})) as dstAz`
  console.log(statement)
}

main()


Using this script to generate the formula to get the AZ ID for each IP address I was able to finalize my Cloudwatch Logs Insights query to see which pair of IP addresses that were in different AZs had the most traffic.

stats sum(bytes) as bytesTransferred by srcAddr, dstAddr, 
trim(substr("other use1-az4use1-az6use1-az4use1-az4use1-az6use1-az4use1-az4use1-az4use1-az6use1-az4use1-az2use1-az1use1-az4use1-az3use1-az4use1-az2use1-az6use1-az6use1-az2use1-az1use1-az6", isIpv4InSubnet(srcAddr, "10.188.0.0/24") * 1 * 8 + isIpv4InSubnet(srcAddr, "10.188.1.0/24") * 2 * 8 + isIpv4InSubnet(srcAddr, "10.188.4.0/24") * 3 * 8 + 
isIpv4InSubnet(srcAddr, "10.0.2.0/24") * 4 * 8 + 
isIpv4InSubnet(srcAddr, "10.188.9.0/24") * 5 * 8 + 
isIpv4InSubnet(srcAddr, "10.0.4.0/24") * 6 * 8 + 
isIpv4InSubnet(srcAddr, "10.188.3.0/24") * 7 * 8 + 
isIpv4InSubnet(srcAddr, "10.188.10.0/24") * 8 * 8 + 
isIpv4InSubnet(srcAddr, "10.0.5.0/24") * 9 * 8 + 
isIpv4InSubnet(srcAddr, "172.16.1.0/24") * 10 * 8 + 
isIpv4InSubnet(srcAddr, "10.188.11.0/24") * 11 * 8 + 
isIpv4InSubnet(srcAddr, "10.188.8.0/24") * 12 * 8 + 
isIpv4InSubnet(srcAddr, "172.31.16.0/20") * 13 * 8 + 
isIpv4InSubnet(srcAddr, "172.31.48.0/20") * 14 * 8 + 
isIpv4InSubnet(srcAddr, "10.0.0.0/24") * 15 * 8 + 
isIpv4InSubnet(srcAddr, "172.16.0.0/24") * 16 * 8 + 
isIpv4InSubnet(srcAddr, "172.31.0.0/20") * 17 * 8 + 
isIpv4InSubnet(srcAddr, "10.0.3.0/24") * 18 * 8 + 
isIpv4InSubnet(srcAddr, "172.31.32.0/20") * 19 * 8 + 
isIpv4InSubnet(srcAddr, "10.188.2.0/24") * 20 * 8 + 
isIpv4InSubnet(srcAddr, "10.0.1.0/24") * 21 * 8, 8)) as srcAz, 
trim(substr("other use1-az4use1-az6use1-az4use1-az4use1-az6use1-az4use1-az4use1-az4use1-az6use1-az4use1-az2use1-az1use1-az4use1-az3use1-az4use1-az2use1-az6use1-az6use1-az2use1-az1use1-az6", isIpv4InSubnet(dstAddr, "10.188.0.0/24") * 1 * 8 + isIpv4InSubnet(dstAddr, "10.188.1.0/24") * 2 * 8 + isIpv4InSubnet(dstAddr, "10.188.4.0/24") * 3 * 8 + 
isIpv4InSubnet(dstAddr, "10.0.2.0/24") * 4 * 8 + 
isIpv4InSubnet(dstAddr, "10.188.9.0/24") * 5 * 8 + 
isIpv4InSubnet(dstAddr, "10.0.4.0/24") * 6 * 8 + 
isIpv4InSubnet(dstAddr, "10.188.3.0/24") * 7 * 8 + 
isIpv4InSubnet(dstAddr, "10.188.10.0/24") * 8 * 8 + 
isIpv4InSubnet(dstAddr, "10.0.5.0/24") * 9 * 8 + 
isIpv4InSubnet(dstAddr, "172.16.1.0/24") * 10 * 8 + 
isIpv4InSubnet(dstAddr, "10.188.11.0/24") * 11 * 8 + 
isIpv4InSubnet(dstAddr, "10.188.8.0/24") * 12 * 8 + 
isIpv4InSubnet(dstAddr, "172.31.16.0/20") * 13 * 8 + 
isIpv4InSubnet(dstAddr, "172.31.48.0/20") * 14 * 8 + 
isIpv4InSubnet(dstAddr, "10.0.0.0/24") * 15 * 8 + 
isIpv4InSubnet(dstAddr, "172.16.0.0/24") * 16 * 8 + 
isIpv4InSubnet(dstAddr, "172.31.0.0/20") * 17 * 8 + 
isIpv4InSubnet(dstAddr, "10.0.3.0/24") * 18 * 8 + 
isIpv4InSubnet(dstAddr, "172.31.32.0/20") * 19 * 8 + 
isIpv4InSubnet(dstAddr, "10.188.2.0/24") * 20 * 8 + 
isIpv4InSubnet(dstAddr, "10.0.1.0/24") * 21 * 8, 8)) as dstAz
| filter srcAz != dstAz
| sort bytesTransferred desc
| limit 50


This query quickly enabled me to identify which ENIs were the cause of the extra traffic. It turns out, the application was scanning a Redis cluster on every operation instead of just getting individual items.

As you can see from the example above you can use this technique to group or categorize your logs for easier analysis when using CloudWatch Logs Insights.

Hopefully, you've found this article helpful. If you're looking for assistance with Amazon CloudWatch or other monitoring components for your AWS environment, please feel free to contact us. We'd love to chat!

Author