Making Harmony with a Step Function Orchestrator

Be the maestro of your step functions with a simple orchestrator state machine.
Jessica Ribeiro Featured Team Member
Jessica Ribeiro | Jun 19 2022
4 min read

You can use AWS Step Functions to run complex serverless workflows on demand. This extends the utility of AWS Lambda, enabling us to build support for batch jobs, long-running processes, pauses for external processes, and more. However, with complexity, there is also an overhead to both understanding and modifying a workflow. Have you ever been asked to modify a design late in development because of a late-emerging requirement? How about being asked to dig up an old project and add just a “small” new feature? In the worst cases, this can be a moment of dread. Assumptions baked into a system can make it difficult to modify and extend in certain ways, and the worst of these spider throughout an entire system architecture. Just like AWS CloudFormation stacks and microservices, with AWS Step Functions it can be important to separate the concerns and responsibilities of a system while simultaneously making it possible to orchestrate their deployment and usage. One approach to this problem with AWS Step Functions is to break loosely related workflows into separate state machines. In order to make these workflows appear to function as a unified system, something is required to orchestrate the launch of all of the component state machines. In this post, we are going to cover the construction of a simple orchestration state machine for running multiple other state machines with some advanced flow controls and debug hooks.

Recently, I had a need to simplify the execution of multiple loosely-related tasks under a single trigger. Given that some of these were already built as step functions in the same repository, I could have chosen to combine them into a single state machine, but the resulting machine would have been more difficult to understand and extend in the future. Instead, I chose to keep them separate and introduced an orchestration state machine that requires no additional AWS Lambda functions to orchestrate all of the target state machines, allowing it to run from a single trigger as required. Here was the list of requirements for the orchestrator:

  • The orchestrator needed to have the option to delay on a per-state-machine basis. To make development with this feature easier, a debug flag to skip delays was also needed.
  • It needed to allow for reconfiguration or running the same state machine in parallel with different configurations.
  • For ease of development, it had to allow individual state machines to be disabled easily.

Given those requirements, we can dive into the solution. For reference, this solution was built with v1.36.0 of AWS SAM CLI. The following YAML definition and state diagram represent the orchestrator state machine that is explained throughout the rest of this post.


Comment: An example of combining workflows using a Step Functions StartExecution task
  state with various integration patterns.
StartAt: Inject target state machine data
  Inject target state machine data:
    Comment: Injects ARNs of target state machines
    Type: Pass
    Next: Start in parallel
      payload.$: "$.payload"
        skipDelays: false
      - stateMachineArn: "${MyStateMachineArn}"
        disableTarget: false
        delay: 60
          someFeature: true
      - stateMachineArn: "${MyStateMachineArn}"
        disableTarget: false
          someFeature: false
  Start in parallel:
    Comment: Start child state machines in parallel dynamically with map
    Type: Map
    End: true
    ItemsPath: "$.targets"
    MaxConcurrency: 1
      debug.$: "$.debug"
      payload.$: "$.payload"
      target.$: "$$.Map.Item.Value"
      StartAt: Skip or Delay or Execute
        Skip or Delay or Execute:
          Comment: Skip/Delay/Execute
          Type: Choice
          Default: Execute
          - Next: Skip
            - Variable: "$.target.disableTarget"
              IsPresent: true
            - Variable: "$.target.disableTarget"
              BooleanEquals: true
          - Next: Delay
            - Variable: "$.target.delay"
              IsPresent: true
            - Variable: "$.target.delay"
              IsNumeric: true
            - Or:
              - Variable: "$.debug.skipDelays"
                IsPresent: false
              - Not:
                  Variable: "$.debug.skipDelays"
                  BooleanEquals: true
          Comment: End
          Type: Pass
          End: true
          Comment: Delay
          Type: Wait
          SecondsPath: "$.target.delay"
          Next: Execute
          Comment: Execute target state machine dynamically from input
          End: true
          Type: Task
          Resource: arn:aws:states:::states:startExecution.sync
            StateMachineArn.$: "$.target.stateMachineArn"
              NeedCallback: false
              AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$: "$$.Execution.Id"
              payload.$: "$.payload"
              configuration.$: "$.target.configuration"

The orchestrator state machine can be broken into a few phases:

  1. Merge input state with target state machine configuration data
  2. Iterate the targets
  3. Skip, delay, or execute each target

The first step uses a Pass state to merge configuration data, including the list of target state machine configurations, into the state. Each state machine configuration is composed of the target state machine’s ARN, the options for disabling or delaying a target state machine’s execution, and configuration specific to the target state machine. A particular state machine ARN, which is supplied via a variable substitution from the SAM template, can be reused in multiple targets with each having its own configuration. In addition to the list of targets, debug settings, such as the option to skip delays, are included in the configuration data. The input to the target state machines is expected to be contained in a payload property on the orchestrator input JSON.

The next step uses a Map state to iterate over the targets in order to decide how to handle each one. The payload, debug settings, and target configurations are passed along to the next step. Here the MaxConcurrency: 0 property assumes that it is ok to run as many of the targets in parallel as AWS limits will permit. Concurrency should be limited if the workflow requires it.

Inside the map iteration, each target is run through a Choice state to decide if it needs to be skipped, executed now, or executed after a delay. It is worth noting that additional decision-making logic, including custom AWS Lambda functions, could be added at this point in the state machine to provide additional control. If the target state machine is executed, it is invoked with the payload property, the target state machine’s configuration property, and some AWS specific flags, one to specify that the target does not use a callback and another to connect the two-step function executions by execution ID. Once all of the target state machines have been executed or skipped, the orchestrator completes its run. Now that we have a complete understanding of the orchestrator state machine, we can look at the AWS CloudFormation needed to deploy it with a set of state machines.

template.yaml snippet

# Resources:
    Type: AWS::Serverless::StateMachine
      DefinitionUri: statemachine/orchestrator.asl.yaml
        - Version: 2012-10-17
            - Effect: Allow
                - states:DescribeExecution
                - states:StopExecution
              Resource: '*'
            - Effect: Allow
                - events:PutTargets
                - events:PutRule
                - events:DescribeRule
              Resource: !Sub arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:rule/StepFunctionsGetEventsForStepFunctionsExecutionRule
# add one of these execution policy templates for each target state machine
        - StepFunctionsExecutionPolicy:
            StateMachineName: !GetAtt MyStateMachine.Name # Target Name
# add one of these variable substitutions for each target state machine
        BillingAlertsStatemachineArn: !Ref MyStateMachine # Target ARN

With the state machine defined, it just needs to be added to a SAM template and deployed to AWS. The two main things worth highlighting here are the policies and substitutions. AWS Step Functions must be granted the ability to get details about (states:DescribeExecution) and stop (states:StopExecution) any of the target state machines. It also needs to be able to start each target state machine. The SAM built-in policy construct StepFunctionsExecutionPolicy grants states:StartExecution against a given state machine by name. Add one of these policies for each target state machine by name (NOT by ARN). In addition, the target state machine ARNs that need to be substituted into the state machine definition need to be supplied under the DefinitionSubstitutions property.

With this technique in your toolbox, you can now design and deploy complex maintainable state machine architectures that are completely code-defined. Modifications and debugging become much easier to reason about and perform. If you are looking for more hands-on help with this, head over to our contact page to talk to Trek10.

Jessica Ribeiro Featured Team Member
Jessica Ribeiro

Team Support Lead Architect