Dead letter queue for errors with Beam, Asgarde, Dataflow and alerting in real time

Published in

Google Cloud - Community

11 min readDec 19, 2022

The goal of this article is showing a use case with a Beam pipeline containing a dead letter queue for errors applied with Asgarde library.

For more details about Asgarde library, you can check this article :

Error handling with Apache Beam : presentation of Asgarde

Error logic with Beam ParDo and DoFn

medium.com

1. Explanation of the use case presented in this article

dead letter queue is a way of catching errors in data pipelines, without interrupting the job. This technic is interesting because errors are not lost and can be sinked in the best storage service depending on the error handing strategy :

Errors can be sent in a PubSub topic to be reloaded
Errors can be stored in a Bigquery table to allow developer teams or partners, analysing errors and display them in a dashboard
Errors can be written in a Cloud storage file to be reloaded later in a batch mode

In this article, a storage in a Bigquery table was chosen.

The Beam job contains 2 sinks :

Good outputs written in a Bigquery table
Failures written in a Bigquery table and also in Cloud Logging

Failures will be exploited in the following way :

An email alert is sent in real time when failures are written in Cloud Logging
A Datastudio dashboard will display the failures for a date range

Here you can see the use case diagram of this article :

2. Dead letter queue with Beam native

In this section, we will show how applying Dead Letter Queue and error handing with Beam native.

2.1 Example with Beam Java

To apply dead letter queue and error handling, Beam suggests handling side outputs with :

TupleTags in a DoFn class
Built in components MapElements FlatMapElements and exceptionInto and exceptionVia methods, example :

@Override
public Result<PCollection<TeamStats>, Failure> expand(PCollection<TeamStatsRaw> input) {
    Result<PCollection<TeamStatsRaw>, Failure> res1 = input.apply("Validate fields", MapElements
            .into(of(TeamStatsRaw.class))
            .via(TeamStatsRaw::validateFields)
            .exceptionsInto(of(Failure.class))
            .exceptionsVia(exElt -> Failure.from("Validate fields", exElt)));

    PCollection<TeamStatsRaw> output1 = res1.output();
    PCollection<Failure> failure1 = res1.failures();

    Result<PCollection<TeamStats>, Failure> res2 = output1.apply("Compute team stats", MapElements
            .into(of(TeamStats.class))
            .via(TeamStats::computeTeamStats)
            .exceptionsInto(of(Failure.class))
            .exceptionsVia(exElt -> Failure.from("Compute team stats", exElt)));

    PCollection<TeamStats> output2 = res2.output();
    PCollection<Failure> failure2 = res2.failures();

    final TransformToTeamStatsWithSloganFn toStatsWithSloganFn = new TransformToTeamStatsWithSloganFn(
            "Add team slogan",
            slogansSideInput
    );
    final PCollectionTuple res3 = output2.apply(name,
            ParDo.of(toStatsWithSloganFn)
                    .withOutputTags(toStatsWithSloganFn.getOutputTag(), TupleTagList.of(toStatsWithSloganFn.getFailuresTag()))
                    .withSideInput("slogans", slogansSideInput));

    PCollection<TeamStats> output3 = res3.get(toStatsWithSloganFn.getOutputTag());
    PCollection<Failure> failure3 = res3.get(toStatsWithSloganFn.getFailuresTag());

    PCollection<Failure> allFailures = PCollectionList
            .of(failure1)
            .and(failure2)
            .and(failure3)
            .apply(Flatten.pCollections());

    return Result.of(output3, allFailures);
}

The problem with this approach, the code is verbose and we have to repeat technical code for error handling.

Link to Github repository :

teams-league-java-dlq-native-beam-summit/TeamStatsTransform.java at main ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

2.2 Example with Beam Python

In Python sdk, the principle is the same, we have to use TupleTag in a DoFn class, example :

def expand(self, inputs: PCollection[TeamStatsRaw]) -> \
        Tuple[PCollection[TeamStats], PCollection[Failure]]:
    # When.
    outputs_map1, failures_map1 = (
            inputs | VALIDATE_FIELDS >> ParDo(TeamStatsRawFieldsValidationFn(VALIDATE_FIELDS))
            .with_outputs(FAILURES, main='outputs')
    )

    outputs_map2, failures_map2 = (
            outputs_map1 | COMPUTE_TEAM_STATS >> ParDo(TeamStatsMapperFn(COMPUTE_TEAM_STATS))
            .with_outputs(FAILURES, main='outputs')
    )

    final_outputs, failures_map3 = (
            outputs_map2 | ADD_SLOGAN_TEAM_STATS >>
            ParDo(
                TeamStatsWithSloganFn(ADD_SLOGAN_TEAM_STATS),
                AsSingleton(self.slogans_side_inputs)
            )
            .with_outputs(FAILURES, main='outputs')
    )

    result_all_failures = (
            (failures_map1, failures_map2, failures_map3)
            | 'All Failures PCollections' >> beam.Flatten()
    )

    return final_outputs, result_all_failures

As for the Java part, the problem is the same and we have to repeat the technical code for error handling.

Link to Github repository :

teams-league-python-dlq-native-beam-summit/team_stats_transform.py at main ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

3. Dead letter queue with Asgarde library

In this section, we will show how applying Dead Letter Queue and error handing with Beam and Asgarde library.

3.1 Example with Asgarde Java

Asgarde Java proposes a CollectionComposer class to simplify error handling with less code and more expressive code :

@Override
public Result<PCollection<TeamStats>, Failure> expand(PCollection<TeamStatsRaw> input) {
    return CollectionComposer.of(input)
            .apply("Validate fields", MapElements.into(of(TeamStatsRaw.class)).via(TeamStatsRaw::validateFields))
            .apply("Compute team stats", MapElementFn
                    .into(of(TeamStats.class))
                    .via(TeamStats::computeTeamStats)
                    .withStartBundleAction(() -> LOGGER.info("####################Start bundle compute stats")))
            .apply("Add team slogan", MapProcessContextFn
                            .from(TeamStats.class)
                            .into(of(TeamStats.class))
                            .via(c -> addSloganToStats(c, slogansSideInput))
                            .withSetupAction(() -> LOGGER.info("####################Start add slogan")),
                    Collections.singletonList(slogansSideInput))
            .getResult();
}

Link to Github repository :

teams-league-java-dlq-asgarde-beam-summit/TeamStatsTransform.java at main ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

3.2 Example with Asgarde Kotlin

Asgarde proposes also extensions for Kotlin with more functional programming style :

override fun expand(input: PCollection<TeamStatsRaw>): Result<PCollection<TeamStats>, Failure> {
    return CollectionComposer.of(input)
        .map("Validate fields") { it.validateFields() }
        .mapFn(
            name = "Compute team stats",
            startBundleAction = { LOGGER.info("####################Start bundle compute stats") },
            transform = { TeamStats.computeTeamStats(it) })
        .mapFnWithContext<TeamStats, TeamStats>(
            name = "Add team slogan",
            setupAction = { LOGGER.info("####################Start add slogan") },
            sideInputs = listOf(slogansSideInput),
            transform = { addSloganToStats(it, slogansSideInput) }
        )
        .result
}

Link to Github repository :

teams-league-kotlin-dlq-asgarde-beam-summit/TeamStatsTransform.kt at main ·…

Contribute to tosun-si/teams-league-kotlin-dlq-asgarde-beam-summit development by creating an account on GitHub.

github.com

3.3 Example with Asgarde Python

Asgarde Python as in Java library, proposes a CollectionComposer class to simplify error handling :

def expand(self, inputs: PCollection[TeamStatsRaw]) -> \
            Tuple[PCollection[TeamStats], PCollection[Failure]]:
    result = (CollectionComposer.of(inputs)
              .map("Validate raw fields", lambda t_raw: t_raw.validate_fields())
              .map("Compute team stats", TeamStats.compute_team_stats)
              .map("Add slogan to team stats",
                   self.add_slogan_to_stats,
                   slogans=AsSingleton(self.slogans_side_inputs),
                   setup_action=lambda: '######### Start add slogan to stats actions')
              )

    return result.outputs, result.failures

Link to Github repository :

teams-league-python-dlq-asgarde-beam-summit/team_stats_transform.py at main ·…

Contribute to tosun-si/teams-league-python-dlq-asgarde-beam-summit development by creating an account on GitHub.

github.com

4. IO of Beam pipeline for failures

In this example, the pipeline outputs are :

Good outputs written in a Bigquery table
Failures caught with Beam/Asgarde written in a Bigquery table for analytics and exposition in a dashboard
Failures written in Cloud logging with a specific log pattern containing the input element and the stacktrace

The schema of the BigQuery job_failure table is :

To be cost effective, we added a partition per DAY on the dwhCreationDate field.

We also added a clustering on featureName jobName componentType and pipelineStep fields to increase query performance when user will use WHERE clause on it.

The json schema :

[
    {
        "name": "featureName",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "Feature name concerned by error"
    },
    {
        "name": "jobName",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "Job name concerned by error"
    },
    {
        "name": "pipelineStep",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "Pipeline step concerned by error"
    },
    {
        "name": "inputElement",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "Input element when error occurred"
    },
    {
        "name": "exceptionType",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "Exception type of error"
    },
    {
        "name": "stackTrace",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "Stack trace of error"
    },
    {
        "name": "componentType",
        "type": "STRING",
        "mode": "NULLABLE",
        "description": "GCP Component type (Dataflow, Composer....)"
    },
    {
        "name": "dwhCreationDate",
        "type": "TIMESTAMP",
        "mode": "NULLABLE",
        "description": "Creation date of error"
    }
]

4.1 Code of BigQuery IO in Beam Java

public PTransform<PCollection<Failure>, ? extends POutput> write() {
    return BigQueryIO.<Failure>write()
            .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
            .to(failureConf.getOutputDataset() + "." + failureConf.getOutputTable())
            .withFormatFunction(failure -> toFailureTableRow(failureConf, failure))
            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND);
}

private static TableRow toFailureTableRow(FailureConf failureConf, Failure failure) {
    val creationDate = Instant.now().toString();

    return new TableRow()
            .set(FEATURE_NAME.getValue(), failureConf.getFeatureName())
            .set(PIPELINE_STEP.getValue(), failure.getPipelineStep())
            .set(JOB_NAME.getValue(), failureConf.getJobName())
            .set(INPUT_ELEMENT.getValue(), failure.getInputElement())
            .set(EXCEPTION_TYPE.getValue(), failure.getException().getClass().getSimpleName())
            .set(STACK_TRACE.getValue(), ExceptionUtils.getStackTrace(failure.getException()))
            .set(COMPONENT_TYPE.getValue(), COMPONENT_TYPE_VALUE)
            .set(DWH_CREATION_DATE.getValue(), creationDate);
}

Link to Github repository :

teams-league-java-dlq-asgarde-beam-summit/FailureBigqueryIOAdapter.java at main ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

4.2 Code of Cloud Logging IO in Beam Java

public PDone expand(PCollection<Failure> input) {
    input.apply(STEP_MAP_FAILURE_JSON_STRING, new LogTransform());
    return PDone.in(input.getPipeline());
}

static class LogTransform extends PTransform<PCollection<Failure>, PCollection<String>> {
    public PCollection<String> expand(PCollection<Failure> input) {
        return input
                .apply(STEP_MAP_FAILURE_JSON_STRING, MapElements
                        .into(strings())
                        .via(FailureCloudLoggingWriteTransform::toFailureLogInfo))
                .apply(STEP_LOG_FAILURE, MapElements.into(strings()).via(this::logFailure));
    }

    private String logFailure(String failureAsString) {
        LOGGER.error(failureAsString);
        return failureAsString;
    }
}

private static String toFailureLogInfo(Failure failure) {
    val inputElementInfo = "InputElement : " + failure.getInputElement();
    val stackTraceInfo = "StackTrace : " + ExceptionUtils.getStackTrace(failure.getException());

    return inputElementInfo + "\n" + stackTraceInfo;
}

The link to the Github repository :

teams-league-java-dlq-asgarde-beam-summit/FailureCloudLoggingWriteTransform.java at main ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

4.3 Code of BigQuery IO in Beam Python

def expand(self, inputs_failures: PCollection[Failure]):
    return (inputs_failures
            | 'Map to failure table fields' >> beam.Map(self.to_failure_table_fields)
            | "Sink failures to Bigquery" >> beam.io.WriteToBigQuery(
                project=self.pipeline_conf.project_id,
                dataset=self.pipeline_conf.failure_output_dataset,
                table=self.pipeline_conf.failure_output_table,
                create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
                write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
            )

def to_failure_table_fields(self, failure: Failure):
    return {
        FailureTableFields.FEATURE_NAME.value: self.pipeline_conf.failure_feature_name,
        FailureTableFields.JOB_NAME.value: self.pipeline_conf.job_type,
        FailureTableFields.PIPELINE_STEP.value: failure.pipeline_step,
        FailureTableFields.INPUT_ELEMENT.value: failure.input_element,
        FailureTableFields.EXCEPTION_TYPE.value: type(failure.exception).__name__,
        FailureTableFields.STACK_TRACE.value: get_failure_error(failure),
        FailureTableFields.COMPONENT_TYPE.value: 'DATAFLOW',
        FailureTableFields.DWH_CREATION_DATE.value: datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")
    }

The link to the Github repository :

teams-league-python-dlq-asgarde-beam-summit/failure_bigquery_write_transform.py at main ·…

Contribute to tosun-si/teams-league-python-dlq-asgarde-beam-summit development by creating an account on GitHub.

github.com

4.4 Code of Cloud Logging IO in Beam Python

def expand(self, inputs_failures: PCollection[Failure]):
    return (inputs_failures
            | 'Map to failure log info' >> beam.Map(self.to_failure_log_info)
            | "Logs Failure to cloud logging" >> beam.Map(self.log_failure))

def to_failure_log_info(self, failure: Failure):
    input_element_info = f'InputElement : {failure.input_element}'
    stack_trace_info = f'StackTrace : {get_failure_error(failure)}'

    return f'{input_element_info} \n {stack_trace_info}'

def log_failure(self, failure_log_info: str):
    logging.error(failure_log_info)

    return failure_log_info

The link to the Github repository :

teams-league-python-dlq-asgarde-beam-summit/failure_cloud_logging_write_transform.py at main ·…

Contribute to tosun-si/teams-league-python-dlq-asgarde-beam-summit development by creating an account on GitHub.

github.com

5. Create an alerting policy in Google Cloud on Beam failures

After writting failures in Cloud Logging , we want to send an email alert in real time, here are the steps in Google Cloud to achieve this need :

Create a log based metric that filters the failures pattern written by the Dataflow job
Create a notification channel, email was chosen for this example
Create an alerting policy based on the log based metric and notification channel created previously

5.1 Create a log-based metric

We will firstly create a log based metric on Dataflow jobs errors.

Some explanations about logs format on Cloud Logging

An entry in Cloud Logging is represented by the LogEntry format and object :

LogEntry | Cloud Logging | Google Cloud

Whether your business is early in its journey or well on its way to digital transformation, Google Cloud can help solve…

cloud.google.com

Here you can see the json representation of LogEntry object :

{
  "logName": string,
  "resource": {
    object (MonitoredResource)
  },
  "timestamp": string,
  "receiveTimestamp": string,
  "severity": enum (LogSeverity),
  "insertId": string,
  "httpRequest": {
    object (HttpRequest)
  },
  "labels": {
    string: string,
    ...
  },
  "metadata": {
    object (MonitoredResourceMetadata)
  },
  "operation": {
    object (LogEntryOperation)
  },
  "trace": string,
  "spanId": string,
  "traceSampled": boolean,
  "sourceLocation": {
    object (LogEntrySourceLocation)
  },
  "split": {
    object (LogSplit)
  },
  // Union field payload can be only one of the following:
  "protoPayload": {
    "@type": string,
    field1: ...,
    ...
  },
  "textPayload": string,
  "jsonPayload": {
    object
  }
  // End of list of possible types for union field payload.
}

We can add filters on the LogEntry fields.

Creation of the Log-based Metric

In Google Cloud Console we access to the log-based metric menu and fill firstly the metric type metric name and description :

Metric type

We chosen the metric type with counter because everytime Dataflow jobs write errors in logs with a specific pattern, we want to increment this metric.

The metric name in this example is dataflow_jobs_intercepted_errors

Filter

Then we fill the filter and label :

The format used in the filter is the one usually used in Cloud Logging and logger explorer in Google Cloud

resource.type=dataflow_step allows to target only on Dataflow jobs and services.

severity=ERROR specifies that the filter concerns only errors in the logs.

Our Dataflow jobs write errors in Cloud Logging with a specific pattern with InputElement and StackTrace keywords, that’s why the filter adds a criteria based on jsonPayload.message

jsonPayload.message =~ InputElement.*

The =~ means contains
The .* means any character
This criteria allows to search the pattern containing InputElement

Labels

Labels allow log-based metrics to contains multiple time series.

In this case, we add the following label resource.label.job_id because we want applying the filter on logs per Dataflow job and not for all jobs together.

5.2 Create a notification channel

In the menu Notification channels, we create an email channel

5.3 Create the alerting policy based on log-based metric

In this section, we will create the alerting policy based on the log based metric created previously.

Firstly in the New Condition menu we fill Dataflow in the metric search field, then select log based metrics and the metric created previously called dataflow_intercepted_errors

In the Configure trigger menu we select Treshold as condition type and the Treshold value is 0.9

The meaning is, if there is one error per Dataflow job, an alert will be fired (1 greater than 0.9)

The last menu to configure is Alert detail, we set the notification channel created previously that is an email address.

For each alert, an email will be sent to the specified address.

It’s worth noting that we can specify multiple notification channels (several emails, Slack…).

6. Alerting email example

Here you can see an example email.

2 links are proposed in the email :

VIEW INCIDENT proposes a link the current incident in Google Cloud
VIEW LOGS proposes a link to the error logs in Cloud Logging

Example of error logs for the Dataflow job in Cloud Logging :

You can see :

The Cloud Logging log explorer page
The filter in the Query menu
At the left side, the log fields : RESOURCE TYPE, SEVERITY, JOB_NAME, REGION…
The Query result containing the error logs, it corresponds to the inputElement and the StackTrace that is the pattern specified by the Beam/Dataflow job

7. Display failures in a Datastudio dashboard

In this section, we will create a Datastudio dashboard having the responsabilty to display failures.

This dashboard proposes the following filters :

date range on dwhCreationDate
Dropdown list on featureName field
Dropdown list on jobName field
Advanced filter on InputElement (equals, contains..)
Advanced filter on StackTrace

The rest of the page contains :

The result failures data in a table
Pie chart in percentage per exceptionType
Pie chart in percentage per jobName
Bar chart with a record count per dwhCreationDate and jobName

I will give a deep explanation for this dashboard in a separated article.

I share resources that could be helpful for some components used in this article :

Asgarde Java :

GitHub - tosun-si/asgarde: Asgarde allows simplifying error handling with Apache Beam Java, with…

This module allows simplifying error handling with Apache Beam Java. The project is hosted on Maven repository. You can…

github.com

Asgarde Python :

GitHub - tosun-si/pasgarde: Asgarde allows simplifying error handling with Apache Beam Python, with…

This module allows simplifying error handling with Apache Beam Python. The project is hosted on PyPi repository. You…

github.com

Talk at Beam Sumit on Asgarde :

If you like the Asgarde project support us with a Github star ✩

If you like my articles and want to see my posts, follow me on :
- Medium
- Twitter
- LinkedIn

Dead letter queue for errors with Beam, Asgarde, Dataflow and alerting in real time

Error handling with Apache Beam : presentation of Asgarde

Error logic with Beam ParDo and DoFn

1. Explanation of the use case presented in this article

2. Dead letter queue with Beam native

2.1 Example with Beam Java

teams-league-java-dlq-native-beam-summit/TeamStatsTransform.java at main ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

2.2 Example with Beam Python

teams-league-python-dlq-native-beam-summit/team_stats_transform.py at main ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

3. Dead letter queue with Asgarde library

3.1 Example with Asgarde Java

teams-league-java-dlq-asgarde-beam-summit/TeamStatsTransform.java at main ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

3.2 Example with Asgarde Kotlin

teams-league-kotlin-dlq-asgarde-beam-summit/TeamStatsTransform.kt at main ·…

Contribute to tosun-si/teams-league-kotlin-dlq-asgarde-beam-summit development by creating an account on GitHub.

3.3 Example with Asgarde Python

teams-league-python-dlq-asgarde-beam-summit/team_stats_transform.py at main ·…

Contribute to tosun-si/teams-league-python-dlq-asgarde-beam-summit development by creating an account on GitHub.

4. IO of Beam pipeline for failures

4.1 Code of BigQuery IO in Beam Java

teams-league-java-dlq-asgarde-beam-summit/FailureBigqueryIOAdapter.java at main ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

4.2 Code of Cloud Logging IO in Beam Java

teams-league-java-dlq-asgarde-beam-summit/FailureCloudLoggingWriteTransform.java at main ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

4.3 Code of BigQuery IO in Beam Python

teams-league-python-dlq-asgarde-beam-summit/failure_bigquery_write_transform.py at main ·…

Contribute to tosun-si/teams-league-python-dlq-asgarde-beam-summit development by creating an account on GitHub.

4.4 Code of Cloud Logging IO in Beam Python

teams-league-python-dlq-asgarde-beam-summit/failure_cloud_logging_write_transform.py at main ·…

Contribute to tosun-si/teams-league-python-dlq-asgarde-beam-summit development by creating an account on GitHub.

5. Create an alerting policy in Google Cloud on Beam failures

5.1 Create a log-based metric

LogEntry | Cloud Logging | Google Cloud

Whether your business is early in its journey or well on its way to digital transformation, Google Cloud can help solve…

5.2 Create a notification channel

5.3 Create the alerting policy based on log-based metric

6. Alerting email example

7. Display failures in a Datastudio dashboard

GitHub - tosun-si/asgarde: Asgarde allows simplifying error handling with Apache Beam Java, with…

This module allows simplifying error handling with Apache Beam Java. The project is hosted on Maven repository. You can…

GitHub - tosun-si/pasgarde: Asgarde allows simplifying error handling with Apache Beam Python, with…

This module allows simplifying error handling with Apache Beam Python. The project is hosted on PyPi repository. You…

Written by Mazlum Tosun