Batch inference for product insights with Apache Airflow®

Info

This page has not yet been updated for Airflow 3. The concepts shown are relevant, but some code may need to be updated. If you run any examples, take care to update import statements and watch for any other breaking changes.

The Batch inference for product insights repository is a free and open-source reference architecture showing how to use Apache Airflow® and OpenAI to summarize product feedback and generate insights from it. The full source code is available on GitHub.

Screenshot of a Slack message showing a product summary generated by the pipeline.

This reference architecture was created as a learning tool to demonstrate how to use Apache Airflow to orchestrate data ingestion, tagging of feedback with relevant products, and per-product feedback summarization in a batch inference pipeline. You can adapt the pipeline for your use case by ingesting data from other sources and adjust the LLM model prompts to fit your needs.

Architecture

Batch inference reference architecture diagram.

This batch inference pipeline consists of 4 main components:

Data ingestion and embedding: Product feedback is ingested from a variety of sources. The ingest_zendesk_tickets DAG extracts feedback from Zendesk tickets stored in Snowflake, the ingest_data_apis DAG extracts feedback from the GitHub and StackOverflow APIs, as well as from local files containing G2 reviews.
Product/feature tagging: Using OpenAI, the feedback is tagged with the relevant product or feature.
Create feedback summaries and insights: All feedback relating to one product/feature is aggregated and summarized using GPT-4o. The summaries are posted to a Slack channel.
Executive summary: A final DAG aggregates all product summaries and insights into an executive summary that is posted to a Slack channel.

Airflow features

The DAGs that power the batch inference pipelines highlight several key Airflow best practices and features:

Dynamic task mapping: Dynamic task mapping is used extensively to parallelize tasks throughout the pipeline. For example, feedback summarization and insight generation is parallelized to create one dynamically mapped task instance per product tag that is analyzed. Custom map indexing is used to make it easier to find specific summaries in the task logs.
Object Storage: Interaction with files in object storage is simplified using the experimental Airflow Object Storage API.
Airflow retries: To protect against transient API failures and rate limits, all tasks are configured to automatically retry after an adjustable delay.
Advanced data-driven scheduling: The DAGs in this reference architecture run on data-driven schedules, including conditional asset scheduling.
Modularization: The ingest_data_apis DAG serves as an example of a high level of modularization. Task functions are stored in the include folder and imported into the DAG file to be used in @task decorators. Ingestion sources are defined in a list of configurations with a loop generating one parallel ingestion track per source.

Next Steps

Get the Astronomer GenAI cookbook to view more examples of how to use Airflow to build generative AI applications.

If you’d like to build your own batch inference pipeline, feel free to fork the repository and adapt it to your use case. We recommend deploying the Airflow pipelines using a free trial of Astro.