Building an AWS Serverless Pipeline for Speaker Diarization and Transcription at ~$1 per Video

TL;DR

• ~$2/month fixed cost (Secrets Manager + ECR + logs) for a speaker diarization transcription pipeline
• AWS Step Functions + Lambda fully serverless architecture
• pyannote.audio 3.1 for speaker diarization, faster-whisper for transcription, gpt-4o-mini for LLM analysis
• 8-hour video processing completed for ~$2.3 (x86, no free tier) — about 5x more cost-efficient than AWS Transcribe
• Deep dive into pitfalls like States.DataLimitExceeded and their solutions

Repository: github.com/ekusiadadus/ek-transcript

Introduction

I've been analyzing more and more user interview recordings lately. When evaluating existing solutions:

✗ AWS Transcribe: ~$11.52 for 8 hours ($0.024/min), speaker diarization accuracy not great
✗ Commercial SaaS: $50–$200/month fixed fees, charged even in months with no usage
✗ Always-on GPU server: EC2 g4dn.xlarge costs $380+/month — too expensive for personal use

The biggest problem was fixed monthly costs. I only use this a few times a month, yet I'd be paying every month. I wanted pure pay-per-use pricing with near-zero monthly fixed costs — that was my top priority.

So I decided to build my own pipeline using AWS serverless services.

Requirements

Zero monthly fixed cost (pay only for what you use)
Support for long videos up to 8 hours
Speaker diarization (who said what)
High-accuracy Japanese transcription
LLM-powered summarization and analysis
Low cost (~$1 per video)
Fully serverless

System Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│                                    AWS Cloud                                     │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│  ┌──────────────┐     ┌─────────────────┐                                       │
│  │   Amazon S3  │     │   EventBridge   │                                       │
│  │ (Input)      │────▶│   Rule          │                                       │
│  │ uploads/     │     │ (Object Created)│                                       │
│  └──────────────┘     └────────┬────────┘                                       │
│                                │                                                 │
│                                ▼                                                 │
│                       ┌────────────────┐      ┌─────────────────┐               │
│                       │     Lambda     │      │    DynamoDB     │               │
│                       │ StartPipeline  │─────▶│ InterviewsTable │               │
│                       └────────┬───────┘      └─────────────────┘               │
│                                │                       ▲                         │
│                                ▼                       │                         │
│  ┌─────────────────────────────────────────────────────┼───────────────────────┐│
│  │                     AWS Step Functions              │                       ││
│  │                                                     │                       ││
│  │  ┌─────────────┐   ┌─────────────┐   ┌────────────┐│                       ││
│  │  │   Lambda    │   │   Lambda    │   │  Lambda    ││                       ││
│  │  │ExtractAudio │──▶│ ChunkAudio  │──▶│(Map State) ││                       ││
│  │  │  (ffmpeg)   │   │ (8min+30s)  │   │DiarizeChunk││                       ││
│  │  └─────────────┘   └─────────────┘   │ x5 parallel││                       ││
│  │        │                              │ pyannote   ││                       ││
│  │        │                              └─────┬──────┘│                       ││
│  │        ▼                                    │       │                       ││
│  │  ┌──────────┐                               ▼       │                       ││
│  │  │Amazon S3 │◀───────────────────┬─────────────────┤                       ││
│  │  │(Output)  │                    │                 │                       ││
│  │  │processed/│   ┌────────────────┴──────┐          │                       ││
│  │  │analysis/ │   │       Lambda          │          │                       ││
│  │  └──────────┘   │    MergeSpeakers      │          │                       ││
│  │       ▲         │ (embedding clustering)│          │                       ││
│  │       │         └───────────┬───────────┘          │                       ││
│  │       │                     ▼                      │                       ││
│  │       │         ┌───────────────────────┐          │                       ││
│  │       │         │       Lambda          │          │                       ││
│  │       │         │   SplitBySpeaker      │          │                       ││
│  │       │         │     (ffmpeg)          │          │                       ││
│  │       │         └───────────┬───────────┘          │                       ││
│  │       │                     ▼                      │                       ││
│  │       │         ┌───────────────────────┐          │                       ││
│  │       │         │       Lambda          │          │                       ││
│  │       │         │     (Map State)       │          │                       ││
│  │       │         │   Transcribe x10      │          │                       ││
│  │       │         │   faster-whisper      │          │                       ││
│  │       │         └───────────┬───────────┘          │                       ││
│  │       │                     ▼                      │                       ││
│  │       │         ┌───────────────────────┐          │                       ││
│  │       │         │       Lambda          │          │                       ││
│  │       ├─────────│  AggregateResults     │          │                       ││
│  │       │         └───────────┬───────────┘          │                       ││
│  │       │                     ▼                      │                       ││
│  │       │         ┌───────────────────────┐   ┌─────────────────┐            ││
│  │       │         │       Lambda          │   │Secrets Manager  │            ││
│  │       └─────────│     LLMAnalysis       │◀──│ OpenAI API Key  │            ││
│  │                 │    gpt-4o-mini        │   └─────────────────┘            ││
│  │                 └───────────┬───────────┘                                  ││
│  │                             │                                              ││
│  └─────────────────────────────┼──────────────────────────────────────────────┘│
│                                ▼                                                │
└─────────────────────────────────────────────────────────────────────────────────┘

Data Flow Details

[Video Upload]
       │
       ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ S3: ek-transcript-input-{env}                                                │
│ Key: uploads/{interview_id}/video.mp4                                        │
│ Metadata: x-amz-meta-interview-id, x-amz-meta-original-filename              │
└──────────────────────────────────────────────────────────────────────────────┘
       │
       │ EventBridge (Object Created)
       ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ Lambda: StartPipeline                                                         │
│ - Create interview record in DynamoDB                                         │
│ - Start Step Functions execution                                              │
└──────────────────────────────────────────────────────────────────────────────┘
       │
       ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ Step Functions: ek-transcript-pipeline-{env}                                  │
│                                                                               │
│ 1. ExtractAudio: video.mp4 → audio.wav (16kHz mono)                          │
│ 2. ChunkAudio: audio.wav → chunk_0.wav, chunk_1.wav, ... (8min+30s overlap)  │
│ 3. DiarizeChunks (Map x5): pyannote speaker diarization per chunk            │
│ 4. MergeSpeakers: Global speaker unification via embedding vectors           │
│ 5. SplitBySpeaker: Split audio by speaker segments                           │
│ 6. TranscribeSegments (Map x10): faster-whisper transcription                │
│ 7. AggregateResults: Merge results → transcript.json                         │
│ 8. LLMAnalysis: gpt-4o-mini structured analysis → analysis.json              │
└──────────────────────────────────────────────────────────────────────────────┘

Why Serverless: Minimizing Monthly Fixed Costs

The key feature of this pipeline is minimizing monthly fixed costs.

Component	Pricing Model	Monthly Fixed Cost
Lambda	Per-execution	$0
Step Functions	Per-transition	$0
S3	Storage + requests	$0~
DynamoDB	On-demand	$0
EventBridge	Per-event	$0
Cognito	Free up to 50K MAU	$0
AppSync	Per-request	$0~
Secrets Manager	Per-secret	$0.80/mo (2 secrets)
ECR	Image storage	$0.50–$1.00/mo
CloudWatch Logs	Log storage	~$0.05/mo

Actual Monthly Fixed Costs

Secrets Manager:    $0.80/mo (OpenAI + HuggingFace, 2 secrets)
ECR:               $0.50–$1.00/mo (Docker images with ML models, 8 images)
CloudWatch Logs:    ~$0.05/mo (Step Functions + Lambda logs)
───────────────────────────────────────────
Total:              ~$1.50–$2.00/mo

Only ~$2 even in months with zero usage. Compared to commercial SaaS at $50–$200/month, that's $576–$2,376 in annual savings.

Pricing References (Dec 2025):
• Secrets Manager: $0.40/secret/month
• ECR: $0.10/GB/month (pyannote+whisper models = 5-10GB)
• CloudWatch Logs: $0.50/GB ingestion + $0.03/GB/month storage

Alternative: SSM Parameter Store SecureString is free but has a 4KB limit and no auto-rotation.

Design Evolution: From Initial to Current Design

Initial Design: Simple Sequential Processing

[Video] → ExtractAudio → Diarize → SplitBySpeaker → Transcribe → LLMAnalysis
                           │
                    (single Lambda processing all audio)

Problems:

Lambda's 15-minute timeout couldn't complete speaker diarization for 8-hour audio
pyannote.audio memory usage was massive (10GB+)
Sequential processing made total processing time too long

Alternative: ECS Fargate Processing

[Video] → ECS Fargate (GPU) → ...

Evaluation:

GPU instance (g4dn.xlarge) cost was high ($0.526/hour)
8-hour video would cost $4+
Spot instances had reliability concerns

Current Design: Parallel Chunk Processing

                    ┌─ DiarizeChunk_0 ─┐
[Video] → Chunk →   ├─ DiarizeChunk_1 ─┤ → Merge → Split → Transcribe(parallel) → LLM
                    ├─ DiarizeChunk_2 ─┤
                    └─      ...       ─┘

Design Points:

8-minute chunks + 30-second overlap: Fits within Lambda's 15-minute limit, captures speaker changes at boundaries
Speaker unification via embedding vectors: Even if SPEAKER_00 is different people across chunks, cosine similarity clustering unifies them
Map State parallel execution: Speaker diarization x5, transcription x10 parallel processing for speed

Technology Choices and Rationale

Technology	Reason	vs. Alternatives
pyannote.audio 3.1	Latest speaker diarization accuracy, Hugging Face integration	Higher accuracy than AWS Transcribe speaker diarization
faster-whisper	4-8x faster than Whisper, int8 quantization support	OpenAI Whisper API is more expensive
gpt-4o-mini	Structured Outputs support, low cost	Claude lacked Structured Outputs (at the time)
Lambda + Container	Up to 10GB image, cold start acceptable	ECS Fargate has always-on cost concerns
Step Functions	Complex workflow management, error handling	SQS + Lambda makes state management complex

Component Implementation Details

1. ExtractAudio Lambda

Extracts 16kHz mono WAV from video — Whisper's recommended sample rate.

def extract_audio(input_path: str, output_path: str) -> None:
    """Extract 16kHz mono WAV from video"""
    cmd = [
        "ffmpeg", "-i", input_path,
        "-vn",                    # No video
        "-acodec", "pcm_s16le",   # 16-bit PCM
        "-ar", "16000",           # 16kHz
        "-ac", "1",               # Mono
        "-y", output_path,
    ]
    subprocess.run(cmd, check=True)

2. ChunkAudio Lambda

Splits into 8-minute chunks + 30-second overlap. The overlap captures speaker changes at boundaries accurately.

CHUNK_DURATION = 480      # 8 minutes
OVERLAP_DURATION = 30     # 30-second overlap

chunk_0: 0–510s (effective: 0–480)
chunk_1: 450–960s (effective: 480–960)
chunk_2: 900–1410s (effective: 960–1440)

3. DiarizeChunk Lambda (Parallel Execution)

Speaker diarization with pyannote.audio 3.1. Extracts embedding vectors for each speaker and saves to S3.

pyannote.audio License Note:
pyannote/speaker-diarization-3.1 requires license agreement on Hugging Face. Visit the model page to agree to the license terms and obtain your HF_TOKEN on first use. Verify licensing requirements for commercial use.

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    token=hf_token,
)
Use GPU if available
if torch.cuda.is_available():
    pipeline.to(torch.device("cuda"))
Run speaker diarization
diarization = pipeline({"waveform": audio_tensor, "sample_rate": sample_rate})
Extract embedding vectors (used for speaker unification later)
speaker_embeddings = extract_speaker_embeddings(audio_path, segments)

4. MergeSpeakers Lambda

Clusters speakers across chunks using cosine similarity of embedding vectors.

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity

Cosine similarity clustering
similarity_matrix = cosine_similarity(all_embeddings)
distance_matrix = 1 - similarity_matrix
clustering = AgglomerativeClustering(
    n_clusters=None,
    distance_threshold=1 - 0.75,  # 75%+ similarity = same speaker
    metric="precomputed",
    linkage="average",
)
labels = clustering.fit_predict(distance_matrix)

5. Transcribe Lambda (Parallel Execution)

High-speed transcription with faster-whisper (medium model).

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cpu", compute_type="int8")
segments, info = model.transcribe(audio_path, language="ja", beam_size=5)
text = "".join([seg.text for seg in segments])

6. LLMAnalysis Lambda

Structured analysis using gpt-4o-mini's Structured Outputs.

from openai import OpenAI

Scoring with Structured Outputs
completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": ANALYSIS_PROMPT},
        {"role": "user", "content": f"Analyze this:\n{transcript}"},
    ],
    response_format=AnalysisResult,
)

Cost Breakdown (8-Hour Video Example, Dec 2025)

Assumptions:

Region: us-east-1
Lambda: x86_64 (arm64 is ~20% cheaper)
No free tier, no retries
Map parallelism: Diarize×5, Transcribe×10

Actual cost breakdown for processing an 8-hour video (~900 segments):

Service	Calculation	Cost
Lambda (Diarize)	10GB × 600s × 6 chunks = 36,000 GB-s	$0.60
Lambda (Transcribe)	2.94GB × 30s × 900 calls = 79,380 GB-s	$1.32
Lambda (Other)	ExtractAudio, Chunk, Merge, Split, Aggregate, LLM	$0.10
Step Functions	~6,000 transitions × $0.025/1K	$0.15
S3	Read/write + temp storage	$0.02
OpenAI API	gpt-4o-mini (300K input + 8K output tokens)	$0.10
Total		~$2.3

Lambda Pricing Basis (x86_64, us-east-1):
• $0.0000166667/GB-second
• Diarize: 36,000 × $0.0000166667 = $0.60
• Transcribe: 79,380 × $0.0000166667 = $1.32

By fully leveraging pyannote.audio and faster-whisper, this is about 5x more cost-efficient than AWS Transcribe ($0.024/min = $11.52 for 8 hours). Using arm64 drops it to ~$1.9, making it about 6x more efficient.

Implementation Pitfalls and Solutions

1. States.DataLimitExceeded (256KB Limit)

Symptom:
When processing 900+ segments, the Step Functions Map state throws this error:

States.DataLimitExceeded - The state/task returned a result with a size
exceeding the maximum number of bytes service limit.

Cause:
Step Functions has a 256KB payload limit, and accumulating all Map state results exceeds it.

Solution:

// CDK: Discard Map state results
const transcribeSegments = new sfn.Map(this, "TranscribeSegments", {
  itemsPath: "$.segment_files",
  maxConcurrency: 10,
  resultPath: sfn.JsonPath.DISCARD,  // ← This is key
});

# Lambda side: Save results to S3
s3.put_object(
    Bucket=bucket,
    Key=f"transcribe_results/{segment_name}.json",
    Body=json.dumps(result_data, ensure_ascii=False),
)
# Return only metadata to Step Functions
return {"bucket": bucket, "result_key": result_key}

2. PyTorch 2.6+ torch.load Issue

Symptom:
pyannote.audio model loading throws this error:

FutureWarning: You are using `torch.load` with `weights_only=False`

PyTorch 2.6 changed the default to weights_only=True, breaking some model loading.

Solution: Monkey-patch torch.load

import torch

Disable PyTorch 2.6+ weights_only=True default
pyannote's HuggingFace checkpoints are from trusted sources, so this is safe
_orig_torch_load = torch.load
def _torch_load_legacy(*args, **kwargs):
    """Always call torch.load with weights_only=False"""
    kwargs["weights_only"] = False
    return _orig_torch_load(*args, **kwargs)
torch.load = _torch_load_legacy  # Apply BEFORE pyannote import

Important: This patch must be applied before from pyannote.audio import Pipeline.

Monkey-patch Risk:
This modifies PyTorch internals and may break in future versions. If possible, wait for pyannote.audio's safetensors support or check for official workarounds. In production, write unit tests to verify this works before the pyannote import.

3. Lambda Container Model Download Strategy

Problem:

Hugging Face models (pyannote, whisper) are several GB
Lambda's /tmp is 512MB–10GB (configurable)
Downloading at cold start causes Lambda timeout

Solution: Include models at build time

# Dockerfile
FROM public.ecr.aws/lambda/python:3.11

Download Hugging Face models at build time
ENV HF_HOME=/var/task/models
RUN pip install huggingface_hub
RUN python -c "from huggingface_hub import snapshot_download; 
    snapshot_download('pyannote/speaker-diarization-3.1', token='${HF_TOKEN}')"

Important: Pass HF_TOKEN as a build argument, don't include it in the final image

ARG HF_TOKEN
RUN --mount=type=secret,id=hf_token \
    HF_TOKEN=$(cat /run/secrets/hf_token) python download_models.py

4. Secure HF_TOKEN Management

Problem:

pyannote's Hugging Face model requires authentication
Putting it directly in Lambda environment variables is a security risk

Solution: AWS Secrets Manager + Build-time Download

# Get from Secrets Manager at Lambda runtime
# (though actually included at build time, so not needed at runtime)
secrets_client = boto3.client("secretsmanager")
secret = secrets_client.get_secret_value(SecretId=HF_SECRET_ARN)
hf_token = json.loads(secret["SecretString"])["token"]

5. Choosing 8-Minute Chunk Length

Trial and Error:

Chunk Length	Result
5 minutes	Speaker diarization accuracy dropped (context too short)
10 minutes	Lambda memory insufficient (barely fits in 10GB)
15 minutes	Exceeded Lambda's 15-minute timeout
8 minutes	Optimal balance of accuracy, memory, and time

Why 30-second overlap:

Speaker changes typically have 2-3 second gaps
30 seconds reliably captures speaker changes at boundaries
Longer overlap increases redundant processing and costs

6. Lambda vs ECS Decision Criteria

Why Lambda was chosen:

Processing time < 15min AND Memory < 10GB → Lambda
Processing time > 15min OR GPU required → ECS Fargate

pyannote.audio runs on CPU, and 8-minute chunks fit within Lambda's constraints.

Future Plans: Google Meet Auto-Integration

Planning to use the Auto-Recording feature from Google Meet REST API (added April 2025) for automatic recording and analysis.

Google Calendar (meeting schedule)
       │
       ▼ Cloud Functions (Calendar Webhook)
Google Meet Space (Auto-Recording enabled)
       │
       ▼ Recording complete
Google Drive (recording storage)
       │
       ▼ Workspace Events API + Pub/Sub
EventBridge (Cross-Cloud)
       │
       ▼
Lambda (DownloadRecording)
       │
       ▼
S3 → Step Functions (existing pipeline)
       │
       ▼
DynamoDB + AppSync → Dashboard

Design document: docs/google-meet-integration/

Summary

✓ ~$2/month fixed cost (Secrets Manager + ECR + logs) for speaker diarization transcription pipeline
✓ AWS Step Functions + Lambda fully serverless — pay only for what you use
✓ pyannote.audio + faster-whisper + gpt-4o-mini for high quality at low cost
✓ 8-hour video for ~$2.3 (~5x more cost-efficient than AWS Transcribe)
✓ Parallel chunk processing + embedding clustering handles long audio
✓ 256KB limit solved with resultPath: DISCARD + S3 passthrough

All code is available on GitHub.

By the way, isn't Secrets Manager at $0.40/secret/month kind of expensive? Storing just 2 API keys costs $9.60/year. SSM Parameter Store SecureString is free... But I guess it's the price for enterprise features like auto-rotation and audit logs. I've accepted it.