Building an AWS Serverless Pipeline for Speaker Diarization and Transcription at ~$1 per Video
TL;DR
- • ~$2/month fixed cost (Secrets Manager + ECR + logs) for a speaker diarization transcription pipeline
- • AWS Step Functions + Lambda fully serverless architecture
- • pyannote.audio 3.1 for speaker diarization, faster-whisper for transcription, gpt-4o-mini for LLM analysis
- • 8-hour video processing completed for ~$2.3 (x86, no free tier) — about 5x more cost-efficient than AWS Transcribe
- • Deep dive into pitfalls like States.DataLimitExceeded and their solutions
Repository: github.com/ekusiadadus/ek-transcript
Introduction
I've been analyzing more and more user interview recordings lately. When evaluating existing solutions:
- ✗ AWS Transcribe: ~$11.52 for 8 hours ($0.024/min), speaker diarization accuracy not great
- ✗ Commercial SaaS: $50–$200/month fixed fees, charged even in months with no usage
- ✗ Always-on GPU server: EC2 g4dn.xlarge costs $380+/month — too expensive for personal use
The biggest problem was fixed monthly costs. I only use this a few times a month, yet I'd be paying every month. I wanted pure pay-per-use pricing with near-zero monthly fixed costs — that was my top priority.
So I decided to build my own pipeline using AWS serverless services.
Requirements
- Zero monthly fixed cost (pay only for what you use)
- Support for long videos up to 8 hours
- Speaker diarization (who said what)
- High-accuracy Japanese transcription
- LLM-powered summarization and analysis
- Low cost (~$1 per video)
- Fully serverless
System Architecture
┌─────────────────────────────────────────────────────────────────────────────────┐
│ AWS Cloud │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ Amazon S3 │ │ EventBridge │ │
│ │ (Input) │────▶│ Rule │ │
│ │ uploads/ │ │ (Object Created)│ │
│ └──────────────┘ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────┐ ┌─────────────────┐ │
│ │ Lambda │ │ DynamoDB │ │
│ │ StartPipeline │─────▶│ InterviewsTable │ │
│ └────────┬───────┘ └─────────────────┘ │
│ │ ▲ │
│ ▼ │ │
│ ┌─────────────────────────────────────────────────────┼───────────────────────┐│
│ │ AWS Step Functions │ ││
│ │ │ ││
│ │ ┌─────────────┐ ┌─────────────┐ ┌────────────┐│ ││
│ │ │ Lambda │ │ Lambda │ │ Lambda ││ ││
│ │ │ExtractAudio │──▶│ ChunkAudio │──▶│(Map State) ││ ││
│ │ │ (ffmpeg) │ │ (8min+30s) │ │DiarizeChunk││ ││
│ │ └─────────────┘ └─────────────┘ │ x5 parallel││ ││
│ │ │ │ pyannote ││ ││
│ │ │ └─────┬──────┘│ ││
│ │ ▼ │ │ ││
│ │ ┌──────────┐ ▼ │ ││
│ │ │Amazon S3 │◀───────────────────┬─────────────────┤ ││
│ │ │(Output) │ │ │ ││
│ │ │processed/│ ┌────────────────┴──────┐ │ ││
│ │ │analysis/ │ │ Lambda │ │ ││
│ │ └──────────┘ │ MergeSpeakers │ │ ││
│ │ ▲ │ (embedding clustering)│ │ ││
│ │ │ └───────────┬───────────┘ │ ││
│ │ │ ▼ │ ││
│ │ │ ┌───────────────────────┐ │ ││
│ │ │ │ Lambda │ │ ││
│ │ │ │ SplitBySpeaker │ │ ││
│ │ │ │ (ffmpeg) │ │ ││
│ │ │ └───────────┬───────────┘ │ ││
│ │ │ ▼ │ ││
│ │ │ ┌───────────────────────┐ │ ││
│ │ │ │ Lambda │ │ ││
│ │ │ │ (Map State) │ │ ││
│ │ │ │ Transcribe x10 │ │ ││
│ │ │ │ faster-whisper │ │ ││
│ │ │ └───────────┬───────────┘ │ ││
│ │ │ ▼ │ ││
│ │ │ ┌───────────────────────┐ │ ││
│ │ │ │ Lambda │ │ ││
│ │ ├─────────│ AggregateResults │ │ ││
│ │ │ └───────────┬───────────┘ │ ││
│ │ │ ▼ │ ││
│ │ │ ┌───────────────────────┐ ┌─────────────────┐ ││
│ │ │ │ Lambda │ │Secrets Manager │ ││
│ │ └─────────│ LLMAnalysis │◀──│ OpenAI API Key │ ││
│ │ │ gpt-4o-mini │ └─────────────────┘ ││
│ │ └───────────┬───────────┘ ││
│ │ │ ││
│ └─────────────────────────────┼──────────────────────────────────────────────┘│
│ ▼ │
└─────────────────────────────────────────────────────────────────────────────────┘
Data Flow Details
[Video Upload]
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ S3: ek-transcript-input-{env} │
│ Key: uploads/{interview_id}/video.mp4 │
│ Metadata: x-amz-meta-interview-id, x-amz-meta-original-filename │
└──────────────────────────────────────────────────────────────────────────────┘
│
│ EventBridge (Object Created)
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ Lambda: StartPipeline │
│ - Create interview record in DynamoDB │
│ - Start Step Functions execution │
└──────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ Step Functions: ek-transcript-pipeline-{env} │
│ │
│ 1. ExtractAudio: video.mp4 → audio.wav (16kHz mono) │
│ 2. ChunkAudio: audio.wav → chunk_0.wav, chunk_1.wav, ... (8min+30s overlap) │
│ 3. DiarizeChunks (Map x5): pyannote speaker diarization per chunk │
│ 4. MergeSpeakers: Global speaker unification via embedding vectors │
│ 5. SplitBySpeaker: Split audio by speaker segments │
│ 6. TranscribeSegments (Map x10): faster-whisper transcription │
│ 7. AggregateResults: Merge results → transcript.json │
│ 8. LLMAnalysis: gpt-4o-mini structured analysis → analysis.json │
└──────────────────────────────────────────────────────────────────────────────┘
Why Serverless: Minimizing Monthly Fixed Costs
The key feature of this pipeline is minimizing monthly fixed costs.
| Component | Pricing Model | Monthly Fixed Cost |
|---|---|---|
| Lambda | Per-execution | $0 |
| Step Functions | Per-transition | $0 |
| S3 | Storage + requests | $0~ |
| DynamoDB | On-demand | $0 |
| EventBridge | Per-event | $0 |
| Cognito | Free up to 50K MAU | $0 |
| AppSync | Per-request | $0~ |
| Secrets Manager | Per-secret | $0.80/mo (2 secrets) |
| ECR | Image storage | $0.50–$1.00/mo |
| CloudWatch Logs | Log storage | ~$0.05/mo |
Actual Monthly Fixed Costs
Secrets Manager: $0.80/mo (OpenAI + HuggingFace, 2 secrets)
ECR: $0.50–$1.00/mo (Docker images with ML models, 8 images)
CloudWatch Logs: ~$0.05/mo (Step Functions + Lambda logs)
───────────────────────────────────────────
Total: ~$1.50–$2.00/mo
Only ~$2 even in months with zero usage. Compared to commercial SaaS at $50–$200/month, that's $576–$2,376 in annual savings.
Pricing References (Dec 2025):
• Secrets Manager: $0.40/secret/month
• ECR: $0.10/GB/month (pyannote+whisper models = 5-10GB)
• CloudWatch Logs: $0.50/GB ingestion + $0.03/GB/month storage
Alternative: SSM Parameter Store SecureString is free but has a 4KB limit and no auto-rotation.
Design Evolution: From Initial to Current Design
Initial Design: Simple Sequential Processing
[Video] → ExtractAudio → Diarize → SplitBySpeaker → Transcribe → LLMAnalysis
│
(single Lambda processing all audio)
Problems:
- Lambda's 15-minute timeout couldn't complete speaker diarization for 8-hour audio
- pyannote.audio memory usage was massive (10GB+)
- Sequential processing made total processing time too long
Alternative: ECS Fargate Processing
[Video] → ECS Fargate (GPU) → ...
Evaluation:
- GPU instance (g4dn.xlarge) cost was high ($0.526/hour)
- 8-hour video would cost $4+
- Spot instances had reliability concerns
Current Design: Parallel Chunk Processing
┌─ DiarizeChunk_0 ─┐
[Video] → Chunk → ├─ DiarizeChunk_1 ─┤ → Merge → Split → Transcribe(parallel) → LLM
├─ DiarizeChunk_2 ─┤
└─ ... ─┘
Design Points:
- 8-minute chunks + 30-second overlap: Fits within Lambda's 15-minute limit, captures speaker changes at boundaries
- Speaker unification via embedding vectors: Even if SPEAKER_00 is different people across chunks, cosine similarity clustering unifies them
- Map State parallel execution: Speaker diarization x5, transcription x10 parallel processing for speed
Technology Choices and Rationale
| Technology | Reason | vs. Alternatives |
|---|---|---|
| pyannote.audio 3.1 | Latest speaker diarization accuracy, Hugging Face integration | Higher accuracy than AWS Transcribe speaker diarization |
| faster-whisper | 4-8x faster than Whisper, int8 quantization support | OpenAI Whisper API is more expensive |
| gpt-4o-mini | Structured Outputs support, low cost | Claude lacked Structured Outputs (at the time) |
| Lambda + Container | Up to 10GB image, cold start acceptable | ECS Fargate has always-on cost concerns |
| Step Functions | Complex workflow management, error handling | SQS + Lambda makes state management complex |
Component Implementation Details
1. ExtractAudio Lambda
Extracts 16kHz mono WAV from video — Whisper's recommended sample rate.
def extract_audio(input_path: str, output_path: str) -> None:
"""Extract 16kHz mono WAV from video"""
cmd = [
"ffmpeg", "-i", input_path,
"-vn", # No video
"-acodec", "pcm_s16le", # 16-bit PCM
"-ar", "16000", # 16kHz
"-ac", "1", # Mono
"-y", output_path,
]
subprocess.run(cmd, check=True)
2. ChunkAudio Lambda
Splits into 8-minute chunks + 30-second overlap. The overlap captures speaker changes at boundaries accurately.
CHUNK_DURATION = 480 # 8 minutes OVERLAP_DURATION = 30 # 30-second overlapchunk_0: 0–510s (effective: 0–480)
chunk_1: 450–960s (effective: 480–960)
chunk_2: 900–1410s (effective: 960–1440)
3. DiarizeChunk Lambda (Parallel Execution)
Speaker diarization with pyannote.audio 3.1. Extracts embedding vectors for each speaker and saves to S3.
pyannote.audio License Note:
pyannote/speaker-diarization-3.1 requires license agreement on Hugging Face. Visit the model page to agree to the license terms and obtain your HF_TOKEN on first use. Verify licensing requirements for commercial use.
from pyannote.audio import Pipelinepipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
token=hf_token,
)Use GPU if available
if torch.cuda.is_available():
pipeline.to(torch.device("cuda"))Run speaker diarization
diarization = pipeline({"waveform": audio_tensor, "sample_rate": sample_rate})
Extract embedding vectors (used for speaker unification later)
speaker_embeddings = extract_speaker_embeddings(audio_path, segments)
4. MergeSpeakers Lambda
Clusters speakers across chunks using cosine similarity of embedding vectors.
from sklearn.cluster import AgglomerativeClustering from sklearn.metrics.pairwise import cosine_similarityCosine similarity clustering
similarity_matrix = cosine_similarity(all_embeddings)
distance_matrix = 1 - similarity_matrix
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=1 - 0.75, # 75%+ similarity = same speaker
metric="precomputed",
linkage="average",
)
labels = clustering.fit_predict(distance_matrix)
5. Transcribe Lambda (Parallel Execution)
High-speed transcription with faster-whisper (medium model).
from faster_whisper import WhisperModel
model = WhisperModel("medium", device="cpu", compute_type="int8")
segments, info = model.transcribe(audio_path, language="ja", beam_size=5)
text = "".join([seg.text for seg in segments])
6. LLMAnalysis Lambda
Structured analysis using gpt-4o-mini's Structured Outputs.
from openai import OpenAIScoring with Structured Outputs
completion = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": ANALYSIS_PROMPT},
{"role": "user", "content": f"Analyze this:\n{transcript}"},
],
response_format=AnalysisResult,
)
Cost Breakdown (8-Hour Video Example, Dec 2025)
Assumptions:
- Region: us-east-1
- Lambda: x86_64 (arm64 is ~20% cheaper)
- No free tier, no retries
- Map parallelism: Diarize×5, Transcribe×10
Actual cost breakdown for processing an 8-hour video (~900 segments):
| Service | Calculation | Cost |
|---|---|---|
| Lambda (Diarize) | 10GB × 600s × 6 chunks = 36,000 GB-s | $0.60 |
| Lambda (Transcribe) | 2.94GB × 30s × 900 calls = 79,380 GB-s | $1.32 |
| Lambda (Other) | ExtractAudio, Chunk, Merge, Split, Aggregate, LLM | $0.10 |
| Step Functions | ~6,000 transitions × $0.025/1K | $0.15 |
| S3 | Read/write + temp storage | $0.02 |
| OpenAI API | gpt-4o-mini (300K input + 8K output tokens) | $0.10 |
| Total | ~$2.3 |
Lambda Pricing Basis (x86_64, us-east-1):
• $0.0000166667/GB-second
• Diarize: 36,000 × $0.0000166667 = $0.60
• Transcribe: 79,380 × $0.0000166667 = $1.32
By fully leveraging pyannote.audio and faster-whisper, this is about 5x more cost-efficient than AWS Transcribe ($0.024/min = $11.52 for 8 hours). Using arm64 drops it to ~$1.9, making it about 6x more efficient.
Implementation Pitfalls and Solutions
1. States.DataLimitExceeded (256KB Limit)
Symptom:
When processing 900+ segments, the Step Functions Map state throws this error:
States.DataLimitExceeded - The state/task returned a result with a size
exceeding the maximum number of bytes service limit.
Cause:
Step Functions has a 256KB payload limit, and accumulating all Map state results exceeds it.
Solution:
// CDK: Discard Map state results
const transcribeSegments = new sfn.Map(this, "TranscribeSegments", {
itemsPath: "$.segment_files",
maxConcurrency: 10,
resultPath: sfn.JsonPath.DISCARD, // ← This is key
});
# Lambda side: Save results to S3
s3.put_object(
Bucket=bucket,
Key=f"transcribe_results/{segment_name}.json",
Body=json.dumps(result_data, ensure_ascii=False),
)
# Return only metadata to Step Functions
return {"bucket": bucket, "result_key": result_key}
2. PyTorch 2.6+ torch.load Issue
Symptom:
pyannote.audio model loading throws this error:
FutureWarning: You are using `torch.load` with `weights_only=False`
PyTorch 2.6 changed the default to weights_only=True, breaking some model loading.
Solution: Monkey-patch torch.load
import torchDisable PyTorch 2.6+ weights_only=True default
pyannote's HuggingFace checkpoints are from trusted sources, so this is safe
_orig_torch_load = torch.load
def _torch_load_legacy(*args, **kwargs):
"""Always call torch.load with weights_only=False"""
kwargs["weights_only"] = False
return _orig_torch_load(*args, **kwargs)
torch.load = _torch_load_legacy # Apply BEFORE pyannote import
Important: This patch must be applied before from pyannote.audio import Pipeline.
Monkey-patch Risk:
This modifies PyTorch internals and may break in future versions. If possible, wait for pyannote.audio's safetensors support or check for official workarounds. In production, write unit tests to verify this works before the pyannote import.
3. Lambda Container Model Download Strategy
Problem:
- Hugging Face models (pyannote, whisper) are several GB
- Lambda's
/tmpis 512MB–10GB (configurable) - Downloading at cold start causes Lambda timeout
Solution: Include models at build time
# Dockerfile FROM public.ecr.aws/lambda/python:3.11Download Hugging Face models at build time
ENV HF_HOME=/var/task/models
RUN pip install huggingface_hub
RUN python -c "from huggingface_hub import snapshot_download;
snapshot_download('pyannote/speaker-diarization-3.1', token='${HF_TOKEN}')"
Important: Pass HF_TOKEN as a build argument, don't include it in the final image
ARG HF_TOKEN
RUN --mount=type=secret,id=hf_token \
HF_TOKEN=$(cat /run/secrets/hf_token) python download_models.py
4. Secure HF_TOKEN Management
Problem:
- pyannote's Hugging Face model requires authentication
- Putting it directly in Lambda environment variables is a security risk
Solution: AWS Secrets Manager + Build-time Download
# Get from Secrets Manager at Lambda runtime
# (though actually included at build time, so not needed at runtime)
secrets_client = boto3.client("secretsmanager")
secret = secrets_client.get_secret_value(SecretId=HF_SECRET_ARN)
hf_token = json.loads(secret["SecretString"])["token"]
5. Choosing 8-Minute Chunk Length
Trial and Error:
| Chunk Length | Result |
|---|---|
| 5 minutes | Speaker diarization accuracy dropped (context too short) |
| 10 minutes | Lambda memory insufficient (barely fits in 10GB) |
| 15 minutes | Exceeded Lambda's 15-minute timeout |
| 8 minutes | Optimal balance of accuracy, memory, and time |
Why 30-second overlap:
- Speaker changes typically have 2-3 second gaps
- 30 seconds reliably captures speaker changes at boundaries
- Longer overlap increases redundant processing and costs
6. Lambda vs ECS Decision Criteria
Why Lambda was chosen:
Processing time < 15min AND Memory < 10GB → Lambda
Processing time > 15min OR GPU required → ECS Fargate
pyannote.audio runs on CPU, and 8-minute chunks fit within Lambda's constraints.
Future Plans: Google Meet Auto-Integration
Planning to use the Auto-Recording feature from Google Meet REST API (added April 2025) for automatic recording and analysis.
Google Calendar (meeting schedule)
│
▼ Cloud Functions (Calendar Webhook)
Google Meet Space (Auto-Recording enabled)
│
▼ Recording complete
Google Drive (recording storage)
│
▼ Workspace Events API + Pub/Sub
EventBridge (Cross-Cloud)
│
▼
Lambda (DownloadRecording)
│
▼
S3 → Step Functions (existing pipeline)
│
▼
DynamoDB + AppSync → Dashboard
Design document: docs/google-meet-integration/
Summary
- ✓ ~$2/month fixed cost (Secrets Manager + ECR + logs) for speaker diarization transcription pipeline
- ✓ AWS Step Functions + Lambda fully serverless — pay only for what you use
- ✓ pyannote.audio + faster-whisper + gpt-4o-mini for high quality at low cost
- ✓ 8-hour video for ~$2.3 (~5x more cost-efficient than AWS Transcribe)
- ✓ Parallel chunk processing + embedding clustering handles long audio
-
✓
256KB limit solved with
resultPath: DISCARD+ S3 passthrough
All code is available on GitHub.
By the way, isn't Secrets Manager at $0.40/secret/month kind of expensive? Storing just 2 API keys costs $9.60/year. SSM Parameter Store SecureString is free... But I guess it's the price for enterprise features like auto-rotation and audit logs. I've accepted it.