TL;DR
- ~$2/month fixed cost (Secrets Manager + ECR + logs) for a speaker diarization transcription pipeline
- AWS Step Functions + Lambda fully serverless architecture
- pyannote.audio 3.1 for speaker diarization, faster-whisper for transcription, gpt-4o-mini for LLM analysis
- 8-hour video processing completed for ~$2.3 (x86, no free tier) — about 5x more cost-efficient than AWS Transcribe
- Deep dive into pitfalls like States.DataLimitExceeded and their solutions
Repository: github.com/ekusiadadus/ek-transcript
Introduction
I've been analyzing more and more user interview recordings lately. When evaluating existing solutions:
- AWS Transcribe: ~0.024/min), speaker diarization accuracy not great
- Commercial SaaS: 200/month fixed fees, charged even in months with no usage
- Always-on GPU server: EC2 g4dn.xlarge costs $380+/month — too expensive for personal use
The biggest problem was fixed monthly costs. I only use this a few times a month, yet I'd be paying every month. I wanted pure pay-per-use pricing with near-zero monthly fixed costs — that was my top priority.
So I decided to build my own pipeline using AWS serverless services.
Requirements
- Zero monthly fixed cost (pay only for what you use)
- Support for long videos up to 8 hours
- Speaker diarization (who said what)
- High-accuracy Japanese transcription
- LLM-powered summarization and analysis
- Low cost (~$1 per video)
- Fully serverless
System Architecture
┌─────────────────────────────────────────────────────────────────┐
│ AWS Cloud │
├─────────────────────────────────────────────────────────────────┤
│ │
│ S3 (Input) ──▶ EventBridge ──▶ Lambda (StartPipeline) │
│ uploads/ (Object Created) │ │
│ ▼ │
│ DynamoDB (InterviewsTable) │
│ │ │
│ ▼ │
│ ┌────────────────── Step Functions ─────────────────────────┐ │
│ │ │ │
│ │ ExtractAudio ──▶ ChunkAudio ──▶ DiarizeChunks (Map x5) │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ S3 (Output) MergeSpeakers ◀────────────┘ │ │
│ │ ▲ │ │ │
│ │ │ ▼ │ │
│ │ │ SplitBySpeaker ──▶ Transcribe (Map x10) │ │
│ │ │ │ │ │
│ │ │ AggregateResults ◀────┘ │ │
│ │ │ │ │ │
│ │ └────────── LLMAnalysis ◀────┘ │ │
│ │ (gpt-4o-mini) │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘Data Flow Details
- ExtractAudio: video.mp4 → audio.wav (16kHz mono)
- ChunkAudio: audio.wav → chunk_0.wav, chunk_1.wav, ... (8min+30s overlap)
- DiarizeChunks (Map x5): pyannote speaker diarization per chunk
- MergeSpeakers: Global speaker unification via embedding vectors
- SplitBySpeaker: Split audio by speaker segments
- TranscribeSegments (Map x10): faster-whisper transcription
- AggregateResults: Merge results → transcript.json
- LLMAnalysis: gpt-4o-mini structured analysis → analysis.json
Why Serverless: Minimizing Monthly Fixed Costs
| Component | Pricing Model | Monthly Fixed Cost |
|---|---|---|
| Lambda | Per-execution | $0 |
| Step Functions | Per-transition | $0 |
| S3 | Storage + requests | $0~ |
| DynamoDB | On-demand | $0 |
| EventBridge | Per-event | $0 |
| Cognito | Free up to 50K MAU | $0 |
| AppSync | Per-request | $0~ |
| Secrets Manager | Per-secret | $0.80/mo (2 secrets) |
| ECR | Image storage | 1.00/mo |
| CloudWatch Logs | Log storage | ~$0.05/mo |
Actual Monthly Fixed Costs
Secrets Manager: $0.80/mo (OpenAI + HuggingFace, 2 secrets)
ECR: $0.50–$1.00/mo (Docker images with ML models, 8 images)
CloudWatch Logs: ~$0.05/mo (Step Functions + Lambda logs)
───────────────────────────────────────────
Total: ~$1.50–$2.00/mo**Only ~50–576–$2,376 in annual savings.
Design Evolution
Initial Design: Simple Sequential Processing
[Video] → ExtractAudio → Diarize → SplitBySpeaker → Transcribe → LLMAnalysis
│
(single Lambda processing all audio)Problems:
- Lambda's 15-minute timeout couldn't complete speaker diarization for 8-hour audio
- pyannote.audio memory usage was massive (10GB+)
- Sequential processing made total processing time too long
Alternative: ECS Fargate Processing
Evaluation:
- GPU instance (g4dn.xlarge) cost was high ($0.526/hour)
- 8-hour video would cost $4+
- Spot instances had reliability concerns
Current Design: Parallel Chunk Processing
┌─ DiarizeChunk_0 ─┐
[Video] → Chunk → ├─ DiarizeChunk_1 ─┤ → Merge → Split → Transcribe(parallel) → LLM
├─ DiarizeChunk_2 ─┤
└─ ... ─┘Design Points:
- 8-minute chunks + 30-second overlap: Fits within Lambda's 15-minute limit, captures speaker changes at boundaries
- Speaker unification via embedding vectors: Even if SPEAKER_00 is different people across chunks, cosine similarity clustering unifies them
- Map State parallel execution: Speaker diarization x5, transcription x10 parallel processing for speed
Technology Choices and Rationale
| Technology | Reason | vs. Alternatives |
|---|---|---|
| pyannote.audio 3.1 | Latest speaker diarization accuracy, Hugging Face integration | Higher accuracy than AWS Transcribe speaker diarization |
| faster-whisper | 4-8x faster than Whisper, int8 quantization support | OpenAI Whisper API is more expensive |
| gpt-4o-mini | Structured Outputs support, low cost | Claude lacked Structured Outputs (at the time) |
| Lambda + Container | Up to 10GB image, cold start acceptable | ECS Fargate has always-on cost concerns |
| Step Functions | Complex workflow management, error handling | SQS + Lambda makes state management complex |
Component Implementation Details
1. ExtractAudio Lambda
Extracts 16kHz mono WAV from video — Whisper's recommended sample rate.
def extract_audio(input_path: str, output_path: str) -> None:
"""Extract 16kHz mono WAV from video"""
cmd = [
"ffmpeg", "-i", input_path,
"-vn", # No video
"-acodec", "pcm_s16le", # 16-bit PCM
"-ar", "16000", # 16kHz
"-ac", "1", # Mono
"-y", output_path,
]
subprocess.run(cmd, check=True)2. ChunkAudio Lambda
Splits into 8-minute chunks + 30-second overlap. The overlap captures speaker changes at boundaries accurately.
CHUNK_DURATION = 480 # 8 minutes
OVERLAP_DURATION = 30 # 30-second overlap
# chunk_0: 0–510s (effective: 0–480)
# chunk_1: 450–960s (effective: 480–960)
# chunk_2: 900–1410s (effective: 960–1440)3. DiarizeChunk Lambda (Parallel Execution)
Speaker diarization with pyannote.audio 3.1. Extracts embedding vectors for each speaker and saves to S3.
pyannote.audio License Note: pyannote/speaker-diarization-3.1 requires license agreement on Hugging Face. Visit the model page to agree to the license terms and obtain your HF_TOKEN on first use. Verify licensing requirements for commercial use.
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
token=hf_token,
)
if torch.cuda.is_available():
pipeline.to(torch.device("cuda"))
diarization = pipeline({"waveform": audio_tensor, "sample_rate": sample_rate})
speaker_embeddings = extract_speaker_embeddings(audio_path, segments)4. MergeSpeakers Lambda
Clusters speakers across chunks using cosine similarity of embedding vectors.
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(all_embeddings)
distance_matrix = 1 - similarity_matrix
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=1 - 0.75, # 75%+ similarity = same speaker
metric="precomputed",
linkage="average",
)
labels = clustering.fit_predict(distance_matrix)5. Transcribe Lambda (Parallel Execution)
High-speed transcription with faster-whisper (medium model).
from faster_whisper import WhisperModel
model = WhisperModel("medium", device="cpu", compute_type="int8")
segments, info = model.transcribe(audio_path, language="ja", beam_size=5)
text = "".join([seg.text for seg in segments])6. LLMAnalysis Lambda
Structured analysis using gpt-4o-mini's Structured Outputs.
from openai import OpenAI
completion = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": ANALYSIS_PROMPT},
{"role": "user", "content": f"Analyze this:\n{transcript}"},
],
response_format=AnalysisResult,
)Cost Breakdown (8-Hour Video Example, Dec 2025)
Assumptions:
- Region: us-east-1
- Lambda: x86_64 (arm64 is ~20% cheaper)
- No free tier, no retries
- Map parallelism: Diarize x5, Transcribe x10
| Service | Calculation | Cost |
|---|---|---|
| Lambda (Diarize) | 10GB x 600s x 6 chunks = 36,000 GB-s | $0.60 |
| Lambda (Transcribe) | 2.94GB x 30s x 900 calls = 79,380 GB-s | $1.32 |
| Lambda (Other) | ExtractAudio, Chunk, Merge, Split, Aggregate, LLM | $0.10 |
| Step Functions | ~6,000 transitions x $0.025/1K | $0.15 |
| S3 | Read/write + temp storage | $0.02 |
| OpenAI API | gpt-4o-mini (300K input + 8K output tokens) | $0.10 |
| Total | ~$2.3 |
By fully leveraging pyannote.audio and faster-whisper, this is about 5x more cost-efficient than AWS Transcribe (11.52 for 8 hours). Using arm64 drops it to ~$1.9, making it about 6x more efficient.
Implementation Pitfalls and Solutions
1. States.DataLimitExceeded (256KB Limit)
Symptom: When processing 900+ segments, the Step Functions Map state throws this error:
States.DataLimitExceeded - The state/task returned a result with a size
exceeding the maximum number of bytes service limit.Cause: Step Functions has a 256KB payload limit, and accumulating all Map state results exceeds it.
Solution:
// CDK: Discard Map state results
const transcribeSegments = new sfn.Map(this, "TranscribeSegments", {
itemsPath: "$.segment_files",
maxConcurrency: 10,
resultPath: sfn.JsonPath.DISCARD, // ← This is key
});# Lambda side: Save results to S3
s3.put_object(
Bucket=bucket,
Key=f"transcribe_results/{segment_name}.json",
Body=json.dumps(result_data, ensure_ascii=False),
)
# Return only metadata to Step Functions
return {"bucket": bucket, "result_key": result_key}2. PyTorch 2.6+ torch.load Issue
Symptom: pyannote.audio model loading throws:
FutureWarning: You are using `torch.load` with `weights_only=False`Solution: Monkey-patch torch.load
import torch
_orig_torch_load = torch.load
def _torch_load_legacy(*args, **kwargs):
"""Always call torch.load with weights_only=False"""
kwargs["weights_only"] = False
return _orig_torch_load(*args, **kwargs)
torch.load = _torch_load_legacy # Apply BEFORE pyannote importMonkey-patch Risk: This modifies PyTorch internals and may break in future versions. If possible, wait for pyannote.audio's safetensors support or check for official workarounds.
3. Lambda Container Model Download Strategy
Problem: Hugging Face models (pyannote, whisper) are several GB. Downloading at cold start causes Lambda timeout.
Solution: Include models at build time
FROM public.ecr.aws/lambda/python:3.11
ENV HF_HOME=/var/task/models
RUN pip install huggingface_hub
RUN python -c "from huggingface_hub import snapshot_download; \
snapshot_download('pyannote/speaker-diarization-3.1', token='${HF_TOKEN}')"Pass HF_TOKEN as a build argument, don't include it in the final image:
ARG HF_TOKEN
RUN --mount=type=secret,id=hf_token \
HF_TOKEN=$(cat /run/secrets/hf_token) python download_models.py4. Choosing 8-Minute Chunk Length
| Chunk Length | Result |
|---|---|
| 5 minutes | Speaker diarization accuracy dropped (context too short) |
| 10 minutes | Lambda memory insufficient (barely fits in 10GB) |
| 15 minutes | Exceeded Lambda's 15-minute timeout |
| 8 minutes | Optimal balance of accuracy, memory, and time |
Why 30-second overlap:
- Speaker changes typically have 2-3 second gaps
- 30 seconds reliably captures speaker changes at boundaries
- Longer overlap increases redundant processing and costs
Future Plans: Google Meet Auto-Integration
Planning to use the Auto-Recording feature from Google Meet REST API (added April 2025) for automatic recording and analysis.
Google Calendar (meeting schedule)
│
▼ Cloud Functions (Calendar Webhook)
Google Meet Space (Auto-Recording enabled)
│
▼ Recording complete
Google Drive (recording storage)
│
▼ Workspace Events API + Pub/Sub
EventBridge (Cross-Cloud)
│
▼
Lambda (DownloadRecording)
│
▼
S3 → Step Functions (existing pipeline)
│
▼
DynamoDB + AppSync → DashboardSummary
- ~$2/month fixed cost for speaker diarization transcription pipeline
- AWS Step Functions + Lambda fully serverless — pay only for what you use
- pyannote.audio + faster-whisper + gpt-4o-mini for high quality at low cost
- 8-hour video for ~$2.3 (~5x more cost-efficient than AWS Transcribe)
- Parallel chunk processing + embedding clustering handles long audio
- 256KB limit solved with
resultPath: DISCARD+ S3 passthrough
All code is available on GitHub.