10 min

Building an AWS Serverless Pipeline for Speaker Diarization and Transcription at ~$1 per Video

awsserverlessmachine-learningpython

TL;DR

  • ~$2/month fixed cost (Secrets Manager + ECR + logs) for a speaker diarization transcription pipeline
  • AWS Step Functions + Lambda fully serverless architecture
  • pyannote.audio 3.1 for speaker diarization, faster-whisper for transcription, gpt-4o-mini for LLM analysis
  • 8-hour video processing completed for ~$2.3 (x86, no free tier) — about 5x more cost-efficient than AWS Transcribe
  • Deep dive into pitfalls like States.DataLimitExceeded and their solutions

Repository: github.com/ekusiadadus/ek-transcript

Introduction

I've been analyzing more and more user interview recordings lately. When evaluating existing solutions:

  • AWS Transcribe: ~11.52for8hours(11.52 for 8 hours (0.024/min), speaker diarization accuracy not great
  • Commercial SaaS: 5050–200/month fixed fees, charged even in months with no usage
  • Always-on GPU server: EC2 g4dn.xlarge costs $380+/month — too expensive for personal use

The biggest problem was fixed monthly costs. I only use this a few times a month, yet I'd be paying every month. I wanted pure pay-per-use pricing with near-zero monthly fixed costs — that was my top priority.

So I decided to build my own pipeline using AWS serverless services.

Requirements

  1. Zero monthly fixed cost (pay only for what you use)
  2. Support for long videos up to 8 hours
  3. Speaker diarization (who said what)
  4. High-accuracy Japanese transcription
  5. LLM-powered summarization and analysis
  6. Low cost (~$1 per video)
  7. Fully serverless

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                          AWS Cloud                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  S3 (Input)  ──▶  EventBridge  ──▶  Lambda (StartPipeline)     │
│  uploads/         (Object Created)        │                     │
│                                           ▼                     │
│                                    DynamoDB (InterviewsTable)   │
│                                           │                     │
│                                           ▼                     │
│  ┌────────────────── Step Functions ─────────────────────────┐  │
│  │                                                           │  │
│  │  ExtractAudio ──▶ ChunkAudio ──▶ DiarizeChunks (Map x5) │  │
│  │       │                                    │              │  │
│  │       ▼                                    ▼              │  │
│  │  S3 (Output)    MergeSpeakers ◀────────────┘              │  │
│  │       ▲              │                                    │  │
│  │       │              ▼                                    │  │
│  │       │         SplitBySpeaker ──▶ Transcribe (Map x10)  │  │
│  │       │                                    │              │  │
│  │       │              AggregateResults ◀────┘              │  │
│  │       │                    │                              │  │
│  │       └────────── LLMAnalysis ◀────┘                     │  │
│  │                   (gpt-4o-mini)                           │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Data Flow Details

  1. ExtractAudio: video.mp4 → audio.wav (16kHz mono)
  2. ChunkAudio: audio.wav → chunk_0.wav, chunk_1.wav, ... (8min+30s overlap)
  3. DiarizeChunks (Map x5): pyannote speaker diarization per chunk
  4. MergeSpeakers: Global speaker unification via embedding vectors
  5. SplitBySpeaker: Split audio by speaker segments
  6. TranscribeSegments (Map x10): faster-whisper transcription
  7. AggregateResults: Merge results → transcript.json
  8. LLMAnalysis: gpt-4o-mini structured analysis → analysis.json

Why Serverless: Minimizing Monthly Fixed Costs

ComponentPricing ModelMonthly Fixed Cost
LambdaPer-execution$0
Step FunctionsPer-transition$0
S3Storage + requests$0~
DynamoDBOn-demand$0
EventBridgePer-event$0
CognitoFree up to 50K MAU$0
AppSyncPer-request$0~
Secrets ManagerPer-secret$0.80/mo (2 secrets)
ECRImage storage0.500.50–1.00/mo
CloudWatch LogsLog storage~$0.05/mo

Actual Monthly Fixed Costs

Secrets Manager:    $0.80/mo (OpenAI + HuggingFace, 2 secrets)
ECR:               $0.50–$1.00/mo (Docker images with ML models, 8 images)
CloudWatch Logs:    ~$0.05/mo (Step Functions + Lambda logs)
───────────────────────────────────────────
Total:              ~$1.50–$2.00/mo

**Only ~2eveninmonthswithzerousage.ComparedtocommercialSaaSat2 even in months with zero usage.** Compared to commercial SaaS at 50–200/month,thats200/month, that's 576–$2,376 in annual savings.

Design Evolution

Initial Design: Simple Sequential Processing

[Video] → ExtractAudio → Diarize → SplitBySpeaker → Transcribe → LLMAnalysis

                    (single Lambda processing all audio)

Problems:

  • Lambda's 15-minute timeout couldn't complete speaker diarization for 8-hour audio
  • pyannote.audio memory usage was massive (10GB+)
  • Sequential processing made total processing time too long

Alternative: ECS Fargate Processing

Evaluation:

  • GPU instance (g4dn.xlarge) cost was high ($0.526/hour)
  • 8-hour video would cost $4+
  • Spot instances had reliability concerns

Current Design: Parallel Chunk Processing

                    ┌─ DiarizeChunk_0 ─┐
[Video] → Chunk →   ├─ DiarizeChunk_1 ─┤ → Merge → Split → Transcribe(parallel) → LLM
                    ├─ DiarizeChunk_2 ─┤
                    └─      ...       ─┘

Design Points:

  1. 8-minute chunks + 30-second overlap: Fits within Lambda's 15-minute limit, captures speaker changes at boundaries
  2. Speaker unification via embedding vectors: Even if SPEAKER_00 is different people across chunks, cosine similarity clustering unifies them
  3. Map State parallel execution: Speaker diarization x5, transcription x10 parallel processing for speed

Technology Choices and Rationale

TechnologyReasonvs. Alternatives
pyannote.audio 3.1Latest speaker diarization accuracy, Hugging Face integrationHigher accuracy than AWS Transcribe speaker diarization
faster-whisper4-8x faster than Whisper, int8 quantization supportOpenAI Whisper API is more expensive
gpt-4o-miniStructured Outputs support, low costClaude lacked Structured Outputs (at the time)
Lambda + ContainerUp to 10GB image, cold start acceptableECS Fargate has always-on cost concerns
Step FunctionsComplex workflow management, error handlingSQS + Lambda makes state management complex

Component Implementation Details

1. ExtractAudio Lambda

Extracts 16kHz mono WAV from video — Whisper's recommended sample rate.

extract_audio.py
def extract_audio(input_path: str, output_path: str) -> None:
    """Extract 16kHz mono WAV from video"""
    cmd = [
        "ffmpeg", "-i", input_path,
        "-vn",                    # No video
        "-acodec", "pcm_s16le",   # 16-bit PCM
        "-ar", "16000",           # 16kHz
        "-ac", "1",               # Mono
        "-y", output_path,
    ]
    subprocess.run(cmd, check=True)

2. ChunkAudio Lambda

Splits into 8-minute chunks + 30-second overlap. The overlap captures speaker changes at boundaries accurately.

chunk_audio.py
CHUNK_DURATION = 480      # 8 minutes
OVERLAP_DURATION = 30     # 30-second overlap
 
# chunk_0: 0–510s (effective: 0–480)
# chunk_1: 450–960s (effective: 480–960)
# chunk_2: 900–1410s (effective: 960–1440)

3. DiarizeChunk Lambda (Parallel Execution)

Speaker diarization with pyannote.audio 3.1. Extracts embedding vectors for each speaker and saves to S3.

pyannote.audio License Note: pyannote/speaker-diarization-3.1 requires license agreement on Hugging Face. Visit the model page to agree to the license terms and obtain your HF_TOKEN on first use. Verify licensing requirements for commercial use.

diarize_chunk.py
from pyannote.audio import Pipeline
 
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    token=hf_token,
)
 
if torch.cuda.is_available():
    pipeline.to(torch.device("cuda"))
 
diarization = pipeline({"waveform": audio_tensor, "sample_rate": sample_rate})
 
speaker_embeddings = extract_speaker_embeddings(audio_path, segments)

4. MergeSpeakers Lambda

Clusters speakers across chunks using cosine similarity of embedding vectors.

merge_speakers.py
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity
 
similarity_matrix = cosine_similarity(all_embeddings)
distance_matrix = 1 - similarity_matrix
 
clustering = AgglomerativeClustering(
    n_clusters=None,
    distance_threshold=1 - 0.75,  # 75%+ similarity = same speaker
    metric="precomputed",
    linkage="average",
)
labels = clustering.fit_predict(distance_matrix)

5. Transcribe Lambda (Parallel Execution)

High-speed transcription with faster-whisper (medium model).

transcribe.py
from faster_whisper import WhisperModel
 
model = WhisperModel("medium", device="cpu", compute_type="int8")
segments, info = model.transcribe(audio_path, language="ja", beam_size=5)
text = "".join([seg.text for seg in segments])

6. LLMAnalysis Lambda

Structured analysis using gpt-4o-mini's Structured Outputs.

llm_analysis.py
from openai import OpenAI
 
completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": ANALYSIS_PROMPT},
        {"role": "user", "content": f"Analyze this:\n{transcript}"},
    ],
    response_format=AnalysisResult,
)

Cost Breakdown (8-Hour Video Example, Dec 2025)

Assumptions:

  • Region: us-east-1
  • Lambda: x86_64 (arm64 is ~20% cheaper)
  • No free tier, no retries
  • Map parallelism: Diarize x5, Transcribe x10
ServiceCalculationCost
Lambda (Diarize)10GB x 600s x 6 chunks = 36,000 GB-s$0.60
Lambda (Transcribe)2.94GB x 30s x 900 calls = 79,380 GB-s$1.32
Lambda (Other)ExtractAudio, Chunk, Merge, Split, Aggregate, LLM$0.10
Step Functions~6,000 transitions x $0.025/1K$0.15
S3Read/write + temp storage$0.02
OpenAI APIgpt-4o-mini (300K input + 8K output tokens)$0.10
Total~$2.3

By fully leveraging pyannote.audio and faster-whisper, this is about 5x more cost-efficient than AWS Transcribe (0.024/min=0.024/min = 11.52 for 8 hours). Using arm64 drops it to ~$1.9, making it about 6x more efficient.

Implementation Pitfalls and Solutions

1. States.DataLimitExceeded (256KB Limit)

Symptom: When processing 900+ segments, the Step Functions Map state throws this error:

States.DataLimitExceeded - The state/task returned a result with a size
exceeding the maximum number of bytes service limit.

Cause: Step Functions has a 256KB payload limit, and accumulating all Map state results exceeds it.

Solution:

cdk-stack.ts
// CDK: Discard Map state results
const transcribeSegments = new sfn.Map(this, "TranscribeSegments", {
  itemsPath: "$.segment_files",
  maxConcurrency: 10,
  resultPath: sfn.JsonPath.DISCARD,  // ← This is key
});
transcribe_lambda.py
# Lambda side: Save results to S3
s3.put_object(
    Bucket=bucket,
    Key=f"transcribe_results/{segment_name}.json",
    Body=json.dumps(result_data, ensure_ascii=False),
)
# Return only metadata to Step Functions
return {"bucket": bucket, "result_key": result_key}

2. PyTorch 2.6+ torch.load Issue

Symptom: pyannote.audio model loading throws:

FutureWarning: You are using `torch.load` with `weights_only=False`

Solution: Monkey-patch torch.load

patch_torch.py
import torch
 
_orig_torch_load = torch.load
 
def _torch_load_legacy(*args, **kwargs):
    """Always call torch.load with weights_only=False"""
    kwargs["weights_only"] = False
    return _orig_torch_load(*args, **kwargs)
 
torch.load = _torch_load_legacy  # Apply BEFORE pyannote import

Monkey-patch Risk: This modifies PyTorch internals and may break in future versions. If possible, wait for pyannote.audio's safetensors support or check for official workarounds.

3. Lambda Container Model Download Strategy

Problem: Hugging Face models (pyannote, whisper) are several GB. Downloading at cold start causes Lambda timeout.

Solution: Include models at build time

Dockerfile
FROM public.ecr.aws/lambda/python:3.11
 
ENV HF_HOME=/var/task/models
RUN pip install huggingface_hub
RUN python -c "from huggingface_hub import snapshot_download; \
    snapshot_download('pyannote/speaker-diarization-3.1', token='${HF_TOKEN}')"

Pass HF_TOKEN as a build argument, don't include it in the final image:

ARG HF_TOKEN
RUN --mount=type=secret,id=hf_token \
    HF_TOKEN=$(cat /run/secrets/hf_token) python download_models.py

4. Choosing 8-Minute Chunk Length

Chunk LengthResult
5 minutesSpeaker diarization accuracy dropped (context too short)
10 minutesLambda memory insufficient (barely fits in 10GB)
15 minutesExceeded Lambda's 15-minute timeout
8 minutesOptimal balance of accuracy, memory, and time

Why 30-second overlap:

  • Speaker changes typically have 2-3 second gaps
  • 30 seconds reliably captures speaker changes at boundaries
  • Longer overlap increases redundant processing and costs

Future Plans: Google Meet Auto-Integration

Planning to use the Auto-Recording feature from Google Meet REST API (added April 2025) for automatic recording and analysis.

Google Calendar (meeting schedule)

       ▼ Cloud Functions (Calendar Webhook)
Google Meet Space (Auto-Recording enabled)

       ▼ Recording complete
Google Drive (recording storage)

       ▼ Workspace Events API + Pub/Sub
EventBridge (Cross-Cloud)


Lambda (DownloadRecording)


S3 → Step Functions (existing pipeline)


DynamoDB + AppSync → Dashboard

Summary

  • ~$2/month fixed cost for speaker diarization transcription pipeline
  • AWS Step Functions + Lambda fully serverless — pay only for what you use
  • pyannote.audio + faster-whisper + gpt-4o-mini for high quality at low cost
  • 8-hour video for ~$2.3 (~5x more cost-efficient than AWS Transcribe)
  • Parallel chunk processing + embedding clustering handles long audio
  • 256KB limit solved with resultPath: DISCARD + S3 passthrough

All code is available on GitHub.

References