What's the monthly fixed cost?

About $2/month for Secrets Manager + ECR + CloudWatch Logs.

How much cheaper is this than AWS Transcribe?

About $2.3 for an 8-hour video vs $11.52 for AWS Transcribe. About 5x more cost-efficient.

Building an AWS Serverless Pipeline for Speaker Diarization and Transcription at ~$1 per Video

TL;DR

~$2/month fixed cost (Secrets Manager + ECR + logs) for a speaker diarization transcription pipeline
AWS Step Functions + Lambda fully serverless architecture
pyannote.audio 3.1 for speaker diarization, faster-whisper for transcription, gpt-4o-mini for LLM analysis
8-hour video processing completed for ~$2.3 (x86, no free tier) — about 5x more cost-efficient than AWS Transcribe
Deep dive into pitfalls like States.DataLimitExceeded and their solutions

Repository: github.com/ekusiadadus/ek-transcript

Introduction

I've been analyzing more and more user interview recordings lately. When evaluating existing solutions:

AWS Transcribe: ~ $11.52 for 8 hours ($ 0.024/min), speaker diarization accuracy not great
Commercial SaaS: $50–$ 200/month fixed fees, charged even in months with no usage
Always-on GPU server: EC2 g4dn.xlarge costs $380+/month — too expensive for personal use

The biggest problem was fixed monthly costs. I only use this a few times a month, yet I'd be paying every month. I wanted pure pay-per-use pricing with near-zero monthly fixed costs — that was my top priority.

So I decided to build my own pipeline using AWS serverless services.

Requirements

Zero monthly fixed cost (pay only for what you use)
Support for long videos up to 8 hours
Speaker diarization (who said what)
High-accuracy Japanese transcription
LLM-powered summarization and analysis
Low cost (~$1 per video)
Fully serverless

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                          AWS Cloud                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  S3 (Input)  ──▶  EventBridge  ──▶  Lambda (StartPipeline)     │
│  uploads/         (Object Created)        │                     │
│                                           ▼                     │
│                                    DynamoDB (InterviewsTable)   │
│                                           │                     │
│                                           ▼                     │
│  ┌────────────────── Step Functions ─────────────────────────┐  │
│  │                                                           │  │
│  │  ExtractAudio ──▶ ChunkAudio ──▶ DiarizeChunks (Map x5) │  │
│  │       │                                    │              │  │
│  │       ▼                                    ▼              │  │
│  │  S3 (Output)    MergeSpeakers ◀────────────┘              │  │
│  │       ▲              │                                    │  │
│  │       │              ▼                                    │  │
│  │       │         SplitBySpeaker ──▶ Transcribe (Map x10)  │  │
│  │       │                                    │              │  │
│  │       │              AggregateResults ◀────┘              │  │
│  │       │                    │                              │  │
│  │       └────────── LLMAnalysis ◀────┘                     │  │
│  │                   (gpt-4o-mini)                           │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Data Flow Details

ExtractAudio: video.mp4 → audio.wav (16kHz mono)
ChunkAudio: audio.wav → chunk_0.wav, chunk_1.wav, ... (8min+30s overlap)
DiarizeChunks (Map x5): pyannote speaker diarization per chunk
MergeSpeakers: Global speaker unification via embedding vectors
SplitBySpeaker: Split audio by speaker segments
TranscribeSegments (Map x10): faster-whisper transcription
AggregateResults: Merge results → transcript.json
LLMAnalysis: gpt-4o-mini structured analysis → analysis.json

Why Serverless: Minimizing Monthly Fixed Costs

Component	Pricing Model	Monthly Fixed Cost
Lambda	Per-execution	$0
Step Functions	Per-transition	$0
S3	Storage + requests	$0~
DynamoDB	On-demand	$0
EventBridge	Per-event	$0
Cognito	Free up to 50K MAU	$0
AppSync	Per-request	$0~
Secrets Manager	Per-secret	$0.80/mo (2 secrets)
ECR	Image storage	$0.50–$ 1.00/mo
CloudWatch Logs	Log storage	~$0.05/mo

Actual Monthly Fixed Costs

Secrets Manager:    $0.80/mo (OpenAI + HuggingFace, 2 secrets)
ECR:               $0.50–$1.00/mo (Docker images with ML models, 8 images)
CloudWatch Logs:    ~$0.05/mo (Step Functions + Lambda logs)
───────────────────────────────────────────
Total:              ~$1.50–$2.00/mo

**Only ~ $2 even in months with zero usage.** Compared to commercial SaaS at$ 50– $200/month, that's$ 576–$2,376 in annual savings.

Design Evolution

Initial Design: Simple Sequential Processing

[Video] → ExtractAudio → Diarize → SplitBySpeaker → Transcribe → LLMAnalysis
                           │
                    (single Lambda processing all audio)

Problems:

Lambda's 15-minute timeout couldn't complete speaker diarization for 8-hour audio
pyannote.audio memory usage was massive (10GB+)
Sequential processing made total processing time too long

Alternative: ECS Fargate Processing

Evaluation:

GPU instance (g4dn.xlarge) cost was high ($0.526/hour)
8-hour video would cost $4+
Spot instances had reliability concerns

Current Design: Parallel Chunk Processing

                    ┌─ DiarizeChunk_0 ─┐
[Video] → Chunk →   ├─ DiarizeChunk_1 ─┤ → Merge → Split → Transcribe(parallel) → LLM
                    ├─ DiarizeChunk_2 ─┤
                    └─      ...       ─┘

Design Points:

8-minute chunks + 30-second overlap: Fits within Lambda's 15-minute limit, captures speaker changes at boundaries
Speaker unification via embedding vectors: Even if SPEAKER_00 is different people across chunks, cosine similarity clustering unifies them
Map State parallel execution: Speaker diarization x5, transcription x10 parallel processing for speed

Technology Choices and Rationale

Technology	Reason	vs. Alternatives
pyannote.audio 3.1	Latest speaker diarization accuracy, Hugging Face integration	Higher accuracy than AWS Transcribe speaker diarization
faster-whisper	4-8x faster than Whisper, int8 quantization support	OpenAI Whisper API is more expensive
gpt-4o-mini	Structured Outputs support, low cost	Claude lacked Structured Outputs (at the time)
Lambda + Container	Up to 10GB image, cold start acceptable	ECS Fargate has always-on cost concerns
Step Functions	Complex workflow management, error handling	SQS + Lambda makes state management complex

Component Implementation Details

1. ExtractAudio Lambda

Extracts 16kHz mono WAV from video — Whisper's recommended sample rate.

extract_audio.py

def extract_audio(input_path: str, output_path: str) -> None:
    """Extract 16kHz mono WAV from video"""
    cmd = [
        "ffmpeg", "-i", input_path,
        "-vn",                    # No video
        "-acodec", "pcm_s16le",   # 16-bit PCM
        "-ar", "16000",           # 16kHz
        "-ac", "1",               # Mono
        "-y", output_path,
    ]
    subprocess.run(cmd, check=True)

2. ChunkAudio Lambda

Splits into 8-minute chunks + 30-second overlap. The overlap captures speaker changes at boundaries accurately.

chunk_audio.py

CHUNK_DURATION = 480      # 8 minutes
OVERLAP_DURATION = 30     # 30-second overlap
 
# chunk_0: 0–510s (effective: 0–480)
# chunk_1: 450–960s (effective: 480–960)
# chunk_2: 900–1410s (effective: 960–1440)

3. DiarizeChunk Lambda (Parallel Execution)

Speaker diarization with pyannote.audio 3.1. Extracts embedding vectors for each speaker and saves to S3.

pyannote.audio License Note: pyannote/speaker-diarization-3.1 requires license agreement on Hugging Face. Visit the model page to agree to the license terms and obtain your HF_TOKEN on first use. Verify licensing requirements for commercial use.

diarize_chunk.py

from pyannote.audio import Pipeline
 
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    token=hf_token,
)
 
if torch.cuda.is_available():
    pipeline.to(torch.device("cuda"))
 
diarization = pipeline({"waveform": audio_tensor, "sample_rate": sample_rate})
 
speaker_embeddings = extract_speaker_embeddings(audio_path, segments)

4. MergeSpeakers Lambda

Clusters speakers across chunks using cosine similarity of embedding vectors.

merge_speakers.py

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity
 
similarity_matrix = cosine_similarity(all_embeddings)
distance_matrix = 1 - similarity_matrix
 
clustering = AgglomerativeClustering(
    n_clusters=None,
    distance_threshold=1 - 0.75,  # 75%+ similarity = same speaker
    metric="precomputed",
    linkage="average",
)
labels = clustering.fit_predict(distance_matrix)

5. Transcribe Lambda (Parallel Execution)

High-speed transcription with faster-whisper (medium model).

transcribe.py

from faster_whisper import WhisperModel
 
model = WhisperModel("medium", device="cpu", compute_type="int8")
segments, info = model.transcribe(audio_path, language="ja", beam_size=5)
text = "".join([seg.text for seg in segments])

6. LLMAnalysis Lambda

Structured analysis using gpt-4o-mini's Structured Outputs.

llm_analysis.py

from openai import OpenAI
 
completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": ANALYSIS_PROMPT},
        {"role": "user", "content": f"Analyze this:\n{transcript}"},
    ],
    response_format=AnalysisResult,
)

Cost Breakdown (8-Hour Video Example, Dec 2025)

Assumptions:

Region: us-east-1
Lambda: x86_64 (arm64 is ~20% cheaper)
No free tier, no retries
Map parallelism: Diarize x5, Transcribe x10

Service	Calculation	Cost
Lambda (Diarize)	10GB x 600s x 6 chunks = 36,000 GB-s	$0.60
Lambda (Transcribe)	2.94GB x 30s x 900 calls = 79,380 GB-s	$1.32
Lambda (Other)	ExtractAudio, Chunk, Merge, Split, Aggregate, LLM	$0.10
Step Functions	~6,000 transitions x $0.025/1K	$0.15
S3	Read/write + temp storage	$0.02
OpenAI API	gpt-4o-mini (300K input + 8K output tokens)	$0.10
Total		~$2.3

By fully leveraging pyannote.audio and faster-whisper, this is about 5x more cost-efficient than AWS Transcribe ( $0.024/min =$ 11.52 for 8 hours). Using arm64 drops it to ~$1.9, making it about 6x more efficient.

Implementation Pitfalls and Solutions

1. States.DataLimitExceeded (256KB Limit)

Symptom: When processing 900+ segments, the Step Functions Map state throws this error:

States.DataLimitExceeded - The state/task returned a result with a size
exceeding the maximum number of bytes service limit.

Cause: Step Functions has a 256KB payload limit, and accumulating all Map state results exceeds it.

Solution:

cdk-stack.ts

// CDK: Discard Map state results
const transcribeSegments = new sfn.Map(this, "TranscribeSegments", {
  itemsPath: "$.segment_files",
  maxConcurrency: 10,
  resultPath: sfn.JsonPath.DISCARD,  // ← This is key
});

transcribe_lambda.py

# Lambda side: Save results to S3
s3.put_object(
    Bucket=bucket,
    Key=f"transcribe_results/{segment_name}.json",
    Body=json.dumps(result_data, ensure_ascii=False),
)
# Return only metadata to Step Functions
return {"bucket": bucket, "result_key": result_key}

2. PyTorch 2.6+ torch.load Issue

Symptom: pyannote.audio model loading throws:

FutureWarning: You are using `torch.load` with `weights_only=False`

Solution: Monkey-patch torch.load

patch_torch.py

import torch
 
_orig_torch_load = torch.load
 
def _torch_load_legacy(*args, **kwargs):
    """Always call torch.load with weights_only=False"""
    kwargs["weights_only"] = False
    return _orig_torch_load(*args, **kwargs)
 
torch.load = _torch_load_legacy  # Apply BEFORE pyannote import

Monkey-patch Risk: This modifies PyTorch internals and may break in future versions. If possible, wait for pyannote.audio's safetensors support or check for official workarounds.

3. Lambda Container Model Download Strategy

Problem: Hugging Face models (pyannote, whisper) are several GB. Downloading at cold start causes Lambda timeout.

Solution: Include models at build time

Dockerfile

FROM public.ecr.aws/lambda/python:3.11
 
ENV HF_HOME=/var/task/models
RUN pip install huggingface_hub
RUN python -c "from huggingface_hub import snapshot_download; \
    snapshot_download('pyannote/speaker-diarization-3.1', token='${HF_TOKEN}')"

Pass HF_TOKEN as a build argument, don't include it in the final image:

ARG HF_TOKEN
RUN --mount=type=secret,id=hf_token \
    HF_TOKEN=$(cat /run/secrets/hf_token) python download_models.py

4. Choosing 8-Minute Chunk Length

Chunk Length	Result
5 minutes	Speaker diarization accuracy dropped (context too short)
10 minutes	Lambda memory insufficient (barely fits in 10GB)
15 minutes	Exceeded Lambda's 15-minute timeout
8 minutes	Optimal balance of accuracy, memory, and time

Why 30-second overlap:

Speaker changes typically have 2-3 second gaps
30 seconds reliably captures speaker changes at boundaries
Longer overlap increases redundant processing and costs

Future Plans: Google Meet Auto-Integration

Planning to use the Auto-Recording feature from Google Meet REST API (added April 2025) for automatic recording and analysis.

Google Calendar (meeting schedule)
       │
       ▼ Cloud Functions (Calendar Webhook)
Google Meet Space (Auto-Recording enabled)
       │
       ▼ Recording complete
Google Drive (recording storage)
       │
       ▼ Workspace Events API + Pub/Sub
EventBridge (Cross-Cloud)
       │
       ▼
Lambda (DownloadRecording)
       │
       ▼
S3 → Step Functions (existing pipeline)
       │
       ▼
DynamoDB + AppSync → Dashboard

Summary

~$2/month fixed cost for speaker diarization transcription pipeline
AWS Step Functions + Lambda fully serverless — pay only for what you use
pyannote.audio + faster-whisper + gpt-4o-mini for high quality at low cost
8-hour video for ~$2.3 (~5x more cost-efficient than AWS Transcribe)
Parallel chunk processing + embedding clustering handles long audio
256KB limit solved with resultPath: DISCARD + S3 passthrough

All code is available on GitHub.

TL;DR#

Introduction#

Requirements#

System Architecture#

Data Flow Details#

Why Serverless: Minimizing Monthly Fixed Costs#

Actual Monthly Fixed Costs#

Design Evolution#

Initial Design: Simple Sequential Processing#

Alternative: ECS Fargate Processing#

Current Design: Parallel Chunk Processing#

Technology Choices and Rationale#

Component Implementation Details#

1. ExtractAudio Lambda#

2. ChunkAudio Lambda#

3. DiarizeChunk Lambda (Parallel Execution)#

4. MergeSpeakers Lambda#

5. Transcribe Lambda (Parallel Execution)#

6. LLMAnalysis Lambda#

Cost Breakdown (8-Hour Video Example, Dec 2025)#

Implementation Pitfalls and Solutions#

1. States.DataLimitExceeded (256KB Limit)#

2. PyTorch 2.6+ torch.load Issue#

3. Lambda Container Model Download Strategy#

4. Choosing 8-Minute Chunk Length#

Future Plans: Google Meet Auto-Integration#

Summary#

References#