Topics
The deep-dives grouped by theme. Each article is a standalone, internals-first read — start anywhere, or follow a cluster end to end. Prefer chronological? The full archive has a timeline and tag filter.
OLAP & Analytical Engines
How columnar warehouses and real-time analytics engines actually execute queries — storage, vectorization, and the trade-offs between them.
- StarRocks vs ClickHouse vs Doris: Which Real-Time OLAP Engine, and When
- StarRocks & Apache Doris Internals: MPP Real-Time OLAP, FE/BE, and the Data Models
- ClickHouse at Scale, Part 3: Insert Performance & Real-Time Streaming from Kafka
- ClickHouse Optimization, Part 2: Engines, ORDER BY, Data Types, and the JOIN Trap
- ClickHouse Internals, Part 1: How a Columnar OLAP Engine Hits Billions of Rows a Second
- DuckDB Internals: The Embedded OLAP Engine That Runs Anywhere
- Apache Arrow & DataFusion: The Columnar In-Memory Standard
- Druid vs Pinot: Real-Time OLAP Serving and Sub-Second Concurrency
- BigQuery Internals: Dremel, Colossus, and Separating Storage from Compute
ClickHouse
The ClickHouse deep-dive series: architecture, schema and query optimization, and real-time ingestion.
- StarRocks vs ClickHouse vs Doris: Which Real-Time OLAP Engine, and When
- ClickHouse at Scale, Part 3: Insert Performance & Real-Time Streaming from Kafka
- ClickHouse Optimization, Part 2: Engines, ORDER BY, Data Types, and the JOIN Trap
- ClickHouse Internals, Part 1: How a Columnar OLAP Engine Hits Billions of Rows a Second
Streaming & Real-Time Data
Event streaming and stream processing — logs, exactly-once, watermarks, CDC, and streaming databases.
- Real-Time Snowflake on AWS: Snowpipe Streaming, Dynamic Tables, and Lessons Learned
- Designing a Data Pipeline: Batch vs Streaming, Idempotency, and Backfills
- Why Kafka Is So Fast — and How It Scales
- ClickHouse at Scale, Part 3: Insert Performance & Real-Time Streaming from Kafka
- Streaming Databases: Materialize, RisingWave, and Incremental View Maintenance
- Amazon Kinesis vs Apache Kafka (MSK): Streaming Data on AWS Without the Regret
- Flink vs Kafka Streams vs Spark Structured Streaming: Choosing a Stream Processor
- Kafka Internals: The Commit Log That Powers Half the Data Stack
- Change Data Capture with Debezium: Log-Based Capture and the Outbox Pattern
- Apache Flink Internals: State, Checkpoints, Watermarks, and Exactly-Once
- Apache Pulsar vs Kafka: Segment Storage, BookKeeper, and Tiered Storage
- Druid vs Pinot: Real-Time OLAP Serving and Sub-Second Concurrency
- The Dataflow Model: Event Time, Windows, Watermarks, and Triggers
- Avro, Protobuf & the Schema Registry: Serialization and Schema Evolution
Kafka & Event Streaming
Kafka internals and the architectures around it — why it's fast, how it scales, and how alternatives differ.
- Why Kafka Is So Fast — and How It Scales
- ClickHouse at Scale, Part 3: Insert Performance & Real-Time Streaming from Kafka
- Amazon Kinesis vs Apache Kafka (MSK): Streaming Data on AWS Without the Regret
- Flink vs Kafka Streams vs Spark Structured Streaming: Choosing a Stream Processor
- Kafka Internals: The Commit Log That Powers Half the Data Stack
- Change Data Capture with Debezium: Log-Based Capture and the Outbox Pattern
- Apache Pulsar vs Kafka: Segment Storage, BookKeeper, and Tiered Storage
- Avro, Protobuf & the Schema Registry: Serialization and Schema Evolution
Snowflake
Snowflake from internals to real-time pipelines, Cortex AI, and regulated data-vault builds.
- Building an AI Assistant with Snowflake Cortex: The Whole Thing, End to End
- Snowflake Cortex AI in 2026: Agents, Analyst, and the Agentic Data Cloud
- End-to-End Data Vault on Snowflake with dbt and Openflow: Engineering, Data Quality, and Governance
- Real-Time Snowflake on AWS: Snowpipe Streaming, Dynamic Tables, and Lessons Learned
- RWE Clinicogenomics on Snowflake, Part 3: Cortex AI, Genomics in Snowpark, and the Analytics Surface
- RWE Clinicogenomics on Snowflake, Part 2: Governance, Tokenization, Clean Rooms & Data Contracts
- Snowflake Internals: How the Three-Layer Architecture Actually Works
- RWE Clinicogenomics on Snowflake, Part 1: The Migration and the TCO Case
- Snowflake Cortex AI Deep Dive: LLMs Inside Your Data Warehouse
- Building a Data Vault 2.0 on Snowflake with dbt: Patterns, Scaling, and Lessons Learned
Databricks, Spark & Lakehouse
The Spark execution model, Photon, the Delta log, performance tuning, and the lakehouse platform.
- Building a HIPAA-Compliant Health Data Lakehouse on Databricks
- Spark Performance Optimization on Databricks: AQE, Shuffle, Skew & Data Layout
- Databricks Internals: Photon, the Delta Log, and How a Query Actually Runs
- The Databricks Data Intelligence Platform: A Practitioner's Overview
- Flink vs Kafka Streams vs Spark Structured Streaming: Choosing a Stream Processor
- Spark Internals (2.x): RDDs, the DAG Scheduler, Catalyst, and Tungsten
Business Intelligence & Semantic Layers
The engines and practices behind BI — Tableau (VizQL, Hyper), Power BI's VertiPaq, Microsoft Fabric, and semantic-model design.
- Text-to-SQL and the Semantic Layer: Why Chat-With-Your-Data Breaks
- AI-Powered Power BI Reporting: Agent Skills, PBIR, and Copilot CLI
- CoddSpeed: Inside Microsoft Fabric's GPU-Accelerated Query Engine (SIGMOD 2026 Best Paper)
- Microsoft Fabric vs Databricks on Azure: The 2025 Decision Guide
- Direct Lake vs Import vs DirectQuery: How to Stop Guessing and Actually Choose
- Microsoft Fabric in the Azure Ecosystem: Migration, Integration, and the Databricks Question
- Power BI & Semantic Models Deep Dive
- Understanding MS Fabric Internals
- VertiPaq Internals: What's Really Happening When Power BI Loads Your Model
- Tableau Best Practices: Extracts, Performance, and Fast Dashboards
- Tableau Internals: VizQL, the Hyper Engine, and How a Viz Renders
RAG & Retrieval
Retrieval-augmented generation end to end — retrieval types, vector search internals, GraphRAG, and cloud builds.
- Building an AI Assistant with Snowflake Cortex: The Whole Thing, End to End
- Building a Clinico-Genomics RAG on AWS: Architecture, Best Practices, and Hard-Won Lessons
- RAG on AWS: Bedrock Knowledge Bases, GraphRAG, and Amazon Neptune
- RAG From the Ground Up: Types, Architecture, and What Actually Moves the Needle
- Building a RAG System on GCP for a Real Estate Agency
- RAG on GCP: From First Corpus to Production — A Practitioner's Guide
- How Vector Search Works: HNSW, IVF, and Product Quantization
AI Agents & LLM Systems
Agents, agent memory, LLM inference and serving, observability, MCP, and the Transformer that started it.
- Evaluating LLM and Agent Systems in Production: Evals That Actually Work
- Text-to-SQL and the Semantic Layer: Why Chat-With-Your-Data Breaks
- LLM Security: Prompt Injection, Data Exfiltration, and Guardrails
- Building Production MCP Servers: Tools, Transports, Auth, and Security
- AI-Powered Power BI Reporting: Agent Skills, PBIR, and Copilot CLI
- AI Strategy: Use-Case Portfolios, Build vs Buy, and the Demo-to-Production Gap
- Building an AI Assistant with Snowflake Cortex: The Whole Thing, End to End
- The 2026 AI Agent Landscape: Hermes vs Claude Code & Cowork vs OpenClaw vs Gemini Spark
- Building a Clinico-Genomics RAG on AWS: Architecture, Best Practices, and Hard-Won Lessons
- Snowflake Cortex AI in 2026: Agents, Analyst, and the Agentic Data Cloud
- RAG on AWS: Bedrock Knowledge Bases, GraphRAG, and Amazon Neptune
- RAG From the Ground Up: Types, Architecture, and What Actually Moves the Needle
- AI Agent Memory: The Infrastructure Layer Nobody Told You About
- LLM Observability in Production: What to Instrument Before Your First Incident
- The Evidence Layer in Healthcare & Biotech AI: HIPAA, 21 CFR Part 11, GxP, GMLP
- Designing Multi-Agent AI Over Sensitive Data: Traceable and Observable by Construction
- State of AI Engineering 2025: Agents in Production, MCP Goes Universal, and the EU Starts Regulating
- Building a RAG System on GCP for a Real Estate Agency
- RAG on GCP: From First Corpus to Production — A Practitioner's Guide
- The Model Context Protocol (MCP) Explained: Architecture and Internals
- GraphRAG: When Your Vector Database Doesn't Know the Whole Story
- Vector Databases Compared: Pinecone vs Weaviate vs Qdrant vs pgvector vs FAISS
- Snowflake Cortex AI Deep Dive: LLMs Inside Your Data Warehouse
- State of AI Engineering 2024: Agents, MCP, and Open-Source Catches Up
- LLM Inference Internals: KV Cache, PagedAttention, and vLLM
- State of AI Engineering 2023: LangChain Explosions, RAG Everywhere, and the Open-Source LLM Surprise
- How Vector Search Works: HNSW, IVF, and Product Quantization
- State of AI Engineering 2022: The Year Before Everything Changed
- The Transformer Explained: Attention, BERT, and the NLP Inflection Point
- From Word2Vec to Embeddings: How Text Became Vectors
MLOps
The ML lifecycle in production — experiment tracking and registries, feature stores, and model serving.
Governance & Compliance
Lineage as regulatory proof, auditable AI over sensitive data, and lakehouse governance with Unity Catalog.
- LLM Security: Prompt Injection, Data Exfiltration, and Guardrails
- AI Strategy: Use-Case Portfolios, Build vs Buy, and the Demo-to-Production Gap
- Data Strategy: A Practitioner's Framework Beyond the Tool List
- The Evidence Layer in Healthcare & Biotech AI: HIPAA, 21 CFR Part 11, GxP, GMLP
- Designing Multi-Agent AI Over Sensitive Data: Traceable and Observable by Construction
- The Evidence Layer: Data Lineage as Regulatory Proof in Banking (BCBS 239, CCAR, SOX)
- End-to-End Data Vault on Snowflake with dbt and Openflow: Engineering, Data Quality, and Governance
- Data Contracts in Practice: ODCS, dbt, Streaming, and the Producer Handshake
- GA4GH Standards: How Genomic Data Is Shared Without Moving It
- Building a HIPAA-Compliant Health Data Lakehouse on Databricks
- RWE Clinicogenomics on Snowflake, Part 2: Governance, Tokenization, Clean Rooms & Data Contracts
NoSQL & Distributed Databases
LSM-tree storage, consistent-hashing rings, inverted indexes, and the CAP trade-offs behind NoSQL stores.
- Cassandra Internals: LSM-Trees, the Ring, Gossip, and Tunable Consistency
- Redis Internals: Single-Threaded Speed, Data Structures, and Persistence
- Neo4j & Graph Databases: Index-Free Adjacency and Cypher
- Elasticsearch Internals: The Inverted Index, Lucene Segments, and Sharding
- HBase Internals: Regions, the LSM Write Path, and Row-Key Design
- MongoDB Internals: The Document Model, WiredTiger, Replica Sets, and Sharding
Cloud Data Platforms (AWS, Azure, GCP)
Platform-specific data services, integrations, and real migration war stories across the three major clouds.
- Building a Clinico-Genomics RAG on AWS: Architecture, Best Practices, and Hard-Won Lessons
- RAG on AWS: Bedrock Knowledge Bases, GraphRAG, and Amazon Neptune
- Building a RAG System on GCP for a Real Estate Agency
- RAG on GCP: From First Corpus to Production — A Practitioner's Guide
- Real-Time Snowflake on AWS: Snowpipe Streaming, Dynamic Tables, and Lessons Learned
- RWE Clinicogenomics on Snowflake, Part 1: The Migration and the TCO Case
- Migrating from Cloudera Hadoop to GCP Dataproc: War Stories and Lessons Learned
- Azure Synapse Analytics vs Azure Databricks: Real Architectural Differences and When to Choose
- Amazon Kinesis vs Apache Kafka (MSK): Streaming Data on AWS Without the Regret
- Azure Data Factory Deep Dive: Pipelines, Data Flows, and When to Stop Using ADF
- AWS Glue Deep Dive: Crawlers, Job Types, Iceberg Integration, and the Cost Traps
- Redshift Physical Schema Design: Distribution Keys, Sort Keys, Skew, and the Lessons That Cost Us Weeks
- BigQuery Internals: Dremel, Colossus, and Separating Storage from Compute
State of the Industry
Annual retrospectives on data and AI engineering — what actually changed each year.
- State of Data Engineering 2025: Agents Take the Wheel (Mostly)
- State of AI Engineering 2025: Agents in Production, MCP Goes Universal, and the EU Starts Regulating
- State of Data Engineering 2024: Open Catalogs, DuckDB Everywhere, and AI Infiltrates the Stack
- State of AI Engineering 2024: Agents, MCP, and Open-Source Catches Up
- State of Data Engineering 2023: The AI Earthquake and the Format Wars
- State of AI Engineering 2023: LangChain Explosions, RAG Everywhere, and the Open-Source LLM Surprise
- State of Data Engineering 2022: Data Mesh Hype, Data Quality Crisis, and a Chatbot Changes Everything
- State of AI Engineering 2022: The Year Before Everything Changed
- State of Data Engineering 2021: The Modern Data Stack Goes Mainstream
Data Architecture & Engineering
Cross-cutting architecture, modeling, and platform-engineering pieces.
- Analytics Engineering with dbt: The Discipline, Not Just the Tool
- Knowledge Graphs and the Semantic Web: RDF, OWL, SPARQL, and SHACL
- Open Table Formats: Iceberg, Delta Lake, and Hudi — The War Nobody Told Your Data Team About
- Unity Catalog Deep Dive: Governance for the Lakehouse (and Beyond)
- Designing a Data API: Serving Analytical Data to Applications
- System Design for Data Engineers: A Framework for Designing Data Systems
- Designing a Data Warehouse: Layers, Modeling, Storage, and Serving
- FinOps for Data Platforms: Real Cost Governance on Snowflake, Databricks, BigQuery, and Fabric
- Apache Iceberg Internals: Metadata Trees, Snapshots, and the Catalog Wars
- dbt Internals and Best Practices: What Happens When You Run
dbt run - Spark Performance Tuning: Stop Guessing, Start Measuring
- Data Mesh: What Actually Works and What Doesn't — Lessons from the Field
- Apache Airflow Internals: The Scheduler, Executors, and the DAG Model
- Postgres Internals for Data Engineers: MVCC, WAL, and the Planner
- Presto Internals: MPP Query Execution and the Coordinator/Worker Model
- Oracle Exadata vs Teradata: The Enterprise Data Warehouse Showdown
- ZooKeeper & Consensus: Paxos, Raft, and How Distributed Systems Agree
- Storage Engines: B-Trees vs LSM-Trees, and the Read/Write Trade-off
- Parquet & ORC Internals: How Columnar Files Actually Store Data
- Apache Hive & the Hive Metastore: SQL on Hadoop and the Catalog That Outlived It
- Hadoop & HDFS Internals: HDFS, YARN, and the MapReduce Model
- Dimensional Modeling: Kimball, Star Schemas, and Slowly Changing Dimensions