# Building a RAG System on GCP for a Real Estate Agency

📚 This is Part 2 of a 2-part series on RAG on Google Cloud

1. [RAG on GCP: From First Corpus to Production](rag-on-gcp)

1. Building a RAG System on GCP for a Real Estate Agency (you are here)

Part 1 covered the GCP RAG toolbox in the abstract. This part is the opposite: one concrete domain, end to end. We'll build the retrieval system a mid-sized residential real estate agency actually needs — the kind with a few thousand active listings, a dozen agents, and a website where buyers ask questions in plain English. Real estate turns out to be a near-perfect stress test for RAG, because it breaks the naive "embed everything and search" approach in four different ways at once.

By the end you'll have a reference architecture on Google Cloud, the data model that makes hybrid retrieval work, the retrieval flow that combines hard filters with semantic search, and — the part everyone forgets until legal calls — the compliance guardrails that keep a property chatbot out of Fair Housing trouble.

## What the Agency Actually Wants

Strip away the buzzwords and there are three concrete products hiding inside "we want AI for our listings":

- **Buyer-facing search assistant** — a website chatbot that answers "show me 3-bed homes under $750k near downtown with a yard and a home office" and returns real, current listings with links and photos.

- **Agent copilot** — an internal tool that drafts listing descriptions, summarizes comparable sales ("comps"), answers questions about disclosures and contracts, and pulls neighborhood facts on demand.

- **Document Q&A** — grounded answers over the agency's pile of unstructured documents: HOA rules, inspection reports, seller disclosures, neighborhood guides, and market reports.

All three are RAG. But the first one is where the design decisions get interesting, because a listing is *half structured data and half prose*, and the query mixes hard constraints with soft preferences.

## Why Naive RAG Fails Here

If you take the tutorial approach — dump every listing into a vector store, embed the query, return top-k — you'll ship something that demos well and falls apart in week one. Four reasons:

| Problem | Why vector-only RAG breaks |
| --- | --- |
| **Hard constraints are non-negotiable** | "Under $750k" and "3 bedrooms" are filters, not similarity. A semantic search will happily return a beautiful $900k 2-bed because the *description* is similar. Price and bedroom count must be exact filters, not vectors. |
| **Freshness is brutal** | Listings change daily — price drops, status flips to "pending" or "sold." A stale index that shows a sold house as available is worse than no answer. The index must track the source of truth in near-real-time. |
| **Geography is structured, not semantic** | "Near downtown" or "in the Eastside school district" is a spatial/categorical filter. Embeddings don't understand a 5-mile radius or a district boundary. |
| **Compliance is a hard wall** | Under the Fair Housing Act, the system must not steer buyers based on protected characteristics (race, religion, family status, etc.) or answer "is this a good neighborhood for [group]." This can't be left to model judgment. |

The fix is the pattern from Part 1's "Advanced RAG": **structured pre-filtering plus semantic ranking, with the structured store as the source of truth.** Hard constraints filter the candidate set deterministically; semantic search ranks what survives by how well it matches the soft preferences.

## The Architecture

Here's the full system on GCP. Two ingestion paths feed one hybrid retrieval store; the query service combines structured filtering, semantic search, and reranking before grounding Gemini.

```mermaid
flowchart TB
    subgraph Ingest["① Ingestion"]
        MLS["MLS feed / listings DB"]
        DOCS["Documents:\ndisclosures, HOA, reports (GCS)"]
        MLS --> CF["Cloud Functions:\nnormalize + enrich"]
        CF --> BQ["AlloyDB:\nstructured fields + pgvector"]
        DOCS --> DF["Document AI +\nchunking"]
        DF --> EMB["Vertex AI Embeddings\n(text-embedding-005)"]
        PHOTO["Listing photos"] --> MM["Multimodal embeddings"]
        EMB --> BQ
        MM --> BQ
    end

    subgraph Serve["② Retrieval & serving (Cloud Run)"]
        Q["Buyer / agent query"] --> GUARD["Fair-Housing guardrail\n+ query parse"]
        GUARD --> FILTER["Structured pre-filter\n(price, beds, geo, status)"]
        FILTER --> VEC["Semantic search\nover survivors"]
        VEC --> RANK["Vertex Ranking API\nrerank top-k"]
        RANK --> GEM["Gemini:\nground + cite listings"]
        GEM --> A["Answer + listing cards"]
    end

    BQ -.serves.-> FILTER
    BQ -.serves.-> VEC
          
```

Real-estate RAG on GCP. AlloyDB holds both the structured listing fields and the pgvector embeddings, so a single store does deterministic filtering and semantic ranking. The guardrail runs before retrieval, not after.

## The Data Model: One Store, Two Jobs

The single most important decision is keeping structured fields and embeddings **in the same row**. On GCP you have two good options for this: AlloyDB (or Cloud SQL) for PostgreSQL with the `pgvector` extension, or BigQuery with vector search. For a transactional, frequently-updated listings catalog with sub-second query needs, AlloyDB is the better fit — it gives you ACID updates for freshness and a vector index in the same table you filter on.

```sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE listings (
    id              BIGINT PRIMARY KEY,
    status          TEXT NOT NULL,          -- active | pending | sold
    price           INTEGER NOT NULL,
    bedrooms        SMALLINT NOT NULL,
    bathrooms       NUMERIC(3,1) NOT NULL,
    sqft            INTEGER,
    city            TEXT,
    school_district TEXT,
    geo             GEOGRAPHY(Point, 4326), -- lat/long for radius queries
    description     TEXT,                   -- the prose half
    updated_at      TIMESTAMPTZ NOT NULL,
    embedding       vector(768)             -- text-embedding-005 dims
);

-- Index the things we filter and rank on
CREATE INDEX ON listings USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON listings (status, price, bedrooms);
CREATE INDEX ON listings USING gist (geo);
```

The embedding is generated from the prose half — description, neighborhood notes, standout features — not the structured fields. You never want to embed "$749,000" into a vector; you filter on it. Embed the language a buyer uses ("open-concept kitchen, walkable, great natural light") and filter the numbers.

### Generating embeddings on ingest

```python
from vertexai.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("text-embedding-005")

def embed_listing(listing: dict) -> list[float]:
    # Embed only the unstructured, preference-bearing text
    text = " ".join(filter(None, [
        listing["description"],
        listing.get("neighborhood_notes", ""),
        ", ".join(listing.get("features", [])),
    ]))
    return model.get_embeddings([text])[0].values  # 768-dim vector
```

## Hybrid Retrieval: Filter Hard, Rank Soft

The query path is where real estate RAG lives or dies. A buyer asks: *"3-bed under $750k near downtown with a home office and good light."* That decomposes into **hard constraints** (beds = 3, price ≤ 750000, within radius of downtown, status = active) and **soft preferences** ("home office," "good light"). The hard constraints become a SQL `WHERE`; the soft preferences become the embedded query vector that orders the survivors.

```sql
-- :qvec = embedding of "home office, good natural light"
-- :lng/:lat = geocoded "downtown" centroid
SELECT id, price, bedrooms, description,
       embedding <=> :qvec AS distance
FROM listings
WHERE status = 'active'                         -- freshness: never show sold
  AND price <= 750000                            -- hard constraint
  AND bedrooms >= 3                              -- hard constraint
  AND ST_DWithin(geo, ST_MakePoint(:lng,:lat)::geography, 8000)  -- 8km
ORDER BY embedding <=> :qvec                     -- soft ranking
LIMIT 30;                                        -- retrieve wide
```

Two things make this work. First, the structured predicates run *before* the vector ordering, so the expensive similarity computation only touches listings that already satisfy the non-negotiables. Second, we **retrieve wide (30) and rerank narrow**: cosine distance gets us a good candidate set, but a cross-encoder reranker scores the query against each description jointly and surfaces the genuinely best matches.

```python
from google.cloud import discoveryengine_v1 as de

def rerank(query: str, candidates: list[dict], top_n: int = 5):
    client = de.RankServiceClient()
    records = [
        de.RankingRecord(id=str(c["id"]), content=c["description"])
        for c in candidates
    ]
    resp = client.rank(de.RankRequest(
        ranking_config=RANKING_CONFIG,
        model="semantic-ranker-default@latest",
        top_n=top_n,
        query=query,
        records=records,
    ))
    keep = {r.id for r in resp.records}
    return [c for c in candidates if str(c["id"]) in keep]
```

## Grounding Gemini — and Forcing Citations

The reranked top 5 listings go into the Gemini prompt as grounded context. The non-negotiable rule: the model answers **only** from the retrieved listings and must cite the listing ID for every property it mentions. A real estate answer that invents a listing or misquotes a price is a liability, not a bug.

```python
PROMPT = """You are a real estate assistant. Answer using ONLY the listings
provided in CONTEXT. For every property you mention, cite its [listing_id].
If no listing matches, say so plainly — never invent properties or prices.
Do NOT comment on the demographics, religion, family makeup, or any protected
characteristic of a neighborhood or its residents.

CONTEXT:
{listings}

QUESTION: {question}
"""

from vertexai.generative_models import GenerativeModel
model = GenerativeModel("gemini-1.5-flash")
answer = model.generate_content(
    PROMPT.format(listings=format_cards(top5), question=user_q)
).text
```

## Fair Housing: Compliance as Architecture

**This is the part that gets agencies sued.** The Fair Housing Act prohibits discrimination based on race, color, religion, sex, disability, familial status, and national origin. An AI assistant that answers "is this a good area for a family like mine?" or steers buyers toward or away from neighborhoods based on demographics is creating direct legal exposure. You cannot rely on the model to "know better" — guardrails must be structural.

Three structural defenses, none of which depend on the model behaving:

- **Pre-retrieval intent classification.** Before any search runs, a lightweight classifier flags queries that ask about protected characteristics or request steering ("safe neighborhood for...", "mostly [group] area"). Flagged queries get a fixed, compliant response and never reach retrieval.

- **Curated, neutral data only.** The neighborhood corpus contains facts — school ratings, transit, amenities, commute times — not demographic commentary. If steering content isn't in the index, it can't be retrieved.

- **Output filtering.** A final check on the generated answer blocks responses that drifted into protected territory, with full logging for audit.

This is the real-estate instance of a general principle from this series: **encode the rules in the system, don't hope the model infers them.** Compliance, like a price filter, is too important to leave to similarity scores.

## Keeping the Index Fresh

Listings change constantly, and a stale index is the fastest way to lose buyer trust. The pattern that works: treat the MLS/listings database as the source of truth and propagate changes incrementally.

- **Event-driven updates.** A change in the listings DB (price drop, status flip) fires a Pub/Sub message; a Cloud Function updates the row in AlloyDB and re-embeds only if the *description* changed. Price and status updates skip re-embedding entirely — they're just column writes.

- **Status as a filter, never a delete.** Sold and pending listings stay in the table with their status flag; the `WHERE status = 'active'` predicate keeps them out of buyer results while preserving them for comps and analytics.

- **Nightly reconciliation.** A scheduled job diffs the index against the source of truth to catch anything the event stream missed — belt and suspenders for the one thing buyers will notice immediately.

## Measuring Whether It Works

Two failure modes matter in real estate, and they map to the two halves of RAG. **Retrieval failures** (wrong or missing listings) and **generation failures** (hallucinated facts, fair-housing drift). Measure them separately:

| Metric | What it catches | How |
| --- | --- | --- |
| Constraint accuracy | Results that violate a hard filter (over budget, wrong bed count) | Deterministic assertion on result set — should be 100% |
| Retrieval recall | Relevant active listings the system missed | Golden query set with known-good listings, checked weekly |
| Faithfulness | Invented properties, misquoted prices | Vertex AI Evaluation / LLM-as-judge against retrieved context |
| Compliance violation rate | Steering or protected-class commentary | Adversarial query suite run on every deploy — target zero |

**The two numbers to watch from day one:** constraint accuracy (are we ever showing a $900k house to a $750k buyer?) and compliance violation rate (did we ever steer?). Constraint accuracy should be exactly 100% because it's deterministic SQL — if it isn't, your filter logic is broken. Compliance violations should be zero on an adversarial test suite before you ever go live. Everything else is optimization; these two are go/no-go.

## Lessons Learned

- **Embed the prose, filter the numbers.** The single biggest quality win is refusing to put structured fields into the vector. Price, beds, geo, and status are SQL predicates; descriptions and features are embeddings.

- **One store beats two.** Keeping structured columns and embeddings in the same AlloyDB row eliminates the consistency nightmare of syncing a separate vector database against a transactional listings DB. Freshness becomes a normal UPDATE.

- **Compliance is a pre-retrieval gate, not a post-hoc filter.** The cheapest place to stop a fair-housing problem is before retrieval runs. By the time the model has generated text, you're cleaning up rather than preventing.

- **Status-as-filter, not delete.** Sold listings are gold for comps and market reports. Flag them, don't drop them.

- **Gemini Flash is enough.** For grounded listing Q&A, the cheaper, faster model is the right default. Reserve Pro for genuinely multi-hop reasoning like comparing five comps across neighborhoods.

The through-line from Part 1 holds with force in this domain: RAG is a search problem with a language model on the end, and the search half is where the engineering lives. In real estate, the structured constraints, the freshness pipeline, and the compliance gate are all *search-side* decisions. Get those right and Gemini's job becomes easy: read five real listings and answer honestly, with citations.

📚 The series

1. [RAG on GCP: From First Corpus to Production](rag-on-gcp)

1. Building a RAG System on GCP for a Real Estate Agency (this article)
