# Knowledge Graphs and the Semantic Web: RDF, OWL, SPARQL, and SHACL

The problem that finally sold me on semantic technologies wasn't academic. A manufacturer wanted to answer one question: if this raw material has a quality issue, which finished products, customers, and open orders are affected? The data existed — in an ERP, a PLM system, a CRM, and three spreadsheets — and every system had its own ID for "part," its own notion of "supplier," and no shared key. In a relational world this is a quarter of integration work and a join graph that explodes the moment someone asks a follow-up question. The relationships *were* the answer, and the relational model had buried them inside foreign keys nobody had mapped across systems.

Semantic technologies attack exactly this: representing data as a graph of explicitly-typed, formally-defined relationships, with global identifiers so that "part #4471 in the ERP" and "component PN-4471 in PLM" can be asserted to be the same thing — once — and every query benefits. It's a stack of W3C standards (RDF, RDFS, OWL, SPARQL, SHACL) with a reputation for being academic, and a 2025 reason to care that's anything but: knowledge graphs turned out to be one of the better ways to ground an LLM. This is the working engineer's tour — what each piece does, when it earns its complexity, and how it compares to the property-graph databases you've probably met first.

## What are semantic technologies?

Semantic technologies are a set of standards for representing knowledge as a graph in which both the entities and the relationships between them carry explicit, machine-interpretable meaning — so data from different sources can be integrated, queried, validated, and reasoned over without first being forced into one rigid schema. The defining move is that meaning lives *in the data*, as typed relationships and a shared vocabulary, rather than being implied by column names and enforced only in application code.

A **knowledge graph** is the artifact you build with them: a graph of real-world entities (people, products, genes, accounts) and their relationships, on top of a vocabulary or ontology that defines what those entity types and relationships mean. The semantic-web stack is the standardized way to build one so that two organizations' graphs can actually interlock instead of merely both being "graphs."

## RDF: everything is a triple

RDF (Resource Description Framework) is the foundation, and it has exactly one data structure: the **triple** — subject, predicate, object. "Aspirin *treats* headache." "Order-88 *placedBy* Customer-12." A set of triples is a graph: the subjects and objects are nodes, the predicates are the labeled, directed edges. That's the whole model. The power isn't in the triple, it's in what the subject and predicate are: **URIs** — globally unique identifiers, usually looking like web addresses.

URIs are the part that makes integration work, and they're easy to undersell. When the ERP team and the PLM team both refer to a part as `https://acme.example/part/4471` — or assert that their two different URIs are `owl:sameAs` each other — the join is done, globally, for every consumer, forever. There's no per-query mapping table. The identifier *is* the integration. Here's the affected-parts scenario as Turtle (RDF's readable syntax), with two sources contributing triples about the same node:

```text
@prefix : <https://acme.example/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

# from the ERP
:part/4471   :usedIn        :product/X9 .
:part/4471   :suppliedBy    :supplier/Nord .

# from PLM (different local id, asserted identical)
:component/PN-4471  owl:sameAs  :part/4471 .
:component/PN-4471  :hasSpec    :spec/heat-rating-A .

# from CRM
:order/88    :contains      :product/X9 .
:order/88    :placedBy      :customer/12 .
```

Now a single graph traversal — material → parts → products → orders → customers — answers the impact question across three systems, because they share node identity. No system was rebuilt; they just contributed triples to a common graph.

```mermaid
graph LR
    SUP["supplier/Nord"] -->|supplies| PART["part/4471"]
    PART -->|usedIn| PROD["product/X9"]
    PNUM["component/PN-4471"] -->|sameAs| PART
    PNUM -->|hasSpec| SPEC["spec/heat-rating-A"]
    ORD["order/88"] -->|contains| PROD
    ORD -->|placedBy| CUST["customer/12"]
          
```

Triples from three systems form one graph because they share URIs (and an `owl:sameAs` bridge). The impact query "which customers are exposed if supplier/Nord has a defect?" is a traversal — supplier → part → product → order → customer — not a multi-system join project.

## Ontologies, RDFS, and OWL: meaning and reasoning

Triples give you a graph; an **ontology** gives that graph meaning. An ontology is a formal, shared specification of the concepts in a domain and how they relate — the classes (Part, Product, Supplier), the properties (suppliedBy, usedIn), and the rules that constrain them. RDFS (RDF Schema) provides the basics: declare classes, subclasses, properties, and their domains and ranges. **OWL** (the Web Ontology Language) goes much further, and this is where "semantic" earns the name.

OWL is grounded in *description logic*, a decidable fragment of formal logic, which means a **reasoner** can infer new, true triples that nobody stated explicitly. Declare that `suppliedBy` is the inverse of `supplies`, and the reasoner fills in the reverse edges. Declare `partOf` transitive, and it derives that a bolt in an assembly in a product is part of the product, across however many levels. Declare two classes disjoint, and it flags a node typed as both as a logical contradiction. This inference is the capability relational databases and property graphs simply don't have natively: the schema isn't just validation, it's a set of axioms a machine can compute consequences from.

**A schema describes what data looks like; an ontology describes what data means, formally enough that a machine can reason about it.** The dividing line is inference. If declaring "X is a subclass of Y" and "a is an X" lets the system conclude "a is a Y" on its own — and catch a contradiction when you assert something impossible — you've crossed from schema into ontology.

One conceptual trap worth naming up front: OWL uses the **open-world assumption** — if a fact isn't stated, it's unknown, not false. That's correct for integrating partial knowledge across sources (no single system knows everything), but it surprises people coming from databases, where absent means false. It's also exactly why you need SHACL.

## SPARQL: querying the graph

SPARQL is to RDF what SQL is to relational tables: the standard query language. You write a *graph pattern* — triples with variables — and SPARQL finds every subgraph that matches and binds the variables. The affected-customers question becomes a pattern that reads almost like the sentence:

```text
PREFIX : <https://acme.example/>
SELECT ?customer WHERE {
  ?part   :suppliedBy  :supplier/Nord .
  ?part   :usedIn      ?product .
  ?order  :contains    ?product .
  ?order  :placedBy    ?customer .
}
```

Two things make SPARQL more than "SQL for graphs." Property paths let you traverse variable-length relationships in one line (`:partOf+` follows the chain to any depth — the recursive query that's painful in SQL). And **federated query** via `SERVICE` lets a single query reach out to a remote SPARQL endpoint and join its results in — so you can query your graph and a public one (a drug database, a geographic gazetteer) together, live, without ingesting it. That federation is the original "semantic web" dream, and it's genuinely useful even when scoped to a few internal endpoints.

## SHACL: validation, or the data contract for graphs

Because RDF is open and OWL assumes an open world, you need a separate mechanism to say "for *my* application, a Part must have exactly one supplier and a heat rating, or it's invalid." That's **SHACL** (Shapes Constraint Language). You define *shapes* — closed-world constraints on the graph — and a SHACL engine validates the data against them, reporting precisely what's missing or malformed.

If that sounds familiar, it should: SHACL is the [data contract](data-contracts) for an RDF graph. OWL says what's logically true and infers more; SHACL says what your application requires and rejects what doesn't conform. The two are complementary and constantly confused — the clean way to hold them apart is *OWL infers, SHACL validates*. You want both: OWL to integrate and enrich, SHACL to keep the graph trustworthy enough to build on.

## RDF graphs vs labeled property graphs

If you've used a graph database, it was probably [Neo4j](neo4j-graph-databases), which is a **labeled property graph** (LPG), not RDF. Both model nodes and relationships, and they're often pitched as rivals, but they optimize for different things and the choice should follow the use case, not the hype.

|  | RDF / semantic web | Labeled property graph (Neo4j) |
| --- | --- | --- |
| Unit | Triple (subject–predicate–object) | Nodes & relationships with key/value properties |
| Identity | Global URIs — built for cross-org integration | Internal node IDs — local to the database |
| Properties on edges | Indirect (reification / RDF-star) | Native and easy |
| Query | SPARQL (W3C standard) | Cypher (de facto, now GQL) |
| Reasoning | Formal inference via OWL | Not native |
| Sweet spot | Integration, interoperability, shared vocabularies, inference | Operational traversals, developer ergonomics, path queries |

My rule of thumb: reach for **RDF** when the point is integrating across organizations or systems, reusing standard vocabularies, or doing real inference — its global identity and W3C standardization are decisive there. Reach for an **LPG** when you're building one application's graph, want fast path-finding and properties on edges, and value developer velocity over formal semantics. Plenty of teams run both: an RDF layer for the integrated, governed knowledge model and an LPG for an app's hot operational traversals. They're tools, not teams.

## Why knowledge graphs came back: grounding LLMs

Semantic technologies spent a decade as a niche, and then large language models gave them a second act. An LLM is fluent and confidently wrong; a knowledge graph is rigid and verifiable. Putting them together — retrieving facts and relationships from a curated graph to ground a model's answer — is one of the more effective ways to cut hallucination and add traceable provenance, because every fact the model leans on traces to an explicit, owned triple rather than to the model's weights. The graph supplies *which entities relate and how*; the LLM supplies fluent language and translates a question into a graph query.

This is the engine behind [GraphRAG](graphrag), and it's why I've seen knowledge graphs become the actual product in hard domains — [connecting variants, genes, diseases, and drugs](clinico-genomics-rag-aws) in clinico-genomics, where the relationships are the value and a wrong link is a clinical error, not a typo. The semantic stack's old strengths — explicit meaning, global identity, validation, provenance — turn out to be exactly what you want when an LLM is going to read your data and a human is going to trust the answer.

## The honest trade-offs

I'd be doing you a disservice to pretend this stack is free. What actually bites:

- **The learning curve is real.** URIs, namespaces, the open-world assumption, the OWL-vs-SHACL distinction — there's genuine conceptual overhead before anything ships, and it loses teams that expected "just a graph."

- **Verbosity and tooling.** RDF can feel heavyweight, and the tooling ecosystem is smaller and less polished than the relational or LPG worlds. Triple stores are fewer and the talent pool is thinner.

- **Reasoning has a cost.** OWL inference is powerful but can be computationally expensive, and full reasoning over a large graph isn't free or always fast. Most production systems use a constrained OWL profile (OWL RL, say) deliberately, not the full logic.

- **The boil-the-ocean ontology.** The classic failure: a year modeling the perfect enterprise ontology before delivering a single query anyone asked for. By the time it's "done," the requirements moved.

**The way semantic projects die is starting with the ontology instead of a question.** I've watched teams spend months building an exhaustive, philosophically pure model of their domain — and ship nothing, because there was no user waiting for an answer to pull the work forward. Invert it: pick one valuable question the graph should answer (the affected-customers query, a specific cross-system lookup), model only what that question needs, ship it, and grow the ontology from real demand. A small graph that answers a real question beats a magnificent ontology that answers none. Reuse existing vocabularies (schema.org, domain standards) instead of inventing your own — bespoke ontologies are how you lose the interoperability that was the whole point.

## What to carry away

Semantic technologies represent data as a graph where meaning is explicit and identifiers are global, which is why they're unmatched for integration and inference. The stack: **RDF** models everything as triples with URIs as global join keys; **RDFS/OWL** add an ontology and formal reasoning that derives new facts and catches contradictions; **SPARQL** queries the graph with pattern matching, variable-length paths, and live federation; **SHACL** validates it like a data contract. Hold the two logics apart — OWL infers, SHACL validates — and remember OWL's open-world assumption, which is a feature for integration and a surprise for everyone from a database background.

Choose RDF when integration, shared vocabularies, and reasoning are the point; choose a [labeled property graph](neo4j-graph-databases) when you want one app's fast traversals and developer ergonomics. And the modern reason to learn all of this: a curated knowledge graph is one of the best ways to ground an LLM, which is why [GraphRAG](graphrag) and graph-backed clinico-genomics work. Start from a question, reuse vocabularies, ship something small — the ontology should grow from demand, never precede it.
