korbin.co

Veritas

A protocol for storing, sharing, and verifying structured knowledge across AI systems

41 min read · updated

Abstract

Veritas is a protocol for storing, sharing, and verifying structured knowledge. Knowledge is expressed in a minimal dependent type theory (the kernel), content-addressed by cryptographic hashes, and distributed across independent knowledge stores that an agent can compose and reason over simultaneously. Claims earn confidence through independent attestation: formal proofs for deductive knowledge, empirical evidence and source citations for everything else. Each node computes confidence locally from the attestations it can see.

The protocol serves as a substrate for AI systems. LLMs translate between natural language and formal claims, query the network for grounded answers with quantified confidence, and propose new knowledge that enters the network with zero confidence until independently attested. A single agent can draw from public knowledge, organizational stores, and private memories in one query, with each source controlling its own sharing boundary. The formal representation makes hallucination detectable at the protocol level: every claim is either proven, attested with traceable evidence, or explicitly unverified. An LLM can still mistranslate, but it cannot produce confident nonsense that passes as knowledge.

The kernel is a full CIC, capable of expressing any logical or mathematical truth.

1. The Problem

Large language models approximate reasoning well enough to be useful, especially in domains like programming, but they have no structured representation of what they know.1 When an LLM states a fact, there is no way to trace that fact to its origin, quantify confidence in it, or determine whether it is consistent with other known facts.

Better training will not fix this. The limitation is architectural.2 Neural networks are function approximators. They do not have a knowledge representation that supports verification, provenance, or logical consistency checking. No amount of scaling closes this gap. You need an external structured knowledge layer to bridge what an LLM asserts and what is verifiably true.

At the same time, formal knowledge systems (proof assistants, ontologies, knowledge graphs) can represent verified knowledge precisely, but they are inaccessible. Writing a Lean proof or a SPARQL query3 requires specialized expertise. These systems contain deep, verified knowledge that is locked away from general use.

Veritas bridges both gaps. It provides:

  • A formal language for expressing any logical claim, from “water boils at 373.15 K” to “every Roman emperor after Diocletian ruled from a divided empire.”
  • A zero-trust model where confidence is earned through independent verification, not assumed.
  • Independent knowledge stores that any authorized agent can query together. A public scientific network, a corporate store, and a user’s private memories can all participate in a single query without any source giving up control of its data.
  • An interface layer where LLMs translate between natural language and formal claims. AI systems get access to verifiable knowledge; formal systems get access to natural language.

1.1 Why Independent Knowledge Stores

The protocol is decentralized as a consequence of its core requirement: a single agent must be able to compose knowledge from arbitrary independent sources, and each source must control what it shares.

Consider a persistent personal agent. It needs to reason across:

  • Public knowledge. The scientific literature, formalized mathematics, general encyclopedic facts. Shared openly.
  • Organizational knowledge. A company’s proprietary research, internal processes, strategic analysis. Accessible to authorized agents within the organization, invisible to everyone else.
  • Personal knowledge. The user’s private memories, preferences, history, notes. Accessible only to that user’s agent.

These are three different knowledge stores with three different sharing policies, but the agent needs to query and reason across all of them simultaneously as a unified knowledge environment. The claim “compound X shows promise for condition Y” might be attested by a peer-reviewed paper in the public network, by unpublished screening results in a corporate store, and by the user’s own lab notebook in a personal store.

This requires knowledge stores to be independent and composable. No single store can contain everything (private data must stay private). No single protocol participant controls what enters or leaves another’s store. The protocol defines how stores communicate, not what they must contain or share.

A multinational running a centralized knowledge store for its 100,000 employees is a valid deployment. So is a solo researcher running a node on a laptop. Both speak the same protocol, both can peer with each other (subject to access control), and an agent authorized by both can reason across both simultaneously.

2. Architecture Overview

The system has five layers. Each depends only on the layers below it.

Interface LayerLLMs, programming languages, visual editors, query buildersSemantic QueryingStructural + semantic embeddings over kernel termsTrust LayerAttestations, derivations, confidence computationNetwork LayerContent-addressed storage, sync, scoped sharingKernelDependent type theory (CIC). The formal foundation.

The kernel has no dependencies. The network layer depends on the kernel (to verify object integrity and type-check derivations). The trust layer depends on the network (to store and retrieve attestations). Semantic querying depends on the kernel (term structure) and the network (to route queries across nodes). The interface layer depends on all of the above.

3. The Kernel

The kernel is a Calculus of Inductive Constructions4 with content-addressed definitions. The companion Veritas Kernel Specification covers it in full.

3.1 Term Language

The entire term language is 7 constructors:

ConstructorPurpose
Var(index)De Bruijn indexed variable reference
Sort(level)Universe (Prop, Type₀, Type₁, …)
Pi(A, B)Dependent function type Π(x : A). B
Lam(A, body)Lambda abstraction λ(x : A). t
App(f, a)Application f a
Def(hash, levels)Reference to a network definition
Let(v, T, body)Local binding

This is a minimal extended kernel optimized for content-addressing, not a pure CIC (a pure kernel would desugar Let and replace Def with direct substitution). Content-addressed networks need stable references (Def), and local bindings keep terms compact. Everything in the network, from trivial facts to advanced mathematics, is a tree of these 7 nodes. Expressiveness comes not from the kernel but from definitions built on top of it.

3.2 Content Addressing

De Bruijn indices5 eliminate variable naming: λx. x and λy. y produce the same tree. All terms are encoded deterministically and hashed with the protocol’s configured hash algorithm (BLAKE36 in the current version). The same logical content always produces the same hash, regardless of who authored it, when, or on which node.

Def(hash, levels) is the bridge between the kernel and the network. Every mathematical constant, type, axiom, and domain-specific definition is an object identified by hash. When the type checker encounters a Def, it resolves it by fetching the definition from the local store (or requesting it from the network).

3.3 Why CIC

The kernel must express any logical truth. First-order logic covers empirical claims but cannot express higher-order mathematics (“every continuous function on [a,b] is bounded”), self-referential structures (types that contain themselves), or dependent relationships (vectors whose type includes their length).

CIC is the strongest well-studied foundation that remains decidable for type checking. It is the basis of Lean 47 and Coq8, which together have formalized a substantial fraction of known mathematics.9 The kernel is not an experiment; it is proven technology deployed in a new context.

The cost of CIC over a simpler logic is implementation complexity of the type checker, not protocol complexity. The term language is 7 constructors regardless. For nodes that only deal with first-order empirical claims, the type checker never encounters the higher-order features.

3.4 Inductive Types

Inductive types are not term constructors but declarations: network objects that, when processed, introduce a type, its constructors, and a recursor into the environment. The recursor is the elimination principle, replacing pattern matching with a well-founded recursion scheme.

Example: the natural numbers.

Nat : Type
| zero : Nat
| succ : NatNat

This declaration introduces four definitions: Nat, Nat.zero, Nat.succ, and Nat.rec. All four are content-addressed and stored as network objects. The recursor Nat.rec allows computing over natural numbers by structural recursion.

Example: propositional equality.

Eq.{u} (A : Sort u) (a : A) : AProp
| refl : Eq A a a

This gives the standard identity type with the J elimination principle. Two values are provably equal only if a proof term of type Eq A a b can be constructed.

3.5 The Core Prelude

The network requires a shared foundation. The core prelude is a fixed set of definitions whose hashes are part of the protocol:

Logic: True, False, And, Or, Not, Iff, Eq, Exists.

Data types: Nat, Bool, List, Prod, Sigma.

Axioms: propositional_extensionality, function_extensionality, quotient_types, and optionally classical_choice.

The prelude is minimal. It provides the logical connectives and basic types needed to build everything else. Domain-specific definitions are not part of the prelude; they are user-published definitions in the network.

4. The Object Model

The network stores and transmits six types of objects, all content-addressed.

4.1 Definition

Introduces a typed constant into the environment.

Definition {
  type:            Term
  body:            Option Term    // none = axiom/opaque
  universe_params: u32
}

Transparent definitions (with a body) can be unfolded by the type checker. Opaque definitions (axioms) are taken on trust. The core prelude contains a small, fixed set of axioms. Everything else is transparent and verifiable.

4.2 Inductive Declaration

Declares an inductive type. Produces a deterministic set of definitions (type, constructors, recursor) that are themselves stored as network objects.

4.3 Claim

A proposition asserted to hold. The term must have type Prop.

Claim {
  term: Term    // infer([], term) ≡ Sort(0)
}

A claim carries no evidence, no provenance, no metadata. It is a pure proposition. “Water boils at 373.15 K at 1 atm” and “the Riemann hypothesis holds” are both claims. The same proposition always produces the same hash.

4.4 Derivation

A formal proof of a claim.

Derivation {
  claim: Hash
  proof: Term    // infer([], proof) ≡ claim.term
}

The proof is a term whose type is the claim’s proposition. Any node can verify a derivation by type-checking the proof against the claim. This requires no trust, no reputation, no authority. If the proof type-checks, the claim is proven. This is the deductive epistemic mode.

4.5 Attestation

External or empirical evidence for a claim.

Attestation {
  target:    Hash                  // the claim (or another attestation) being attested
  agent:     AgentId               // who is attesting
  result:    Confirmed | Disputed
  evidence:  Option EvidenceType   // Proof, StatisticalFit, PatternMatch, Frequency
  source:    Option SourceInfo     // PeerReviewed, Preprint, Encyclopedia, Web, ...
  inputs:    List Hash             // datasets, premises, or other objects used
  signature: ByteArray             // cryptographic signature by the agent
  timestamp: u64
}

Attestations are the empirical epistemic mode. They express: “I, agent X, have evidence that this claim is true (or false), and here is why.” The evidence and source fields are optional but contribute to confidence computation.

What counts as an attestation is broad. A research group publishes a paper measuring the speed of light; their measurement becomes an attestation against the claim speed_of_light = 299792458 m/s. An encyclopedia editor reviews sources and confirms a historical date; that’s an attestation with source type Encyclopedia. An AI agent runs a statistical analysis over a climate dataset and finds a correlation coefficient of 0.87; it produces an attestation with evidence type StatisticalFit and the dataset hash in its inputs. A second agent reads the same paper the first one cited and checks that the cited figures match the original; that’s an attestation targeting the first attestation. Each of these carries different weight in confidence computation, but they all share the same structure: an agent, a target claim, a result, and whatever evidence supports it.

Attestations can target other attestations, forming an audit chain. “I verified that attestation Y’s cited paper actually supports the claim” is itself an attestation.

4.6 Dataset

Datasets are first-class objects in the network, but they are stored referentially. The network holds a lightweight manifest (schema, metadata, hash of the actual data). The bulk data lives outside the network, hosted on any storage backend.

DatasetManifest {
  schema:       List (String, ColumnType)
  row_count:    u64
  content_hash: Hash             // hash of the canonical data encoding
  source:       Option SourceInfo
  description:  Term             // what this dataset represents, in the kernel language
  retrieval:    List URI         // where the bulk data can be fetched
  parent:       Option Hash      // if derived from another dataset, its manifest hash
  transform:    Option Term      // the transformation applied to the parent
  tags:         List String
}

The content hash is the anchor. hash(raw_data) == content_hash is the verification condition. It does not matter where the data is hosted: S3, IPFS10, an institutional repository, a government open data portal. If the hash matches, it is the right data.

Why referential. Datasets can be gigabytes or terabytes. Storing them inline in the network would make synchronization impractical and storage costs prohibitive. The manifest is small (kilobytes), content-addressed, and distributable like any other network object. The bulk data is fetched on demand by nodes that need it.

Multiple retrieval URIs. A manifest can list multiple locations for the same data. Nodes can mirror popular datasets locally. If one URI goes down, others remain. The hash guarantees equivalence across mirrors.

Immutable snapshots. A dataset manifest pins a specific immutable snapshot of data. If the underlying data changes (new rows, corrections), that is a new manifest with a new content hash, requiring new attestations. Old attestations remain valid against the old snapshot. There is no silent invalidation of evidence.

Derived datasets. A filtered, aggregated, or joined dataset is a new manifest whose parent field references the source manifest and whose transform field describes the operation. “I filtered dataset X to rows where temperature > 300 K, producing dataset Y” is traceable: the derivation chain lives in the network even though the bulk data does not.

Verification. When an attestation says “I computed a statistical fit against dataset X and got = 0.99,” a verifier can:

  1. Look up the dataset manifest in the network.
  2. Fetch the bulk data from any listed retrieval URI.
  3. Confirm hash(data) == content_hash.
  4. Rerun the computation.
  5. Confirm the result matches the attestation’s claimed evidence.

The attestation, the dataset manifest, and the claim are all in the network. The computation is deterministic. The only external dependency is bulk storage, verified by hash. Anyone can reproduce the entire chain.

4.7 Concrete Example

A typical real-world claim in the network:

Claim (hash: 0x8f3a9b2c...):

BoilingPoint(water, atm(1.0), kelvin(373.15))

Attestation by agent:

  • result: Confirmed
  • evidence: StatisticalFit ( = 0.9997 across 12,847 measurements)
  • source: PeerReviewed (NIST Chemistry WebBook11, DOI-linked)
  • inputs: [DatasetManifest hash]
  • timestamp: 2025-11-12
  • signature: [valid Ed25519 signature]

Derivation (optional but present): A formal proof term deriving the value from thermodynamic axioms and the definition of phase transitions.

5. The Trust Model

5.1 Local Confidence

The protocol does not define a global truth function. Each node computes its own confidence in each claim from the attestation graph visible to it.

The multi-source architecture demands it. Different nodes have access to different attestations (including private ones), apply different confidence functions, and serve different use cases. A global consensus mechanism would require all nodes to agree on a single assessment, which is incompatible with scoped knowledge and heterogeneous trust policies.

5.2 Two Epistemic Modes

Knowledge enters the network through two different paths:

Deductive. A claim has a Derivation: a proof term that type-checks against the proposition. This is mechanical verification. If the proof type-checks, the claim is proven. No confidence score is needed.

Empirical. A claim has Attestations: agents asserting that evidence supports (or disputes) the claim. The number, independence, and quality of attestations determine confidence. Most real-world knowledge works this way.

Both modes coexist in the same network, on the same claims. A claim might be formally derived from axioms AND attested by experimental data. The derivation and the attestations are independent confirmation through different epistemic paths.

5.3 Confidence Computation

The protocol does not specify how to compute confidence. Each node implements its own function from attestation history to a confidence value. The protocol specifies only the inputs:

  • Evidence type and strength. A StatisticalFit with = 0.99 on a large dataset is stronger than a Frequency count. A formal Proof attestation (distinct from a Derivation) indicates someone checked a logical argument.
  • Source quality. A PeerReviewed source with a verifiable DOI carries more weight than a Web source. Citation count provides a rough proxy for community acceptance.
  • Independence. Multiple attestations from different agents, using different evidence, are stronger than repeated attestations from the same agent or source.
  • Disputes. A Disputed attestation reduces confidence. The dispute itself carries evidence and can be attested or counter-disputed.
  • Audit depth. An attestation that has itself been confirmed by an audit attestation (someone verified the cited source actually says what the attestation claims) is more trustworthy.
  • Derivation existence. If a Derivation exists for the claim, confidence is maximal regardless of attestations. A valid proof is stronger than any amount of empirical evidence.

The protocol provides the attestation graph. Confidence computation is implementation-defined.

5.4 Identity and Signatures

Agents sign attestations. A public key identifies each agent. The signature algorithm is a protocol parameter (Ed2551912 in the current version), upgradeable as described in Section 12.7.

Agent identity is minimal by design. An agent is a key pair, nothing more. Reputation, institutional affiliation, and real-world identity are out of scope for the protocol. Nodes may build reputation systems on top of agent identities, but the protocol does not require or define them.

The network handles key distribution: an agent’s public key is a Definition object. Nodes that trust an agent store its key. There is no PKI, no certificate authority, no web of trust at the protocol level.

5.5 Integrity

All objects are content-addressed. A node receiving an object checks that its hash matches the claimed hash, verifies attestation signatures against the agent’s public key, and verifies Derivations by type-checking the proof term. At no point does the protocol require trusting another node’s word.

6. The Network Protocol

The network protocol handles storage, retrieval, querying, and synchronization of objects across nodes. Specified fully in the companion Veritas Network Protocol Specification.

6.1 Transport

Binary, length-prefixed frames over TCP (reference transport). TLS 1.313 for remote connections. 13 message types covering handshake, object exchange, querying, subscriptions, and keepalive.

6.2 Sync

Eventually consistent, pull-based. Nodes exchange manifests (lightweight summaries of their knowledge) and request objects they lack. Content addressing guarantees no conflicts: the same object has the same hash everywhere. Inserting the same object twice is a no-op.

6.3 Query Routing

Queries specify variables, skeleton hashes, or domain tags. Nodes search locally first, then forward to peers whose manifests indicate potential matches. TTL limits prevent unbounded propagation. Queries return hashes and relevance scores; the requester fetches full objects separately.

6.4 Dependency Resolution

Objects reference other objects by hash. When a node receives an object, it computes its dependencies, requests any that are missing, and repeats until the transitive closure is resolved. A claim referencing vocabulary definitions from the physics domain will pull in those definitions automatically.

6.5 Subscriptions

Nodes can subscribe to new objects matching a filter (by variable, skeleton, or domain). The network pushes notifications when matching objects are stored.

6.6 Peer Discovery

Bootstrap nodes, mDNS14 for local networks, and manual configuration. The protocol is transport-agnostic at the specification level; TCP is the reference implementation.

6.7 Network Topology

The network has no fixed topology. There is no required set of nodes, no backbone, no hierarchy. Any node can connect to any other node. The network is whatever set of peer connections currently exists.

In practice, operators connect nodes by domain and trust. A cluster of nodes operated by physics researchers will peer with each other and share physics vocabulary claims. A cluster of medical nodes will do the same. Cross-domain connections form when operators in different clusters peer their nodes. Queries then propagate across domains.

The protocol supports several deployment patterns:

Public nodes. Open to any peer, accept and serve all objects. These are the backbone of the public knowledge network.

Institutional nodes. Operated by a university, company, or research group. May peer selectively, may require TLS client certificates for connection. Serve their institution’s attested knowledge to the broader network while maintaining control over what they store locally.

Personal nodes. A researcher’s laptop or a home server. Stores a working set of knowledge relevant to the user. Syncs with public or institutional nodes as needed. Operates independently when offline.

Offline operation. A node with a populated local store can type-check derivations, compute confidence, and answer queries without any network connectivity. The network is for synchronization, not for basic operation. A node that goes offline does not lose its knowledge or its ability to reason over it. It cannot discover new objects until it reconnects.

6.8 Private Knowledge and Knowledge Scoping

The protocol treats all knowledge stores uniformly at the kernel level, but each store controls its own sharing boundary. A single agent can reason across multiple knowledge sources with different visibility, the core use case from Section 1.1.

Scoping model. Every object in a node’s store has a visibility scope:

  • Public. Included in manifests, served to any peer, fully participates in the shared network.
  • Scoped. Shared only with authorized peers (those presenting the right credentials during handshake). This is how organizational knowledge works: all agents within a company can access the corporate knowledge store, but outside peers cannot.
  • Local. Never leaves the node. Used by the local type checker, confidence computation, and query engine, but invisible to all peers.

From the kernel’s perspective, there is no difference between a public, scoped, or local object. A local claim has a hash, can have local attestations, can be referenced by local derivations. The privacy boundary is at the network layer, not the knowledge layer.

Organizational knowledge stores. A company operates a knowledge store (one node or a cluster of nodes) containing proprietary research, internal processes, competitive analysis, and institutional memory. All agents authorized by the organization can connect, query, and contribute. The store peers selectively with the public network, sharing what the organization chooses to publish (attested research results, product specifications) while keeping everything else scoped to authorized agents.

An agent working for an employee of this company connects to both the public network and the corporate store. A query about “what compounds interact with target X” searches both simultaneously. The answer draws from public literature and the company’s unpublished screening data, synthesized into a single response. The employee sees everything. An outside agent sees only the public portion.

Personal knowledge. A persistent personal agent maintains a local knowledge store containing the user’s memories, preferences, notes, and context. This is private to the user’s agent by default. “User prefers visual explanations,” “user has a background in molecular biology,” “user discussed compound Y with colleague Z last week” are all local claims, attested by the agent’s own observations, stored locally.

The personal store participates in the same kernel and vocabulary as everything else. The agent can reason across personal context and public knowledge: “Given what I know about the user’s research interests [local] and the latest publications on CRISPR [public], here is what’s relevant.” The personal knowledge shapes the response without ever leaving the user’s node.

Private datasets. The referential dataset model (Section 4.6) supports private data. A dataset manifest can exist locally or in a scoped store, with retrieval URIs pointing to internal storage (a corporate data lake, a local filesystem). Attestations computed against the private dataset stay within the same scope. The claims those attestations support may be published more broadly (if the agent chooses), but the evidence chain into the private data remains hidden.

An agent can publish the claim “compound X reduces tumor growth in cell line Y” as a public claim with a public attestation saying “confirmed by internal experimental data,” without revealing the raw data, the methodology details, or the full statistical analysis. Other agents see the claim and the attestation. They can weigh it according to their own confidence function (which might assign lower confidence to attestations that reference opaque evidence). But the proprietary data itself is never exposed.

Multi-source reasoning. The agent’s local query engine searches across all connected stores: public, organizational, and personal. Confidence computation considers attestations from all sources, weighted by the agent’s own confidence function. Cross-referencing works: a private claim can reference public definitions, a scoped derivation can use both public theorems and organizational lemmas, and a personal memory can contextualize a public claim. Scoping determines what leaves each store, not what any agent can think about.

Access control. The protocol provides the mechanism for scoped sharing through authenticated connections and capability flags during handshake. The specific access control policy (role-based, credential-based, per-object ACLs) is implementation-defined. A simple deployment might use a shared secret for organizational access. A complex one might integrate with an existing identity provider. The protocol does not prescribe a model; it provides the transport-level hooks for any model.

7. Semantic Querying

The retrieval problem: given a question or a concept, find all relevant knowledge in the network, across domains, nodes, and scoping boundaries the agent is authorized to access.

The network protocol (Section 6.3) supports exact matching by variable, skeleton hash, and domain tag. This handles the case where you know what you’re looking for: “find all claims about the boiling point of water” maps directly to a skeleton match. But the interesting queries are the ones where the connection isn’t syntactic: “what else exhibits runaway positive feedback?” should return claims about financial bubbles, bacterial growth, acoustic resonance, and nuclear chain reactions, even though none of these share variables or skeleton hashes.

Semantic querying, retrieval based on meaning and logical structure rather than surface representation, is an open design problem.

7.1 Why Text Embeddings Are Insufficient

Standard text embeddings (the kind used in vector databases and RAG15 systems) cluster by linguistic co-occurrence. “Financial bubble” is near “market crash” and “stock price.” This is useful but shallow. It misses the structural connection between a financial bubble and bacterial colony growth, both of which are instances of a positive feedback loop leading to instability.

Veritas claims are not strings. They are typed kernel terms with explicit logical structure. A claim about financial bubbles and a claim about bacterial growth might share the same logical skeleton:

(x : System). PositiveFeedback(x)Amplification(x)Instability(x)

Different concrete types (financial system vs. biological system), same shape. Text embeddings discard this structure. An embedding model for Veritas should preserve it.

7.2 What the Embedding Should Capture

Several signals could contribute to a useful embedding for kernel terms. Which matter most, and how to combine them, I do not yet know.

Term structure. The shape of the kernel term: depth, connective patterns, quantifier structure. Two claims with the same logical skeleton but different Def references would produce similar structural embeddings. This is what would connect “financial bubble” to “feedback loop” across domains.

Definition semantics. The types and relationships of the definitions a claim references. PositiveFeedback and RunawayGrowth have different hashes but similar type signatures and play similar roles in claims. Embeddings that capture this would group semantically related but structurally distinct claims.

Derivation context. Claims derivable from common premises or sharing intermediate lemmas are logically related. If two claims from different domains both follow from a shared theorem about dynamical systems, that relationship is meaningful and could be reflected in the embedding space.

Attestation context. Claims attested by similar evidence types, by the same agents, or against similar datasets share an empirical context.

How to weight these signals against each other is open. Heavy structural weighting would favor cross-domain analogical retrieval. Heavy definitional weighting would favor within-domain precision. The right balance likely depends on the query.

7.3 Protocol Considerations

The protocol must support whatever embedding approach nodes adopt. Several design decisions follow:

Comparability. Embedding-based query routing (comparing a query embedding against manifest centroids) only works if embeddings are comparable across nodes. This requires nodes to either use the same embedding model, or use models with a known projection between them. The protocol should allow nodes to advertise which model produced their centroids, so peers can determine compatibility.

Fallback. Exact matching (by variable, skeleton, domain) must work regardless of embedding model. Semantic querying enhances retrieval but cannot be a hard dependency. Nodes that disagree on embedding models, or that don’t support embeddings at all, must still be able to participate in the network.

Privacy. Embeddings can leak information about the claims they represent. A scoped or local claim’s embedding must not appear in manifests sent to unauthorized peers. The scoping model (Section 6.8) applies to embeddings as well as to objects.

Standardization. Embedding model diversity (different models may be better for different domains) and interoperability pull in opposite directions. One approach: the protocol specifies a reference model (or a small set), and nodes can use alternatives with the understanding that cross-node semantic routing only works with shared models.

7.4 Possible Approaches

None of these approaches are settled.

Text embeddings as a starting point. Embed the human-readable description of each claim (its natural language gloss) using a standard text embedding model. Implementable now with existing tooling. Decent within-domain retrieval, but no structural similarity across domains. A good pragmatic first step.

Tree-structured embeddings. Embed the kernel term AST directly, treating it as a graph or tree. GNN-based approaches16 or recursive neural networks over the term structure could capture the skeletal similarity between structurally analogous claims. The most direct path to cross-domain structural matching, but requires training data (pairs of structurally similar claims across domains) that does not yet exist.

Hybrid approaches. Combine structural features (extracted deterministically from the term) with learned semantic features (from definition descriptions or types). The structural features provide cross-domain signal; the semantic features provide within-domain precision. The combination could be learned or rule-based.

Proof-aware embeddings. Use the derivation graph as a signal: claims that share proof ancestry should be close. Powerful in principle, but it requires a substantial derivation graph, which the early network will not have.

Embedding model as a network object. The embedding model itself could be a content-addressed network object: a Definition describing the function, with weights stored as a referenced dataset. Nodes that adopt the same model (by hash) are guaranteed to produce comparable embeddings. Consistent with the protocol’s design, but adds complexity.

7.5 Phase 0 Implementation

The reference implementation starts with structural features extracted deterministically from the kernel term (quantifier nesting depth, connective skeleton, universe levels), with text embeddings of the natural-language gloss as a fallback. Learned features fine-tuned on cross-domain claim pairs are a likely next step, but not part of the initial implementation.

7.6 Query Routing

Regardless of embedding approach, the basic routing mechanism is the same. The manifest contains a centroids field: a set of embedding vectors summarizing the node’s knowledge. A query with an embedding is compared against peer manifest centroids using cosine similarity, and forwarded to the most relevant peers.

The scenario:

  1. A user asks “what other systems show positive feedback leading to instability?”
  2. The interface translates this into a query embedding.
  3. The local node searches its own store.
  4. It compares the query embedding against peer manifest centroids and forwards to matching peers.
  5. Results are aggregated across nodes and returned.

If the embeddings capture structural similarity well, the user gets results spanning finance, biology, and physics, connected by shared logical structure, without needing to know which node stored what.

If the embeddings are text-based and only capture surface semantics, the results will be narrower but still useful. The protocol supports both. The quality of cross-domain retrieval depends on the quality of the embedding model, which will improve over time.

8. Vocabularies

The claim BoilingPoint water (atm 1) (kelvin 373.15) uses three definitions: BoilingPoint, atm, kelvin. Without them, the claim is just opaque hashes. A vocabulary bundles related definitions into a named, versioned set: “physics v1.2” contains BoilingPoint, PhysicalQuantity, Unit, Measurement, and everything else an agent needs to express claims about the physical world.

Under the hood, every definition in a vocabulary is an ordinary Def(hash). The vocabulary itself is a manifest: a content-addressed object listing its definitions, their human-readable names, and a version identifier. Adopting a vocabulary means adopting its manifest hash, which pins every definition in the set.

Agents using different definitions for the same concept produce claims with different hashes, but semantic querying (Section 7) can still surface them as related. Hash-level convergence matters for deduplication and exact matching, not for discovery. Shared vocabularies are a convenience for end users: consistent tooling and predictable results. Vocabulary manifests give agents a concrete thing to converge on when they choose to. Anyone can publish a vocabulary manifest for any domain, and multiple competing vocabularies can coexist.

8.1 Structure

A vocabulary manifest lists a set of Definition and InductiveDecl objects. The manifest and its definitions are ordinary network objects, content-addressed and distributable. The manifest’s hash identifies the vocabulary as a whole.

Example: definitions for a physics domain.

PhysicalQuantity : Type// Mass | Length | Time | ...
Unit : PhysicalQuantityType// Kilogram | Meter | Second | ...
Measurement.{u} : PhysicalQuantityType// a numeric value with a unit

speed_of_light     : Measurement Velocity
boltzmann_constant : Measurement (Ratio Energy Temperature)

BoilingPoint       : SubstanceMeasurement PressureMeasurement TemperatureProp
Causes             : EventEventProp
Reduces            : SubstancePhysicalProcessProp

Each of these is a Definition with a hash. “Water boils at 373.15 K at 1 atm” becomes:

BoilingPoint water (atm 1) (kelvin 373.15)

A term built entirely from Def references and numeric literals. It has a deterministic hash.

8.2 Interfaces

LLMs will be the dominant interface. Most users will interact with Veritas through natural language, which means every natural language statement an LLM produces must trace back to specific claims in the network.

Different interfaces present the same kernel terms differently. The vocabulary definitions provide the shared reference point.

LLMs. An LLM receives a set of definitions as context (names, types, descriptions). Given “aspirin reduces inflammation,” it generates the kernel term App(App(Def(reduces_hash), Def(aspirin_hash)), Def(inflammation_hash)). The term is type-checked by the kernel. If it fails, the LLM retries. The vocabulary determines what the LLM can express and how naturally it can translate.

Programming languages. A Rust or Python library can bind definitions as typed constants. let claim = reduces(aspirin, inflammation); compiles to the same kernel term. The vocabulary becomes a typed API.

Visual editors. A node-based editor renders definitions as blocks with typed input/output ports. BoilingPoint becomes a block with three ports: Substance, Pressure, Temperature. The user connects water, atm(1), and kelvin(373.15). The editor produces the kernel term. The vocabulary determines the block library.

Query builders. A structured query interface lists available predicates from loaded definitions. The user selects BoilingPoint, fills in water and _ wildcards, and gets back all matching claims with their attestations.

In all cases, the interface works with kernel terms. The vocabulary provides the human-meaningful surface.

8.3 Subjective Domains: Art

Art is a useful test case because it spans the full range from objective fact to subjective judgment. The same definition mechanism handles all of them.

Objective knowledge is straightforward:

Artist    : Type
Artwork   : Type
Medium    : Type// Oil, Watercolor, Bronze, Digital, ...
Movement  : Type// Impressionism, Cubism, Baroque, ...
Technique : Type// Chiaroscuro, Impasto, Sfumato, ...

CreatedBy     : ArtworkArtistProp
CreatedIn     : ArtworkPeriodProp
BelongsTo     : ArtworkMovementProp
HousedAt      : ArtworkLocationProp
InfluencedBy  : ArtistArtistProp
CreatedBy starry_night van_gogh
CreatedIn starry_night (year 1889)
BelongsTo starry_night post_impressionism
HousedAt starry_night moma

Attested by museum records, art historical sources, encyclopedia entries.

Analytical knowledge requires expertise but remains attestable:

UseTechnique   : ArtworkTechniqueProp
Depicts        : ArtworkSubjectProp
EvokesTheme    : ArtworkThemeProp
References     : ArtworkArtworkProp
UseTechnique starry_night impasto
References picasso_guernica goya_third_of_may

“Guernica references The Third of May” is attested by art historians citing specific visual parallels. The evidence is interpretive rather than measured, but the structure is identical: a claim with attestations carrying sources.

Subjective judgments use the same mechanism by making the subject explicit. “Starry Night is beautiful” is not a proposition about the artwork. It is a proposition about someone’s experience:

Aesthetic  : Type// Beautiful, Sublime, Disturbing, Kitsch, ...
Considers  : AgentArtworkAestheticProp
Prefers    : AgentArtworkArtworkProp
Considers alice starry_night beautiful

This is an objective claim about a subjective state. It is either true or false, and Alice can attest it. The network does not represent “Starry Night is beautiful.” It represents “47 agents consider Starry Night beautiful, 3 consider it overrated, 2 consider it kitsch.” Aggregate subjective knowledge (“most people consider Starry Night beautiful”) becomes an empirical claim attestable by survey data or by aggregating Considers claims in the network.

Cross-domain reasoning works because all definitions share the same kernel:

RespondsTo     : ArtworkHistoricalEventProp
CreatedDuring  : ArtworkHistoricalEventProp

RespondsTo picasso_guernica spanish_civil_war
CreatedDuring starry_night (stay_at asylum_saint_paul)

An agent querying “how did the Spanish Civil War influence art?” follows formal links from historical events to artworks to techniques to movements, with every link attested.

The same pattern applies to any subjective domain: music, cuisine, film, literature. The definitions do not formalize subjectivity. They formalize claims about subjective experiences, which are objective.

8.4 Convergence

Shared vocabularies improve the experience: exact matching, deduplication, consistent tooling. The protocol makes convergence easy:

  • Vocabulary manifests give agents a single hash to adopt rather than dozens of individual definitions.
  • Widely-used vocabularies are easier to query. A claim using the dominant physics vocabulary will be found by more peers.
  • Interface tooling converges on popular vocabularies. An LLM trained on a vocabulary translates into it.
  • When two vocabularies describe the same concept differently, a Derivation can prove the equivalence. Nodes can maintain equivalence indices across vocabulary versions.

People will publish competing vocabularies, abandon underused ones, and adopt winners, like terminology in scientific fields.

8.5 Dependency Structure

Definitions build on other definitions. A pharmacology definition references biochemistry definitions, which reference chemistry definitions, which reference the core prelude. This forms a dependency graph:

  • Core Prelude (logic, Nat, Bool, List)
    • Mathematics (Real, algebra, analysis, topology)
    • Units and Measurement (SI, derived units, dimensional analysis)
      • Physics (mechanics, thermodynamics, electromagnetism)
      • Chemistry (elements, compounds, reactions)
        • Biochemistry (proteins, enzymes, metabolic pathways)
          • Pharmacology (drugs, dosages, interactions)
    • Geography (locations, regions, coordinates)
      • History (events, periods, causation)
        • Art History (artists, movements, influence)
    • Linguistics (languages, grammar, semantics)
    • Agent and Subjectivity (subjective claims, preferences, experience)
      • Art (aesthetic judgments, cross-references art history)
      • Music (genres, instruments, harmonic analysis)
      • Cuisine (flavors, techniques, cultural context)

The only protocol-level requirement is the core prelude: the fixed set of definitions (logic, basic types, axioms) whose hashes are part of the protocol specification. Everything above the prelude is user-published definitions with no special status. The network’s dependency resolution handles the rest: requesting any claim pulls in every definition it transitively references.

9. The Interface Layer

The interface layer bridges the formal protocol to natural language, programming languages, and visual tools.

9.1 LLMs

LLMs are the primary interface today. An LLM does not participate in the trust model: it does not attest to claims, it does not have an agent identity, and its assertions carry no weight. It is a translator between natural language and kernel terms.

The LLM performs four roles:

Translation (natural language to formal claim). A user says “aspirin reduces inflammation.” The LLM, given the active vocabulary, produces the corresponding kernel term. The LLM emits a term that must type-check. Invalid output is rejected by the kernel, not by heuristics.

Query (natural language to formal query, formal results to natural language). A user asks “what reduces inflammation?” The LLM generates a query against the network (searching for claims matching a pattern like Reduces(_, inflammation)), receives results with their attestation histories and confidence scores, and synthesizes a natural language answer grounded in those results.

Grounded response. When the LLM answers a question using network results, it can cite specific claims and their confidence. “Aspirin reduces inflammation (14 attestations, 3 peer-reviewed sources, confidence 0.94).” The LLM is reporting the network’s epistemic state, not expressing confidence in its own assertion.

Claim proposal. The LLM (or an automated system) can propose new claims based on reasoning over existing knowledge. These enter the network with zero attestations and zero confidence. They become knowledge only when independently verified.

9.2 The Translation Problem

Translating natural language to formal claims is the hardest part of the interface layer. It requires:

  • Vocabulary awareness. The LLM must know which definitions are available and select the appropriate ones. This is a context problem: the LLM is given the relevant vocabulary definitions as part of its prompt or fine-tuning.
  • Ambiguity resolution. “Banks can be eroded by rivers” and “banks can be eroded by financial crises” use different vocabularies. The LLM must disambiguate, potentially by asking the user or by checking which interpretation produces a well-typed term.
  • Granularity matching. “Aspirin helps with headaches” can be formalized at many levels of specificity. The LLM must choose an appropriate level, or generate multiple claims at different granularities.
  • Type-checking as validation. Every proposed term is type-checked by the kernel. If it does not type-check, the LLM is informed and must retry. No ill-typed claim enters the network, regardless of how confident the LLM is.

9.3 Constraining Hallucination

The protocol constrains LLM hallucination structurally:

  1. Claims without attestations are explicitly unverified. The network distinguishes between “this is known” (attested) and “this has been stated” (unattested). An LLM querying the network sees this distinction.
  2. Confidence is computed from evidence, not from the LLM’s weights. The network provides confidence scores; the LLM reports them.
  3. Formal claims are verifiable. A claim is a typed term. It either type-checks or it doesn’t. A derivation either proves the claim or it doesn’t.
  4. The LLM cannot forge attestations. Attestations are signed by agent keys. The LLM does not hold a key (unless explicitly given one by an agent who takes responsibility for LLM-generated attestations).

The LLM can still mistranslate natural language into the wrong formal claim. But the failure mode shifts: instead of “the LLM confidently states something false,” you get “the LLM produces a well-typed claim that doesn’t match the user’s intent.” The second is more tractable.

9.4 LLM-Assisted Verification

Beyond translation, LLMs can assist the verification process:

  • Suggesting attestation strategies. Given a claim, the LLM can suggest what evidence would support it: “this claim could be attested by referencing paper X” or “this is derivable from claims Y and Z.”
  • DOI/source lookup. The LLM can check whether a cited DOI actually supports the claim it’s attached to, producing an audit attestation.
  • Contradiction detection. The LLM can flag when a new claim appears to contradict existing attested claims, prompting human review.
  • Derivation sketching. For claims that should be formally provable, the LLM can sketch a proof that a dedicated prover then verifies and completes.

In all cases, the LLM’s output is validated by the formal layer. It can be wrong; the protocol catches it.

10. The Practical Subset

A reference implementation is under active development, operating in a subset of the kernel’s expressiveness that will expand as tooling matures.

10.1 Supported Claim Patterns

The practical subset covers first-order empirical and relational claims:

Relational. A predicate applied to constants.

Element(gold)
Capital(tokyo, japan)

Quantified. Universal or existential statements over a single type.

(x : Element). NobleGas(x)Nonreactive(x)
(x : Element). AtomicNumber(x) > 100

Equational. A measurement or property equals a value.

Eq (BoilingPoint water (atm 1)) (kelvin 373.15)
Eq (AtomicNumber gold) 79

Conditional. An implication between propositions.

Administered(aspirin, therapeutic_dose)Reduces(inflammation)
Pressure_increase(system)BoilingPoint_increase(system)

Causal. A causal relation (distinct from logical implication).

Causes(smoking, lung_cancer)
Causes(greenhouse_emissions, temperature_increase)

These patterns cover most empirical knowledge expressible in natural language.

10.2 Expansion Path

The subset expands through three mechanisms:

  1. Better LLMs. As language models improve at formal reasoning, and as fine-tuned models specialize in structured knowledge translation, they can generate more complex claim structures and eventually proof sketches.
  2. Integration with proof assistants. Lean 4 and Coq can export proofs as Veritas Derivation objects, bringing formalized mathematics into the network.
  3. Knowledge import. Existing knowledge sources (scientific databases, legal corpora, institutional records) can be ingested into the network. Each import carries proper provenance tagging, with attestation confidence potentially weighted by the credibility of the source.

Each mechanism brings more of the kernel’s expressiveness into active use without protocol changes.

11. Convergence

Veritas rests on a convergence from two directions.

11.1 From Below: Formal Systems Become Accessible

Proof assistants today contain deep, verified mathematical knowledge, but require specialized expertise to use. Veritas, through its LLM interface, makes this knowledge queryable in natural language. A user asks “is every continuous function integrable?” and receives a grounded answer citing a formal proof in the network, not an LLM’s best guess.

As formalization efforts (Lean’s mathlib, Coq’s libraries) grow, and as Veritas can import their results, the network accumulates a growing body of mechanically verified knowledge accessible to anyone through natural language.

11.2 From Above: AI Systems Become More Rigorous

LLMs today generate plausible text without epistemic grounding. Veritas provides the grounding layer. As LLMs improve at formal reasoning, the interface layer becomes thinner: translation becomes more reliable, derivation sketches become more complete, and the gap between natural language and formal representation narrows.

None of this requires changes to the foundation. The kernel is expressive enough. The network protocol handles all object types. The trust model supports both deductive and empirical knowledge. Only the tooling on top changes.

11.3 The End State

At full convergence, the network contains:

  • The entirety of formalized mathematics, mechanically verified.
  • The empirical knowledge of every scientific domain, attested by evidence trails linking claims to papers, datasets, and measurements.
  • Logical bridges between domains, enabling cross-domain reasoning (“this drug targets this protein because of this chemical property, which follows from this physical law”).
  • All of it queryable in natural language, with every answer traceable to its formal justification.

Alongside the public network, organizations maintain their own knowledge stores: corporate research, institutional memory, strategic analysis. Individuals maintain personal stores: memories, preferences, context. Agents reason across all of them simultaneously, with each source controlling its own sharing boundary.

A knowledge graph is flat: subject-predicate-object triples with no logical depth. A database stores facts but cannot reason over them. Veritas is a distributed, content-addressed, incrementally verifiable body of structured knowledge where a derivation can chain from an empirical measurement through a chemical property to a formal mathematical proof.

12. Evolution and Versioning

12.1 The Kernel as an Abstraction Boundary

The kernel is the deepest layer of the protocol, but it is not permanent. CIC is the best available foundation today. It may not be in ten years. Type theory is an active research area; cubical type theory17, for instance, handles equality more elegantly than CIC and may prove superior for a knowledge network where equivalence between claims is a central concern.

The kernel can be replaced without destroying the network. It sits behind an abstraction boundary: the rest of the system (network protocol, trust model, interface layer) interacts with it only through typed, content-addressed objects. The system depends on a kernel that can type-check terms, verify proofs, and produce deterministic content hashes. It does not depend on CIC specifically.

12.2 Protocol Versioning

The protocol version covers the wire format and message types. The kernel version is a separate concern: two nodes can speak the same protocol version while using different kernel versions, as long as they can agree on how to interpret objects.

A kernel version defines:

  • The term grammar (which constructors exist and what they mean).
  • The reduction rules (how terms evaluate).
  • The type-checking algorithm (what is well-typed).
  • The canonical encoding (how terms are serialized for hashing).

Changing any of these changes the content hashes of objects. A claim expressed in kernel v1 and the same claim expressed in kernel v2 will generally have different hashes, even if they mean the same thing.

12.3 Migration

When a new kernel version is introduced, existing objects must be migrated. Migration must be lossless: every well-typed object in kernel vN can be expressed as a well-typed object in kernel vN+1 with identical semantics. The new kernel must be at least as expressive as the old one.

Migration is mechanical. A migration function migrate(v_old, v_new, object) → object transforms the term representation and recomputes the hash. Because the function is deterministic, all nodes migrating the same object produce the same new hash. The network can maintain a mapping from old hashes to new hashes, allowing queries against either version during a transition period.

12.4 Coexistence

During a kernel transition, the network will contain objects from both versions. Nodes must be able to:

  • Verify objects from any supported kernel version. A node running kernel v2 can still type-check a v1 object by running the v1 checker (or by migrating the object to v2 first and checking it there).
  • Translate references across versions. If a v2 claim references a definition via a v1 hash, the node can resolve it through the migration mapping.
  • Serve objects in the version requested. Query results should indicate which kernel version each object uses. A client can request migration on delivery.

The handshake already negotiates protocol version. It can be extended to advertise supported kernel versions, allowing peers to communicate which objects they can verify natively.

12.5 What Cannot Change

Certain properties hold across all kernel versions:

  • Content addressing. Objects are identified by the hash of their canonical encoding. This is a protocol-level property, not a kernel-level one.
  • The object model. Definitions, Claims, Derivations, Attestations, and Datasets exist regardless of kernel. A new kernel might change what a Claim’s term looks like internally, but there is still a Claim type with a term that has type Prop.
  • The trust model. Derivations are mechanically verified (by whatever kernel is in use). Attestations carry external evidence. Confidence is computed locally. None of this depends on which type theory is underneath.
  • The network protocol. Framing, message types, sync, query routing. These are kernel-agnostic.

The kernel is a pluggable component. The protocol is the stable shell around it.

12.6 Definition Stability

Domain definitions are kernel objects, so they are affected by kernel migrations. However, the semantic intent of a definition is stable across versions. BoilingPoint : Substance → Measurement Pressure → Measurement Temperature → Prop means the same thing regardless of how the kernel encodes it.

Definition versioning is independent of kernel versioning. A set of definitions can be updated (adding new definitions, refining types) without a kernel change, and a kernel can be upgraded without altering definition semantics.

12.7 Cryptographic Agility

The same principle applies to cryptographic primitives. The protocol depends on content addressing and digital signatures, not on BLAKE3 and Ed25519 specifically. Both will be superseded.

Hashes are self-describing. Every hash in the network carries a prefix identifying the algorithm and digest length, following the multihash convention18: <algorithm_id><digest_length><digest>. The current default is BLAKE3 (32 bytes). Nodes advertise supported hash algorithms during handshake.

This means:

  • Objects hashed with different algorithms can coexist in the network. A node recognizes which algorithm produced a hash and uses the corresponding verifier.
  • Migration to a new hash algorithm is the same process as kernel migration: rehash all objects, maintain a mapping from old hashes to new hashes. Because the canonical encoding is deterministic, all nodes produce the same new hashes.
  • A hash algorithm transition does not require a flag day. New objects use the new algorithm. Old objects remain valid under their original algorithm indefinitely, or are migrated incrementally.

Signatures use the same approach. Attestation signatures carry an algorithm identifier. The current default is Ed25519. If a post-quantum signature scheme becomes necessary, new attestations use it while old attestations remain verifiable under their original scheme.

The protocol depends on properties, not implementations: collision resistance and preimage resistance for hashes, unforgeability for signatures. Any algorithm satisfying these properties is a valid substitution.

13. Open Problems

13.1 Definition Convergence

Vocabulary manifests make convergence easy, but who decides what the “standard” physics vocabulary looks like is a social problem. Possible approaches: curated vocabulary manifests maintained by domain experts, registries with adoption statistics, or organic emergence.

I expect the core physics and chemistry vocabularies to converge quickly once a small group of domain experts seeds them, similar to how IUPAC nomenclature19 became canonical. How quickly is an open question.

13.2 Attestation Quality

How do you prevent low-quality attestations from flooding the network? A bot could generate millions of attestations for arbitrary claims. Possible mitigations: reputation systems on agent keys, stake-based rate limiting, attestation quality metrics. The protocol does not mandate a solution; nodes can filter by whatever criteria they choose.

13.3 Translation Fidelity

How do you know the LLM correctly translated a natural language statement into a formal claim? A human can inspect any individual formal claim, but verifying translation fidelity at scale is hard. Techniques from formal specification and contract-based design may help: the LLM can generate a back-translation, and the user confirms it matches their intent.

13.4 Ontological Commitment

Domain definitions make ontological commitments. Defining Causes : Event → Event → Prop presupposes a specific model of causation. Different philosophical frameworks model causation differently. The protocol allows multiple vocabularies, but this fragments the knowledge base. There may be no clean solution; ontological disagreement is inherent to human knowledge.

13.5 Computational Efficiency

Type-checking dependent type theory terms from untrusted sources is computationally expensive. Maliciously crafted terms can trigger exponential reduction. The kernel spec recommends fuel limits, but concrete bounds and their security properties need analysis.

13.6 Bootstrapping

The network’s value depends on the amount and quality of knowledge in it. Initially, it contains only the core prelude. Concrete bootstrapping steps:

  • Formal mathematics: Export thousands of theorems and proofs from Lean 4’s mathlib as Derivations via an automated bridge. This seeds the network with high-confidence deductive knowledge immediately.
  • Empirical structured data: Import from Wikidata20, PubChem21, and major scientific databases as Claims with Attestations attributed to a “2026 Snapshot Import” agent. This is not a bulk copy. Wikidata triples and PubChem records need to be mapped into the kernel’s type system, which requires vocabulary design decisions for each domain. The mapping is lossy in places (Wikidata’s loose typing does not survive translation into dependent types without editorial choices), and competing mappings will produce incompatible claims. This is engineering and community work, not a script.
  • Targeted attestation campaigns: Run a public program (initially recognition-focused, later possibly with small bounties) for high-value empirical claims such as physical constants, drug interactions, and materials properties. Early contributors include research institutions and individual scientists.

Together, these should bring the network to critical mass (tens of thousands of interconnected claims) within 6 to 12 months.

Direct Influence

  • Lean 4 / mathlib. The kernel is a deliberate subset of Lean’s Calculus of Inductive Constructions (CIC with proof irrelevance, quotient types, and a classical axiom). CIC was chosen because its primitives are the right ones for expressing structured knowledge. A practical bonus: the full expressiveness of Lean and the large body of formalized mathematics in mathlib can be imported directly as Derivations. The content-addressed Def(hash) representation converges with Lean’s general approach to definitions and modular structure, though the mechanism itself was developed independently for the Veritas network.

Independent Convergence

The remaining parallels were discovered after the core design stabilized. That independent lines of thinking converge on similar structures suggests these patterns are natural to the problem.

  • IPFS / content addressing. Veritas uses content hashes22 as universal identifiers for typed logical objects. IPFS applies the same principle to arbitrary files. The designs converge because content addressing is the natural solution whenever identity should follow from content rather than location.
  • Nostr.23 The agent-is-a-keypair identity model and content-addressed event structure parallel Nostr’s design. Nostr applies these to social messaging; Veritas applies them to knowledge. Both arrive at the same primitives because decentralized systems without a central authority need self-certifying identity.
  • Wikidata / knowledge graphs. Wikidata’s structured knowledge with provenance references resembles Veritas’ attestation model. The parallel reflects a shared constraint: knowledge without traceable origin is not trustworthy. Veritas goes further by grounding provenance in formal verification rather than editorial consensus.
  • Semantic Web / RDF. The vision of machine-readable linked knowledge is shared. Veritas rejects RDF’s triple model in favor of dependent type theory for its vastly greater expressiveness, but the underlying goal of making knowledge interoperable across systems is the same one the W3C pursued two decades ago.

Footnotes

  1. Sebastian Farquhar et al., “Detecting hallucinations in large language models using semantic entropy,” Nature 630 (2024), pp. 625–630. DOI: 10.1038/s41586-024-07421-0

  2. Adam Tauman Kalai and Santosh S. Vempala, “Calibrated Language Models Must Hallucinate,” STOC 2024, pp. 160–171. Proves mathematically that calibrated language models have an inherent lower bound on hallucination rate, independent of architecture or training. arXiv: 2311.14648

  3. Richard Cyganiak, David Wood, and Markus Lanthaler, eds., “RDF 1.1 Concepts and Abstract Syntax,” W3C Recommendation, 2014. w3.org/TR/rdf11-concepts — Steve Harris and Andy Seaborne, eds., “SPARQL 1.1 Query Language,” W3C Recommendation, 2013. w3.org/TR/sparql11-query

  4. Thierry Coquand and Gérard Huet, “The Calculus of Constructions,” Information and Computation 76(2–3):95–120, 1988. DOI: 10.1016/0890-5401(88)90005-3 — Extended with inductive types by Christine Paulin-Mohring, TLCA 1993. DOI: 10.1007/BFb0037116

  5. Nicolaas Govert de Bruijn, “Lambda Calculus Notation with Nameless Dummies,” Indagationes Mathematicae 75(5):381–392, 1972. DOI: 10.1016/1385-7258(72)90034-0

  6. Jack O’Connor, Jean-Philippe Aumasson, Samuel Neves, and Zooko Wilcox-O’Hearn, BLAKE3: One Function, Fast Everywhere, 2020. Spec (PDF)

  7. Leonardo de Moura and Sebastian Ullrich, “The Lean 4 Theorem Prover and Programming Language,” CADE 28, pp. 625–635, 2021. DOI: 10.1007/978-3-030-79876-5_37

  8. Yves Bertot and Pierre Castéran, Interactive Theorem Proving and Program Development: Coq’Art, Springer, 2004. DOI: 10.1007/978-3-662-07964-5 — Now renamed to Rocq.

  9. The mathlib Community, “The Lean Mathematical Library,” CPP 2020, pp. 367–381. arXiv: 1910.09336 — As of early 2026, mathlib contains roughly 265,000 theorems and 127,000 definitions across 2 million lines of code.

  10. Juan Benet, “IPFS — Content Addressed, Versioned, P2P File System,” 2014. arXiv: 1407.3561 — Peer-reviewed evaluation: Dennis Trautwein et al., “Design and Evaluation of IPFS,” ACM SIGCOMM 2022.

  11. P.J. Linstrom and W.G. Mallard, eds., NIST Chemistry WebBook, NIST Standard Reference Database Number 69. DOI: 10.18434/T4D303

  12. Daniel J. Bernstein et al., “High-speed high-security signatures,” Journal of Cryptographic Engineering 2(2):77–89, 2012. DOI: 10.1007/s13389-012-0027-1

  13. Eric Rescorla, “The Transport Layer Security (TLS) Protocol Version 1.3,” RFC 8446, August 2018. rfc-editor.org/rfc/rfc8446

  14. Stuart Cheshire and Marc Krochmal, “Multicast DNS,” RFC 6762, February 2013. rfc-editor.org/rfc/rfc6762

  15. Patrick Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS 2020, pp. 9459–9474. arXiv: 2005.11401

  16. Thomas N. Kipf and Max Welling, “Semi-Supervised Classification with Graph Convolutional Networks,” ICLR 2017. arXiv: 1609.02907 — For inductive graph embeddings: William L. Hamilton, Rex Ying, and Jure Leskovec, “Inductive Representation Learning on Large Graphs,” NeurIPS 2017. arXiv: 1706.02216

  17. Cyril Cohen, Thierry Coquand, Simon Huber, and Anders Mörtberg, “Cubical Type Theory: A Constructive Interpretation of the Univalence Axiom,” TYPES 2015, LIPIcs vol. 69, 2018. arXiv: 1611.02108

  18. “Multihash — Self-describing hashes,” Multiformats project (Protocol Labs). multiformats.io/multihash — IETF draft: Juan Benet and Manu Sporny, “The Multihash Data Format,” draft-multiformats-multihash-07, 2023.

  19. Henri A. Favre and Warren H. Powell, Nomenclature of Organic Chemistry: IUPAC Recommendations and Preferred Names 2013, Royal Society of Chemistry, 2014. DOI: 10.1039/9781849733069

  20. Denny Vrandečić and Markus Krötzsch, “Wikidata: A Free Collaborative Knowledgebase,” Communications of the ACM 57(10):78–85, 2014. DOI: 10.1145/2629489

  21. Sunghwan Kim et al., “PubChem 2025 update,” Nucleic Acids Research 53(D1):D1516–D1525, 2025. DOI: 10.1093/nar/gkae1059

  22. Ralph C. Merkle, “A Digital Signature Based on a Conventional Encryption Function,” CRYPTO ‘87, pp. 369–378, 1988. DOI: 10.1007/3-540-48184-2_32 — Git’s object store uses the same principle: every object is identified by its content hash.

  23. fiatjaf et al., “NIP-01: Basic Protocol Flow Description,” github.com/nostr-protocol/nips — Defines the core event structure, Schnorr signatures over secp256k1, and WebSocket relay communication.