insights

Your AI Provider Isn't Training on Your Code — But It's Still Learning Your IP

The real moat risk isn't training-data leakage. It's your constructs and IP shape. Why host your own model.

May 7, 2026

English

Every enterprise AI procurement conversation in 2026 ends with the same reassurance from the vendor: we do not train on your data. The procurement team writes that line into the contract, the security team initials it, and the deal closes. Everyone moves on to the next sprint feeling that the IP risk has been handled. It has not been handled. The reassurance is true, narrowly true, and almost completely beside the point. The serious risk to your intellectual property in 2026 is not that a frontier provider is going to fine-tune the next model on your private repository. The serious risk is that your engineers are going to spend the next eighteen months pasting the architectural soul of your business into a chat window — the constructs, the design principles, the ordering of operations, the names you chose for your abstractions, the trade-offs you made between consistency and latency, the way you decided to model your customers — and that the receiving system, even when it dutifully does not retain a single token, is going to make a competent AI assistant trained on the public web a co-author of your thinking. And what happens after that, what happens when a competitor or a clever solo developer feeds the same assistant a one-paragraph description of what your product does, is the part that the we do not train on your data clause does nothing to address.

This is what I want to walk through carefully and at length, because it is the single most under-discussed risk in enterprise AI right now, and the response most companies are going to settle for — we put the right clauses in the contract — is the wrong response. The right response is structural. It involves bringing more of the loop in-house, hosting your own model where the stakes warrant it, and treating prompts the way you treat outbound traffic to any other vendor: as something with a budget, a routing policy, and a redaction layer. Most of this post is the argument for why that response is necessary, and the rest is what it looks like to actually do it.

The first thing to understand is what the public training-data debate has done to the conversation. For the last three years the public-facing argument about AI and intellectual property has revolved almost entirely around training data. The lawsuits filed against the major labs are about whether ingestion of copyrighted text into a training corpus constitutes infringement. The contracts negotiated between enterprises and vendors are about whether prompts will be retained and used to fine-tune the next base model. The certifications offered by the labs — zero data retention, no training on customer prompts, regional data residency — are all calibrated to that question. And so a procurement officer who has done their job correctly, by the conventional standards of 2024 and 2025, has secured a contract that prevents the vendor from training on the customer's data, has secured a deletion policy with a defined retention window, and has secured a region-locked deployment so that the bytes do not leave the relevant jurisdiction. They have done good work. They have also, almost entirely, missed what is actually happening to the customer's IP.

What is actually happening is this. Your senior engineer opens an AI coding assistant, pastes in three thousand lines of context — service code, a configuration file, a chunk of the data model, a portion of the test suite — and asks the model to add a feature. The model produces a working implementation. Your engineer accepts it. The model has now seen, in context, a substantial fraction of how your service is structured. The model does not retain that context after the session ends. The vendor does not train on it. The bytes are deleted on schedule. And yet a kind of learning has occurred anyway, in two ways. First, the model itself — the public, base model that everyone shares — already knows, before your engineer ever opened the chat window, the entire universe of design patterns, architectural conventions, framework idioms, and anti-patterns that exist on GitHub and in technical writing. When your engineer pastes in your service code, the model is not learning anything new from those bytes; it is simply recognising which of the public patterns your code happens to be a particular instance of. Second, and this is the part that matters most: when the model produces a feature for your codebase, it does so by projecting your design onto the universe of public patterns it already knows, and choosing the pattern that fits best. That projection is the dangerous step. Because after that projection, the idea of your service — the category of system you are running, the shape of the abstractions you chose, the style of trade-off you make — has been encoded in the model's working memory for the duration of the session. The model does not save that encoding. But the encoding is trivially reproducible: anyone with a competent description of your product can reproduce most of it, in a different programming language, in a different framework, in an afternoon, because the model that they are using already knows how to do it.

This is the part that engineers find difficult to internalise, because it inverts something we have all been taught for thirty years. The thing we were taught is that implementation has value because implementation is hard. The thing that has changed is that implementation is no longer hard, and the thing that is now valuable — design, naming, abstraction choice, the shape of the data model — is the thing that fits cleanly into a single prompt. A senior engineer who could once protect a company's moat simply by being one of the few people on earth who knew how to write a particular kind of system has been replaced, for moat purposes, by the question how would I describe this system in two paragraphs. If the answer to that question is two paragraphs that any AI assistant could turn into a working system in a week, then the moat is the two paragraphs, not the system.

The example that crystallised this for a lot of people in the industry was ClawCode. For those who have not followed the story closely: ClawCode is the open-source replica of Claude Code that appeared, in a different programming language, with substantially the same set of features and substantially the same operator experience, only a few months after the original was released. The replica was not built by reverse-engineering the binary, and as far as anyone can tell, it was not built by stealing source code or by any of the things that would constitute conventional intellectual property infringement. It was built by reading the public documentation, watching the developer-focused live streams, and using a competent AI coding assistant to translate the idea of Claude Code — its agent loop, its tool catalogue, its session model, its keybindings, its plan mode, its hook system — into a working implementation. The implementation is not bit-identical. The implementation is behaviourally identical, which is the only kind of identicality that matters for a tool that operators interact with at the level of features and ergonomics. And the implementation came together fast, because the idea of Claude Code, once written down, is not actually that large.

Now, you can have whatever feelings you want about ClawCode specifically. You can think it was a useful contribution to the ecosystem, or a parasitic cloning, or both. None of that matters for the argument I am making. What matters is the demonstration: a competent developer with a competent AI assistant can, in 2026, replicate the idea of a sophisticated piece of software in weeks, not years, as long as they have a clear description of what the software does. The description is the moat. The implementation is not. And the description fits into the context window of any frontier model that anyone can rent.

What this means for your codebase, specifically, is that any time you give an AI assistant — yours, your competitor's, a contractor's, a former employee's — enough context to understand what you are building and why, you have given it the moat-equivalent of your codebase. The actual code is now optional. The actual code can be regenerated. The thing that cannot be regenerated, the thing that took your team five years to figure out, is the description of what to do, and that description is now small and portable and lives easily in a single prompt.

There is a counter-argument here that goes: but our domain is so specific, so regulated, so strange, that no AI assistant could possibly figure it out from a description. This counter-argument is, sadly, mostly wrong, and for an unsettling reason. Frontier models are now trained on so much industry-specific text — regulatory filings, post-mortems, technical blogs, architectural diagrams from conference talks, leaked documentation that was indexed before being taken down — that they have a quite passable understanding of most regulated industries. They know how core banking systems are structured, even if they do not know exactly how your core banking system is structured. They know how clinical trial data flows, even if they do not know exactly how your trial data flows. They know how a large insurer's claims pipeline is wired up, even if they do not know exactly how your claims pipeline is wired up. The gap between the public knowledge and your specific implementation is the gap your engineer fills in, every time they paste your context into the chat window. And once your engineer has filled in that gap, the model has, for the duration of the session, the full picture. The vendor still does not retain it. But your engineer has just demonstrated, in real time, that the full picture is constructible from a small number of inputs, and that constructibility is a property of your business now, whether or not any specific model retains the bytes.

There is a second counter-argument that goes: but our engineers do not paste production code into chat windows, we have policies. This counter-argument also fails, for two reasons that matter in different ways. The first is that policies of this kind have a vanishing half-life inside any engineering org that uses AI assistants seriously. The whole reason engineers are using assistants is that the assistants are most useful when they have more context, not less, and the natural pressure inside any team that takes velocity seriously is to give the assistant enough context to actually be helpful. Policies that say no production code in chat windows erode within months, not years. The second reason is that even if your engineers obey the policy perfectly, the structural description of your system — what it does, why, in what order — leaks out in countless other ways. It leaks out in pull-request descriptions that an engineer asks an assistant to clean up. It leaks out in design docs that an engineer asks an assistant to summarise. It leaks out in architectural diagrams that an engineer asks an assistant to re-format. It leaks out in commit messages, in incident reports, in onboarding documentation, in the README files that get pasted into chat windows when a junior engineer is trying to understand a service. The bytes that contain your moat are not exclusively in src/. They are everywhere your engineering org thinks out loud, and your engineering org now thinks out loud through an AI assistant.

I want to spend a moment on the specific kind of leakage that matters most, because it is the one that engineers most consistently underestimate. It is not the leakage of code. It is the leakage of constructs. By construct I mean the small, named decisions that make a system the system it is — the fact that you decided to call the central object a Workflow rather than a Pipeline, the fact that you decided to model retries at the step level rather than the run level, the fact that you decided that an idle session should expire after seven days rather than thirty, the fact that you decided that a tool error is a structured envelope rather than a thrown exception. None of these decisions, taken alone, looks like intellectual property in any conventional sense. None of them would be defensible in a copyright dispute. And yet, taken together, the set of construct-level decisions that you have made over the last five years is the actual shape of your business. It is what makes your system different from the equivalent system at a competitor. It is the part that the AI assistant projects, recognises, and reproduces with no effort. Once that constellation of constructs is in a prompt, the moat is the prompt, and the prompt fits in a tweet.

Let me push the argument harder, because I do not think I can get this point to land without insisting on it. Imagine, for a moment, that you are a competent engineer who has decided to build a competitor to your business from scratch. You do not have access to your code. You do not have access to your repository. You have only the public marketing site, a recent conference talk by your CTO, the job descriptions on your careers page, and a handful of blog posts your engineering team has published. You feed all of this — perhaps four hundred thousand tokens, well within the context window of any frontier model in 2026 — to a coding assistant, and you ask it to build a working prototype. What you will get back, in a few weeks, is a system that does substantially the same thing your system does, that uses substantially the same set of constructs, and that is built on top of substantially the same architectural skeleton. It will be missing your specific data, your specific customer relationships, and the small set of genuinely novel decisions that no amount of public information could reveal. But it will not be missing the shape. It will not be missing the constructs. It will be a working competitor's prototype, built in weeks by a single engineer, using only the public footprint of your business and the publicly available frontier model that everyone has access to. This is not a thought experiment. This is what is actually happening in the open-source ecosystem right now. ClawCode was the first prominent example. There will be many more.

So what is the actual response? The actual response has three layers, and they correspond, roughly, to three different categories of intellectual property exposure.

The first layer is the boundary layer. It is the simplest and the cheapest, and almost no enterprise has implemented it yet. It is the practice of treating prompts to external AI providers as outbound traffic, with a budget, a routing policy, and a redaction layer in front of it. In the same way that you do not allow your application servers to make arbitrary outbound HTTP calls without going through an egress proxy, you should not allow your engineers to make arbitrary chat-window calls without going through a prompt egress proxy. The proxy enforces redaction (customer names, account numbers, internal identifiers, trade secrets that have been registered as such), enforces a routing policy (which prompts go to which vendor, which prompts go to a self-hosted model, which prompts are blocked entirely), and produces an audit trail (so that when you discover six months later that a piece of design documentation was pasted into a chat window, you know who pasted it and what came back). This layer is a tax on engineering velocity, and the tax is real, and it is also unavoidable. The alternative is no oversight at all, which is what most organisations are running with right now.

The second layer is the substrate layer. This is the layer where you decide that for some categories of work, you do not want a public frontier model in the loop at all. You want a model that runs on infrastructure you control, with weights that do not leave your perimeter, and with no upstream provider in the path. This is what hosting your own model means. It is no longer the painful, expensive, and uncertain undertaking it was in 2023. The open-weight frontier — Llama-class models, DeepSeek-class models, Mistral-class models — has caught up to the closed-weight frontier within the bands that matter for most enterprise work. A self-hosted model is no longer the second-best option for cost reasons; it is, increasingly, the first-best option for any workload where the loss of constructs to an external vendor is unacceptable. The trade-off is operational: you now have to run a model. You have to keep it patched. You have to monitor its performance. You have to rotate its weights when better ones become available. None of this is fun, but it is far from impossible, and the unit-economics of doing it have been declining month-over-month since 2024. The serious question for any enterprise procurement team in 2026 is not whether to host a model internally, but which workloads should be routed to the internal model and which can safely use the frontier vendor through the boundary layer.

The third layer is the workflow layer, and this is the one that ties the first two together. Boundary policies and self-hosted substrates are necessary, but they are not sufficient by themselves; the place where they have to meet, the place where the redaction policy and the routing policy and the model selection actually get enforced for a specific piece of work, is a workflow orchestration layer that knows what kind of work it is running and chooses the right substrate for it. In other words: you do not let an engineer make a one-off decision about whether to send a prompt to the frontier vendor or the internal model. The orchestrator makes that decision, for them, based on a policy. The orchestrator routes prompts that touch sensitive constructs to the internal model. The orchestrator routes prompts that are clearly generic to the frontier vendor. The orchestrator emits the audit trail. The orchestrator is the place where governance becomes operational rather than performative. Building this orchestrator from scratch is a six-month project. Buying it from a vendor whose entire business is workflow orchestration with model routing, redaction, and policy enforcement is a one-week deployment. This is, not coincidentally, what Swfte is built to do, and I will return to that point at the end of the post.

Let me address the obvious objection now: if I can do all of this with the right policies, why do I need to host my own model at all? The answer, the reason this gets to layer two and not just to layer one, is that policy enforcement is fundamentally a trust-the-vendor proposition. When you send a prompt to a frontier provider, even with the strongest contract in the world, you are trusting that vendor — its employees, its subcontractors, its incident response, its disclosure practices, its government-relations posture in whatever jurisdictions matter to you — to do what they say they will do. Sometimes they will. Often they will. But the residual risk, the we discovered an issue and your prompts may have been visible during the window, is real and is increasing as these vendors grow. Self-hosting eliminates that residual risk for the workloads that matter. It does not eliminate it for everything; you still trust the open-weights provenance of the model you downloaded. But it eliminates the operational, runtime, multi-tenant version of the risk, which is the version that costs you the most when it materialises.

There is a second reason for the self-hosted layer that I have not seen anyone articulate clearly enough, and it is, I think, the most important reason of all. It is that the things you train into your self-hosted model accrete, in a way that the things you put in a frontier vendor's context window do not. Every time your engineers fine-tune the internal model on internal documentation, every time you run a domain-specific evaluation against it and adjust accordingly, every time you teach it a particular convention by including it in the system prompt, you are building up a private knowledge base that is itself a moat. It is the inverse of the moat-erosion problem. The frontier vendor's model is, by design, the same model your competitor is using. The internal model, if you maintain it well, is yours, and it gets better at your work, in ways your competitor's model does not. Over five years, this gap compounds. Over ten, it is the difference between an organisation that has a private intelligence asset and an organisation that does not. The engineering teams that figure this out in 2026 will look, by 2030, like they have an unfair advantage. The teams that do not will look like they spent five years renting their cognition from a vendor who was renting the same cognition to everyone else.

I want to address one more counter-argument before I close, because it is the one that is most often deployed by senior leaders who do not want to invest in this layer. The counter-argument is: we are too small / too early / too fast-moving to invest in self-hosted infrastructure. There is a kernel of truth in this, and I do not want to dismiss it. The reality is that for a seed-stage startup with twelve engineers and no production traffic, the boundary layer is enough. You write the policy, you put the egress proxy in place, you accept the residual risk, and you build the product. This is correct, in that context. But the kernel of truth becomes a misleading story very quickly as the company scales. By the time you are at fifty engineers with real customers, the cost of not having a self-hosted layer is no longer the cost of building it; it is the cost of every prompt that leaks structural information to a vendor your competitors also use. By the time you are at two hundred engineers, the cost is qualitative: it is the cost of having a moat that is increasingly described in tweets. The window for we are too small to do this is real, but it is short, and most enterprises have already missed it.

Let me close with the operational picture, because this is meant to be a piece that helps you act, not just a piece that worries you. The operational picture for a serious enterprise in 2026 looks like this. You have a boundary layer — an egress proxy in front of every AI vendor — that enforces redaction and produces an audit trail. You have a substrate layer — a self-hosted model on infrastructure you control, with weights you maintain — that handles any workload that touches structural IP. You have a workflow layer — an orchestrator with model-routing, redaction enforcement, and policy primitives — that decides, for every piece of work, which substrate it runs against. And you have a governance layer sitting on top of all of this — a small team, often two or three people, who own the policy, review the audit logs, and tune the routing rules quarterly as the threat model evolves. The whole apparatus is roughly the same size as a serious cloud-cost optimisation function. It is not enormous. It is not free. But it is the thing that means, when the next ClawCode-equivalent appears, the description it would need to clone you is not sitting in a chat window somewhere, it is sitting on infrastructure you control, behind a model you own.

Most companies will not do this in 2026. Most companies will continue to put the we do not train on your data clause into their contracts and continue to file the IP-protection question under handled. Most companies will be fine, in the sense that the modal outcome of any given individual decision is fine. But over a ten-year horizon, the companies that built the substrate layer, the workflow layer, and the governance layer will find that they have something their peers do not have. They will find that they have a private intelligence asset that gets better at their business every year. They will find that their moat is described in their infrastructure, not in their tweets. And they will find that when a competent engineer with a frontier model and a clear description sits down to clone them, the description is not actually available, because the description has been kept inside a substrate that the cloner does not have access to.

That is the response. The procurement clause is not the response. The contract is not the response. The vendor's training policy is not the response. The response is structural, it involves bringing more of the loop in-house, and the part of the loop that has to come in-house first is the part that decides which prompts go where. After that, the model itself follows naturally. After that, the long-term gap between you and a public model trained on the public web starts to compound. After that, you have a moat again, and the moat is the part of your thinking that you decided not to give away.

Swfte is the workflow and orchestration layer for exactly this picture. We built it because we believe — and we have come to believe more firmly every quarter — that the future of enterprise AI is not a single frontier model rented from a single vendor, but a fleet: an internal substrate handling the sensitive workloads, a frontier vendor handling the generic ones, an orchestrator deciding which is which, and a governance layer reviewing the choices. If you are thinking about how to put any of this in place — whether at the boundary layer, the substrate layer, or the workflow layer — that is the conversation we want to have with you. The first decision is rarely the model. The first decision is the orchestrator that decides where the model lives.

Read the related deep-dives: Buy vs Build in the Age of AI Coding Assistants, The AI Workflow Marketplace, and The $50B AI Agent Marketplace Economy. Or talk to us about Swfte's workflow orchestration layer — model-routed, policy-enforced, and built for the boundary-substrate-workflow pattern described above.

نشر فيinsights

ai-ip-protection self-hosted-llm data-sovereignty vendor-lock-in moat-erosion

Enjoyed this article?

Get more insights on AI and enterprise automation delivered to your inbox.

← Back to all articles