Designing applications around AI development agents

Asking ChatGPT to write a Vertex2D library will get you a complete, (likely) functional library in 30 seconds. AI is really good at writing libraries around existing problems that fit in context.

Easily accessible models (Examples: Claude 3.7, Gemini 2.5) these days have a context size of around a million tokens. As of writing this post, all blog posts we’ve written will fit in about 250K tokens, so that leaves us with a lot of tokens remaining to ask questions, discuss the blog posts, and much more. Modern AI models are big enough to be able to load complete applications into context, if the application is small enough.

I believe that a future in AI-written applications lies in creating applications consisting of many small modules, where each module is small enough to fit in the context of an AI agent, together with all the tests, documentation and requirements for that module.

How do you make applications small enough? Easy: Split them up. Delegate implementation details to dependencies. These dependencies? AI can write them for you.

AI Agents: A new employee for every task

Querying a large language model is basically like asking an experienced developer a question on his first day. Sure, he knows a lot of things, but he knows nothing about your application, business, conventions or expectations. You’ll have to tell them, every time.

Imagine you have a new employee and the only way to communicate with him is via email. To have him perform a task, you send him all relevant documentation and instruct the employee what to do.

Without thinking about it, the employee reads all the documentation and attempts to solve the problem right away.

To have this employee succeed, you’ll have to send pretty detailed instructions, and ideally you also give this employee access to anything it needs.

After the task is completed, the employee leaves. If you have follow-up questions or tasks, a new employee is assigned. This new employee knows nothing and needs to read the previous conversation to attempt to solve your problem.

These employees are your AI agents. The instructions you provide to these employees is called a prompt. You need a pretty detailed prompt to give these employees a proper chance to produce something you need, but you also need to provide no more than required or they’ll hallucinate, make stuff up, or fix unrelated things.

When these AI agents are tasked with anything non-trivial, you want the AI agents to be able to debug their code, and only their code. You want the AI to be able to write tests and run the test suite with only their code.

Dependencies need to be incredibly simple or need to be mocked predictably. If a test fails, it must be because the AI agent screwed up his own code, not because a dependency is failing.

Testable modules

We want to split our application into smaller modules, and ideally you can develop and test these modules independently from each other. Designing composable modules has always been a good idea, but it has been too easy to just create a “big ball of mud” where everything relies on everything else.

I think we will need to define very strong boundaries, almost like you’re designing micro-services, but without the actual micro-services. Instead of literally hosting many slow, inter-communicating HTTP servers, I imagine we can just create a framework that allows you to define boundaries. These boundary definitions can then be used to validate inputs and outputs of modules and isolate errors thrown by a module.

That way, calls between modules can easily be mocked in tests, and it is easier to identify which module is misbehaving when something goes wrong. By validating the boundaries and catching errors around boundaries, we can quickly identify a lot of problems early on, directly at the source.

To aid with validating and enforcing boundaries, I think it’s helpful to strongly prefer functional-style functions, only accepting and returning primitive values, or objects composed of primitive values. These value objects should be clonable and replacable by literals: ideally they’re JSON-serializable.

Let’s start with an example, imagine the structure of a blog-like website.

The Blog Structure

A blog website can have modules like this:

SqliteDatabase: A database connection, taking SQL queries and returning the results
PostsRepo: A module responsible for loading and saving posts on persistent storage, like a database. Can be implemented using a SqlitePostsRepo (using SqliteDatabase), FilePostsRepo saving the posts in JSON files or MemoryPostsRepo keeping them in-memory. (useful for testing?)
PublicPosts: A module responsible for accessing posts accessible by the public, only returning posts that are published and only returning properties that should be accessible by the public. Uses a PostsRepo to load posts.
AdminPosts: A module for managing posts, like creating, editing, publishing and deleting posts. Uses a PostsRepo to load and save posts.
PublicPostsRss: A module wrapping PublicPosts, returning published posts as an RSS feed.
PublicWeb: A module wrapping PublicPosts, returning published posts as HTML pages, like a homepage and a detailed post page, and maybe some additional navigational elements like “posts per month”.
PostsWebComponents: A module rendering PublicPosts as HTML elements. Can be a react-like thing or just render to plain HTML. Is used by PublicWeb to render the page. Because styling and designing a webpage can be complicated, I would like this to be a separate module, so we can render the pages with just a bit of dummy data.
ErrorWebComponents: A module for rendering error pages and components.

Rendering posts

To render the homepage, we’ll render a list of published posts. This will involve a variety of modules, including:

PostsRepo: A module responsible for loading and saving posts on persistent storage, like a database.
PublicPosts: A module using a repo to load published posts.

The boundary can look like this:

{
  "id": "PublicPosts",
  "schemas": {
    "PublicPost": {
      "id": "PublicPost",
      "type": "object",
      "properties": {
        "title": { "type": "string" },
        "publishedAt": { "$ref": "ZonedDateTime" },
        "author": { "$ref": "PublicAuthor" },
        "contentHtml": { "type": "string" },
        "href": { "type": "string" }
      }
    },
    "getLatestPosts": {
      "id": "getLatestPosts",
      "type": "function",
      "parameters": [
        { "id": "page", "type": "integer", "min": 1 }
      ],
      "output": {
        "type": "object",
        "properties": {
          "posts": { "type": "array", "items": { "$ref": "PublicPost" } }
        }
      }
    },
    "getPost": {
      "id": "getPost",
      "type": "function",
      "parameters": [
        { "id": "id", "type": "integer" }
      ],
      "output": {
        "type": "object",
        "properties": {
          "post": { "oneOf": [{ "$ref": "PublicPost" }, { "type": "null" }] },
          "error": { "oneOf": [{ "type": "string" }, { "type": "null" }] }
        }
      }
    }
  },
  "runtimeDependencies": ["PostsRepo", "PublicRoutes"]
}

We can define this boundary using zod. Notice how all inputs and outputs are JSON serializable. This PublicPosts module might be implemented like this:

export async function getLatestPosts(limit, offset) {
  const posts = await PostsRepo.findPublishedPostsDescending(offset, limit);
  return {
    posts: posts.map((post) => renderPublicPost(post))
  };
}

export async function getPost(id) {
  const post = await PostsRepo.findPublishedPostById(id);
  if (!post)
    return { error: "not-found" };
  return { post: renderPublicPost(post) };
}

function renderPublicPost(post) {
  return {
    title: post.title,
    publishedAt: post.publishedAt,
    author: renderPublicAuthor(post.author),
    contentHtml: markdown2html(post.contentMd),
    href: PublicRoutes.postPath(post),
  };
}

function renderPublicAuthor(user) {
  return {
    name: user.name,
    href: PublicRoutes.authorPath(user),
  };
}

Now imagine the PublicWeb module, which depends on PublicPosts:

{
  "id": "PublicWeb",
  "schemas": {
    "getHome": {
      "id": "getHome",
      "type": "function",
      "parameters": [
        { "id": "req", "$ref": "HttpRequest" }
      ],
      "output": { "$ref": "HttpResponse" }
    },
    "getPostDetail": {
      "id": "getPostDetail",
      "type": "function",
      "parameters": [
        { "id": "id", "type": "integer" }
      ],
      "output": { "$ref": "HttpResponse" }
    }
  },
  "runtimeDependencies": [
    "PublicPosts",
    "PostsWebComponents",
    "ErrorWebComponents"
  ]
}

export async function getHome(req) {
  const page = parseInt(req.params.page || "1", 10);
  const limit = 30;
  const response = await PublicPosts.getLatestPosts(limit, (page - 1) * limit);
  return res(200, PostsWebComponents.renderIndex(response));
}

export async function getPostDetail(req) {
  const id = parseInt(req.params.id || "", 10);
  const response = await PublicPosts.getPost(id);
  if (response.error == "not-found")
    return res(404, ErrorWebComponents.renderError("Post not found"));
  return res(200, PostsWebComponents.renderDetail(response));
}

function res(status, body) {
  return { status, headers: { contentType: "text/html; charset=utf-8" }, body };
}

In these modules, there’s a clear separation of concerns. Note that a view (PostsWebComponents.renderIndex) can’t access any data that isn’t previously loaded - no magic auto-loading of relations or unintentional access of properties here.

The PublicWeb controller can only access PublicPosts, not AdminPosts, because only PublicPosts is listed as a dependency in the schema. Even if AdminPosts were to be added as a dependency, both the call to AdminPosts.* and the addition of the AdminPosts dependency would be clear indicators that a PublicWeb module suddenly uses a module only meant in modules related to admin access.

In many applications, you can access any module or class from anywhere, even in views. Loading Admin-only sensitive data on the homepage would be just as easy as loading any other data, and you rely on code reviews to catch mistakes or carelessness.

I imagine we will have an agent designing these schemas, and other agents implementing the modules. Because the agent writing the module can’t alter any dependencies, it can automatically not access any data it isn’t supposed to access. If a module agent needs additional dependencies, it should request the schema agent.

Module agents

I don’t intend to give the agents access to a couple of source files directly. Instead, I want to give the module agent access to a virtual file containing an array of functions, tests and references (dependencies, imports) using the initial prompt and a set of tools:

getFunctionNames(), getFunction(name), setFunction(name, params, body), deleteFunction(name): Create, update or delete a function, uniquely identified by name.
getTestNames(), getTest(description), setTest(description, body), deleteTest(description): Create, update or delete a test, uniquely identified by description.
getReferences(), setReference(alias, source): Request a dependency to be added, uniquely identified by alias. If the dependency is not present in the schema, the agent (or human) managing the schema will be requested to allow the dependency.
testAll(): Runs all tests. Includes code coverage.
setFunctionAndTest(fun, test): A test-driven-development cycle, combining setFunction and setTest. Runs a sequence of steps, aborting if any of the steps fail. Steps: 1. Run all tests, assert they succeed, 2. Add the test, assert it fails, 3. Add the function, assert the new test succeeds, 4. Run all tests, assert they succeed.

The agent will be initialized with:

a list of all modules in the application
a list of all possible dependencies
full source code and tests of its module
its current schema
general documentation about the whole application
a prompt for code-style

You can imagine this initial prompt will be pretty big.

I want to put a lot of emphasis on test-driven-development and getting 100% code coverage.

Mocking dependencies

In my example, PublicWeb depends on PublicPosts, which depends on PostsRepo which would depend on SqliteDatabase. We likely want to mock these dependencies at one point.

I think it works best if runtime dependencies are mocked directly. PublicWeb relies on PublicPosts, so it should mock PublicPosts. This makes sure that each module is only concerned about its declared interface, even in tests.

This leaves a strong demand for integration or end-to-end tests. I can imagine to have a separate agent writing integration tests, which doesn’t mock any module except modules that rely on infrastructure (the database, http calls, APIs, etc) and another one for end-to-end tests, which just tests against a running application.

The integration tests will likely mock the database, or run a clean in-memory database, and use real modules to setup a scenario to test with.

For example, an integration test for PostPosts and AdminPosts can look like this:

const publicRoutes = inject(PublicRoutes, { host: "http://localhost:9999/" })
const db = inject(SqliteDatabase, { databasePath: ":memory:" });
const postsRepo = inject(PostsRepo, { SqliteDatabase: db });
const adminPosts = inject(AdminPosts, { PostsRepo: postsRepo });
const publicPosts = inject(PublicPosts,
  { PostsRepo: postsRepo, PublicRoutes: publicRoutes })

const createdPost = await adminPosts.createPost({
  title: "Hello",
  contentMd: "Hello, **world**!"
});
await adminPosts.publishPost(createdPost.id);

const publishedPost = await publicPosts.getPost(post.id);
expect(publishedPost).toEqual({
  href: "http://localhost:9999/posts/1337-hello",
  title: "Hello",
  publishedAt: "2025-12-25T12:00:00+01:00[Europe/Amsterdam]",
  contentHtml: `<p>Hello, <strong>world</strong>!</p>`,
  author: {
    user: "Toby",
    href: "http://localhost:9999/users/422-toby"
  },
});

Schema changes & refactoring across boundaries

For graceful refactoring & schema changes, I’m thinking to add the possibility to have boundary migrations where you can define a set of functions to transform between old and new versions of the boundary. For example, let’s imagine we’re changing the name field of the PublicAuthor to be givenName and familyName. A transform might be defined like this:

{
  id: "PublicAuthorNameTransform",
  target: "PublicAuthor",
  from: {
    name: { type: "string" },
  },
  to: {
    givenName: { type: "string" },
    familyName: { type: "string" },
  },
  transform(author) {
    return {
      ...author,
      givenName: author.name,
      familyName: "",
    };
  }
}

This way, we can have focus on our new module with a new version of the schema without breaking any other modules.

Once the old version of the schema has been removed, the transform can be automatically deleted. The framework could automatically detect any schema mismatches and automatically detect an appropriate schema transform, and also automatically delete any transform that is no longer required.

Notice how the transform is only concerned with relevant properties. That way, multiple versions of the same schema can exist at the same time, allowing for gracefully migrations between them.

Implementation

There are a lot of tools today that take on autonomous bugfixing and software development, but these are mostly general purpose tools you can drop into an existing codebase.

I’m currently working on creating the tools to implement this (not open source, at this moment)

I imagine by designing the infrastructure, prompts and the framework together, we could get better, faster and/or cheaper results. Or it will be a huge, impractical mess that serves no purpose for real-world applications. We’ll see!