Nullsec Research

Security Review for AI-Generated Software

Why generated applications need a separate verification layer

Summary

  • AI coding systems are increasingly able to generate complete application structures, not just isolated code snippets.

  • This changes the security problem: the volume and speed of software creation increases, while review capacity often remains unchanged.

  • Many generated applications appear functional before their trust boundaries, authorization logic, and runtime behavior have been properly examined.

  • We believe AI-generated software requires a dedicated security review layer that is separate from the model or system that produced the code.

  • Nullsec S1 is our first step toward building a model specifically focused on identifying security risks in AI-generated applications.

Software development is moving from manual implementation toward intent-driven generation. Instead of writing every file directly, developers and increasingly non-developers now describe the product they want, and AI systems produce much of the application structure around that request.

This changes the economics of building software. A small team can move faster, prototypes can become usable products much earlier, and technical ideas can be tested with less upfront engineering effort. In many cases, this is a positive development. The ability to generate software from natural language lowers the cost of experimentation and gives more people access to creation.

However, the security model around this new workflow has not matured at the same pace.

Most current AI software workflows optimize for completion. The system is asked to produce an application, fix build errors, improve the interface, add features, and make the output feel usable. These are useful capabilities, but they are not the same as security review. A generated application can compile successfully and still contain broken access control, exposed secrets, unsafe API behavior, incorrect wallet verification, insecure database rules, or agent actions that exceed their intended scope.

The central issue is not whether AI can generate useful software. It can. The issue is whether generated software should be trusted without an independent review process.

We do not think it should.

The problem with self-review

A model that generates an application is usually optimized to satisfy the original request. It attempts to infer missing context, produce working code, and resolve implementation details in a way that makes the user’s goal possible. This is useful for building, but it can be unreliable for security.

Security review requires a different posture. The reviewer should not primarily ask whether the product matches the requested behavior. It should ask what assumptions the implementation makes, which inputs are attacker-controlled, where identity is verified, which actions require authorization, and whether the system exposes privileges that should remain constrained.

In traditional engineering teams, this distinction is familiar. The person who writes a feature is not always the person who approves its security properties. Code review, threat modeling, static analysis, penetration testing, and production monitoring all exist because functional correctness and security correctness are different categories.

AI-generated software makes this separation more important, not less. When generation becomes faster, the cost of insufficient review increases.

New risks in generated applications

Generated software inherits the same risks as traditional software, including weak authentication, missing authorization, injection flaws, exposed environment variables, unsafe dependencies, and misconfigured backend services. But AI-generated systems also introduce additional failure modes.

Some of these risks come from the way LLMs handle instructions. Applications that accept natural language input may be vulnerable to prompt injection, especially when user-controlled text can influence tools, API calls, database operations, or agent decisions.

Other risks come from over-trusting generated output. If LLM output is rendered into an interface, passed into a command, used to construct queries, or allowed to trigger actions, the output itself becomes part of the attack surface.

The risk becomes more serious when generated applications connect to wallets, payment systems, user accounts, private files, deployment environments, or on-chain actions. In these cases, the application is no longer only producing content. It may be controlling access, moving assets, writing to databases, or executing privileged operations.

This is where security review must become part of the generation pipeline itself.

What we are building with Nullsec S1

Nullsec S1 is our first model focused specifically on reviewing AI-generated software.

The goal is not to replace general-purpose coding models. Builder models are good at producing applications, explaining code, and iterating on product requirements. Nullsec S1 is designed for a narrower task: inspect generated code and identify where the implementation may be unsafe.

A useful security model should produce structured findings rather than vague warnings. It should be able to identify the category of risk, explain why it matters, point to evidence in the code, estimate severity, and suggest a concrete remediation. It should distinguish between cosmetic issues and production-blocking vulnerabilities. It should also understand the difference between authentication and authorization, between client-side checks and server-side enforcement, and between a working user flow and a secure trust boundary.

Our working hypothesis is simple: AI-generated applications need a second model with a different objective. The first model builds. The second model challenges.

Toward a verification layer for generated software

We believe the future AI development pipeline will include security review as a default step, not an optional audit after deployment.

A generated application should pass through several stages before it is considered production-ready. It should build successfully, but it should also be inspected for exposed secrets, unsafe routes, weak authorization, insecure database access, dependency risk, wallet verification mistakes, and risky agent behavior. Where possible, the review system should propose targeted patches and verify that the fix does not break the application.

This does not remove the need for human engineers. It gives them leverage. Instead of manually reviewing every generated file from scratch, teams can use specialized models to surface the highest-risk areas first.

The long-term direction is not simply “AI writes code.” It is “AI generates software, specialized systems verify it, and humans make better final decisions.”

Limitations

Nullsec S1 is an early step, not a finished answer to software security. Security models can miss issues, overstate severity, or misunderstand project context. Generated findings still require validation, especially in high-risk production environments.

There is also no single model that can fully replace static analysis, dependency scanning, runtime monitoring, manual review, or professional security audits. The right approach is layered.

Our focus is narrower: improving the first line of review for AI-generated software, especially in environments where applications are created quickly and traditional review processes are easy to skip.

Discussion

The security challenge around AI-generated software is not that every generated application is unsafe. The challenge is that generation is becoming cheap, fast, and widely accessible, while verification is still slow and uneven.

That imbalance will matter more as generated applications become more capable. The more software touches real users, real accounts, real wallets, real APIs, and real infrastructure, the more important it becomes to verify what was generated before it is trusted.

Nullsec S1 is built around this premise. If the next generation of software is increasingly generated by AI, then security must also become AI-native.

Not as a replacement for engineering judgment, but as a dedicated verification layer for a new software era.