GitHub

AI researchers and labs have made leaps and bounds in evaluating AI models in everything from security and compliance to sycophancy and alignment. But it seems that companies and developers are facing a specific new need: ensuring that their AI system behaves as expected for their specific product or service.

In an effort to simplify this testing process, Microsoft on Tuesday unveiled ASSERT, short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing.

According to Microsoft, the open source framework makes it easier to evaluate application-specific AI behavior by using AI to transform high-level, natural language descriptions of goals, policies, or intended behaviors into in-depth, scored tests that can be studied.

ASSERT takes plain language descriptions of an AI model's expected behavior and policies, transforms them into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, runs them on the target system, and scores the results. It can also record the paths taken by the AI system, including intermediate actions and tool calls, so developers can inspect where failures are occurring.

Developers can also provide context, tools, and system constraints if they want to further customize what the assessments cover.

For example, a developer could specify that a search AI agent should not send emails to people outside the company, and that it should limit confidential information to senior executives and provide concise summaries taking into account prior context. ASSERT will use these rules to generate test cases that check whether the system adheres to these rules on an ongoing basis.

According to Microsoft, this framework fills a gap that broader, more general assessments cannot fill when AI models are expected to behave in a way shaped by an application or product's context, policies and tools.

“One of the things we learned is that assessments are absolutely essential to making good decisions,” said Sarah Bird, director of Responsible AI products at Microsoft. “Because if you don't understand the behavior of the AI system, it's really hard to know if it meets the requirements of your organization... What we found is that if you really want to have a reliable system, you have to evaluate a lot of other application-specific dimensions.

Bird said ASSERT can be used to evaluate systems as they are built, after they are deployed and even for ongoing monitoring.

This release comes amid a gradual but broader shift in the AI industry. As models become more capable, researchers are focusing on repeatable testing and regression checks, with Stanford's HELM, MLCommons' AILuminate, and evaluation groups like METR deploying benchmark tests to measure how models behave under different conditions.

![New Microsoft The tool developers to start AI behavior testing using text descriptions](https://techcrunch.com/wp-content/uploads/2026/06/GettyImages-172665283.jpg?resize=1200,900)