State of AI Testing

Trends, research, and implications for quality engineering

Overview

This living document tracks the state of AI testing and quality engineering: emerging tools, research, and practices. It is updated monthly by the Research & Literary Agent, which curates discoveries from the portfolio's research pipeline and publishes a new section each month.

What you'll find here

Monthly digests of LLM and Gen AI research relevant to testing and QA: new models, tools, papers, and repos. Each section is generated automatically from the latest discovery feed, so the content stays current without manual publishing.

Monthly updates

Newest first. Sections are added by the Research & Literary Agent on the first of each month.

May 2026

May 2026

Curated discoveries from the LLM & Gen AI research pipeline relevant to testing and quality engineering.

sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

Link
NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Link
Skills

A project to improve skills of large language models

Link
ras-commander

The RAS-Commander library provides a python API for automating HEC-RAS 6.x and accessing HDF data using Python, built with and driven by large language models.

Link
NewsPilot

NewsPilot is an automated intelligence analysis system based on Large Language Models (LLM), designed to transform massive global news into personalized, actionable insights. It is not just a news aggregation tool, but a 24/7 intelligent intelligence assistant that understands your profession, holdings, and interests.

Link
MAGI

AI system powered by large language models.

Link
AbstractCore

A unified Python library for interaction with multiple Large Language Model (LLM) providers. Write once, run everywhere.

Link
Auto-Pentest-LLM

🔍 Automate penetration testing with an intelligent agent that organizes security assessments, leveraging local LLMs and Kali Linux for effective exploitation.

Link
Recipe-Language-Model

Recipe Language Model (RLM): We introduce a domain-specific RLM developed through seven AI layers and interconnected robotic boxes to drive the evolution of physical AI for defining a new paradigm - Materials Intelligence.

Link
jax-llm-expts

Experiments in building Small/Medium/Large Language Models using JAX

Link
auto-gpt

Auto-GPT is an experimental open-source framework that transforms large language models like GPT into autonomous agents capable of self-directed reasoning, recursive goal execution, and dynamic tool use. Unlike traditional chat interfaces

Link
HowHungryisAIRepo

This repository powers the “How Hungry is AI?” dashboard and provides the code + data release accompanying our paper on inference-phase (operational) environmental footprints of LLMs.

Link
ai-agent-pulse

Daily automated tracking of 19 AI agent frameworks. Pulse Score ranks momentum via star velocity, release freshness, commit activity, and community health.

Link
wiki-doc-skill

Persistent, LLM-maintained 3-level documentation for codebases. AST-driven line numbers, doc-only or edit-enabled doc passes, project-config'd cascade and verification. (Haven't taught the agent how to read the doc, will be included in v1.1 release)

Link
mcp-nvidia

MCP server to search across NVIDIA blogs and releases to empower LLMs to better answer NVIDIA specific queries

Link
INT-6940---AI-New-Store-Support-System---phase-1

This project is to use python and Structured Query Language with some known exiting store data to predict if someone wanted to start a new store in the same area, how likely the new store would be visited. Using Huff Model.

Link
Fake-News-Detection

Real-Time Fake News Detection offers an API-first approach to content moderation and journalistic verification. Powered by advanced Natural Language Processing (NLP) and ensemble tree models, this platform analyzes structure, sentiment, and context to deliver actionable credibility metrics.

Link
Jailbreak-Scaling-Laws-for-Large-Language-Models

[arxiv: 2603.11331] Scaling laws for the attack success rate under prompt-injection-based jailbreak attacks

Link
CutoffDateTesting

Benchmarking the true internal knowledge cutoff dates and factual decay of Large Language Models (LLMs) using notable death records.

Link
rag-chatbot

Chat With Documents is a Streamlit application designed to facilitate interactive, context-aware conversations with large language models (LLMs) by leveraging Retrieval-Augmented Generation (RAG). Users can upload documents or provide URLs, and the app indexes the content using a vector store called Chroma to supply relevant context during chats.

Link

Source: discoveries-2026-05-04.md

Research & Literary Agent – State of AI Testing

April 2026

April 2026

Curated discoveries from the LLM & Gen AI research pipeline relevant to testing and quality engineering.

sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

Link
bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

Link
inspect_ai

Inspect: A framework for large language model evaluations

Link
jiuwenclaw

JiuwenClaw is an intelligent AI Agent built on openJiuwen. It extends the powerful capabilities of large language models directly to your fingertips through various communication apps you use daily.

Link
Spatial-SSRL

[CVPR 2026] Official release of "Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning"

Link
rl-injector

Official release of code for the paper RL is a hammer and LLMs are nails A simple RL approach to stronger prompt injection attacks

Link
COMPASS

INFRA-COMPASS is a tool that leverages Large Language Models (LLMs) to create and maintain an inventory of state and local codes and ordinances applicable to energy infrastructure.

Link
Forgetting-A-New-Mechanism-Towards-Better-Large-Language-Model-Fine-tuning

Code for "FORGETTING: A New Mechanism Towards Better Large Language Model Fine-Tuning" paper. Novel approach for LLM fine-tuning using token-level forgetting mechanism.

Link
mcp-github-pr-issue-analyser

A Model Context Protocol (MCP) application for automated GitHub PR analysis and issue management. Enables LLMs to fetch PR details, analyse diffs, manage issues, and handle releases through a standardised interface

Link
AI-ML-ENGINEERING-JOURNEY

A structured end-to-end AI/ML engineering journey covering mathematics, machine learning, deep learning, large language models, MLOps, and production-grade projects. Built with a strong focus on fundamentals, implementation, and real-world systems.

Link
AI-Voice-Agent

🗣️ Build an interactive voice agent that leverages Speech-to-Text, a Large Language Model, and Text-to-Speech for real-time voice interactions.

Link
synapz

synapz is a research prototype exploring how large language models can adapt teaching content to different cognitive styles. built over a 48-hour sprint with a strict $50 api budget, this project implements a scientific framework to test whether adaptive teaching produces measurably better results than static approaches.

Link
pretrain-experiments

Continual pretrainig experiments with large language models

Link
agentdeck-preview

AgentDeck is a research platform for studying AI behavior through game scenarios. Run controlled experiments with LLMs, collect comprehensive behavioral data, and replay matches for analysis. 🚧 Preview release - feedback welcome.

Link
Auto-Pentest-LLM

🔍 Automate penetration testing with an intelligent agent that organizes security assessments, leveraging local LLMs and Kali Linux for effective exploitation.

Link
AI-Drug-Repurposing

An end-to-end AI drug repurposing system based on knowledge graphs + large language models. Supported by CrewAI multi-agent collaboration, it quickly and efficiently identifies new indications for existing drugs.

Link
aipyapp

🖥️ Execute Python commands with AIPy, unlocking the potential of large language models to solve complex problems seamlessly.

Link
Source-Code-Security-Audit-Reviewer

intelligent auditing tool powered by large language models, supporting GPT, . It automatically detects security vulnerabilities, performance issues

Link
AI_OS

Local-first workspace for using large language models in a practical, understandable way. It makes reasoning, memory, tools, and workflows visible and reusable, so you can save thinking once, keep control of your data, and move on.

Link
semantic-conflicts-benchmark

Benchmarking the ability of large language models to detect semantic conflicts across domains, documents, and evolving knowledge bases.

Link

Source: discoveries-2026-04-13.md

Research & Literary Agent – State of AI Testing

February 2026

February 2026

Curated discoveries from the LLM & Gen AI research pipeline relevant to testing and quality engineering.

Engram

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Link
Spatial-SSRL

Official release of "Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning"

Link
ChatGPT_on_CTF

We want to see whether ChatGPT or other AI-LLM (Microsoft New_Bing or Google Bard) are able to help the user to go to some test environment to run cmds to solve the CTF problems (Whether the AI large language models can understand the challenge question and capture the question flags)

Link
NewAdversarialAttackPaper

A list of recent adversarial attack and defense papers (including those on large language models)

Link
StillMe-Learning-AI-System-RAG-Foundation

Self-Evolving RAG System with ChromaDB for continuous knowledge updates (6x daily), designed to overcome Large Language Model data cutoff limitations.

Link
llm-release-action

GitHub Action that uses LLM to analyze commits, suggest semantic version bumps, and generate multi-audience changelogs

Link
mcp-github-pr-issue-analyser

A Model Context Protocol (MCP) application for automated GitHub PR analysis and issue management. Enables LLMs to fetch PR details, analyse diffs, manage issues, and handle releases through a standardised interface

Link
LLM-Evaluation

Real-time LLM delay & quality comparison tool built on FastAPI + SSE, tailored for Azure OpenAI. Modular architecture, unified responses/exceptions, env-based config, docs, and automation scripts ready for production release

Link
SSFT

Set Supervised Fine-Tuning (SSFT): Training Large Language Models To Reason In Parallel With Global Forking Tokens (ICLR2026).

Link
Comfy-Bridge-release

A desktop tool for inspecting and modifying ComfyUI workflows using LLMs.

Link
ExFairS

This repository provides a comprehensive vLLM benchmarking framework for testing large language model performance and fairness across multiple scheduling strategies (ExFairS, VTC, FCFS, Queue-based) with built-in engine management, multi-experiment batch execution, and advanced plotting capabilities.

Link
SEDAC-V7.0-Pre-release-Test-Version

SEDAC is a next-generation framework that dynamically allocates computation during LLM inference. By using entropy-based gating, it routes predictable tokens through shallow subnetworks and sends ambiguous or high-impact tokens to deeper, specialized paths.

Link
agentdeck-preview

AgentDeck is a research platform for studying AI behavior through game scenarios. Run controlled experiments with LLMs, collect comprehensive behavioral data, and replay matches for analysis. 🚧 Preview release - feedback welcome.

Link
dltha_reasoning_v1

This dataset is the first release from DLTHA Labs, focused on enhancing the logical reasoning and step-by-step problem-solving capabilities of Large Language Models (LLMs).

Link
mcp-nvidia

MCP server to search across NVIDIA blogs and releases to empower LLMs to better answer NVIDIA specific queries

Link
GenAI-CAD-CFD-Studio

🚀 Universal AI-Powered CAD & CFD Platform | Democratizing 3D Design & Simulation | Natural Language → Parametric Models | Build123d + Zoo.dev + Adam.new + OpenFOAM | Solar PV, Test Chambers, Digital Twins & More

Link
LLMfromscratch

Creating a large language model

Link
Academic-Extraction-GenAI-Pipeline

🔍 Extract structured academic metadata from research abstracts using multiple Large Language Models and assess their performance effectively.

Link
Source-Code-Security-Audit-Reviewer

intelligent auditing tool powered by large language models, supporting GPT, . It automatically detects security vulnerabilities, performance issues

Link
dp-fusion-lib

🔒 Enable secure Large Language Model inference with differential privacy for sensitive data protection using DP-Fusion-Lib.

Link

Source: discoveries-2026-01-26.md

Research & Literary Agent – State of AI Testing

March 2026

Kickoff

This is the first monthly section. Future sections will be added automatically with curated discoveries from the llm-discovery pipeline. Run the agent manually from Actions or wait for the monthly schedule.

Research & Literary Agent – State of AI Testing