State of AI Testing

Monthly updates

Newest first. Sections are added by the Research & Literary Agent on the first of each month.

June 2026

Curated discoveries from the LLM & Gen AI research pipeline relevant to testing and quality engineering.

sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

Link

fortune

Implementing the concept of Human-AI Hybrid Intelligence, developing a financial asset intelligent quantitative analysis assistant with monetization capabilities. The system integrates Large Language Model reasoning with Machine Learning prediction models, monitoring cryptocurrency, Hong Kong stocks, gold, and other financial markets in real-time.

Link

Forgetting-A-New-Mechanism-Towards-Better-Large-Language-Model-Fine-tuning

[TMLR 2026 J2C Certification] FORGETTING: A New Mechanism Towards Better Large Language Model Fine-Tuning

Link

free-llm-api-resources

🤖 Discover free API access and credits for various legitimate large language model services in one curated resource list.

Link

md-bridge

Self-hosted document converter built on hand-written heuristics, not language models. Ships PDF ↔ Markdown; the architecture welcomes new format pairs as contributions land. Deterministic by design: same input, same output, every run. FastAPI + React, multi-arch Docker, MIT.

Link

AGI_HER_LLM

🚀 Adapt large language models continuously with task-agnostic methods, enhancing their performance with efficient benchmarks and algorithmic approaches.

Link

LLM-Filter-Probe

🕵️♂️ Analyze and reverse engineer keyword filtering in large language models to enhance compliance and operational insights.

Link

AI-cyber-range

⚔️ Build, break, and secure Large Language Models with our automated OWASP Top 10 cyber range for hands-on AI security training and research.

Link

Mobius-LLM-Fine-tuning-Engine

🔧 Fine-tune large language models locally on your data, export to GGUF, and train on CPU with ease using the Mobius LLM Fine-Tuning Engine.

Link

HowHungryisAIRepo

This repository powers the “How Hungry is AI?” dashboard and provides the code + data release accompanying our paper on inference-phase (operational) environmental footprints of LLMs.

Link

apollo-s2t

Push-to-talk speech-to-text for Windows: hold a key, speak, release: your words land in any text field. Deepgram + OpenRouter, with project-aware F10 prompt profiles.

Link

ai-agent-pulse

Daily automated tracking of 19 AI agent frameworks. Pulse Score ranks momentum via star velocity, release freshness, commit activity, and community health.

Link

navimed-umb

AWQ/W4A16 quantization, benchmarking and public release of Polish LLMs (PLLuM, Bielik) on AMD RDNA 4 (ROCm + vLLM-native) — reproducible methodology and hardware-envelope studies.

Link

Can-Agents-Write-Good-Property-Based-Tests

This research attempts to replicate similar examples from the paper: Can Large Language Models Write Good Property-Based Tests? https://doi.org/10.48550/arXiv.2307.04346 in a modern-agent context while using the same prompts and comparing it to basic human written invariants and strategies via the Hypothesis Library and Codex

Link

rd-net

🌊 Stabilize long-form text generation with RD-Net's simple drift mechanism for frozen large language models, reducing repetition collapse effectively.

Link

mcp-code-mode

🤖 Enable AI code generation using the Code Execution MCP Server, integrating large language models with the Model Context Protocol for seamless tool use.

Link

Clinical-LLM-CDSS

AI-powered Clinical Decision Support System integrating OCR, NLP, Large Language Models, and machine learning for automated extraction, analysis, and prediction from electronic health records and laboratory reports.

Link

llmtest-perf

Validate LLM inference performance and catch latency, throughput, and TTFT regressions before release

Link

auto-crawl-tiktok-post-fb

Automate TikTok to Facebook posting, schedule Reels, and manage AI replies, inbox, campaigns, and workers in one dashboard

Link

NeuroLex-AI-powered-Fake-News-Detection-System

NeuroLex is an AI-powered fake news detection system that combines Natural Language Processing (NLP), Cognitive Linguistics, and Transformer-based models to analyze news content, detect misinformation, and provide explainable real-time predictions.

Link

Source: discoveries-2026-06-22.md

Research & Literary Agent – State of AI Testing

May 2026

Curated discoveries from the LLM & Gen AI research pipeline relevant to testing and quality engineering.

sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

Link

NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Link

Skills

A project to improve skills of large language models

Link

ras-commander

The RAS-Commander library provides a python API for automating HEC-RAS 6.x and accessing HDF data using Python, built with and driven by large language models.

Link

NewsPilot

NewsPilot is an automated intelligence analysis system based on Large Language Models (LLM), designed to transform massive global news into personalized, actionable insights. It is not just a news aggregation tool, but a 24/7 intelligent intelligence assistant that understands your profession, holdings, and interests.

Link

MAGI

AI system powered by large language models.

Link

AbstractCore

A unified Python library for interaction with multiple Large Language Model (LLM) providers. Write once, run everywhere.

Link

Auto-Pentest-LLM

🔍 Automate penetration testing with an intelligent agent that organizes security assessments, leveraging local LLMs and Kali Linux for effective exploitation.

Link

Recipe-Language-Model

Recipe Language Model (RLM): We introduce a domain-specific RLM developed through seven AI layers and interconnected robotic boxes to drive the evolution of physical AI for defining a new paradigm - Materials Intelligence.

Link

jax-llm-expts

Experiments in building Small/Medium/Large Language Models using JAX

Link

auto-gpt

Auto-GPT is an experimental open-source framework that transforms large language models like GPT into autonomous agents capable of self-directed reasoning, recursive goal execution, and dynamic tool use. Unlike traditional chat interfaces

Link

HowHungryisAIRepo

This repository powers the “How Hungry is AI?” dashboard and provides the code + data release accompanying our paper on inference-phase (operational) environmental footprints of LLMs.

Link

ai-agent-pulse

Daily automated tracking of 19 AI agent frameworks. Pulse Score ranks momentum via star velocity, release freshness, commit activity, and community health.

Link

wiki-doc-skill

Persistent, LLM-maintained 3-level documentation for codebases. AST-driven line numbers, doc-only or edit-enabled doc passes, project-config'd cascade and verification. (Haven't taught the agent how to read the doc, will be included in v1.1 release)

Link

mcp-nvidia

MCP server to search across NVIDIA blogs and releases to empower LLMs to better answer NVIDIA specific queries

Link

INT-6940---AI-New-Store-Support-System---phase-1

This project is to use python and Structured Query Language with some known exiting store data to predict if someone wanted to start a new store in the same area, how likely the new store would be visited. Using Huff Model.

Link

Fake-News-Detection

Real-Time Fake News Detection offers an API-first approach to content moderation and journalistic verification. Powered by advanced Natural Language Processing (NLP) and ensemble tree models, this platform analyzes structure, sentiment, and context to deliver actionable credibility metrics.

Link

Jailbreak-Scaling-Laws-for-Large-Language-Models

[arxiv: 2603.11331] Scaling laws for the attack success rate under prompt-injection-based jailbreak attacks

Link

CutoffDateTesting

Benchmarking the true internal knowledge cutoff dates and factual decay of Large Language Models (LLMs) using notable death records.

Link

rag-chatbot

Chat With Documents is a Streamlit application designed to facilitate interactive, context-aware conversations with large language models (LLMs) by leveraging Retrieval-Augmented Generation (RAG). Users can upload documents or provide URLs, and the app indexes the content using a vector store called Chroma to supply relevant context during chats.

Link

Source: discoveries-2026-05-04.md

Research & Literary Agent – State of AI Testing

April 2026

Curated discoveries from the LLM & Gen AI research pipeline relevant to testing and quality engineering.

sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

Link

bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

Link

inspect_ai

Inspect: A framework for large language model evaluations

Link

jiuwenclaw

JiuwenClaw is an intelligent AI Agent built on openJiuwen. It extends the powerful capabilities of large language models directly to your fingertips through various communication apps you use daily.

Link

Spatial-SSRL

[CVPR 2026] Official release of "Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning"

Link

rl-injector

Official release of code for the paper RL is a hammer and LLMs are nails A simple RL approach to stronger prompt injection attacks

Link

COMPASS

INFRA-COMPASS is a tool that leverages Large Language Models (LLMs) to create and maintain an inventory of state and local codes and ordinances applicable to energy infrastructure.

Link

Forgetting-A-New-Mechanism-Towards-Better-Large-Language-Model-Fine-tuning

Code for "FORGETTING: A New Mechanism Towards Better Large Language Model Fine-Tuning" paper. Novel approach for LLM fine-tuning using token-level forgetting mechanism.

Link

mcp-github-pr-issue-analyser

A Model Context Protocol (MCP) application for automated GitHub PR analysis and issue management. Enables LLMs to fetch PR details, analyse diffs, manage issues, and handle releases through a standardised interface

Link

AI-ML-ENGINEERING-JOURNEY

A structured end-to-end AI/ML engineering journey covering mathematics, machine learning, deep learning, large language models, MLOps, and production-grade projects. Built with a strong focus on fundamentals, implementation, and real-world systems.

Link

AI-Voice-Agent

🗣️ Build an interactive voice agent that leverages Speech-to-Text, a Large Language Model, and Text-to-Speech for real-time voice interactions.

Link

synapz

synapz is a research prototype exploring how large language models can adapt teaching content to different cognitive styles. built over a 48-hour sprint with a strict $50 api budget, this project implements a scientific framework to test whether adaptive teaching produces measurably better results than static approaches.

Link

pretrain-experiments

Continual pretrainig experiments with large language models

Link

agentdeck-preview

AgentDeck is a research platform for studying AI behavior through game scenarios. Run controlled experiments with LLMs, collect comprehensive behavioral data, and replay matches for analysis. 🚧 Preview release - feedback welcome.

Link

Auto-Pentest-LLM

🔍 Automate penetration testing with an intelligent agent that organizes security assessments, leveraging local LLMs and Kali Linux for effective exploitation.

Link

AI-Drug-Repurposing

An end-to-end AI drug repurposing system based on knowledge graphs + large language models. Supported by CrewAI multi-agent collaboration, it quickly and efficiently identifies new indications for existing drugs.

Link

aipyapp

🖥️ Execute Python commands with AIPy, unlocking the potential of large language models to solve complex problems seamlessly.

Link

Source-Code-Security-Audit-Reviewer

intelligent auditing tool powered by large language models, supporting GPT, . It automatically detects security vulnerabilities, performance issues

Link

AI_OS

Local-first workspace for using large language models in a practical, understandable way. It makes reasoning, memory, tools, and workflows visible and reusable, so you can save thinking once, keep control of your data, and move on.

Link

semantic-conflicts-benchmark

Benchmarking the ability of large language models to detect semantic conflicts across domains, documents, and evolving knowledge bases.

Link

Source: discoveries-2026-04-13.md

Research & Literary Agent – State of AI Testing

February 2026

Curated discoveries from the LLM & Gen AI research pipeline relevant to testing and quality engineering.

Engram

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Link

Spatial-SSRL

Official release of "Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning"

Link

ChatGPT_on_CTF

We want to see whether ChatGPT or other AI-LLM (Microsoft New_Bing or Google Bard) are able to help the user to go to some test environment to run cmds to solve the CTF problems (Whether the AI large language models can understand the challenge question and capture the question flags)

Link

NewAdversarialAttackPaper

A list of recent adversarial attack and defense papers (including those on large language models)

Link

StillMe-Learning-AI-System-RAG-Foundation

Self-Evolving RAG System with ChromaDB for continuous knowledge updates (6x daily), designed to overcome Large Language Model data cutoff limitations.

Link

llm-release-action

GitHub Action that uses LLM to analyze commits, suggest semantic version bumps, and generate multi-audience changelogs

Link

mcp-github-pr-issue-analyser

Link

LLM-Evaluation

Real-time LLM delay & quality comparison tool built on FastAPI + SSE, tailored for Azure OpenAI. Modular architecture, unified responses/exceptions, env-based config, docs, and automation scripts ready for production release

Link

SSFT

Set Supervised Fine-Tuning (SSFT): Training Large Language Models To Reason In Parallel With Global Forking Tokens (ICLR2026).

Link

Comfy-Bridge-release

A desktop tool for inspecting and modifying ComfyUI workflows using LLMs.

Link

ExFairS

This repository provides a comprehensive vLLM benchmarking framework for testing large language model performance and fairness across multiple scheduling strategies (ExFairS, VTC, FCFS, Queue-based) with built-in engine management, multi-experiment batch execution, and advanced plotting capabilities.

Link

SEDAC-V7.0-Pre-release-Test-Version

SEDAC is a next-generation framework that dynamically allocates computation during LLM inference. By using entropy-based gating, it routes predictable tokens through shallow subnetworks and sends ambiguous or high-impact tokens to deeper, specialized paths.

Link

agentdeck-preview

Link

dltha_reasoning_v1

This dataset is the first release from DLTHA Labs, focused on enhancing the logical reasoning and step-by-step problem-solving capabilities of Large Language Models (LLMs).

Link

mcp-nvidia

MCP server to search across NVIDIA blogs and releases to empower LLMs to better answer NVIDIA specific queries

Link

GenAI-CAD-CFD-Studio

🚀 Universal AI-Powered CAD & CFD Platform | Democratizing 3D Design & Simulation | Natural Language → Parametric Models | Build123d + Zoo.dev + Adam.new + OpenFOAM | Solar PV, Test Chambers, Digital Twins & More

Link

LLMfromscratch

Creating a large language model

Link

Academic-Extraction-GenAI-Pipeline

🔍 Extract structured academic metadata from research abstracts using multiple Large Language Models and assess their performance effectively.

Link

Source-Code-Security-Audit-Reviewer

intelligent auditing tool powered by large language models, supporting GPT, . It automatically detects security vulnerabilities, performance issues

Link

dp-fusion-lib

🔒 Enable secure Large Language Model inference with differential privacy for sensitive data protection using DP-Fusion-Lib.

Link

Source: discoveries-2026-01-26.md

Research & Literary Agent – State of AI Testing

March 2026

Kickoff

This is the first monthly section. Future sections will be added automatically with curated discoveries from the llm-discovery pipeline. Run the agent manually from Actions or wait for the monthly schedule.

Research & Literary Agent – State of AI Testing

Overview

What you'll find here

Monthly updates

June 2026

May 2026

April 2026

February 2026

Kickoff