📓 OzSE 2026 — Trustworthiness as the New Frontier

Australian Summer School in Software Engineering
📍 Melbourne Connect
🗓 9–10 February 2026

OzSE 2026 group photo at Melbourne Connect

Across two days of talks and discussions, a clear shift emerged: AI in software engineering is no longer a performance race — it is a trustworthiness race. The focus has moved from model capability to verification, auditability, and structural guarantees.

Day 1 — Neuro-Symbolic × CPS × Safety

The morning session centered on safety-critical autonomy and how learning-enabled systems can be made formally verifiable.

Xi Zheng — Verifiable Autonomous Systems

Xi represents a classical CPS systems engineering perspective rather than a purely model-driven AI approach.

His core question was:When deep learning enters UAVs and autonomous driving, how can we provide system-level verifiable guarantees?

NeuroStrata Architecture:

Visual Input
Foundation Model Segmentation
Probabilistic Scene Graph
Datalog / Scallop Rule Reasoning
Safety Decision

The novelty lies not in segmentation itself, but in:

Probabilistic semantic scene graphs
Rule-level attribution
Traceable safety decision logic

This is structural-level explanation — one abstraction layer above attention heatmaps.

Yulei Sui — Static Analysis for Neural Networks

Sui represents the program analysis tradition. His core logic was precise:

Testing ≠ Verification
∀ x ∈ Φ, f(x) ⊨ Ψ must be proven.

He transfers abstract interpretation to neural network verification, leveraging:

DeepPoly / CROWN convex relaxations
Branch-and-Bound exact search
ACT verification framework

Notably, he is extending this logic to LLM and Agentic workflow verification, signaling that the static analysis community is actively entering the LLM era.

Cristina Cifuentes — Intelligent Application Security

Cristina Cifuentes presenting at OzSE 2026

Cristina presented an industry-grounded neuro-symbolic approach.

LLM-generated vulnerability detectors
LLM-guided fuzzing
LLM-generated patches

LLMs cannot replace static analysis.
They must be embedded into the static analysis loop.

The engineering workflow: Static analysis → LLM candidate generation → Re-verification → Human confirmation.

Yongqiang Tian — LPR: Language-Agnostic Program Reduction

The core philosophy was clear: LLM provides creativity, while formal specifications provide safety.

LPR leverages language-agnostic transformations combined with LLM-generated reduction strategies, and discovered over 300+ bugs across multiple compilers.

Cost: $0.42 per benchmark.

This represents a rare example of industrially viable LLM + formal method integration — a concrete form of AI-augmented systems engineering.

Sumudu Bambarawanaliyanage — EnsLLM: Reliability in LLM Code Generation

Sumudu Bambarawanaliyanage presenting EnsLLM at OzSE 2026

The central insight of EnsLLM is straightforward but important: LLMs optimize probability, not semantic correctness.

Rather than replacing models with stronger ones, EnsLLM introduces a verification layer above generation:

Multi-model generation
Similarity-based voting
Behavioral differential analysis

Correctness is inferred via collective consistency rather than single-model confidence.

This approach feels deeply rooted in software engineering thinking: not model worship, but layered verification.

Afternoon — LLM Code Generation Reliability

Guowei Yang — EnsLLM

Guowei Yang presenting EnsLLM at OzSE 2026

Core insight: LLMs optimize probability, not semantic correctness.

EnsLLM introduces a verification layer:

Multi-model generation
Similarity voting
Behavioral differential analysis

Correctness is inferred via collective consistency, rather than single-model confidence.

Chunhua Liu — Hallucinations in Code Change to Natural Language Generation

This talk examined hallucinations in tasks that translate code changes into natural language — specifically:

Commit message generation
Code review comment generation

While hallucinations have been studied separately in natural language generation and code generation, their behavior in structurally complex, context-dependent code-change tasks remains largely unexplored.

Key empirical findings:

~50% of generated code reviews contain hallucinations
~20% of generated commit messages contain hallucinations

The study evaluated metric-based hallucination detection approaches. While commonly used metrics perform weakly in isolation, combining multiple metrics substantially improves detection performance.

Notably, model confidence and feature attribution metricsshow promise for inference-time hallucination detection.

This work reinforces a central theme of the conference: reliability and trustworthiness cannot be assumed — they must be measured, quantified, and systematically detected.

Naim Rastgoo — AI Tutorial: Building AI Assistants with LangChain and LangGraph

Naim Rastgoo presenting LangChain and LangGraph tutorial at OzSE 2026

This 40-minute tutorial focused on practical methods for building AI assistants using LangChain and LangGraph.

The session emphasized how agent-based architectures can be engineered through explicit workflow design rather than relying on a single monolithic model.

Key concepts demonstrated:

Composable LLM pipelines
Tool integration and external API orchestration
Stateful multi-step reasoning
Graph-based agent execution control

LangGraph in particular highlights a shift towardexplicit control flow in agent systems, making agent behavior more structured and inspectable.

From a broader perspective, this tutorial reinforced a recurring theme of the conference: Agentic AI is moving from prompt hacking to workflow engineering.

My Talk — XMAS-CQP

Lei Pei presenting XMAS-CQP at OzSE 2026

I presented XMAS-CQP as a structured, multi-agent system for explainable software quality prediction.

Role-based multi-agent collaboration
Structured JSON schema outputs
Auditable risk reasoning
Explanations as first-class artefacts

The system emphasizes:

Structured outputs
Traceability
Reproducibility
Verification-awareness

Rather than scaling a single LLM, the focus is on engineering controllable, auditable agent workflows.

Software Requirement and Evolution in the Age of Generative AI

This session, chaired by Dr Tingting Bi, explored how generative AI reshapes requirements engineering, software quality assurance, human-oriented SE, and documentation maintenance.

Chetan Arora — Prompt Engineering, LLMs and RAG for Requirements-driven QA

The talk demonstrated how prompt engineering, LLMs, and Retrieval-Augmented Generation (RAG) can operationalize requirements-driven software quality assurance.

Transforming natural-language requirements into test scenarios
Automatic generation and refinement of QA artefacts
RAG grounding using up-to-date specifications
Prompt engineering techniques for requirements engineering

This presentation highlighted a key trend: grounding LLM outputs in structured requirement knowledge is essential for trustworthy QA pipelines.

Maria Spichkova — Human-Oriented SE in the Age of GenAI

Maria emphasized that software systems are developed by humans and for humans — yet socio-cultural diversity and human factors are often underrepresented in system design.

With the rise of GenAI, SE methodologies are evolving rapidly, and it becomes critical to ensure that these changes preserve usability, sustainability, and inclusivity.

This talk rebalanced the technical intensity of the day with a reminder: trustworthiness also includes cultural and social dimensions.

Yuqing Xiao — Human-centric Requirements in Aged Care Digital Health

Yuqing presented a systematic review of 69 primary studies on requirements engineering for older adult digital health systems.

Complemented by an empirical survey of developers, caregivers, and older adults, the work identified both functional and non-functional requirements for aged care software.

This integrated evidence base strengthens the foundation for human-centric digital health design.

Shashiwadana Nirmani — Motivation-aware OSS Recommendation

Shashiwadana Nirmani presenting at OzSE 2026

Based on a study of 208 OSS practitioners, this work explored how demographics and motivations influence contributor preferences.

A prototype recommendation system integrated human factors alongside technical attributes to personalize project suggestions.

This research highlights that software ecosystems are sustained not only by code quality, but by human motivation alignment.

Haoyu Gao — Automated Documentation Maintenance

Haoyu addressed the long-standing problem of outdated documentation.

Building upon prior empirical work on README evolution, he introduced an LLM-based agentic system that automates documentation updates within pull requests, incorporating a human-in-the-loop.

Quantitative evaluation, qualitative failure analysis, and case studies with OSS developers collectively demonstrate the feasibility of automated documentation maintenance.

Day 2 — Enhancing Trustworthy of Agentic AI Systems

Chaired by Dr Yongqiang Tian, this session focused on one central theme: trust cannot be assumed for LLMs or agentic systems — it must be engineered, measured, and verified.

Valerio Terragni — Metamorphic Testing for LLMs: Improving Trust and Quality in AI

Valerio Terragni presenting at OzSE 2026

The talk tackled a fundamental barrier in testing LLMs: the oracle problem. When labeled ground truth is unavailable, detecting faulty behavior becomes difficult.

Metamorphic Testing (MT) addresses this by using Metamorphic Relations (MRs) — expected relationships between outputs of related inputs — enabling fault detection without explicit oracles.

What stood out was the scale and completeness of the study:

Reviewed the literature and identified 191 MRs for NLP tasks
Implemented 36 representative MRs
Ran over 560,000 metamorphic tests
Evaluated three popular LLMs

The results highlighted both strengths and limitations of MT — and importantly, positioned MT as a practical bridge between software testing traditions (SE4AI) and AI-augmented software engineering (AI4SE).

Aldeida Aleti — Trustworthy AI Agents for Software Engineering

This talk framed the core obstacle in agentic SE systems as trustworthiness. Even when benchmark results look impressive, LLM-based agents often remain unreliable: hallucinations, vulnerability misinterpretations, and misleading explanations.

A particularly sharp point was that many failures stem from spurious correlations and shortcut learning — models mimic surface patterns rather than understanding program semantics.

The proposed direction: Trustworthiness Oracles.

These are automated evaluators that measure, explain, and audit trustworthiness in AI-generated artefacts, integrating formal reasoning and interpretability to restore confidence in AI-assisted development.

This concept maps cleanly onto the broader OzSE narrative: evaluation must move from “does it work on benchmarks?” to “can we systematically justify, verify, and audit its behavior?”

Qinghua Lu — Verifiability-First AI Engineering: Building Trust in AIware-Centric Systems

Qinghua’s framing was extremely clear: as systems shift from software-centric architectures to AIware-centric ecosystems, the engineering challenge shifts from producing behavior to verifying it.

Traditional SE assumes business logic lives in human-written code, so testing and analysis can be deterministic. In contrast, foundation models embed logic in weights and learned policies, introducing unprecedented assurance complexity.

Verifiability-first design strategies discussed:

Decompose tasks into machine-verifiable / human-verifiable components
Embed constraints into pipelines
Leverage formal methods and automated testing
Extend verification beyond correctness to alignment, interpretability, and safety

A key takeaway was the shift in human roles: humans increasingly become designers of constraints and orchestrators of verification, rather than the direct producers of code and tests.

The Future of Agentic Software Development

Chaired by A/Prof Patanamon (Pick) Thongtanunam, this session explored how agentic systems can be scaled, evaluated, and integrated into real software development environments — especially under enterprise constraints.

Minwoo Jeong & Jirat Pasuksmit — Rovo Dev: Toward Scaling AI Agents for Software Engineering (Atlassian)

Minwoo Jeong presenting Rovo Dev at OzSE 2026

Jirat Pasuksmit presenting Rovo Dev at OzSE 2026

This talk centered on real-world scaling challenges for enterprise agent systems: agents must operate reliably across large, diverse codebases while balancing latency, cost, and accuracy.

A strong engineering message emerged: scaling is not just adding compute — it requires architectural constraints that keep agent behavior predictable and maintain developer trust.

The talk concluded with open challenges in evaluating agent workflows and managing data risks in proprietary settings — a very practical industry perspective.

Gopi Krishnan Rajbahadur — SPICE: An Automated SWE-Bench Labeling Pipeline

Gopi Krishnan Rajbahadur presenting SPICE at OzSE 2026

SPICE addresses a bottleneck that the whole community feels: high-quality labeled datasets are essential, but manual labeling is prohibitively expensive and slow.

SPICE is a scalable automated pipeline for labeling SWE-bench-style datasets with annotations such as issue clarity, test coverage, and effort estimation — combining context-aware navigation, rationale-driven prompting, and multi-pass consensus.

The cost reduction claim was striking:

Labeling 1,000 instances: from ~$100,000 manual cost
Down to only $5.10 using SPICE

This is a “hidden infrastructure” contribution: reliable evaluation requires reliable datasets, and SPICE pushes that frontier in a concrete, scalable way.

Sherlock Licorish — LLMs’ Efficacy for Code Generation and Software Improvement

Sherlock Licorish presenting at OzSE 2026

This talk critically examined the real efficacy of LLMs for code generation, not only in functional correctness but also across broader quality dimensions: security, reliability, readability, and maintainability.

A key practical emphasis was that performance is highly sensitive to prompt designs and hyperparameter configurations, reinforcing that LLM “capability” is not a fixed property — it is an engineered outcome.

The framing was deliberately balanced: LLMs are useful, but the debate remains open, and careful methodology is essential when comparing LLMs with human developers.

Hong Yi Lin — CodeReviewQA: Code Review Comprehension Assessment for LLMs

Hong Yi Lin presenting CodeReviewQA at OzSE 2026

This work targets a real-world weakness: LLMs may generate code well, but struggle with code review-driven refinement, where comments are often implicit, ambiguous, and colloquial.

Existing evaluations rely on text matching and can be vulnerable to training data contamination. CodeReviewQA proposes a benchmark that supports fine-grained assessment and mitigates contamination risk.

The key design is decomposing refinement into three reasoning steps:

Change Type Recognition (CTR)
Change Localisation (CL)
Solution Identification (SI)

Each step is reformulated as multiple-choice questions with varied difficulty. The evaluation spans 72 recently released LLMs on900 manually curated examples across nine languages, exposing specific weaknesses disentangled from pure generation performance.

This is another strong signal of the conference trend: trustworthy agentic SE requires benchmarks that measure reasoning, not only outputs.

Building and Managing Generative AI Software

Chaired by Dr Chetan Arora, this session shifted from model- and agent-level trustworthiness toward the operational reality: how GenAI software is built, managed, and sustained in practice — across partnerships, project management, and model reuse in downstream systems.

Scott Barnett — Power Laws and Partnership: Lessons from Serving the Long Tail

Barnett’s talk was an unusually practical reflection on how academic SE research can generate real-world impact in an AI-accelerated industry landscape dominated by a handful of major tech companies.

His core argument: academics can create unique value by partnering with the underserved long tail of organisations — via a non-traditional model built on small engagements, deployable artefacts, contextual understanding, and adaptability.

Lessons from 40 projects and 20 partnerships:

What enables success: trust, problem-driven work, responsive communication, supportive environments
What undermines it: treating researchers as developers, building beyond partner capacity, publication-centric timelines
Common failure modes: shifting focus too often, slow internal processes

The meta-message was clear: industry-focused research is a viable academic path, but only if navigated intentionally — with impact-oriented portfolios, cross-disciplinary fluency, and methodological grounding (e.g., design science, context-driven SE).

Lakshana Assalaarachchi — Redefining Software Project Management in the Era of Evolving AI

Lakshana Assalaarachchi presenting at OzSE 2026

This spotlight talk addressed a surprisingly underexplored dimension of agentic AI: software project management (SPM). The framing was that SPM must evolve in parallel with SE — especially as agentic AI becomes embedded in development workflows.

Based on a grey literature review (LinkedIn articles, blogs, and industry reports), she summarized current AI applications in SPM, including:

Automation of routine tasks (e.g., document generation)
Predictive analytics
Communication and meeting support

The stance was pragmatic: project managers are expected to be supported by AI, not replaced. She proposed an upskilling framework mapped to PMI’s talent triangle, and introduced a vision of an agentic PM — an ethical, human-controlled multi-agent system with four working modes that vary autonomy by task complexity and risk.

This talk connected strongly with the conference’s broader “trustworthiness” theme: governance is not only about models — it is also about workflow roles and human control.

Peerachai Banyongrakkul — Challenges and Evolution in Reusing Pre-trained Models

Peerachai Banyongrakkul presenting at OzSE 2026

This talk focused on the downstream engineering reality of AI-driven software: developers increasingly reuse pre-trained models (PTMs) rather than training from scratch — but reuse introduces unique, underexplored SE challenges.

Drawing from an ICSME 2025 qualitative empirical study, the work provided a first analysis of challenges faced in OSS projects reusing PTMs, by systematically mining real-world issue reports.

Recurring barriers identified:

Model usage and integration friction
Model performance instability
Software environment and dependency constraints
Computation resource limitations
Documentation gaps

A particularly novel challenge was the frequent request for additional model support and model replacement — driven by rapid evolution of upstream model hubs.

The ongoing mixed-method study extends this into the evolutionary dimension: how PTMs are added, migrated, and removed over time within downstream projects — making “model evolution” a first-class concern in modern software maintenance.

Overall Trends — Trustworthiness as the Main Competition

Across both days, a clear consensus emerged: AI in software engineering is no longer a performance race — it is a trustworthiness race.

Three dominant directions stood out:

Neuro-Symbolic + CPS Safety — structural explanations and verifiable control layers for learning-enabled systems
LLM + Static Analysis Integration — pulling LLMs into formal verification loops rather than trusting raw generation
Verifiable Agentic Workflows — evaluating and auditing agents at the workflow level, not just the model level

The overall spirit of OzSE 2026 felt very consistent: large models are being integrated into formal systems — not allowed to swallow them.

My Observations & Research Positioning

My main takeaway is that the static analysis community and the LLM community are converging. At the same time, industry appears far more pragmatic than academia: the most convincing solutions were not “bigger models,” but closed-loop pipelines with verification and human confirmation.

I also believe that verifiable Agentic AI will be one of the most important directions over the next five years.

This is exactly where XMAS-CQP fits: multi-agent collaboration + structured explanations + auditable outputs.

If I continue pushing the framework toward:

Explanation stability
Rule-level attribution
Verification-aware agents

it will naturally align with the broader trajectory highlighted by OzSE 2026: building trustworthy, auditable, and verifiable AI workflows, rather than only improving model capability.

🤝 Acknowledgement

Lei Pei with supervisor Sherlock Licorish at OzSE 2026

I am deeply grateful to my supervisor, Professor Sherlock Licorish, for his continuous guidance and support throughout this journey.

Presenting at OzSE 2026 was not just an academic milestone, but also a reflection of the mentorship, encouragement, and critical thinking culture he has cultivated in our research group.

← Back to PhD Journey