📓 OzSE 2026 — Trustworthiness as the New Frontier
Australian Summer School in Software Engineering
📍 Melbourne Connect
🗓 9–10 February 2026

Across two days of talks and discussions, a clear shift emerged: AI in software engineering is no longer a performance race — it is a trustworthiness race. The focus has moved from model capability to verification, auditability, and structural guarantees.
Day 1 — Neuro-Symbolic × CPS × Safety
The morning session centered on safety-critical autonomy and how learning-enabled systems can be made formally verifiable.
Xi Zheng — Verifiable Autonomous Systems

Xi represents a classical CPS systems engineering perspective rather than a purely model-driven AI approach.
His core question was:When deep learning enters UAVs and autonomous driving, how can we provide system-level verifiable guarantees?
NeuroStrata Architecture:
- Visual Input
- Foundation Model Segmentation
- Probabilistic Scene Graph
- Datalog / Scallop Rule Reasoning
- Safety Decision
The novelty lies not in segmentation itself, but in:
- Probabilistic semantic scene graphs
- Rule-level attribution
- Traceable safety decision logic
This is structural-level explanation — one abstraction layer above attention heatmaps.
Yulei Sui — Static Analysis for Neural Networks

Sui represents the program analysis tradition. His core logic was precise:
Testing ≠ Verification
∀ x ∈ Φ, f(x) ⊨ Ψ must be proven.
He transfers abstract interpretation to neural network verification, leveraging:
- DeepPoly / CROWN convex relaxations
- Branch-and-Bound exact search
- ACT verification framework
Notably, he is extending this logic to LLM and Agentic workflow verification, signaling that the static analysis community is actively entering the LLM era.
Cristina Cifuentes — Intelligent Application Security

Cristina presented an industry-grounded neuro-symbolic approach.
- LLM-generated vulnerability detectors
- LLM-guided fuzzing
- LLM-generated patches
LLMs cannot replace static analysis.
They must be embedded into the static analysis loop.
The engineering workflow: Static analysis → LLM candidate generation → Re-verification → Human confirmation.
Yongqiang Tian — LPR: Language-Agnostic Program Reduction

The core philosophy was clear: LLM provides creativity, while formal specifications provide safety.
LPR leverages language-agnostic transformations combined with LLM-generated reduction strategies, and discovered over 300+ bugs across multiple compilers.
Cost: $0.42 per benchmark.
This represents a rare example of industrially viable LLM + formal method integration — a concrete form of AI-augmented systems engineering.
Sumudu Bambarawanaliyanage — EnsLLM: Reliability in LLM Code Generation

The central insight of EnsLLM is straightforward but important: LLMs optimize probability, not semantic correctness.
Rather than replacing models with stronger ones, EnsLLM introduces a verification layer above generation:
- Multi-model generation
- Similarity-based voting
- Behavioral differential analysis
Correctness is inferred via collective consistency rather than single-model confidence.
This approach feels deeply rooted in software engineering thinking: not model worship, but layered verification.
Afternoon — LLM Code Generation Reliability
Guowei Yang — EnsLLM

Core insight: LLMs optimize probability, not semantic correctness.
EnsLLM introduces a verification layer:
- Multi-model generation
- Similarity voting
- Behavioral differential analysis
Correctness is inferred via collective consistency, rather than single-model confidence.
Chunhua Liu — Hallucinations in Code Change to Natural Language Generation

This talk examined hallucinations in tasks that translate code changes into natural language — specifically:
- Commit message generation
- Code review comment generation
While hallucinations have been studied separately in natural language generation and code generation, their behavior in structurally complex, context-dependent code-change tasks remains largely unexplored.
Key empirical findings:
- ~50% of generated code reviews contain hallucinations
- ~20% of generated commit messages contain hallucinations
The study evaluated metric-based hallucination detection approaches. While commonly used metrics perform weakly in isolation, combining multiple metrics substantially improves detection performance.
Notably, model confidence and feature attribution metricsshow promise for inference-time hallucination detection.
This work reinforces a central theme of the conference: reliability and trustworthiness cannot be assumed — they must be measured, quantified, and systematically detected.
Naim Rastgoo — AI Tutorial: Building AI Assistants with LangChain and LangGraph

This 40-minute tutorial focused on practical methods for building AI assistants using LangChain and LangGraph.
The session emphasized how agent-based architectures can be engineered through explicit workflow design rather than relying on a single monolithic model.
Key concepts demonstrated:
- Composable LLM pipelines
- Tool integration and external API orchestration
- Stateful multi-step reasoning
- Graph-based agent execution control
LangGraph in particular highlights a shift towardexplicit control flow in agent systems, making agent behavior more structured and inspectable.
From a broader perspective, this tutorial reinforced a recurring theme of the conference: Agentic AI is moving from prompt hacking to workflow engineering.
My Talk — XMAS-CQP

I presented XMAS-CQP as a structured, multi-agent system for explainable software quality prediction.
- Role-based multi-agent collaboration
- Structured JSON schema outputs
- Auditable risk reasoning
- Explanations as first-class artefacts
The system emphasizes:
- Structured outputs
- Traceability
- Reproducibility
- Verification-awareness
Rather than scaling a single LLM, the focus is on engineering controllable, auditable agent workflows.
Software Requirement and Evolution in the Age of Generative AI
This session, chaired by Dr Tingting Bi, explored how generative AI reshapes requirements engineering, software quality assurance, human-oriented SE, and documentation maintenance.
Chetan Arora — Prompt Engineering, LLMs and RAG for Requirements-driven QA

The talk demonstrated how prompt engineering, LLMs, and Retrieval-Augmented Generation (RAG) can operationalize requirements-driven software quality assurance.
- Transforming natural-language requirements into test scenarios
- Automatic generation and refinement of QA artefacts
- RAG grounding using up-to-date specifications
- Prompt engineering techniques for requirements engineering
This presentation highlighted a key trend: grounding LLM outputs in structured requirement knowledge is essential for trustworthy QA pipelines.
Maria Spichkova — Human-Oriented SE in the Age of GenAI

Maria emphasized that software systems are developed by humans and for humans — yet socio-cultural diversity and human factors are often underrepresented in system design.
With the rise of GenAI, SE methodologies are evolving rapidly, and it becomes critical to ensure that these changes preserve usability, sustainability, and inclusivity.
This talk rebalanced the technical intensity of the day with a reminder: trustworthiness also includes cultural and social dimensions.
Yuqing Xiao — Human-centric Requirements in Aged Care Digital Health

Yuqing presented a systematic review of 69 primary studies on requirements engineering for older adult digital health systems.
Complemented by an empirical survey of developers, caregivers, and older adults, the work identified both functional and non-functional requirements for aged care software.
This integrated evidence base strengthens the foundation for human-centric digital health design.
Shashiwadana Nirmani — Motivation-aware OSS Recommendation

Based on a study of 208 OSS practitioners, this work explored how demographics and motivations influence contributor preferences.
A prototype recommendation system integrated human factors alongside technical attributes to personalize project suggestions.
This research highlights that software ecosystems are sustained not only by code quality, but by human motivation alignment.
Haoyu Gao — Automated Documentation Maintenance

Haoyu addressed the long-standing problem of outdated documentation.
Building upon prior empirical work on README evolution, he introduced an LLM-based agentic system that automates documentation updates within pull requests, incorporating a human-in-the-loop.
Quantitative evaluation, qualitative failure analysis, and case studies with OSS developers collectively demonstrate the feasibility of automated documentation maintenance.
Day 2 — Enhancing Trustworthy of Agentic AI Systems
Chaired by Dr Yongqiang Tian, this session focused on one central theme: trust cannot be assumed for LLMs or agentic systems — it must be engineered, measured, and verified.
Valerio Terragni — Metamorphic Testing for LLMs: Improving Trust and Quality in AI

The talk tackled a fundamental barrier in testing LLMs: the oracle problem. When labeled ground truth is unavailable, detecting faulty behavior becomes difficult.
Metamorphic Testing (MT) addresses this by using Metamorphic Relations (MRs) — expected relationships between outputs of related inputs — enabling fault detection without explicit oracles.
What stood out was the scale and completeness of the study:
- Reviewed the literature and identified 191 MRs for NLP tasks
- Implemented 36 representative MRs
- Ran over 560,000 metamorphic tests
- Evaluated three popular LLMs
The results highlighted both strengths and limitations of MT — and importantly, positioned MT as a practical bridge between software testing traditions (SE4AI) and AI-augmented software engineering (AI4SE).
Aldeida Aleti — Trustworthy AI Agents for Software Engineering

This talk framed the core obstacle in agentic SE systems as trustworthiness. Even when benchmark results look impressive, LLM-based agents often remain unreliable: hallucinations, vulnerability misinterpretations, and misleading explanations.
A particularly sharp point was that many failures stem from spurious correlations and shortcut learning — models mimic surface patterns rather than understanding program semantics.
The proposed direction: Trustworthiness Oracles.
These are automated evaluators that measure, explain, and audit trustworthiness in AI-generated artefacts, integrating formal reasoning and interpretability to restore confidence in AI-assisted development.
This concept maps cleanly onto the broader OzSE narrative: evaluation must move from “does it work on benchmarks?” to “can we systematically justify, verify, and audit its behavior?”
Qinghua Lu — Verifiability-First AI Engineering: Building Trust in AIware-Centric Systems

Qinghua’s framing was extremely clear: as systems shift from software-centric architectures to AIware-centric ecosystems, the engineering challenge shifts from producing behavior to verifying it.
Traditional SE assumes business logic lives in human-written code, so testing and analysis can be deterministic. In contrast, foundation models embed logic in weights and learned policies, introducing unprecedented assurance complexity.
Verifiability-first design strategies discussed:
- Decompose tasks into machine-verifiable / human-verifiable components
- Embed constraints into pipelines
- Leverage formal methods and automated testing
- Extend verification beyond correctness to alignment, interpretability, and safety
A key takeaway was the shift in human roles: humans increasingly become designers of constraints and orchestrators of verification, rather than the direct producers of code and tests.
The Future of Agentic Software Development
Chaired by A/Prof Patanamon (Pick) Thongtanunam, this session explored how agentic systems can be scaled, evaluated, and integrated into real software development environments — especially under enterprise constraints.
Minwoo Jeong & Jirat Pasuksmit — Rovo Dev: Toward Scaling AI Agents for Software Engineering (Atlassian)


This talk centered on real-world scaling challenges for enterprise agent systems: agents must operate reliably across large, diverse codebases while balancing latency, cost, and accuracy.
A strong engineering message emerged: scaling is not just adding compute — it requires architectural constraints that keep agent behavior predictable and maintain developer trust.
The talk concluded with open challenges in evaluating agent workflows and managing data risks in proprietary settings — a very practical industry perspective.
Gopi Krishnan Rajbahadur — SPICE: An Automated SWE-Bench Labeling Pipeline

SPICE addresses a bottleneck that the whole community feels: high-quality labeled datasets are essential, but manual labeling is prohibitively expensive and slow.
SPICE is a scalable automated pipeline for labeling SWE-bench-style datasets with annotations such as issue clarity, test coverage, and effort estimation — combining context-aware navigation, rationale-driven prompting, and multi-pass consensus.
The cost reduction claim was striking:
- Labeling 1,000 instances: from ~$100,000 manual cost
- Down to only $5.10 using SPICE
This is a “hidden infrastructure” contribution: reliable evaluation requires reliable datasets, and SPICE pushes that frontier in a concrete, scalable way.
Sherlock Licorish — LLMs’ Efficacy for Code Generation and Software Improvement

This talk critically examined the real efficacy of LLMs for code generation, not only in functional correctness but also across broader quality dimensions: security, reliability, readability, and maintainability.
A key practical emphasis was that performance is highly sensitive to prompt designs and hyperparameter configurations, reinforcing that LLM “capability” is not a fixed property — it is an engineered outcome.
The framing was deliberately balanced: LLMs are useful, but the debate remains open, and careful methodology is essential when comparing LLMs with human developers.
Hong Yi Lin — CodeReviewQA: Code Review Comprehension Assessment for LLMs

This work targets a real-world weakness: LLMs may generate code well, but struggle with code review-driven refinement, where comments are often implicit, ambiguous, and colloquial.
Existing evaluations rely on text matching and can be vulnerable to training data contamination. CodeReviewQA proposes a benchmark that supports fine-grained assessment and mitigates contamination risk.
The key design is decomposing refinement into three reasoning steps:
- Change Type Recognition (CTR)
- Change Localisation (CL)
- Solution Identification (SI)
Each step is reformulated as multiple-choice questions with varied difficulty. The evaluation spans 72 recently released LLMs on900 manually curated examples across nine languages, exposing specific weaknesses disentangled from pure generation performance.
This is another strong signal of the conference trend: trustworthy agentic SE requires benchmarks that measure reasoning, not only outputs.
Building and Managing Generative AI Software
Chaired by Dr Chetan Arora, this session shifted from model- and agent-level trustworthiness toward the operational reality: how GenAI software is built, managed, and sustained in practice — across partnerships, project management, and model reuse in downstream systems.
Scott Barnett — Power Laws and Partnership: Lessons from Serving the Long Tail

Barnett’s talk was an unusually practical reflection on how academic SE research can generate real-world impact in an AI-accelerated industry landscape dominated by a handful of major tech companies.
His core argument: academics can create unique value by partnering with the underserved long tail of organisations — via a non-traditional model built on small engagements, deployable artefacts, contextual understanding, and adaptability.
Lessons from 40 projects and 20 partnerships:
- What enables success: trust, problem-driven work, responsive communication, supportive environments
- What undermines it: treating researchers as developers, building beyond partner capacity, publication-centric timelines
- Common failure modes: shifting focus too often, slow internal processes
The meta-message was clear: industry-focused research is a viable academic path, but only if navigated intentionally — with impact-oriented portfolios, cross-disciplinary fluency, and methodological grounding (e.g., design science, context-driven SE).
Lakshana Assalaarachchi — Redefining Software Project Management in the Era of Evolving AI

This spotlight talk addressed a surprisingly underexplored dimension of agentic AI: software project management (SPM). The framing was that SPM must evolve in parallel with SE — especially as agentic AI becomes embedded in development workflows.
Based on a grey literature review (LinkedIn articles, blogs, and industry reports), she summarized current AI applications in SPM, including:
- Automation of routine tasks (e.g., document generation)
- Predictive analytics
- Communication and meeting support
The stance was pragmatic: project managers are expected to be supported by AI, not replaced. She proposed an upskilling framework mapped to PMI’s talent triangle, and introduced a vision of an agentic PM — an ethical, human-controlled multi-agent system with four working modes that vary autonomy by task complexity and risk.
This talk connected strongly with the conference’s broader “trustworthiness” theme: governance is not only about models — it is also about workflow roles and human control.
Peerachai Banyongrakkul — Challenges and Evolution in Reusing Pre-trained Models

This talk focused on the downstream engineering reality of AI-driven software: developers increasingly reuse pre-trained models (PTMs) rather than training from scratch — but reuse introduces unique, underexplored SE challenges.
Drawing from an ICSME 2025 qualitative empirical study, the work provided a first analysis of challenges faced in OSS projects reusing PTMs, by systematically mining real-world issue reports.
Recurring barriers identified:
- Model usage and integration friction
- Model performance instability
- Software environment and dependency constraints
- Computation resource limitations
- Documentation gaps
A particularly novel challenge was the frequent request for additional model support and model replacement — driven by rapid evolution of upstream model hubs.
The ongoing mixed-method study extends this into the evolutionary dimension: how PTMs are added, migrated, and removed over time within downstream projects — making “model evolution” a first-class concern in modern software maintenance.
Overall Trends — Trustworthiness as the Main Competition
Across both days, a clear consensus emerged: AI in software engineering is no longer a performance race — it is a trustworthiness race.
Three dominant directions stood out:
- Neuro-Symbolic + CPS Safety — structural explanations and verifiable control layers for learning-enabled systems
- LLM + Static Analysis Integration — pulling LLMs into formal verification loops rather than trusting raw generation
- Verifiable Agentic Workflows — evaluating and auditing agents at the workflow level, not just the model level
The overall spirit of OzSE 2026 felt very consistent: large models are being integrated into formal systems — not allowed to swallow them.
My Observations & Research Positioning
My main takeaway is that the static analysis community and the LLM community are converging. At the same time, industry appears far more pragmatic than academia: the most convincing solutions were not “bigger models,” but closed-loop pipelines with verification and human confirmation.
I also believe that verifiable Agentic AI will be one of the most important directions over the next five years.
This is exactly where XMAS-CQP fits: multi-agent collaboration + structured explanations + auditable outputs.
If I continue pushing the framework toward:
- Explanation stability
- Rule-level attribution
- Verification-aware agents
it will naturally align with the broader trajectory highlighted by OzSE 2026: building trustworthy, auditable, and verifiable AI workflows, rather than only improving model capability.
🤝 Acknowledgement

I am deeply grateful to my supervisor, Professor Sherlock Licorish, for his continuous guidance and support throughout this journey.
Presenting at OzSE 2026 was not just an academic milestone, but also a reflection of the mentorship, encouragement, and critical thinking culture he has cultivated in our research group.