Jiazheng Wen (Joshua Wen)

Reinforcement Learning Computer Vision
Query-MARFT: Query-Guided Multi-Agent Reinforcement Fine-Tuning for End-to-End Multi-Object Tracking
Jiazheng Wen, Yuheng Su, Huanyu Liu, Junbao Li,
Neurocomputing (Preprint).
project page / paper

Current end-to-end multi-object tracking (MOT) methods face three primary bottlenecks: the intrinsic mismatch between local supervised signals and global trajectory metrics, rigid predefined module coordination, and the stability-plasticity dilemma caused by traditional full-parameter fine-tuning. To address these limitations, we propose Query-MARFT, a novel Query-Guided Multi-Agent Reinforcement Fine-Tuning framework. We reconceptualize the evolution of track queries as a Flexible Markov Game, decoupling the trajectory lifecycle into four role-specific agents: Detection Focus, Association Modulation, Temporal Update, and Trajectory Correction. To accommodate diverse visual dynamics, these agents collaborate dynamically via a scene-adaptive Directed Acyclic Graph topology. Furthermore, to overcome the credit assignment challenge in multi-agent reinforcement learning, we introduce a hierarchical reward mechanism optimized via Group Relative Policy Optimization. By integrating a Low-Rank Adaptation based parameter-efficient fine-tuning strategy with a two-stage training pipeline, Query-MARFT achieves an optimal balance between acquiring task-specific trajectory reasoning and retaining pre-trained zero-shot representations. Extensive experiments demonstrate that our method achieves state-of-the-art HOTA scores of 74.1 and 65.3 on the highly challenging DanceTrack and MOT20 benchmarks, respectively. It substantially reduces identity switches while exhibiting exceptional trajectory continuity and robust cross-domain generalization. Code is available at https://github.com/JoshuaWenHIT/Query-MARFT.

Reinforcement Learning Computer Vision
ReQAR: A Reinforcement Learning-Based Query Adaptive Model Routing Framework for Person Search
Yuheng Su, Huanyu Liu, Jiazheng Wen, Junbao Li,
Neurocomputing (Preprint).
paper

Person search has advanced rapidly with end-to-end frameworks that jointly integrate person detection and re-identification. However, retrieval quality remains highly query dependent: under a fixed gallery and evaluation protocol, a single fixed model often performs inconsistently across queries under heterogeneous visual conditions. To address this issue, we propose ReQAR, a model-agnostic query adaptive model routing framework for person search. Instead of relying on a single fixed model, ReQAR selects the most suitable model for each query from a pool of complementary end-to-end person search models at inference time, without modifying model architectures or weights. To support reliable routing, ReQAR employs a lightweight query-state encoder, ResQENet, to produce compact and decision-relevant query representations. The routing policy is learned with retrieval-metric-aligned rewards, enabling effective exploitation of complementary model strengths. Extensive experiments on CUHK-SYSU and PRW show that ReQAR consistently outperforms the strongest single-model baseline in the candidate pool, achieving 97.0% mAP / 97.5% Top-1 on CUHK-SYSU and 57.6% mAP / 90.3% Top-1 on PRW. These results suggest that query-adaptive model routing provides an effective alternative to purely model-centric improvement for end-to-end person search under diverse query conditions.

Reinforcement Learning
AE-RTG: An Action Exploration Agent for Offline Reinforcement Learning Based on Returns-To-Go Regularization
Jiazheng Wen, Huanyu Liu, Junbao Li,
Pattern Recognition (Preprint).
paper

The reinforcement learning (RL) paradigm that employs sequence modeling (SM) optimizes the target policy based on trajectory segments guided by returns-to-go (RTG), utilizing supervised learning. Nevertheless, these methods encounter two significant challenges: First, the sampled returns within a single trajectory do not align with the optimal ones obtained from multiple trajectories. Second, full supervisory trajectory prediction diminishes the agent's capacity to explore the behavior policy. Consequently, these methods often struggle to derive the optimal trajectory from suboptimal trajectories for stitching, contradicting the foundational principles of RL. Given these considerations, we propose an action exploration agent, AE-RTG, founded on RTG regularization. This agent leverages the trajectory modeling capabilities of the Transformer for both action and RTG predictions, sampling actions according to the evaluation distribution of the action value function to facilitate exploration during training. AE-RTG simultaneously learns the RTG prediction function and the action value function, guiding action generation based on the RTG prediction value and minimizing the disparity between the behavior policy and the target policy by maximizing the action value function. Extensive evaluations on the D4RL benchmark demonstrate that AE-RTG surpasses conventional RL and SM methods, offering enhanced guidance for agent development within the SM paradigm.

Reinforcement Learning Computer Vision
TrHelpTr: A Long-Term Single-Object Tracking Paradigm Based on Sequence Modeling Reinforcement Learning
Jiazheng Wen, Huanyu Liu, Junbao Li,
Pattern Recognition.
paper

Long-term single-object trackers are designed to maintain robust tracking performance, even when objects temporarily disappear from view, face prolonged occlusions, or undergo abrupt changes in appearance. Most existing methods tackle these challenges by combining baseline trackers and reformulating the long-term tracking task as a decision problem: identifying the most suitable short-term tracker for each frame. However, many of these baseline trackers, not originally intended for long-term tracking, often encounter significant limitations, such as difficulties in adapting to sudden object motion shifts and varying environmental conditions. To bridge this gap, we propose an innovative approach that leverages reinforcement learning within a sequence model. This method not only controls the search area but also incorporates an agent-based decision-making process to assess object presence and automatically select the most applicable baseline tracker. Our proposed solution, TrHelpTr, demonstrates impressive generalization capabilities across various trackers and scenarios in a plug-and-play manner. We observe significant and consistent improvements when applying our method to three representative trackers. Comprehensive evaluations on the LTB50, OTB100, LaSOT, TLP, UAV123 and GOT-10K benchmarks and comparisons with other long-term tracking algorithms on the VOT leaderboard reveal that TrHelpTr achieves superior tracking precision and recall, effectively addressing the critical issue of object loss during re-detection.

Vision Language Large Model
Seg-LLaVA: A Small-Scale Large Vision-Language Model with External Visual Prompts
Tianxing Guo, Huanyu Liu, Jiazheng Wen, Junbao Li,
Neurocomputing.
paper

With recent significant advancements in large vision-language models (LVLMs), image-text understanding capabilities have substantially improved. However, a notable gap remains in fine-grained region understanding. Moreover, the resource consumption for training and testing large-scale LVLMs is immense, making them less accessible to researchers with limited resources. In this paper, we propose a small-scale LVLM, Seg-LLaVA, which employs a lightweight visual prompting method that leverages a semantic segmenter and a small-scale large language model (LLM). By integrating fine-grained knowledge generated by a specialized instance segmentation model with the original image into a multi-layer linear model, we enable the model to perceive object boundaries and types in the image without significantly increasing the number of training parameters, thereby greatly enhancing its visual understanding capabilities. Additionally, we adopt an efficient training approach, allowing Seg-LLaVA to achieve outstanding performance while further reducing resource requirements. Experimental results show that our model excels across multiple benchmarks and demonstrates strong fine-grained perception capabilities.

Computer Vision
SiamSDT: A Self-Adaptive Dynamic Template Siamese Network for Airborne Visual Tracking of MAVs on Heterogeneous FPGA-SoC
Yuxin Zhang, Jiazheng Wen, Ran Wu, Huanyu Liu, Junbao Li,
The Journal of Supercomputing.
paper

We propose a robust and lightweight tracking model, self-adaptive dynamic template Siamese network (SiamSDT).

Computer Vision
Image inpainting with aggregated convolution progressive network
Yang li, Jia Zhai, Wen Lu, Haipeng Guo, Jiazheng Wen, Huanyu Liu, Junbao Li,
IET Image Processing.
paper

This paper adopts a progressive network approach to design an aggregated convolution progressive network (ACP) inpainting model. It enhances the inpainting ability of interference regions in various types of images with different levels of information.

Computer Vision
Cross-Spectral Gaussian Splatting with Spatial Occupancy Consistency
Haipeng Guo, Huanyu Liu, Jiazheng Wen, Junbao Li.
AAAI2025.
project page / paper

Recent advances have shown the possibility of jointly optimizing cross-spectral relative poses and neural radiance fields using normalized cross-device coordinates. However, such method suffers from cross-spectral misalignment when collecting data asynchronously from devices and lacks the capability to render in real-time or handle large scenes. We address these issues by proposing cross-spectral Gaussian Splatting with spatial occupancy consistency, strictly aligns cross-spectral scene representation by sharing explicit Gaussian surfaces across spectra and separately optimizing each view's extrinsic using a matching-optimizing pose estimation method.

Reinforcement Learning Computer Vision
Automatic Visual Enhancement of PTZ Camera Based on Reinforcement Learning
Hao Fang, Huanyu Liu, Jiazheng Wen, Zhonglin Yang, Junbao Li, Qi Han.
Neurocomputing.
project page

In this paper, we propose an advanced pan-tilt-zoom (PTZ) camera control method that does not require intrinsic camera parameters. The goal is to accomplish the visual enhancement task of low-confidence objects.

Computer Vision
PTDS CenterTrack: Pedestrian Tracking in Dense Scenes with Re-Identification and Feature Enhancement
Jiazheng Wen, Huanyu Liu, Junbao Li.
Machine Vision and Applications.
paper

In this work, we propose PTDS(Pedestrian Tracking in Dense Scene) CenterTrack based on the CenterTrack for object center point detection and tracking.

Reinforcement Learning Computer Vision
A Task-Risk Consistency Object Detection Framework Based on Deep Reinforcement Learning
Jiazheng Wen, Huanyu Liu, Junbao Li.
Remote Sensing-Special Issue: Artificial Intelligence Algorithm for Remote Sensing Imagery Processing III.
project page / paper

This study introduces a Task-Risk Consistent Intelligent Detection Framework (TRC-ODF) for object detection in optical remote sensing images.

Computer Vision
CenterCounter: A Video Pig Detection and Counting Network Based on Object Center Point
Jiazheng Wen, Yan Cang, Yulong Qiao.
Computers and Electronics in Agriculture.
project page/ news page

The purpose of this project is to continuously count the moving objects in aisles in a fixed-view video scene, and specify the positive direction of movement, which means that the objects moving in the opposite direction should be counted down.

Vision Language Large Model
A Vision-Language Large Model Perception Enhancement Method and System Based on Visual Prompting
Huanyu Liu, Tianxing Guo, Jiazheng Wen, Junbao Li, Yutong Jiang, Yue Zhou.
Patent.
patent page

To address the urgent need for deploying small-scale large language models (LLMs) under resource constraints, the proposed method works as follows: a segmentation component generates masks and an object segmentation list from the original image; a visual encoder processes the original image and masks to extract multi-level visual features highlighting object positions/boundaries, which are then refined via layer normalization and MLP layers into final visual features; finally, the masks, segmentation list (as text instructions), and visual features are fed into a vision-language large model (VLLM) for autoregressive semantic generation. This method also enhances VLLMs’ object perception and question-answering abilities without adding extra training parameters.

Reinforcement Learning
An Offline Reinforcement Learning Agent Method for Action Exploration Based on Expected Reward Regularization
Huanyu Liu, Jiazheng Wen, Junbao Li.
Patent.
patent page

This method resolves the poor reliability of trajectory stitching and strategy generalization in complex tasks with existing approaches. It involves: 1) building sequence-modeled state/action losses for iterative training; 2) designing a weighted squared error-based RTG loss; 3) using a double Q-learning framework (with two Q-functions, conservative Q-learning constraints, and Boltzmann distribution) to optimize action exploration; 4) integrating state, action, RTG regularization, and Q-value losses into a joint optimization objective; 5) generating diverse action predictions via noise-perturbed RTG candidate sampling; 6) selecting the highest-Q action (evaluated by double conservative Q-functions) for execution. It is primarily used in agent exploration.

Computer Vision
A MAV-Borne Target Tracking Method and System Based on Adaptive Dynamic Templates
Yuxin Zhang, Junbao Li, Huanyu Liu, Jiazheng Wen.
Patent.
patent page

This invention solves the trade-off between tracking performance and computational complexity in MAV-borne target tracking. Its core steps: 1) Set initial, adjacent, and memory templates; 2) Input current frame search features and all templates into the adaptive template fusion (STF) module to generate the final template; 3) Correlate the final template with the search template to get a response map, judge tracking state, and update adjacent/memory templates; 4) The memory template module uses temporal cascading to integrate historical tracking key info, fitting all history into limited memory; 5) The adaptive fusion module adjusts template weights dynamically across tracking stages via template-search feature similarity matrices. It applies to MAV-borne target tracking.

Reinforcement Learning Computer Vision
A Long-Term Single-Target Tracking Method, System, and Device Based on Sequence Modeling Reinforcement Learning
Huanyu Liu, Jiazheng Wen, Junbao Li.
Patent.
patent page

This invention resolves poor tracker performance in long-term tracking. Key steps: 1) Build a sequence modeling reinforcement learning-based long-term tracker with Transformer-based perception and decision layers (perception layer’s visual Transformer outputs feed the decision Transformer, which feeds action sequences back); 2) The tracker integrates sequence modeling reinforcement learning to adaptively select baseline short-term trackers; 3) Make decisions via memory sequence analysis; 4) Individual short-term trackers impact overall results (jointly determined by visual encoder and tracking method); 5) The decision layer dynamically optimizes search region position for tracking. It applies to long-term single-target tracking.

Computer Vision
A dense pedestrian multi-object tracking method based on feature fusion, computer equipment, and storage medium
Huanyu Liu, Jiazheng Wen, Junbao Li, Zhonglin Yang.
Patent.
patent page

A dense pedestrian multi-target tracking method based on feature fusion, computer equipment, and storage medium, belonging to the field of computer vision tracking technology, solving the problem of existing tracking methods for pedestrians in dense scenes.

Computer Vision
An intelligent spark plug appearance defect detection system
Yan Cang, Jiazheng Wen, Yulong Qiao, Chunyu Chen.
Patent.
patent page

The present invention belongs to the field of image processing and specifically relates to an intelligent spark plug appearance defect detection system

Workshops

AI&Game
A turn-based gaming agent based on an RL environment built from Sid Meier's Civilization game series.
Jiazheng Wen.
project page

Recently, I've been working on an interesting personal project. This project plans to build an RL environment suitable for turn-based games. We will build this environment based on Sid Meier's Civilization 5 and 6. It is still in its infancy, and everyone is welcome to discuss and participate!

Other activities

Academic activities

Best Newcomer Award at the 9th GuangXi International Academic Forum, Faculty of Computing, HIT.

Forum Link

As a reviewer for journal The Visual Computer and Autonomous Intelligent Systems.

Award

2025 National Scholarship for Doctoral Students ￥30,000.
2024 "QAX Network Security" Doctoral Scholarship ￥10,000.

Teaching Assistant

Digital Logic Circuit Experiment. MOOC page
Artificial Intelligence Safety Experiment. experiment source code

Template based on Jon Barron's website.

News

Publications

Workshops

Other activities