Hello! I am a Ph.D student at HIT, advised by Junbao Li and Huanyu Liu.
My name is Jiazheng Wen and english name is Joshua Wen. In 2017, I received a bachelor's degree in information engineering from Xi'an Jiaotong University. In 2022, I received a master's degree in electronic science and technology from Harbin Engineering University. Nowadays, I am studying for a Ph.D in Faculty of Computing, Harbin Institute of Technology. I worked as a research intern at Focused Loong Technology Co.,Ltd. from 2019-2022. My research focuses on deep reinforcement learning and vision enhancement.
Currently, my main interest is in how to enhance the perception of computer vision algorithms through deep reinforcement learning. If you want to discuss anything research related, please feel free to reach me :)
Email  /  ORCID  /  Google Scholar  /  Github  /  GitHub Stars: 63
The reinforcement learning (RL) paradigm that employs sequence modeling (SM) optimizes the target policy based on trajectory segments guided by returns-to-go (RTG), utilizing supervised learning. Nevertheless, these methods encounter two significant challenges: First, the sampled returns within a single trajectory do not align with the optimal ones obtained from multiple trajectories. Second, full supervisory trajectory prediction diminishes the agent's capacity to explore the behavior policy. Consequently, these methods often struggle to derive the optimal trajectory from suboptimal trajectories for stitching, contradicting the foundational principles of RL. Given these considerations, we propose an action exploration agent, AE-RTG, founded on RTG regularization. This agent leverages the trajectory modeling capabilities of the Transformer for both action and RTG predictions, sampling actions according to the evaluation distribution of the action value function to facilitate exploration during training. AE-RTG simultaneously learns the RTG prediction function and the action value function, guiding action generation based on the RTG prediction value and minimizing the disparity between the behavior policy and the target policy by maximizing the action value function. Extensive evaluations on the D4RL benchmark demonstrate that AE-RTG surpasses conventional RL and SM methods, offering enhanced guidance for agent development within the SM paradigm.
Long-term single-object trackers are designed to maintain robust tracking performance, even when objects temporarily disappear from view, face prolonged occlusions, or undergo abrupt changes in appearance. Most existing methods tackle these challenges by combining baseline trackers and reformulating the long-term tracking task as a decision problem: identifying the most suitable short-term tracker for each frame. However, many of these baseline trackers, not originally intended for long-term tracking, often encounter significant limitations, such as difficulties in adapting to sudden object motion shifts and varying environmental conditions. To bridge this gap, we propose an innovative approach that leverages reinforcement learning within a sequence model. This method not only controls the search area but also incorporates an agent-based decision-making process to assess object presence and automatically select the most applicable baseline tracker. Our proposed solution, TrHelpTr, demonstrates impressive generalization capabilities across various trackers and scenarios in a plug-and-play manner. We observe significant and consistent improvements when applying our method to three representative trackers. Comprehensive evaluations on the LTB50, OTB100, LaSOT, TLP, UAV123 and GOT-10K benchmarks and comparisons with other long-term tracking algorithms on the VOT leaderboard reveal that TrHelpTr achieves superior tracking precision and recall, effectively addressing the critical issue of object loss during re-detection.
With recent significant advancements in large vision-language models (LVLMs), image-text understanding capabilities have substantially improved. However, a notable gap remains in fine-grained region understanding. Moreover, the resource consumption for training and testing large-scale LVLMs is immense, making them less accessible to researchers with limited resources. In this paper, we propose a small-scale LVLM, Seg-LLaVA, which employs a lightweight visual prompting method that leverages a semantic segmenter and a small-scale large language model (LLM). By integrating fine-grained knowledge generated by a specialized instance segmentation model with the original image into a multi-layer linear model, we enable the model to perceive object boundaries and types in the image without significantly increasing the number of training parameters, thereby greatly enhancing its visual understanding capabilities. Additionally, we adopt an efficient training approach, allowing Seg-LLaVA to achieve outstanding performance while further reducing resource requirements. Experimental results show that our model excels across multiple benchmarks and demonstrates strong fine-grained perception capabilities.
We propose a robust and lightweight tracking model, self-adaptive dynamic template Siamese network (SiamSDT).
This paper adopts a progressive network approach to design an aggregated convolution progressive network (ACP) inpainting model. It enhances the inpainting ability of interference regions in various types of images with different levels of information.
Recent advances have shown the possibility of jointly optimizing cross-spectral relative poses and neural radiance fields using normalized cross-device coordinates. However, such method suffers from cross-spectral misalignment when collecting data asynchronously from devices and lacks the capability to render in real-time or handle large scenes. We address these issues by proposing cross-spectral Gaussian Splatting with spatial occupancy consistency, strictly aligns cross-spectral scene representation by sharing explicit Gaussian surfaces across spectra and separately optimizing each view's extrinsic using a matching-optimizing pose estimation method.
In this paper, we propose an advanced pan-tilt-zoom (PTZ) camera control method that does not require intrinsic camera parameters. The goal is to accomplish the visual enhancement task of low-confidence objects.
In this work, we propose PTDS(Pedestrian Tracking in Dense Scene) CenterTrack based on the CenterTrack for object center point detection and tracking.
This study introduces a Task-Risk Consistent Intelligent Detection Framework (TRC-ODF) for object detection in optical remote sensing images.
The purpose of this project is to continuously count the moving objects in aisles in a fixed-view video scene, and specify the positive direction of movement, which means that the objects moving in the opposite direction should be counted down.
To address the urgent need for deploying small-scale large language models (LLMs) under resource constraints, the proposed method works as follows: a segmentation component generates masks and an object segmentation list from the original image; a visual encoder processes the original image and masks to extract multi-level visual features highlighting object positions/boundaries, which are then refined via layer normalization and MLP layers into final visual features; finally, the masks, segmentation list (as text instructions), and visual features are fed into a vision-language large model (VLLM) for autoregressive semantic generation. This method also enhances VLLMs’ object perception and question-answering abilities without adding extra training parameters.
This method resolves the poor reliability of trajectory stitching and strategy generalization in complex tasks with existing approaches. It involves: 1) building sequence-modeled state/action losses for iterative training; 2) designing a weighted squared error-based RTG loss; 3) using a double Q-learning framework (with two Q-functions, conservative Q-learning constraints, and Boltzmann distribution) to optimize action exploration; 4) integrating state, action, RTG regularization, and Q-value losses into a joint optimization objective; 5) generating diverse action predictions via noise-perturbed RTG candidate sampling; 6) selecting the highest-Q action (evaluated by double conservative Q-functions) for execution. It is primarily used in agent exploration.
This invention solves the trade-off between tracking performance and computational complexity in MAV-borne target tracking. Its core steps: 1) Set initial, adjacent, and memory templates; 2) Input current frame search features and all templates into the adaptive template fusion (STF) module to generate the final template; 3) Correlate the final template with the search template to get a response map, judge tracking state, and update adjacent/memory templates; 4) The memory template module uses temporal cascading to integrate historical tracking key info, fitting all history into limited memory; 5) The adaptive fusion module adjusts template weights dynamically across tracking stages via template-search feature similarity matrices. It applies to MAV-borne target tracking.
This invention resolves poor tracker performance in long-term tracking. Key steps: 1) Build a sequence modeling reinforcement learning-based long-term tracker with Transformer-based perception and decision layers (perception layer’s visual Transformer outputs feed the decision Transformer, which feeds action sequences back); 2) The tracker integrates sequence modeling reinforcement learning to adaptively select baseline short-term trackers; 3) Make decisions via memory sequence analysis; 4) Individual short-term trackers impact overall results (jointly determined by visual encoder and tracking method); 5) The decision layer dynamically optimizes search region position for tracking. It applies to long-term single-target tracking.
A dense pedestrian multi-target tracking method based on feature fusion, computer equipment, and storage medium, belonging to the field of computer vision tracking technology, solving the problem of existing tracking methods for pedestrians in dense scenes.
The present invention belongs to the field of image processing and specifically relates to an intelligent spark plug appearance defect detection system
Recently, I've been working on an interesting personal project. This project plans to build an RL environment suitable for turn-based games. We will build this environment based on Sid Meier's Civilization 5 and 6. It is still in its infancy, and everyone is welcome to discuss and participate!
Template based on Jon Barron's website.