Hello! I am a Ph.D student at HIT, advised by Junbao Li and Huanyu Liu.
My name is Jiazheng Wen and english name is Joshua Wen. In 2017, I received a bachelor's degree in information engineering from Xi'an Jiaotong University. In 2022, I received a master's degree in electronic science and technology from Harbin Engineering University. Nowadays, I am studying for a Ph.D in Faculty of Computing, Harbin Institute of Technology. I worked as a research intern at Focused Loong Technology Co.,Ltd. from 2019-2022. My research focuses on deep reinforcement learning and vision enhancement.
Currently, my main interest is in how to enhance the perception of computer vision algorithms through deep reinforcement learning. If you want to discuss anything research related, please feel free to reach me :)
Email  /  ORCID  /  Google Scholar  /  Github  /  GitHub Stars: 63
With recent significant advancements in large vision-language models (LVLMs), image-text understanding capabilities have substantially improved. However, a notable gap remains in fine-grained region understanding. Moreover, the resource consumption for training and testing large-scale LVLMs is immense, making them less accessible to researchers with limited resources. In this paper, we propose a small-scale LVLM, Seg-LLaVA, which employs a lightweight visual prompting method that leverages a semantic segmenter and a small-scale large language model (LLM). By integrating fine-grained knowledge generated by a specialized instance segmentation model with the original image into a multi-layer linear model, we enable the model to perceive object boundaries and types in the image without significantly increasing the number of training parameters, thereby greatly enhancing its visual understanding capabilities. Additionally, we adopt an efficient training approach, allowing Seg-LLaVA to achieve outstanding performance while further reducing resource requirements. Experimental results show that our model excels across multiple benchmarks and demonstrates strong fine-grained perception capabilities.
We propose a robust and lightweight tracking model, self-adaptive dynamic template Siamese network (SiamSDT).
This paper adopts a progressive network approach to design an aggregated convolution progressive network (ACP) inpainting model. It enhances the inpainting ability of interference regions in various types of images with different levels of information.
Recent advances have shown the possibility of jointly optimizing cross-spectral relative poses and neural radiance fields using normalized cross-device coordinates. However, such method suffers from cross-spectral misalignment when collecting data asynchronously from devices and lacks the capability to render in real-time or handle large scenes. We address these issues by proposing cross-spectral Gaussian Splatting with spatial occupancy consistency, strictly aligns cross-spectral scene representation by sharing explicit Gaussian surfaces across spectra and separately optimizing each view's extrinsic using a matching-optimizing pose estimation method.
In this paper, we propose an advanced pan-tilt-zoom (PTZ) camera control method that does not require intrinsic camera parameters. The goal is to accomplish the visual enhancement task of low-confidence objects.
In this work, we propose PTDS(Pedestrian Tracking in Dense Scene) CenterTrack based on the CenterTrack for object center point detection and tracking.
This study introduces a Task-Risk Consistent Intelligent Detection Framework (TRC-ODF) for object detection in optical remote sensing images.
The purpose of this project is to continuously count the moving objects in aisles in a fixed-view video scene, and specify the positive direction of movement, which means that the objects moving in the opposite direction should be counted down.
To address the urgent need for deploying small-scale large language models (LLMs) under resource constraints, the proposed method works as follows: a segmentation component generates masks and an object segmentation list from the original image; a visual encoder processes the original image and masks to extract multi-level visual features highlighting object positions/boundaries, which are then refined via layer normalization and MLP layers into final visual features; finally, the masks, segmentation list (as text instructions), and visual features are fed into a vision-language large model (VLLM) for autoregressive semantic generation. This method also enhances VLLMs’ object perception and question-answering abilities without adding extra training parameters.
This method resolves the poor reliability of trajectory stitching and strategy generalization in complex tasks with existing approaches. It involves: 1) building sequence-modeled state/action losses for iterative training; 2) designing a weighted squared error-based RTG loss; 3) using a double Q-learning framework (with two Q-functions, conservative Q-learning constraints, and Boltzmann distribution) to optimize action exploration; 4) integrating state, action, RTG regularization, and Q-value losses into a joint optimization objective; 5) generating diverse action predictions via noise-perturbed RTG candidate sampling; 6) selecting the highest-Q action (evaluated by double conservative Q-functions) for execution. It is primarily used in agent exploration.
This invention solves the trade-off between tracking performance and computational complexity in MAV-borne target tracking. Its core steps: 1) Set initial, adjacent, and memory templates; 2) Input current frame search features and all templates into the adaptive template fusion (STF) module to generate the final template; 3) Correlate the final template with the search template to get a response map, judge tracking state, and update adjacent/memory templates; 4) The memory template module uses temporal cascading to integrate historical tracking key info, fitting all history into limited memory; 5) The adaptive fusion module adjusts template weights dynamically across tracking stages via template-search feature similarity matrices. It applies to MAV-borne target tracking.
This invention resolves poor tracker performance in long-term tracking. Key steps: 1) Build a sequence modeling reinforcement learning-based long-term tracker with Transformer-based perception and decision layers (perception layer’s visual Transformer outputs feed the decision Transformer, which feeds action sequences back); 2) The tracker integrates sequence modeling reinforcement learning to adaptively select baseline short-term trackers; 3) Make decisions via memory sequence analysis; 4) Individual short-term trackers impact overall results (jointly determined by visual encoder and tracking method); 5) The decision layer dynamically optimizes search region position for tracking. It applies to long-term single-target tracking.
A dense pedestrian multi-target tracking method based on feature fusion, computer equipment, and storage medium, belonging to the field of computer vision tracking technology, solving the problem of existing tracking methods for pedestrians in dense scenes.
The present invention belongs to the field of image processing and specifically relates to an intelligent spark plug appearance defect detection system
Recently, I've been working on an interesting personal project. This project plans to build an RL environment suitable for turn-based games. We will build this environment based on Sid Meier's Civilization 5 and 6. It is still in its infancy, and everyone is welcome to discuss and participate!
Template based on Jon Barron's website.