Tommaso Galliena

I am a PhD student in the PhD of National Interest in Robotics and Intelligent Machines (phDRIM) at the Italian Institute of Technology, where I work across the Pattern Analysis and Computer Vision and Humanoid Sensing and Perception labs under the supervision of Lorenzo Natale, Alessio Del Bue, and Pietro Morerio.

My research sits at the intersection of 3D scene understanding, vision-language models, and self-supervised learning. I am interested in building embodied agents that can develop persistent, spatially grounded representations of the world, enabling consistent object understanding, reasoning, and navigation across viewpoints and time.

Previously, I was a Visiting Research Student at Simon Fraser University in Vancouver, Canada, where I worked in the 3D Language and Generation lab under the supervision of Prof. Angel Xuan Chang.

Since March 2026, I have been a Research Intern in the Geometric Deep Learning group at NAVER LABS Europe, supervised by Gabriela Csurka and Vassilina Nikoulina, where I work on scalable representations that bridge vision, geometry, and language.

Email / CV / Scholar / Twitter / Linkedin / Github

Pubblications

Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning
Tommaso Galliena, Tommaso Apicella, Stefano Rosa, Pietro Morerio, Alessio Del Bue Lorenzo Natale,
Pre-print, 2026
project page / arXiv

We rethink captioning as a memory-driven embodied process, where agents actively explore and refine object descriptions to achieve cross-view semantic consistency

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Tommaso Galliena, Tommaso Apicella, Stefano Rosa, Pietro Morerio, Alessio Del Bue Lorenzo Natale,
ICCV, 2025 (Highlight)
project page / arXiv

We propose a novel self-supervised learning framework for embodied image captioning, where an agent explores a 3D environment to generate spatially coherent image descriptions and collect challenging training data to fine-tune vision-language models.

Semiautomatic volume measure of kidney vascular territories on CT angiography to plan aortic aneurysm repair in patients with horseshoe kidney
Axel Bartoli, Alberto Colombo, Franscesco Pisu, Tommaso Galliena, Chiara Gnasso, Enrico Rinaldi, Germano Melisano, Anna Palmisano, Antonio Esposito
Journal of European Radiology, 2024
Paper

By developing a semiautomatic CTA-based model to measure kidney vascular territories, we enable precise preoperative planning for aortic aneurysm repair in patients with horseshoe kidney, reducing risk of postoperative renal damage

Feel free to steal this website's source code. Do not scrape the HTML from this page itself, as it includes analytics tags that you do not want on your own website — use the github code instead. Also, consider using Leonid Keselman's Jekyll fork of this page.