A Dataset and Benchmarks for Deep Learning-Based Optical Microrobot Pose and Depth Perception
Lan Wei and Dandan Zhang
Imperial College London

Introduction

Optical microrobots, manipulated via optical tweezers (OT), have broad applications in biomedicine. However, reliable pose and depth perception remain fundamental challenges due to the transparent, noisy, and dynamic characteristics of the microscale environments in which they operate. An open dataset is crucial for enabling reproducible research, facilitating benchmarking, and accelerating the development of perception models tailored to microscale challenges. Standardized evaluation enables consistent comparison across algorithms, promoting fair assessment and driving progress in the field. Here, we introduce the OpTical MicroRobot dataset (OTMR), the first publicly available dataset designed to support microrobot perception under optical microscope. OTMR contains 232,881 high-resolution images spanning 18 microrobot types and 176 distinct poses. We benchmarked the performance of eight deep learning models, including architectures derived via neural architecture search, on two key tasks: pose classification and depth regression. Results indicated that Vision Transformers achieve the highest accuracy in pose classification, while depth regression benefits from deeper architectures. Additionally, increasing the size of the training dataset leads to substantial improvements across both tasks, highlighting OTMR’s potential as a foundational resource for robust and generalisable microrobot perception in complex microscale environments.

Dataset

How can the microtobot be fabricated?

The optical microrobots used in this study were fabricated using a Nanoscribe 3D printer (Nanoscribe GmbH,Germany) with IP-L 780 photoresist as the printing material. The fabrication process employed two-photon polymerization (2PP). Microrobots were directly printed onto glass substrates and subsequently immersed in deionized (DI) water within a sealed spacer chamber for imaging and experimental use.

Eighteen distinct microrobot designs were fabricated for inclusion in the OTMR dataset. Their CAD models and corresponding focal plane images are shown in Fig. 2. Microrobots 1–6 (top row) were primarily used for pose classification, featuring 176 unique out-of-plane poses generated by varying rotation angles from 0 to 90. All 18 types were used for depth estimation.

Optical Microrobot Visualization
Fig. 2. Overview of the 18 microrobot types included in the OTMR dataset. For each robot, the left image shows its CAD model, and the right image presents the corresponding experimental image captured at the focus plane under an optical microscope. Microrobots 1–6 (top row) are specifically designed for the pose classification task due to their varied and distinguishable orientations, while all 18 types are used for depth estimation tasks.
Git
Image 1 Image 2 Image 3 Image 4 Image 5 Image 6
Process Optical Tweezer Video to Single Frames

The aboving figure provides an example video of microrobots recorded using an optical tweezer. For user convenience, the dataset includes pre-processed frames and labels. To access the original video files, please contact: d.zhang17@imperial.ac.uk or l.wei24@imperial.ac.uk.

How is the video/image collected?

The data collection system is built around an optical tweezer (OT) platform, exemplified here by the setup from Elliot Scientific (UK), integrated with a nanopositioning stage (Mad City Labs Inc., USA). Microscopic images were captured using a high-speed CCD camera (Basler AG, Germany) mounted on a Nikon Ti microscope with a 100× oil immersion objective. Each image frame has a resolution of 678×488 pixels. While this setup is used in our experiments, the data collection process is broadly compatible with other commercial OT systems and optical manipulation platforms that include a high-resolution microscope, making the dataset applicable across a wide range of micro-manipulation research environments. A schematic of the system is shown in Fig. 3.

During data acquisition, microrobots were fixed to a glass substrate mounted on a piezoelectric stage, enabling precise vertical translation along the z-axis for accurate depth measurements. To generate diverse out-of-plane poses, we fabricated microrobots with systematically varied orientations, which were further manipulated using the piezoelectric drive in either discrete steps or continuous motion. For each pose, over 1,000 image frames were captured to support both pose classification and depth estimation tasks. All data acquisition and processing were conducted offline.

OT-SETUP
Fig. 3. Overview of the experimental platform for data collection.
experimental-setup
Fig. 4. Image of the experimental setup.

Benchmark

Table II presents the five-fold cross-validation results for pose classification using two different microrobot types, with models trained on an equal number of images per pose. The results indicate that microrobots with more complex structures—such as Robot 3, which incorporates two distinct types of spherical components—pose greater challenges for pose estimation compared to simpler designs like Robot 1, which consists of four identical spheres. For instance, the best pitch and roll prediction accuracies for Robot 3 are 3.4% and 2.7% lower, respectively, than those for Robot 1. Among all evaluated architectures, the Vision Transformer (ViT) consistently outperforms the others across different microrobot types. This superior performance can be attributed to its pretraining on ImageNet, a large-scale classification dataset with over 14 million images, as well as its patch-based image decomposition strategy, which effectively captures local and global features critical for pose recognition in microscale environments. Table III compares the computational characteristics of various models for the pose classification task, including the number of parameters (in MB), inference complexity (in GFLOPs), and real-time processing capability (measured as throughput in images per second). Although the Vision Transformer (ViT) exhibits the highest GFLOPs among all models, it is still capable of processing over 1,300 images per second, demonstrating strong real-time performance despite its computational intensity.

TABLE II
POSE CLASSIFICATION FIVE-FOLD CROSS-VALIDATION RESULTS FOR ROBOT 1 AND ROBOT 3 ACROSS ALL MODELS.
table II
TABLE III
COMPARISON OF MODEL SIZE (MB), INFERENCE COST (GFLOPS) AND THROUGHPUT THAT MEASURES THE NUMBER OF IMAGES THAT CAN BE PROCESSED PER SECOND FOR POSE CLASSIFICATION ACROSS BENCHMARKED METHODS.
table III

Table IV summarizes the depth regression results for six different microrobot types. Similar to pose classification, robots with complex and asymmetric geometries (e.g., Robot 14) are significantly more difficult to regress accurately. The lowest mean squared error (MSE) obtained on Robot 14 is approximately six times higher than that of a simpler design like Robot 8. Furthermore, for a given robot, deeper architectures (e.g., ResNet50) tend to outperform shallower ones (e.g., ResNet18), highlighting the need for higher model capacity in depth estimation tasks.

TABLE IV
DEPTH REGRESSION RESULTS FOR ROBOTS 8-18 ACROSS ALL MODELS.
table IV

The results of neural architecture search (NAS) are shown in Table V. The NAS process was applied to CNN-based architectures and trained from scratch. As reported in Tables II and IV, the NAS-optimized models consistently outperform the baseline CNN in all evaluated cases. Notably, the NAS model achieved the best depth regression performance on Robot 16, one of the most complex designs, demonstrating the effectiveness of architecture search in tailoring models to task difficulty. Moreover, the architectures discovered by NAS further reflect the relative difficulty of the tasks. The optimal model for depth regression includes two additional convolutional layers and a larger fully connected layer compared to the model optimized for pose classification, aligning with the inherently more complex nature of continuous-value prediction in regression tasks.

TABLE V
OPTIMAL ARCHITECTURES FOUND BY NAS FOR POSE CLASSIFICATION AND DEPTH REGRESSION.
table V

paper-IV-D: Transfer Learning Among Different Robots

To evaluate the generalisation ability of deep learning models, we conducted a transfer learning experiment using the best-performing model—ViT—trained on data from Robot Type 3 for the pose classification task. The trained model was directly tested on Robot Types 1, 4, and 5 without further fine-tuning. The evaluation metric is the average classification accuracy of pitch and roll angles.

As shown in Fig. 7, the model achieves the highest accuracy on Robot 3, the training target, as expected. Robots 4 and 5 exhibit higher classification accuracy than Robot 1, likely due to structural similarity to Robot 3. All three robots (3, 4, and 5) share a common feature: they are composed of two distinct types of spherical components along the arms. In contrast, Robot 1 consists of four identical spheres, which differ significantly in geometry and visual features. Furthermore, the spatial orientation of the robot also affects transfer performance. Robot 5 shares the same horizontal configuration as Robot 3, leading to better generalisation and higher accuracy. In contrast, Robot 4 is vertically oriented, which introduces a distribution shift in visual appearance and results in reduced classification performance compared to Robot 5.

figure 7
Fig. 7. Transfer learning results of the ViT model trained on Robot Type 3 and tested on different robot types without fine-tuning. The evaluation metric is the average classification accuracy of pitch and roll angles.
paper-IV-E: Model Interpretability

To gain insights into which regions of an input microrobot image influence the model’s predictions, we employ Gradient-weighted Class Activation Mapping (Grad- CAM) to visualize the spatial attention of the CNN during pose classification. As illustrated in Fig. 8, the Grad- CAM heatmaps highlight the areas that contribute most to the model’s decision-making process.

The visualizations for Robot Types 1 and 3 reveal that the model consistently attends to the microrobot structure itself, particularly the arms and spherical components when determining the pose. This confirms that CNN has successfully learned to focus on the relevant features of the microrobot rather than background noise, providing interpretability and confidence in the model’s classification behaviour.

figure 8
Fig. 8. Grad-CAM visualizations for pose classification on Robot Types 1 and 3 using a CNN model. Each pair shows the original microscope image (left) and its corresponding Grad-CAM heatmap (right). The red regions indicate high-importance areas that the model relies on most for its predictions, while blue regions indicate areas of low importance that are largely ignored.
paper-IV-F: Influence of Data Size

To evaluate the impact of dataset size on depth regression performance, we conduct experiments using Robot Type 8 and the best-performing model identified in Table IV, ResNet50, trained for 10 epochs. The complete dataset for this robot consists of 5,600 images. We train and test the model using varying proportions of the data: 100%, 80%, 60%, 40%, and 20%, while maintaining a fixed train/validation/test split of 8:1:1 in each case. As shown in Fig. 9, increasing the amount of training data consistently reduces the mean squared error (MSE) and improves the R2 score, indicating better regression accuracy and stronger predictive reliability. These results emphasize the importance of large-scale data availability for training deep learning models in micro-scale environments and demonstrate the value of the OTMR dataset in enabling robust, data-driven depth estimation.

figure 9
Fig. 9. Impact of training data size on depth regression performance using ResNet50 for Robot Type 8.

Download

The optical microrobot database is available at:

Optical Microrobot Dataset

If you encounter any issues accessing the files, please feel free to contact us at d.zhang17@imperial.ac.uk or l.wei24@imperial.ac.uk.

When using the datasets, please cite:
@inproceedings{wei2025dataset, title={A Dataset and Benchmarks for Deep Learning-Based Optical Microrobot Pose and Depth Perception}, author={Wei, L. and Zhang, D.}, booktitle={2025 International Conference on Manipulation, Automation and Robotics at Small Scales (MARSS)}, pages={1--8}, year={2025}, address={West Lafayette, IN, USA}, doi={10.1109/MARSS65887.2025.11072739} }

Code

All experimental implementations are available at:

LannWei / Optical-Microrobot-Database

Publications

  1. Zhang, Dandan, Antoine Barbot, Florent Seichepine, Frank P-W. Lo, Wenjia Bai, Guang-Zhong Yang, and Benny Lo. "Micro-object pose estimation with sim-to-real transfer learning using small database." Communications Physics 5, no. 1 (2022): 80.
  2. Zhang, Dandan, Yunxiao Ren, Antoine Barbot, Florent Seichepine, Benny Lo, Zhuo-Chen Ma, and Guang-Zhong Yang. "Fabrication and optical manipulation of micro-robots for biomedical applications." Matter 5, no. 10 (2022): 3135-3160.
  3. Zhang, Dandan, Antoine Barbot, Florent Seichepine, Frank P-W. Lo, Wenjia Bai, Guang-Zhong Yang, and Benny Lo. "Micro-object pose estimation with sim-to-real transfer learning using small dataset." Communications Physics 5, no. 1 (2022): 80.
  4. Ren, Yunxiao, Meysam Keshavarz, Salzitsa Anastasova, Ghazal Hatami, Benny Lo, and Dandan Zhang. "Machine learning-based realtime localization and automatic trapping of multiple microrobots in optical tweezer." In 2022 international conference on manipulation, automation and robotics at small scales (MARSS), pp. 1-6. IEEE, 2022.
  5. Zhang, Dandan, Frank P-W. Lo, Jian-Qing Zheng, Wenjia Bai, Guang- Zhong Yang, and Benny Lo. "Data-driven microscopic pose and depth estimation for optical microrobot manipulation." Acs Photonics 7, no. 11 (2020): 3003-3014.
  6. Zhang, Dandan, Antoine Barbot, Benny Lo, and Guang‐Zhong Yang. "Distributed force control for microrobot manipulation via planar multi‐spot optical tweezer." Advanced Optical Materials 8, no. 21 (2020): 2000543
Note
The open-source resources—including the dataset, deep learning models, benchmarking results, and documentation will be continuously updated. The presented algorithms are generalisable and can be applied to other types of microscopic imaging data for perception-related tasks. In future work, we plan to integrate more advanced AI techniques to further enhance benchmarking capabilities and expand the applicability of our platform.