Learning Similarity between Scene Graphs and Images with Transformers

Institute for Information Processing, Leibniz University Hannover1
Scene Understanding Group, University of Twente2

Abstract

Scene graph generation is conventionally evaluated by (mean) Recall@K, which measures the ratio of correctly predicted triplets that appear in the ground truth. However, such triplet-oriented metrics cannot capture the global semantic information of scene graphs, and measure the similarity between images and generated scene graphs. The usability of scene graphs is therefore limited in downstream tasks. To address this issue, a framework that can measure the similarity of scene graphs and images is urgently required. Motivated by the successful application of Contrastive Language-Image Pre-training (CLIP), we propose a novel contrastive learning framework consisting of a graph Transformer and an image Transformer to align scene graphs and their corresponding images in the shared latent space. To enable the graph Transformer to comprehend the scene graph structure and extract representative features, we introduce a graph serialization technique that transforms a scene graph into a sequence with structural encoding. Based on our framework, we introduce R-Precision measuring image retrieval accuracy as a new evaluation metric for scene graph generation and establish new benchmarks for the Visual Genome and Open Images datasets. A series of experiments are further conducted to demonstrate the effectiveness of the graph Transformer, which shows great potential as a scene graph encoder.

Method

we propose a straightforward and robust contrastive learning framework to connect scene graphs and images (GICON) and learn their similarity. It consists of a scene graph encoder and an image encoder, both of which are built on the Transformer architecture. The input scene graph can be a location-free scene graph (where nodes only represent entity categories) or a location-bound scene graph (where nodes represent entity categories and bounding boxes).

Similarity between scene graphs and images

Recall@K (widely-used)

Recall@K is widely used to evaluate scene graph generation methods. It calculates the fraction of ground truth triplets that appear in the top K confident triplet predictions. Due to the long tail issue in the scene graph dataset such as Visual Genome, mean Recall@K is proposed. However, both Recall@K and mean Recall@K have limitations, as they critically compare the predicted triplet set with the ground truth triplet set and are thus sensitive to noise and bias in the dataset annotations

MY ALT TEXT

R-Precision (new!)

Therefore, we propose R-Precision based on GICON for scene graph generation evaluation. R-Precision measures the retrieval accuracy when retrieving the matching image from K image candidates using the generated scene graph as a query. We use the scene graph and image representations provided by GICON to compute the similarity score for retrieval. Compared to triplet-oriented metrics, R-Precision based on GICON is more robust to the perturbation of single triplets.

MY ALT TEXT

New Benchmark of Scene Graph Generation Models

We benchmark different scene graph generation models on the Visual Genome dataset. 6 two-stage methods and 4 one-stage methods are re-evaluated using R-Precision (K=10/50/100) for location-free scene graphs (LF Graph) and location-bound scene graphs (LB Graph).

BibTeX

@article{cong2023learning,
  title={Learning Similarity between Scene Graphs and Images with Transformers},
  author={Cong, Yuren and Liao, Wentong and Rosenhahn, Bodo and Yang, Michael Ying},
  journal={arXiv preprint arXiv:2304.00590},
  year={2023}
}