Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

1Johns Hopkings University, 2Meta, 3University of Toronto, 4University of Central Florida

Primary Contributions


  • VistaLLM: We introduce VistaLLM, a powerful general-purpose vision system that integrates coarse- and fine-grained vision-language reasoning and grounding tasks over single and multiple input images into a unified framework.

  • Novel Sampling Algorithm: To efficiently convert segmentation masks into a sequence, we propose a gradient-aware adaptive contour sampling scheme, which improves over previously used uniform sampling by 3-4 mIoU scores on different segmentation benchmarks.

  • CoinIt Dataset: To train VistaLLM on a versatile form of vision and language tasks, we propose Coarse-to-fine Instruction-tuning) Dataset, which contains 6.8M samples

    ranging over four broad categories of tasks - single-image coarse-level, single-image region-level, multi-image coarse-level, and multi-image region-level.

  • Novel Task: We address the lack of publicly-available multi-image region-level datasets by proposing a novel task, Attribute-level Co-Segmentation, which aims to recognize input images which have objects with common attributes (shape, color, size, position), and segment those objects. AttCoSeg contains 685k training samples, and helps VistaLLM to gain significant generalizable reasoning and grounding capability over multiple input images.

VistaLLMVistaLLM


Overview of the proposed system - VistaLLM, which integrates single- and multi-image coarse- and fine-grained visionlanguage tasks into a unified general-purpose framework. VistaLLM contains three key design modules - (i) image encoder to extract the global image embedding, (ii) instruction-guided image tokenizer, which refines and compresses the global image embeddings using task instruction, enabling the model to filter the necessary visual information required for the current task, and (iii) LLM (Vicuna)- based decoder to jointly process image and language features, and generate the desired output. VistaLLM uses a gradient-aware adaptive sampling technique to efficiently represent segmentation masks as a point sequence. All parameters except the image encoder are trained in stage 1, while only the image tokenizer is fine-tuned in stage 2.

Adaptive Sampling


Visualization of uniform and adaptive sampling strategies. (left) illustration of sampled points and comparison of reassembled curves, (right) illustration of sampled points and comparison of reassembled masks. We efficiently transform binary masks into a sequence of points by proposing a gradient-aware adaptive contour sampling scheme, which significantly improves over the naive uniform sampling technique previously used for sequence-to-sequence segmentation tasks.


Main Results

Multi-round Conversational Example

Multi-round Conversational Ability of VistaLLM-13B. The images are taken from COCO. VistaLLM can address all possible grounding and reasoning tasks across single and multiple input images.

Referring Expression Comprehension (REC)

Referring Expression Comprehension (REC) on RefCOCO, RefCOCO+ and RefCOCOg by VistaLLM-13B. REC aims to generate a bounding box around a single object described by a referring expression.

Referring Expression Segmentation (RES)

Referring Expression Segmentation (RES) on RefCOCO, RefCOCO+ and RefCOCOg by VistaLLM-13B. RES aims to segment a single object described by a referring expression.

Generalized Referring Expression Comprehension (GREC)

Generalized Referring Expression Comprehension (GREC) on gRefCOCO by VistaLLM-13B. GREC aims to identify all objects described by a referring expression and draw bounding boxes around every referred object. GREC also contains no-target expressions where the output is empty.

Generalized Referring Expression Segmentation (GRES)

Generalized Referring Expression Segmentation (GRES) on gRefCOCO by VistaLLM-13B. GRES aims to identify all objects described by a referring expression and segment every referred object. GRES also contains no-target samples where the output is empty.

Image Captioning

Image Captioning on COCO by VistaLLM-13B, which aims to generate a short holistic description of the input image.

VQAv2

VQAv2 by VistaLLM-13B, which aims to answer direct questions based on an input image.

LookTwice-QA

Box Question Answering (BoxQA) and Point Question Answering (PointQA) on LookTwice-QA by VistaLLM-13B. Given a question about a specified region in the image, either mentioning a point or a box, this task needs to comprehend the area in the context of the whole image to produce the correct answer.

POPE

Object Hallucination Evaluation of VistaLLM-13B on POPE benchmark. The task aims to input a query inquiring about the existence of an object, and the model is expected to generate a response in the form of either “yes/no.”

NLVR2

Natural Language for Visual Reasoning (NLVR2) by VistaLLM-13B. Given a pair of input images and a question, the model must reason both images to produce the answer correctly.

CoSeg and AttCoSeg

CoSeg and AttCoSeg by VistaLLM-13B. Given a set of input images, CoSeg aims to find and segment a common object in every image. AttCoSeg is the more generalized scenario where a pair of images among all inputs contains a common object with similar attributes. VistaLLM is first expected to identify two images with the common object and then segment the object in both images.

BibTeX

@article{pramanick2023jack,
      title   = {Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model}, 
      author  = {Shraman Pramanick and Guangxing Han and Rui Hou and Sayan Nag and Ser-Nam Lim and Nicolas Ballas and Qifan Wang and Rama Chellappa and Amjad Almahairi},
      journal = {arXiv preprint arXiv:2312.12423},
      year    = {2023}
}

Acknowledgement

This codebase is built on the LLaVa and Shikra repository. We would like to thank the respective authors for their help, and the Meta AI team for discussions and feedback. Shraman Pramanick and Rama Chellappa were partially supported by a ONR MURI grant N00014-20-1-2787. This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Template of this website is borrowed from nerfies website.