LAVIS - A One-stop Library for Language-Vision Intelligence
-
Updated
Nov 18, 2024 - Jupyter Notebook
LAVIS - A One-stop Library for Language-Vision Intelligence
Compose multimodal datasets 🎹
This repository is build in association with our position paper on "Multimodality for NLP-Centered Applications: Resources, Advances and Frontiers". As a part of this release we share the information about recent multimodal datasets which are available for research purposes. We found that although 100+ multimodal language resources are available…
Pytorch implementation of Multimodal Fusion Transformer for Remote Sensing Image Classification.
[NeurIPS 2023 Oral] Quilt-1M: One Million Image-Text Pairs for Histopathology.
500,000 multimodal short video data and baseline models. 50万条多模态短视频数据集和基线模型(TensorFlow2.0)。
Code from the paper "Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models"
This repository provides a comprehensive collection of research papers focused on multimodal representation learning, all of which have been cited and discussed in the survey just accepted https://dl.acm.org/doi/abs/10.1145/3617833 .
Code and data to evaluate LLMs on the ENEM, the main standardized Brazilian university admission exams.
[Paperlist] Awesome paper list of multimodal dialog, including methods, datasets and metrics
Real-world photo sequence question answering system (MemexQA). CVPR'18 and TPAMI'19
[ICCV 2025] Official repository of "Mitigating Object Hallucinations via Sentence-Level Early Intervention".
Millions-Level Face/Human-Scene Image-Text Datasets
Collects a multimodal dataset of Wikipedia articles and their images
Official evaluation scripts and baseline prompts for the DocVQA 2026 (ICDAR 2026) Competition on Multimodal Reasoning over Documents.
Data and code of the Findings of EMNLP'23 paper MuG: A Multimodal Classification Benchmark on Game Data with Tabular, Textual, and Visual Fields
Vision-Language Models Toolbox: Your all-in-one solution for multimodal research and experimentation
Towards Explainable Multimodal Depression Recognition for Clinical Interviews
Pre-Processing of Annotated Music Video Corpora (COGNIMUSE and DEAP)
Official Git repository for "Hakimov, S., and Schlangen, D., (2023). Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks. Findings of the Association for Computational Linguistics (ACL 2023 Findings)"
Add a description, image, and links to the multimodal-datasets topic page so that developers can more easily learn about it.
To associate your repository with the multimodal-datasets topic, visit your repo's landing page and select "manage topics."