GitHub - THUDM/CogVLM: a state-of-the-art-level open visual language model | 多模态预训练模型

🚀 Dive into the world of advanced visual language modeling with GitHub's THUDM/CogVLM! 🤖✨ With 10B visual and 7B language parameters, it excels in image understanding & GUI Agent capabilities. 🎨💬 #AI #VisualLanguageModeling #CogVLM

CogVLM is a visual language model with 10 billion visual parameters and 7 billion language parameters, excelling in image understanding and cross-modal benchmarks.
CogAgent is an enhanced version of CogVLM with GUI Agent capabilities, supporting image understanding at a resolution of 1120x1120 and outperforming on various benchmarks.
CogVLM-17B and CogAgent-18B achieve top performance on several cross-modal benchmarks and possess advanced features like multi-turn dialogue, visual grounding, and GUI Agent capabilities.
CogVLM-17B has state-of-the-art results on benchmarks like NoCaps, Flicker30k, Visual7W, while CogAgent-18B excels in VQAv2, OK-VQ, TextVQA, and more.
CogVLM supports multi-round chat and VQA, while CogAgent offers specific GUI-related question-answering capabilities with plans, actions, and operations on GUI screenshots.
The hardware requirements for model inference range from RTX 3090 for INT4 quantization to A100/A100SLI for FP16, depending on the complexity of the task.
Finetuning demos are available for CogAgent and CogVLM to adapt the models for specific tasks, such as Captcha Recognition, with detailed steps and evaluation processes.
OpenAI Vision format APIs are provided for tasks similar to GPT-4V, enabling image-related dialogues and continuous interactions with the model.
In the FAQ section, troubleshooting tips are given for issues related to model downloading, tokenizer access, and model saving locations.
The code in the repository follows the Apache-2.0 license, and the usage of CogVLM model weights must adhere to the specified Model License.
Citation and acknowledgements are provided for referencing the work, including papers on CogVLM and CogAgent by the authors, along with gratitude towards datasets used in fine-tuning phases.