GitHub - THUDM/CogVLM: a state-of-the-art-level open visual language model | 多模态预训练模型
🚀 Dive into the world of advanced visual language modeling with GitHub's THUDM/CogVLM! 🤖✨ With 10B visual and 7B language parameters, it excels in image understanding & GUI Agent capabilities. 🎨💬 #AI #VisualLanguageModeling #CogVLM
- CogVLM is a visual language model with 10 billion visual parameters and 7 billion language parameters, excelling in image understanding and cross-modal benchmarks.
- CogAgent is an enhanced version of CogVLM with GUI Agent capabilities, supporting image understanding at a resolution of 1120x1120 and outperforming on various benchmarks.
- CogVLM-17B and CogAgent-18B achieve top performance on several cross-modal benchmarks and possess advanced features like multi-turn dialogue, visual grounding, and GUI Agent capabilities.
- CogVLM-17B has state-of-the-art results on benchmarks like NoCaps, Flicker30k, Visual7W, while CogAgent-18B excels in VQAv2, OK-VQ, TextVQA, and more.
- CogVLM supports multi-round chat and VQA, while CogAgent offers specific GUI-related question-answering capabilities with plans, actions, and operations on GUI screenshots.
- The hardware requirements for model inference range from RTX 3090 for INT4 quantization to A100/A100SLI for FP16, depending on the complexity of the task.
- Finetuning demos are available for CogAgent and CogVLM to adapt the models for specific tasks, such as Captcha Recognition, with detailed steps and evaluation processes.
- OpenAI Vision format APIs are provided for tasks similar to GPT-4V, enabling image-related dialogues and continuous interactions with the model.
- In the FAQ section, troubleshooting tips are given for issues related to model downloading, tokenizer access, and model saving locations.
- The code in the repository follows the Apache-2.0 license, and the usage of CogVLM model weights must adhere to the specified Model License.
- Citation and acknowledgements are provided for referencing the work, including papers on CogVLM and CogAgent by the authors, along with gratitude towards datasets used in fine-tuning phases.