In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
在本报告中,我们介绍Gemini 1.5系列模型,它代表了下一代具有高计算效率的多模态模型,能够对来自数百万个上下文标记的细粒度信息进行回忆和推理,包括多个长文档以及数小时的视频和音频。该系列包括两个新模型:(1) 更新后的Gemini 1.5 Pro,在绝大多数能力和基准测试方面都超过了2月版本;(2) Gemini 1.5 Flash,这是一个更轻量级的变体,旨在提高效率的同时在质量上仅有最小程度的下降。Gemini 1.5模型在跨模态的长文本检索任务中实现了近乎完美的回忆,提升了长文档问答、长视频问答和长文本自动语音识别方面的最先进水平,并在广泛的基准测试中达到或超过了Gemini 1.0 Ultra的最先进性能。在研究Gemini 1.5长文本能力的极限时,我们发现它在接下来的标记预测方面持续改进,并且在至少1000万个标记上实现了近乎完美的检索(>99%),相比Claude 3.0(20万)和GPT - 4 Turbo(12.8万)等现有模型有了代际飞跃。最后,我们强调了实际应用案例,例如Gemini 1.5与专业人员合作完成任务,在10个不同的工作类别中节省了26%到75%的时间,以及处于前沿的大型语言模型令人惊讶的新能力;当给定一本卡拉芒语(一种全球使用者不足200人的语言)的语法手册时,该模型学会了将英语翻译成卡拉芒语,其水平与从相同内容学习的人相似。