2026 年 6 月 Hugging Face Papers Top20 论文速读| 东毅居士

2026 年 6 月 Hugging Face Papers Top20 论文速读

作者：XD / 发表： 2026年6月30日 02:48 / 更新： 2026年6月30日 02:50 / 科研学习 / 阅读量：4

数据来自 Hugging Face Papers 2026-06 月榜，按页面 upvotes 排序取 Top20。这个月的主线很清楚：世界模型、Agent 评测、实时多模态、长上下文推理优化和专业内容生成工具。

说明：机构字段使用 Hugging Face 页面里的 organization，不等同于完整作者单位；缺失时标注为“HF 页面未给出机构”。

简单统计

入选论文数        20
总点赞数          4109
平均点赞数        205.4
GitHub 覆盖       17 / 20
项目主页覆盖      13 / 20
排序方式          Hugging Face upvotes 降序
截止时间          2026/06/30

Top20

1. ABot-Earth 0.5: Generative 3D Earth Model

点赞：485
机构：Alibaba AMAP CV Lab
关键词：3D Gaussian Splatting, generative model, satellite imagery, 3D environment synthesis, level-of-detail, real-time visualization, Embodied AI, UAV navigation
链接：HF / arXiv / GitHub，175 stars / Project

ABot-Earth 0.5 是本月最受关注的论文。它把卫星影像作为输入，生成大规模、可交互的 3D 地球环境，并使用 3D Gaussian Splatting 作为核心表示。论文强调在每平方公里 10 分钟以内生成场景，同时通过层级 LOD 支持 Web 地图引擎里的实时可视化。它的亮点不只是视觉效果，而是把低成本三维重建和 Embodied AI、UAV 导航、数字地球可视化连接起来。项目页和代码都给得比较完整，所以传播效果也很强。

摘要要点：We present ABot-Earth 0.5, a generative 3D framework designed to synthesize vast, seamless 3D environments from ubiquitous, geospatially referenced satellite imagery. To achieve this, we propose a novel generative model formulated directly with the 3D Gaussian Splatting (3DGS) representation. The model is trained on a diverse corpus of existing real-world urban reconstructions, learning to generate realistic geometry and textures. At inference, it synthesizes novel 3D scenes conditioned solely on satellite imagery...

2. Looped World Models

点赞：469
机构：FaceMind
关键词：world models, looped architectures, latent environment states, parameter-shared transformer block, adaptive computation, iterative refinement, computational efficiency
链接：HF / arXiv

Looped World Models 关注世界模型里的计算效率。长程模拟需要更深的计算，但传统加深模型会带来部署成本和误差累积。它提出用参数共享的 Transformer block 反复 refine 潜在环境状态，把深度变成一种可循环的计算过程。摘要里提到最高 100x 参数效率，这个点很吸引人：模型不一定只靠更多参数变强，也可以通过更聪明的迭代机制获得更深的推理和模拟能力。

摘要要点：Current world models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to compounding errors. We resolve this by introducing Looped World Models (LoopWM), which are the first looped architectures for world modelling. Our method iteratively refines latent environment states through a parameter-shared transformer block. This yield up to 100x parameter efficiency over conventional approaches with adaptive computation that...

3. Agents' Last Exam

点赞：370
机构：UC Berkeley
关键词：AI agents, benchmark, real-world tasks, economic value, task taxonomy, industry clusters, O*NET, SOC 2018
链接：HF / arXiv / GitHub，759 stars / Project

Agents' Last Exam 是一个面向真实工作流的 Agent benchmark。它认为现有 benchmark 和实际经济价值之间仍有距离，因此设计了长周期、可验证、接近真实工作的任务集合，覆盖 13 个行业 cluster 和 1000+ 任务。它的价值在于把 Agent 评测从“能不能答题”推进到“能不能完成有价值的工作”。对想判断 Agent 是否真的接近可部署的人来说，这篇值得细读。

摘要要点：Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable...

4. Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

点赞：250
机构：HF 页面未给出机构
关键词：multi-agent harness, figure generation, raster outputs, editable SVGs, CraftBench, PaperBanana-Bench
链接：HF / arXiv / GitHub，135 stars

Crafter 针对科研图生成。很多图像生成工具能画出 raster 图，但科研图真正麻烦的是结构化编辑：改标签、调布局、换颜色、对齐语义组件。Crafter 把科学图看成由语义元素组成的结构化对象，用 multi-agent harness 处理不同输入，并强调输出 editable SVG。它适合放在科研写作、论文配图、自动化内容生产这条线里看。

摘要要点：Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic...

5. On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters

点赞：236
机构：Mind Lab
关键词：parameter-efficient fine-tuning, full fine-tuning, trainable adapters, shared foundation models, persistent local state, instance-specific behavior, shared priors, adapter identity
链接：HF / arXiv

这篇 PEFT 论文的视角不是“低成本微调”，而是“个人模型的持久状态”。作者把小 adapter 看成强 foundation model 上的本地状态，用来承载偏好、技能、工具习惯和类似记忆的更新。如果未来每个人、每个任务、每个组织都有自己的 adapter，那么问题就会变成规模化管理百万级 personal models：如何迁移、组合、隔离、更新和治理这些状态。

摘要要点：Parameter-efficient fine-tuning (PEFT) is usually treated as a cheaper alternative to full fine-tuning. We study a broader role: small trainable adapters as persistent local state on top of strong shared foundation models. In this framing, the base model provides shared competence while adapters carry instance-specific behavior such as preferences, skills, tool habits, and memory-like updates. We organize the problem around three scaling axes: Scale Up, where stronger shared priors make small local updates more...

6. LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

点赞：208
机构：HF 页面未给出机构
关键词：Looped Transformers, parallel loop Transformers, cross-loop position offsets, shared-KV gated sliding-window attention, loop-count selection, LoopCoder-v2, instruction tuning, SWE-bench
链接：HF / arXiv

LoopCoder-v2 研究代码模型中的 test-time computation scaling。Looped Transformers 可以通过重复应用共享 block 增加计算深度，但顺序 loop 会增加 latency 和 KV-cache 成本。论文分析 parallel loop Transformers 的收益和代价，指出额外 loop 会 refine 表示，也会带来位置不匹配等成本，最终提出“只 loop 一次”的高效选择。它很适合和 Looped World Models 对照阅读。

摘要要点：Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain--cost view: an extra loop may refine representations, but CLP also introduces a positional mismatch at each...

7. JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

点赞：206
机构：JD.com Open Source
关键词：vision-language model, real-time interaction, autonomous decision-making, vision-triggered responsiveness, time awareness, background model, deployable system, video streaming
链接：HF / arXiv / GitHub，848 stars / Project

JoyAI-VL-Interaction 关注实时视觉语言交互。传统多模态模型多数是 turn-based：用户问，模型答。但现实场景里，火情、表情变化、直播商品出现等事件不会等人提问。JoyAI-VL 试图让模型持续看见环境，并自主决定何时响应、何时交给后台模型。这篇的工程意味很强，适合关注实时多模态、视频流理解和主动交互系统的人看。

摘要要点：Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It...

8. Kwai Keye-VL-2.0 Technical Report

点赞：192
机构：Kwai Keye
关键词：Mixture-of-Experts, multimodal foundation model, DeepSeek Sparse Attention, GQA-based architectures, 256K context processing, heterogeneous ViT-LM parallelism, custom DSA kernels, Cross-Modal Multi-Teacher On-Policy Distillation
链接：HF / arXiv / GitHub，798 stars / Project

Kwai Keye-VL-2.0 是一个开源 MoE 多模态基础模型，重点是长视频理解和 agentic intelligence。摘要里最关键的是把 DeepSeek Sparse Attention 适配到 GQA-based 多模态架构，使模型支持 256K context，并处理长视频里的关键帧和长程依赖。它代表多模态模型从单图问答走向长视频、长上下文、Agent 化使用场景。

摘要要点：We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range...

9. MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

点赞：171
机构：HF 页面未给出机构
关键词：personalized presentation agents, hierarchical memory framework, long-term memory, working memory, user profile memory, tool memory, round-0 personalization, session constraints
链接：HF / arXiv / GitHub，347 stars / Project

MemSlides 处理个性化幻灯片生成。它的关键点是记忆设计：把长期用户偏好、当前会话约束和工具经验分开管理。这个问题很实际，因为 PPT 生成不是一次性产物，用户经常要多轮局部修改。如果 Agent 记不住偏好或改一处坏一片，工具就很难真正进入办公流。MemSlides 的层级记忆框架正是在解决这个痛点。

摘要要点：Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn revision, and carry out local edits reliably. We propose MemSlides, a hierarchical memory framework for personalized presentation agents that separates long-term memory from working memory and further divides long-term memory into user profile memory and tool memory. User profile memory...

10. Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

点赞：170
机构：HF 页面未给出机构
关键词：Embodied Foundation Model, embodied cognition, task planning, correction, pointing, data construction pipelines, multi-task balanced RL, Planner-Grounder-Corrector framework
链接：HF / arXiv / GitHub，37 stars / Project

Embodied-R1.5 是统一的 Embodied Foundation Model，覆盖 embodied cognition、task planning、correction 和 pointing。摘要提到它用自动数据构建 pipeline 扩大能力覆盖，并用 multi-task balanced RL 缓解不同任务训练不均衡。它更像具身智能方向的系统性推进：不仅要看懂环境，还要规划、纠错、指向并执行。

摘要要点：We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous...

11. Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

点赞：152
机构：Shanghai Jiao Tong University
关键词：speculative decoding, autoregressive drafters, parallel drafters, causal dependencies, draft quality, drafting cost, token drafting, parallel backbone
链接：HF / arXiv / GitHub，98 stars

Domino 是 speculative decoding 方向的推理加速论文。当前方法常在 draft quality 和 drafting cost 之间取舍：自回归 drafter 质量高但慢，并行 drafter 快但依赖建模弱。Domino 的思路是解耦 causal dependency modeling 和 autoregressive drafting，用并行 backbone 加轻量 causal refinement head 来提升实际吞吐。适合关注 LLM serving 和推理优化的人看。

摘要要点：Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal...

12. MiniMax Sparse Attention

点赞：148
机构：MiniMax
关键词：sparse attention, Grouped Query Attention, Top-k selection, blockwise sparse attention, tensor-core utilization, prefill, decoding, attention compute
链接：HF / arXiv / GitHub，365 stars

MiniMax Sparse Attention 面向超长上下文。Agent workflow、仓库级代码理解、长记忆都会把上下文拉到几十万甚至百万 token，标准 attention 成本很难承受。MSA 基于 GQA 做 blockwise sparse attention，用轻量 Index Branch 选择关键 KV blocks，并强调 GPU 执行效率。它的重点不是只提出稀疏模式，而是把长上下文能力往可部署方向推。

摘要要点：Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a...

13. EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

点赞：142
机构：Massachusetts Institute of Technology
关键词：large language model agents, EvoArena, EvoMem, memory evolution, environment changes, progressive updates, terminal domain, software domain
链接：HF / arXiv / GitHub，19 stars / Project

EvoArena 关注动态环境里的 Agent 记忆演化。很多 Agent benchmark 默认环境静态，但真实部署时工具、规则、用户偏好和任务条件都在变。EvoArena 把环境变化建模为 progressive updates，并提出 EvoMem 记忆范式。它适合和 Agents' Last Exam、SWE-Explore 一起读：三者都在把 Agent 评测从最终答案拆到过程能力。

摘要要点：Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We...

14. Qwen-AgentWorld: Language World Models for General Agents

点赞：139
机构：Qwen
关键词：world model, language models, agentic environment simulation, long chain-of-thought reasoning, foundation models, state transition dynamics, next-state-prediction reasoning, reinforcement learning
链接：HF / arXiv / GitHub，662 stars

Qwen-AgentWorld 是语言世界模型路线。它把 world model 用在 Agent 环境模拟上，让模型基于观察和动作预测下一状态，并通过模拟提升下游 Agent 表现。相比物理世界模型，它更偏语言环境和任务状态转移。这个方向很关键：如果 Agent 能在语言环境中低成本试错和预测后果，规划能力会更容易规模化。

摘要要点：A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on building foundation models for agentic environment simulation. We introduce Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B, the first language world models capable of simulating agentic environments covering 7...

15. Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

点赞：139
机构：HF 页面未给出机构
关键词：diffusion backbone, Local-λ Mix Interaction, LλMI block, spatial contexts, global semantic priors, fixed-size linear matrices, representation bottleneck, adaptive multi-granularity distillation
链接：HF / arXiv / GitHub，422 stars / Project

Moebius 是轻量级图像修复框架，目标是用 0.2B 参数接近 10B 级工业模型表现。它通过 Local-lambda Mix Interaction block 和 adaptive multi-granularity distillation 缓解小模型压缩后的表示瓶颈。图像 inpainting 是高频实用任务，如果小模型能做到高质量、低延迟，价值会直接体现在部署成本和交互体验上。

摘要要点：While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the...

16. Cosmos 3: Omnimodal World Models for Physical AI

点赞：136
机构：NVIDIA
关键词：omnimodal world models, mixture-of-transformers architecture, Physical AI, vision-language models, video generators, world simulators, world-action models, embodied agents
链接：HF / arXiv / GitHub，10717 stars / Project

Cosmos 3 是 NVIDIA 的 omnimodal world model，支持 language、image、video、audio、action 等多种序列的统一处理和生成。它把 VLM、video generator、world simulator、world-action model 放进一个 mixture-of-transformers 框架里。虽然 HF upvotes 排第 16，但 GitHub stars 很高，说明它的工程生态和复用价值很强。

摘要要点：We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a...

17. Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

点赞：127
机构：University of Oxford
关键词：multi-agent framework, data journalism, evidence-grounded, multimodal generation, verifiability, article generation, data storytelling
链接：HF / arXiv / GitHub，131 stars / Project

Data Journalist Agent 面向数据新闻自动化。它不是简单写文章，而是把数据分析、上下文理解、视觉呈现和证据链组织成一个 multi-agent workflow。数据新闻的核心是可信：结论要能回到数据，图表要能解释，故事要有证据。它和 Crafter、MemSlides 属于同一类趋势：AI 开始进入专业内容生产的完整流程。

摘要要点：Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a...

18. Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

点赞：126
机构：University of Washington
关键词：Vision language models, spatial reasoning, imaginative perception, Imaginative Perception Tokens, perspective taking, path tracing, multiview counting, BAGEL
链接：HF / arXiv / GitHub，99 stars / Project

Imaginative Perception Tokens 处理 VLM 的空间推理问题。很多空间问题需要想象不可见视角、推断遮挡路径或整合多视角信息，单纯文本推理不一定够。IPT 引入中间感知表示，把模型在替代视角下“会看到什么”外显出来。它的意义在于让空间推理不只是语言链条，而有更接近感知的中间状态。

摘要要点：Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under alternative...

19. Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

点赞：122
机构：NLPIR Lab @ RUC
关键词：autonomous research, long horizons, Hypothesis Tree Refinement, coordinator, executors, worktrees, iterative experimentation, research artifact
链接：HF / arXiv / GitHub，831 stars / Project

Arbor 研究 autonomous research。它把科研看成探索、实验、抽象的长期循环，并提出 Hypothesis Tree Refinement：一个长期 coordinator、多个短期 executors，以及持久假设树来保存 hypotheses、artifacts、evidence 和 lessons。这个框架很适合 AI 科研助理方向，因为真正的研究不是一次回答，而是持续试错和积累证据。

摘要要点：Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and...

20. SWE-Explore: Benchmarking How Coding Agents Explore Repositories

点赞：121
机构：Shanghai Jiao Tong University
关键词：repository exploration, coding agents, SWE-bench, SWE-Explore, line budget, code localization, context retrieval, repository understanding
链接：HF / arXiv / GitHub，22 stars / Project

SWE-Explore 把 coding agent 的仓库探索能力单独拿出来评估。SWE-bench 通常看最终是否 resolved，但失败可能来自不会找文件、上下文检索差、代码定位不准或 bug 诊断错。SWE-Explore 要求 Agent 在 line budget 内给出相关代码区域 ranked list，更直接地评估 repository exploration。对代码 Agent 来说，这个能力非常基础。

摘要要点：Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a...

小结

这份 Top20 不像单纯的模型榜，更像一组系统能力清单：世界模型负责模拟环境，Agent benchmark 负责衡量长期工作，推理优化负责把长上下文和多轮计算跑起来，多模态与内容生成工具则把模型推向真实交互和专业工作流。后续如果只挑几篇深读，我会推荐看 ABot-Earth 0.5、Looped World Models、Agents' Last Exam、MiniMax Sparse Attention、Qwen-AgentWorld 和 SWE-Explore。

本文作者：XD 转载请标明出处：http://www.eadst.com/blog/333

本站采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。

上一篇
OPD: On-Policy Distillation 介于 SFT 与 RL 之间的第三条路

原 2026 年 6 月 Hugging Face Papers Top20 论文速读