new

Get trending papers in your email inbox!

Subscribe

Trending Papers

byAK and the research community

Trending Papers
Submitted by daixufang

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Agent Lightning is a flexible RL framework for training LLMs in various agents, using a hierarchical RL algorithm and decoupling execution from training to handle complex interactions.

  • 8 authors
· Aug 5, 2025
Submitted by taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style visual encoder and ERNIE-4.5 language model, achieves state-of-the-art performance in document parsing with minimal resource consumption.

PaddlePaddle PaddlePaddle · Oct 16, 2025
Submitted by akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

  • 5 authors
· Mar 20, 2024
Submitted by taesiri

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Linear, a hybrid linear attention architecture, outperforms full attention in various scenarios with improved efficiency and performance, using Kimi Delta Attention and Multi-Head Latent Attention.

moonshotai Moonshot AI · Oct 30, 2025

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

  • 9 authors
· Oct 23, 2024
Submitted by cccczshao

Continuous Autoregressive Language Models

Continuous Autoregressive Language Models (CALM) improve language model efficiency by predicting continuous vectors instead of discrete tokens, reducing computational cost while maintaining performance.

tencent Tencent · Oct 31, 2025
Submitted by xinlongwang

Emu3.5: Native Multimodal Models are World Learners

Emu3.5, a large-scale multimodal world model, predicts next states in vision and language, enhanced with reinforcement learning and Discrete Diffusion Adaptation for efficient inference, achieving strong performance in various multimodal tasks.

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

  • 5 authors
· Feb 8, 2025
Submitted by KevinQHLin

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

VCode introduces a benchmark for generating SVG code from images to preserve symbolic meaning, highlighting gaps in visual-centric coding and proposing VCoder to improve performance.

CSU-JPG Jinpeng Group · Nov 4, 2025
Submitted by taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

  • 61 authors
· Sep 26, 2025
Submitted by zhangshaolei

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

DeepAnalyze-8B, an agentic LLM, autonomously completes the data science pipeline from raw data to research reports using curriculum-based training and data-grounded trajectory synthesis.

RUC-DataLab RUC-DataLab · Oct 19, 2025

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

  • 4 authors
· Dec 28, 2024
Submitted by wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

  • 18 authors
· Sep 27, 2024
Submitted by akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

  • 5 authors
· Apr 28, 2025

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

  • 5 authors
· Jan 20, 2025
Submitted by richardxp888

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

WebWatcher, a multimodal agent with enhanced visual-language reasoning, outperforms existing agents in complex visual and textual information retrieval tasks using synthetic trajectories and reinforcement learning.

Alibaba-NLP Alibaba-NLP · Aug 7, 2025
Submitted by callanwu

WebDancer: Towards Autonomous Information Seeking Agency

The paper proposes a framework for building end-to-end agentic information seeking agents through a combination of data construction, trajectory sampling, supervised fine-tuning, and reinforcement learning, showcasing its effectiveness on information seeking benchmarks.

  • 12 authors
· May 28, 2025
Submitted by callanwu

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

WebShaper, a formalization-driven framework, synthesizes information-seeking datasets using set theory and Knowledge Projections to enhance reasoning structure and achieve top performance in open-sourced benchmarks.

Alibaba-NLP Alibaba-NLP · Jul 20, 2025
Submitted by callanwu

Scaling Agents via Continual Pre-training

AgentFounder, a deep research agent model incorporating Agentic Continual Pre-training, achieves state-of-the-art performance in agentic tasks while maintaining strong tool-use ability.

  • 22 authors
· Sep 16, 2025
Submitted by taesiri

WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research

WebWeaver, a dual-agent framework, addresses open-ended deep research challenges by integrating adaptive planning and focused synthesis to produce high-quality, reliable reports.

  • 12 authors
· Sep 16, 2025
Submitted by callanwu

WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

WebSailor, a post-training methodology, enhances open-source models with systematic uncertainty reduction, matching proprietary agents' performance in complex information-seeking tasks.

  • 17 authors
· Sep 16, 2025
Submitted by callanwu

ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

ReSum, a novel paradigm with periodic context summarization, enhances web agents' performance on knowledge-intensive tasks by overcoming context window limitations, achieving significant improvements over ReAct.

  • 14 authors
· Sep 16, 2025
Submitted by learn3r

WebSailor: Navigating Super-human Reasoning for Web Agent

WebSailor, a post-training methodology, enhances open-source LLMs with sophisticated reasoning to match proprietary systems in complex information-seeking tasks.

  • 19 authors
· Jul 3, 2025

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch Fully Sharded Data Parallel (FSDP) enables efficient and scalable training of large models across hardware configurations.

  • 16 authors
· Apr 21, 2023

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

  • 11 authors
· Jun 28, 2020
Submitted by taesiri

The Denario project: Deep knowledge AI agents for scientific discovery

Denario, an AI multi-agent system, performs various scientific research tasks and generates papers across multiple disciplines, demonstrating its capabilities and limitations through expert evaluations.

  • 36 authors
· Oct 30, 2025
Submitted by taesiri

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

AgentScope enhances agentic applications by providing flexible tool-based interactions, unified interfaces, and advanced infrastructure based on the ReAct paradigm, supporting efficient and safe development and deployment.

  • 23 authors
· Aug 22, 2025

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

olmOCR is an open-source toolkit using a fine-tuned vision language model to process PDFs into clean text while preserving structure, optimized for large-scale batch processing.

  • 9 authors
· Feb 25, 2025

MedRAX: Medical Reasoning Agent for Chest X-ray

MedRAX, a versatile AI agent integrating advanced CXR analysis tools and multimodal large language models, achieves top performance in a wide range of medical queries without extra training.

  • 5 authors
· Feb 4, 2025
Submitted by dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

  • 6 authors
· Oct 26, 2025
Submitted by xw-eric

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Agent S2, a compositional framework using Mixture-of-Grounding and Proactive Hierarchical Planning, achieves state-of-the-art performance in computer use automation across various benchmarks and operating systems.

simular-ai Simular · Apr 1, 2025
Submitted by xw-eric

The Unreasonable Effectiveness of Scaling Agents for Computer Use

Behavior Best-of-N (bBoN) improves the reliability and success rates of computer-use agents by generating and selecting among multiple rollouts using behavior narratives, achieving state-of-the-art performance on OSWorld and strong generalization to different operating systems.

simular-ai Simular · Oct 2, 2025
Submitted by jinjieni

Diffusion Language Models are Super Data Learners

Diffusion language models outperform autoregressive models in low-data settings due to any-order modeling, iterative bidirectional denoising, and Monte Carlo augmentation, and maintain advantages even at scale.

Submitted by YuWangX

MIRIX: Multi-Agent Memory System for LLM-Based Agents

MIRIX, a modular multi-agent memory system, enhances language models' memory capabilities by integrating diverse memory types and a dynamic framework, achieving superior performance in multimodal and long-form conversation benchmarks.

  • 2 authors
· Jul 10, 2025
Submitted by Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Submitted by hiyouga

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

A framework called Easy Dataset synthesizes fine-tuning data from unstructured documents using a GUI and LLMs, improving domain-specific performance of LLMs while maintaining general knowledge.

  • 7 authors
· Jul 5, 2025
Submitted by lixiaoxi45

DeepAgent: A General Reasoning Agent with Scalable Toolsets

DeepAgent, an end-to-end deep reasoning agent, autonomously performs thinking, tool discovery, and action execution using memory folding and reinforcement learning, outperforming baselines in various tool-use and application tasks.

  • 11 authors
· Oct 24, 2025
Submitted by Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

  • 47 authors
· Apr 14, 2025
Submitted by taesiri

LongCat-Video Technical Report

LongCat-Video, a 13.6B parameter video generation model based on the Diffusion Transformer framework, excels in efficient and high-quality long video generation across multiple tasks using unified architecture, coarse-to-fine generation, and block sparse attention.

  • 11 authors
· Oct 25, 2025
Submitted by taesiri

Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

Pico-Banana-400K is a large-scale, high-quality dataset for instruction-based image editing, featuring diverse edit pairs, multi-turn editing, preference subsets, and long-short instruction pairs, enabling comprehensive research and benchmarking.

apple Apple · Oct 22, 2025
Submitted by fuvty

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Cache-to-Cache (C2C) enables direct semantic communication between LLMs using neural network projections, improving accuracy and reducing latency compared to text-based communication.

nics-efc Tsinghua-NICS-EFC · Oct 3, 2025
Submitted by zoeyuchao

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

RLinf-VLA is a unified framework for scalable reinforcement learning training of vision-language-action models, offering improved performance and generalization compared to supervised fine-tuning.

RLinf RLinf · Oct 8, 2025

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

  • 5 authors
· Oct 8, 2024
Submitted by UglyToilet

MemOS: A Memory OS for AI System

MemOS, a memory operating system for Large Language Models, addresses memory management challenges by unifying plaintext, activation-based, and parameter-level memories, enabling efficient storage, retrieval, and continual learning.

  • 39 authors
· Jul 4, 2025
Submitted by taesiri

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

ThinkMorph, a unified model fine-tuned on interleaved reasoning traces, enhances multimodal reasoning by generating coherent text-image steps, achieving significant performance gains and demonstrating emergent capabilities.

  • 8 authors
· Oct 30, 2025
Submitted by GuyYariv

DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion

Dynamic Position Extrapolation (DyPE) enhances ultra-high-resolution image generation by dynamically adjusting positional encodings in pre-trained diffusion transformers, achieving state-of-the-art fidelity without additional sampling cost.

Submitted by Seongyun

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

PaperCoder is a multi-agent LLM framework that converts machine learning papers into functional code repositories through planning, analysis, and generation stages.

  • 4 authors
· Apr 24, 2025
Submitted by akhaliq

3D Gaussian Splatting for Real-Time Radiance Field Rendering

A method using 3D Gaussians for scene representation and optimized rendering allows high-quality, real-time novel-view synthesis at 1080p resolution.

  • 4 authors
· Aug 8, 2023
Submitted by nielsr

DINOv3

DINOv3, a self-supervised learning model, achieves superior performance across various vision tasks by scaling datasets and models, addressing dense feature degradation, and enhancing flexibility with post-hoc strategies.

facebook AI at Meta · Aug 13, 2025

MarS: a Financial Market Simulation Engine Powered by Generative Foundation Model

Large Market Model (LMM) and MarS simulate realistic financial market interactions, addressing scalability and realism in financial applications.

  • 7 authors
· Sep 4, 2024