Skip to content

Conversation

@Shixiaowei02
Copy link
Collaborator

@Shixiaowei02 Shixiaowei02 commented Mar 26, 2024

  • Model Support
    • Support Smaug-72B-v0.1
  • Features
    • Medusa IFB support
    • [Experimental] Support FP8 FMHA, note that the performance is not optimal, and we will keep optimizing it
    • [BREAKING CHANGE] Support embedding sharing for Gemma
  • API
    • [BREAKING CHANGE] Remove LoRA related parameters from convert checkpoint scripts
    • [BREAKING CHANGE] Simplify Qwen convert checkpoint script
    • Add advanced and multi-GPU examples for Python binding of executor C++ API, see examples/bindings/README.md
    • High-level API, please refer to examples/high-level-api/README.md for guide
      • [BREAKING CHANGE] Reuse the QuantConfig used in trtllm-build tool, support broader quantization features
      • Add support for TensorRT-LLM checkpoint as model input
      • Refine SamplingConfig used in LLM.generate or LLM.generate_async APIs, with the support of beam search, a variety of penalties, and more features
      • Add support for the StreamingLLM feature, enable it by setting LLM(streaming_llm=...)
  • Bug fixes
  • Benchmark
    • Add percentile latency report to gptManagerBenchmark
  • Performance
    • Improve custom all-reduce kernel
@Shixiaowei02 Shixiaowei02 requested a review from kaiyux March 26, 2024 12:20
@Shixiaowei02 Shixiaowei02 merged commit 850b6fa into main Mar 26, 2024
@Shixiaowei02 Shixiaowei02 deleted the kaiyu/update branch March 26, 2024 12:47
wu1du2 pushed a commit to wu1du2/TensorRT-LLM that referenced this pull request May 11, 2025
Co-authored-by: Kaiyu <26294424+kaiyux@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants