\

LLM Inference Handbook

278 points - today at 2:40 AM

Source
  • sherlockxu

    today at 8:00 AM

    Hi everyone. I'm one of the maintainers of this project. We're both excited and humbled to see it on Hacker News!

    We created this handbook to make LLM inference concepts more accessible, especially for developers building real-world LLM applications. The goal is to pull together scattered knowledge into something clear, practical, and easy to build on.

    We’re continuing to improve it, so feedback is very welcome!

    GitHub repo: https://github.com/bentoml/llm-inference-in-production

      • criemen

        today at 8:18 PM

        Thanks a lot for putting this together!

        I have a question. In https://github.com/bentoml/llm-inference-in-production/blob/..., you have a single picture that defines TTFT and ITL. That does not match my understanding (but you guys know probably more than me): In the graphic, it looks like that the model is generating 4 tokens T0 to T3, before outputting a single output token.

        I'd have expected that picture for ITL (except that then the labeling of the last box is off), but for TTFT, I'd have expected that there's only a single token T0 from the decode step, that then immediately is handed to detokenization and arrives as first output token (if we assume a streaming setup, otherwise measuring TTFT makes little sense).

        • DiabloD3

          today at 6:47 PM

          I'm not going to open an issue on this, but you should consider expanding on the self-hosting part of the handbook and explicitly recommend llama.cpp for local self-hosted inference.

            • leopoldj

              today at 9:51 PM

              The self hosting section covers corporate use case using vLlm and sglang as well as personal desktop use using Ollama which is a wrapper over llama.cpp.

          • armcat

            today at 9:30 AM

            Amazing work on this, beautifully put together and very useful!

        • gchadwick

          today at 8:39 PM

          Very glad to see this. There is (understandably) much excitement and focus on training models in publicly available material.

          Running them well is very important too. As we get to grips with everything models can do and look to deploy them widely knowledge of how to best run them becomes ever more important.

          • subset

            today at 12:08 PM

            Ooh this looks really neat! I'd love to see more content in the future on Structured outputs/Guided generation and sampling. Another great reference on inference-time algorithms for sampling is here: https://rentry.co/samplers

              • larme

                today at 4:13 PM

                Wow that's really thorough

            • aligundogdu

              today at 8:52 AM

              It's a really beautiful project, and I’d like to ask something purely out of curiosity and with the best intentions. What’s the name of the design trend you used for your website? I really loved the website too.

                • Jimmc414

                  today at 5:44 PM

                  it appears to be using Infima, which is Docusaurus's default CSS framework plus a standard system font stack

                  [0] font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;

              • qrios

                today at 12:38 PM

                Thanks for putting this together! From now on I only need one link to point interested ones to learn.

                Only one suggestion: On page "OpenAI-compatible API" it would be great to have also a simple example for the pure REST call instead of the need to import the OpenAI package.

                • srameshc

                  today at 3:13 PM

                  If I remember, BentoML was about MLOps, I remember trying it about a year back. Did the company pivot ?

                    • fsjayess

                      today at 3:20 PM

                      There is a big pie in the market around LLM serving. It make sense for a serving framework to extend into the space

                  • holografix

                    today at 11:49 AM

                    Very good reference thanks for collating this!

                    • Domainzsite

                      today at 1:36 PM

                      [dead]