1
Inference latency
Parth Thakkar edited this page 2023-04-14 01:11:09 -05:00

Here are some notes from my work on benchmarking inference latency of Codegen models. This was inspired by https://github.com/fauxpilot/fauxpilot/issues/144

image
  • Prompt size: 1536 tokens
  • Output tokens: varied from 1 to 64 tokens.

These benchmarks were run on A10 machines on my university cluster. Running a 6B model while having a decent experience would require at least 4 A10 GPUs. That would have a latency around 350ms for most single line completions. For multi-line completions it'll take longer (0.5s-2s depending on the length). You could use the 2B model with 2GPUs to get similar latency.

In my benchmarking, I have not found a lot of difference between 2B and 6B models in terms of quality. image

This graph shows how close 2B and 6B models are in terms of quality of suggestions.


The overall latency depends on two things: processing the context (1500 tokens) and generating the output token-by-token. The first phase is basically fixed cost you always have to pay, and the second phase costs O(output length). Most of the suggestions are expected to be single-line suggestions, so expect ~10-20 tokens, not a lot more.

Here are latency numbers for different numbers of GPUs:

Model NumGPUs Phase 1 Phase 2 Total for 10 tokens Total for 20 tokens Total for 50 tokens
2B. 1 250 15ms/token 400ms 550ms 1000ms
2B. 2 150 9ms/token 240ms 330ms 600ms
2B 4 100 6ms/token 160ms 220ms 400ms
6B 1 410 33ms/token 740ms 1070ms 2060ms
6B 2 240. 19ms/token 430ms 620ms 1190ms
6B 4 170 12ms/token. 290ms 410ms 770ms