Here are some notes from my work on benchmarking inference latency of Codegen models. This was inspired by https://github.com/fauxpilot/fauxpilot/issues/144

- Prompt size: 1536 tokens
- Output tokens: varied from 1 to 64 tokens.
These benchmarks were run on A10 machines on my university cluster. Running a 6B model while having a decent experience would require at least 4 A10 GPUs. That would have a latency around 350ms for most single line completions. For multi-line completions it'll take longer (0.5s-2s depending on the length). You could use the 2B model with 2GPUs to get similar latency.
In my benchmarking, I have not found a lot of difference between 2B and 6B models in terms of quality.
This graph shows how close 2B and 6B models are in terms of quality of suggestions.
The overall latency depends on two things: processing the context (1500 tokens) and generating the output token-by-token. The first phase is basically fixed cost you always have to pay, and the second phase costs O(output length). Most of the suggestions are expected to be single-line suggestions, so expect ~10-20 tokens, not a lot more.
Here are latency numbers for different numbers of GPUs:
Model | NumGPUs | Phase 1 | Phase 2 | Total for 10 tokens | Total for 20 tokens | Total for 50 tokens |
---|---|---|---|---|---|---|
2B. | 1 | 250 | 15ms/token | 400ms | 550ms | 1000ms |
2B. | 2 | 150 | 9ms/token | 240ms | 330ms | 600ms |
2B | 4 | 100 | 6ms/token | 160ms | 220ms | 400ms |
6B | 1 | 410 | 33ms/token | 740ms | 1070ms | 2060ms |
6B | 2 | 240. | 19ms/token | 430ms | 620ms | 1190ms |
6B | 4 | 170 | 12ms/token. | 290ms | 410ms | 770ms |