The main change is that the model config format has changed. To
deal with this we have a new script in the converter that will
upgrade a model to the new version.
Aside from that, we also no longer need to maintain our own fork
of Triton since they have fixed the bug with GPT-J models. This
should make it a lot easier to stay synced with upstream (although
we still have to build our own container since there doesn't seem
to be a prebuilt Triton+FT container hosted by NVIDIA).
Newer Triton should let us use some nice features:
- Support for more models, like GPT-NeoX
- Streaming token support (this still needs to be implemented in
the proxy though)
- Dynamic batching
Still TODO:
- Proxy support for streaming tokens
- Add stuff to setup.sh and launch.sh to detect if a model upgrade
is needed and do it automatically.
- Add validation rule to ensure is set to fastertransformer or python-backend
- Add warning if model is unavailable, likely the user has not set correctly
Signed-off-by: Parth Thakkar <thakkarparth007@gmail.com>
- Modify dockerfile to include bitsandbytes, transformers and latest version of pytorch
- Minor modifications in utils/codegen.py so that same client works with FT and Py-backend
- Minor modifications in launch.sh (no need to name models by GPU)
- Add installation script for adding a new python model (with super simple config_template)
- Modify setup.sh so that it aworks with both FT and Python backend models
Signed-off-by: Parth Thakkar <thakkarparth007@gmail.com>