Have you ever wondered how to deploy a large language model (LLM) on Oracle Cloud Infrastructure (OCI)? In this solution, you’ll learn how to deploy LLMs using OCI Compute Bare Metal instances accelerated by NVIDIA GPUs with an inference server called vLLM.
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API, meaning that we can choose OpenAI models (such as GPT-3.5 or GPT-4) to generate text for our request based just on two things.
These LLMs can come from any Hugging Face well-formed repository (developer’s choice), so we’ll need to authenticate to Hugging Face to pull the models (if we haven't built them from the source code) with an authentication token.
LLMs can also be deployed with NVIDIA NIM, a set of easy-to-use microservices designed for secure, reliable deployment of high performance AI model inferencing on NVIDIA GPU–accelerated instances on OCI.