AI Solution

Deploying LLMs with NVIDIA GPUs on OCI Compute Bare Metal

Introduction

Have you ever wondered how to deploy a large language model (LLM) on Oracle Cloud Infrastructure (OCI)? In this solution, you’ll learn how to deploy LLMs using OCI Compute Bare Metal instances accelerated by NVIDIA GPUs with an inference server called vLLM.

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API, meaning that we can choose OpenAI models (such as GPT-3.5 or GPT-4) to generate text for our request based just on two things.

  • The original user’s query
  • The model name of the LLM you want to run text generation against

These LLMs can come from any Hugging Face well-formed repository (developer’s choice), so we’ll need to authenticate to Hugging Face to pull the models (if we haven't built them from the source code) with an authentication token.

LLMs can also be deployed with NVIDIA NIM, a set of easy-to-use microservices designed for secure, reliable deployment of high performance AI model inferencing on NVIDIA GPU–accelerated instances on OCI.

Demo

Demo: Deploying LLMs with NVIDIA GPUs on OCI Compute Bare Metal (1:17)

Prerequisites and setup

  1. Oracle Cloud account—sign-up page
  2. Oracle Cloud Infrastructure—documentation
  3. OCI Generative AI—documentation
  4. vLLM—getting started documentation