Deploying LLMs with NVIDIA GPUs on OCI Compute Bare Metal

AI solution topics

Introduction
Demo
Prerequisites and setup
Getting started

Introduction

Have you ever wondered how to deploy a large language model (LLM) on Oracle Cloud Infrastructure (OCI)? In this solution, you’ll learn how to deploy LLMs using OCI Compute Bare Metal instances accelerated by NVIDIA GPUs with an inference server called vLLM.

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API, meaning that we can choose OpenAI models (such as GPT-3.5 or GPT-4) to generate text for our request based just on two things.

The original user’s query
The model name of the LLM you want to run text generation against

These LLMs can come from any Hugging Face well-formed repository (developer’s choice), so we’ll need to authenticate to Hugging Face to pull the models (if we haven't built them from the source code) with an authentication token.

LLMs can also be deployed with NVIDIA NIM, a set of easy-to-use microservices designed for secure, reliable deployment of high performance AI model inferencing on NVIDIA GPU–accelerated instances on OCI.

Demo

Prerequisites and setup

Oracle Cloud account—sign-up page
Oracle Cloud Infrastructure—documentation
OCI Generative AI—documentation
vLLM—getting started documentation

Getting started

Detailed steps and sample code on GitHub