OCI Speech is an AI service that both transcribes speech to text and synthesizes speech from text. It applies automatic speech recognition technology to transform audio-based content to text in real time or asynchronously. The neural network–based text-to-speech feature generates a natural-sounding voice based on your input text. You can easily make API calls to integrate OCI Speech’s pretrained models into their applications. OCI Speech can be used for accurate, text-normalized, time-stamped transcription or synthetic voice via the console and REST APIs, as well as CLIs or SDKs. You can also use OCI Speech in an OCI Data Science notebook session. With OCI Speech, you can filter profanities, get confidence scores for both single words and complete transcriptions, and more.
You should use OCI Speech if you need a fast, accurate, time-stamped transcription service. If you’re using OCI to store your audio files, you can also enjoy lower latencies and no network costs associated with transcription. The latest text-to-speech and real-time speech-to-text features, now in limited availability, provide additional capabilities to integrate with your application.
To get start, log in to create your first transcription or read more about the service.
We currently support file-based asynchronous transcription. Real-time transcription is offered in limited availability at this time.
Transcription comes with pretrained models for the following languages: English, Spanish, Portuguese, German, French, Italian, and Hindi. We also support OpenAI Whisper model for asynchronous file-based transcription with 57+ languages supported out of the box.
No. We only transcribe your content and keep no information from the file.
Like any other transcription service, the quality of the output depends on the quality of the input audio file. Speakers' accents, background noises, switching between languages, using fusion languages (such as Spanglish), and multiple people speaking simultaneously can all impact the quality of transcription. We are also constantly working to improve the performance of the service to provide more accurate transcriptions for all inputs and speakers.
Not currently, but this capability is coming soon.
We support single-channel, 16-bit PCM WAV audio files with a 16 kHz sample rate. We also support the following media formats and will convert them to PCM WAV before transcribing:
You can also convert your files before submitting jobs to reduce latency. We recommend Audacity (GUI) or FFmpeg (command line) for audio transcoding.
We support JSON as the default and SRT as an option with no additional costs.
We use precision billing, which means we charge you $0.50 for every hour of transcription or voice synthesis, but we use seconds to measure the aggregated usage. For example, if you upload three files with respective durations of 10,860 seconds, 8,575 seconds, and 9,421 seconds, your monthly bill will by calculated by the sum of your seconds (28,856) divided by 3,600 (the number of seconds in an hour) and minus 5 (the number of free hours per month), multiplied by $0.50. In other words, you will be charged $1.508 or (28,856/3,600 - 5) x $0.50 = $1.508.
Our billable metric is transcription hour. Transcription hour measures the number of audio hours transcribed or synthetized during a given month of the service.
No. OCI Speech does not have any setup charges or minimum service commitments, and there’s no hardware required.
Yes. We offer five hours of free transcription every month per tenancy.
Punctuation is a free service just like SRT. Storing SRT files may increase your storage fee.
OCI Speech works with any recording device and is not device-specific.
We recommend using the FFmpeg utility with the following command: $ ffmpeg -i <input.ext> -fflags +bitexact -acodec pcm_s16le -ac 1 -ar 16000 <output.wav>.
See the Speech policy setup documentation..