Google Cloud Speech-to-Text over gRPC in Delphi

3 June 2026 · Components

Speech-to-Text turns spoken audio into written text. Google Cloud exposes this as a gRPC service, and sgcWebSockets Enterprise ships a typed Speech client built on the generic TsgcGRPCClient so you can transcribe audio straight from Delphi and C++Builder. You assemble a recognition request with a few properties, send it over gRPC, and read the transcript back, no external runtime or hand-written protobufs required.

How it works

gRPC is Protocol Buffers messages framed over HTTP/2, so the Speech client rides on the same transport as the rest of the library. A TsgcHTTP2Client opens a TLS connection to speech.googleapis.com:443, a TsgcGRPCClient handles the gRPC framing and trailers on top of it, and the typed Speech messages in sgcGRPC_Google_Speech serialize and parse the request and response for you.

Google Cloud APIs are authenticated. You authenticate with a service account, exchanging its JSON key for a short-lived bearer token, and send that token as gRPC metadata on every call. The request itself is a RecognitionConfig (language, encoding, sample rate) plus the audio, either inline bytes or a Cloud Storage URI. The service replies with one or more results, each holding ranked transcript alternatives with a confidence score.

Authenticating with a service account

The Google Cloud client turns a service-account JSON key into a bearer token. Load the key, set the JWT properties, and bind the audience to the Speech endpoint so the self-signed token is accepted. Once the token arrives, add it to the gRPC client's DefaultMetadata as an authorization header, which then travels on every call.

uses
  sgcHTTP2_Client, sgcGRPC_Client, sgcGRPC_Types,
  sgcHTTP_Google_Cloud, sgcGRPC_Google_Speech;

// service-account JWT authentication
Cloud.GoogleCloudOptions.Authentication := gcaJWT;
Cloud.GoogleCloudOptions.JWT.ClientEmail := ClientEmail;
Cloud.GoogleCloudOptions.JWT.PrivateKeyId := PrivateKeyId;
Cloud.GoogleCloudOptions.JWT.PrivateKey.Text := PrivateKey;
Cloud.GoogleCloudOptions.JWT.ProjectId := ProjectId;
// self-signed service-account JWT is audience-bound to the Speech endpoint
Cloud.GoogleCloudOptions.JWT.API_Endpoint := 'https://speech.googleapis.com/';

// once the token is acquired, send it on every gRPC call
GRPC.DefaultMetadata.Clear;
GRPC.DefaultMetadata.Add('authorization', 'Bearer ' + Token);

Connecting the gRPC client

The channel is an HTTP/2 connection. Point a TsgcHTTP2Client at the Speech host with TLS enabled, assign it to the Client property of the gRPC component, and select the wire content type. The Speech service speaks Protocol Buffers, so use grpcProto with no compression.

HTTP2 := TsgcHTTP2Client.Create(nil);
HTTP2.Host := 'speech.googleapis.com';
HTTP2.Port := 443;
HTTP2.TLS  := True;

GRPC := TsgcGRPCClient.Create(nil);
GRPC.Client := HTTP2;
GRPC.ChannelOptions.ContentType := grpcProto;
GRPC.ChannelOptions.Compression := grpcNoCompression;

HTTP2.Active := True;

Recognizing audio and reading the transcript

To transcribe, build a TsgcGRPCSpeechRecognizeRequest. Fill in the Config with the language code, encoding and sample rate, point Audio.Uri at a Cloud Storage object (or set Audio.Content with inline bytes), and call Recognize on the google.cloud.speech.v1.Speech service. The request serializes itself with ToBytes, and the reply parses back into a typed response you can walk for results and alternatives.

var
  oRequest: TsgcGRPCSpeechRecognizeRequest;
  oResponse: TsgcGRPCResponse;
  oSpeech: TsgcGRPCSpeechRecognizeResponse;
  oResult: TsgcGRPCSpeechRecognitionResult;
  oAlt: TsgcGRPCSpeechRecognitionAlternative;
  i, j: Integer;
begin
  oRequest := TsgcGRPCSpeechRecognizeRequest.Create;
  try
    oRequest.Config.Encoding := 0;             // 0 = ENCODING_UNSPECIFIED, let the API detect
    oRequest.Config.SampleRateHertz := 16000;
    oRequest.Config.LanguageCode := 'en-US';
    oRequest.Config.EnableAutomaticPunctuation := True;
    oRequest.Audio.Uri := 'gs://my-bucket/audio.flac';

    oResponse := GRPC.Call('google.cloud.speech.v1.Speech', 'Recognize',
      oRequest.ToBytes);
  finally
    oRequest.Free;
  end;

  if oResponse.StatusCode <> grpcOK then
  begin
    ShowMessage('gRPC error: ' + oResponse.StatusMessage);
    Exit;
  end;

  oSpeech := TsgcGRPCSpeechRecognizeResponse.Create;
  try
    oSpeech.LoadFromBytes(oResponse.Data);
    for i := 0 to oSpeech.ResultCount - 1 do
    begin
      oResult := oSpeech.ResultItem(i);
      for j := 0 to oResult.AlternativeCount - 1 do
      begin
        oAlt := oResult.Alternative(j);
        Memo1.Lines.Add('Transcript: ' + oAlt.Transcript);
        Memo1.Lines.Add('Confidence: ' + FloatToStr(oAlt.Confidence));
      end;
    end;
  finally
    oSpeech.Free;
  end;
end;

Recognition config

The Config object on the request maps directly to Google's RecognitionConfig message. Beyond language and sample rate, you can set MaxAlternatives to ask for ranked variants, ProfanityFilter to mask offensive words, AudioChannelCount for multi-channel audio, EnableAutomaticPunctuation for readable output, and Model to pick a tuned recognition model. Each property is optional and only emitted on the wire when set, so you send just what you need.

Inline audio or a Cloud Storage URI

Short clips can travel inside the request: assign the raw audio bytes to Audio.Content and the client embeds them in the protobuf. For longer files, upload the audio to a bucket and set Audio.Uri to a gs:// path instead, which keeps the request small and lets Google read the object directly. The two are mutually exclusive, you set one or the other on a given request.

Results and alternatives

A response is a list of results, one per recognized segment of audio. Each result carries one or more alternatives ordered by likelihood, with the most probable transcript first and a Confidence score between 0 and 1. Iterate ResultCount and AlternativeCount to read them all, or simply take the first alternative of the first result for the best guess. The typed helpers do the protobuf parsing, so you work with plain Delphi strings and floats.

Availability

The typed Speech-to-Text gRPC client is part of the sgcWebSockets Enterprise edition and runs on Windows, macOS, Linux, iOS and Android. A ready-to-run sample, the one this article is based on, is in Demos\21.GRPC\11.Speech_to_Text, and the full reference is on the gRPC Client product page.

Questions or feedback? Get in touch. You will get a reply from the people who wrote the code.