Old Typewriter Machine
Emanuel-Ioan Nazare

Emanuel-Ioan Nazare

October 26, 2023

How to use speech-to-text to transcribe media files faster

Speech to text, or speech recognition, is a technology that was first used in the 1950s, but only in the recent years it has gained popularity. The name says it all, speech to text refers to a technology that receives and audio file, or a video file from which the audio is extracted, and gives back, in text, the words from human voices in that audio. This means that a speech to text technology can be used to automatically transcribe a media file, and thus, help humans in manual transcription.

In this tutorial, we will cover three ways you can automatically transcribe your media files, using the Vatis Tech speech to text technology.

Prerequisites

First of all, if you do not have any, you will have to create a free account on the Vatis Tech website here.

A free account on Vatis Tech comes with one free hour of transcription. Meaning, you can use the Vatis Tech speech to text with an audio or video file of 60 minutes, or multiple audio files that amount to 60 minutes.

When you first create your account, you will be redirected to your account, where you will be prompted with a small tutorial on how the speech to text technology with one example file.

You can choose between media files in Romanian, or in English.

For the time being, you can skip them.

If you already have an account, you will need to log into it here, if not already.

How to use Vatis Tech speech to text

We will go through three methods of doing this:

  1. Transcribe in app with a click of a button
  2. Transcribe in app with dragging your files
  3. Transcribe using Vatis Tech API endpoints [For developers]

1. Transcribe in app with a click of a button

After you have logged in your account, in the upper right corner of your screen, you should click on a green button,  🔼 New file  .

You will be prompted with the browsers file selection pop-up.

Select as many media files you want to automatically transcribe and continue. To continue, depending on your browser and/or language, you should have a button in the selection pop-up named Open, or Select. If you have any issues here, feel free to contact me.

Once you have selected the files you want to transcribe, you will be prompted with the transcription pop-up.In this transcription pop-up you need to:

  1. Select the language of your files - the speech language
  2. Select the AI model you would like to use

    Our models have names such as General, Legal, Medical, etc. depending on the type of speech, you should choose accordingly
    The General model is the most powerful, and is composed by all the other models, so we suggest you use this one
  3. Depending on the langue and model, you might have a few more other options that you can choose for your files:

    Post-processing
    : whether you would like your transcription to have Punctuation, Capitalisation, Numerals Conversion, Entities Recognition and not have Disfluencies
    Here is an example of a transcript without Post-processing: i would like ah to go on the thirty first of october in europe
    And here is the same transcript with Post-processing: I would like to go on 31st of October in Europe.

    Find and replace
    : this is an advanced filtering, where you can specify our speech to text technology how to change some words or phrases in your transcripts. For example, if in your transcripts our speech to text technology outputs SpongeBob as Sponge Bob, or SpongeBwab, or Sponge Bwab, etc., you could specify this. You can set your find and replace tags here, on some languages and models.

    Speakers Diarization
    : you can switch this option if you would like the speech to text technology to try and split your output transcript in paragraphs by speakers.

    Multiple Channels
    : if you audio has multiple channels, you can switch this, and the output transcript will be split in paragraphs by channels. Note that you can only use one of this option, or the  Speakers Diarization one.
  4. Press the green Upload button, and wait for the files to be uploaded and the automated transcription to start.
  5. Once the transcript is done, you will be able to click on it in the list, and open our editor to start editing and checking your transcript along side your audio or video file. We will cover the editor in a later blog post.
Uploading a file by button on Vatis Tech
First step: press the New file button
Second step: choose the files you want to automatically transcribe and confirm
Third step: configure your transcription and press the Upload button
Last step: wait for the files to get uploaded and automatically transcribed, and press on them to check the transcript

2. Transcribe in app with dragging your files

This method is almost the same as the previous one, but the difference is that, instead of clicking a button, you just need to drag and drop your files directly in your Vatis Tech files list.

Once you do that, you will be prompted with the transcription modal.

Uploading a file by drag and drop in Vatis Tech
First step: drag and drop the files you want to transcribe
Second step: configure your transcription and press the Upload button
Last step: wait for the files to get uploaded and automatically transcribed, and press on them to check the transcript

3. Transcribe using Vatis Tech API endpoints [For developers]

Vatis Tech offers three ways of automatically transcribing your files using our APIs:

  1. You can upload a file to our GCS and then ask to start the transcription for that file
  2. You can send a public link of your file and let us do the downloading and starting the transcription
  3. You can directly upload your file through our API and the transcription will start when the file is uploaded

NOTE: If you upload multiple files, some files may be queued and will start the transcription once the first files have been transcribed. If you have any question about this, please let me know at emanuel@vatis.tech.

In any case of the above, you will first need to get your API key from here.

Upload to GCS and start the transcript

First step is to get your signed GCS url by doing the following HTTP request:

GET https://vatis.tech/api/v1/files/new?name=your_file.flac

NOTE! the query param name needs to specified and need to be the actual name of your file.

You will get a response that will look something like:

{ "uploadUrl": "https://storage.googleapis.com/vatis-tech-bucket-euro/xxxx-xxxx?X-Goog-Signature=yyyyy....",   "fileUid": "35dc963f-4107-4699-a4a0-0f8ea0fcbb6e" }

The second thing, will be to upload your file with a POST request to the uploadUrl from the previous GET:

POST https://storage.googleapis.com/vatis-tech-bucket-euro/xxxx-xxxx?X-Goog-Signature=yyyyy....

And send the file as a form body to the request. If you need more info, please check the official docs of Google.

Next, you need to make the following request to start the transcription process:

POST https://vatis.tech/api/v1/files/transcribe/uid?transcribe=[true|false]

This request has the following Query Parameters:

Parameter Type Default Description
transcribe boolean TRUE Specify if the transcription process should be automatically initiated after the upload.
parent UUID Root folder The identifier of the parent folder of this file
model UUID Default model The model used for transcription. It can be used to specify a custom model
speaker_diarization bool FALSE Enable speakers diarization flag
speakers_number int 1 The speakers count in the audio hint
multi_channels bool FALSE Transcribe each audio channel separately. multi_channels and speaker_diarization can't be used simultaneously
disfluencies bool TRUE Enable disfluencies flag
punctuation_capitalization bool FALSE Add punctuation and capitalization flag
entities_recognition bool FALSE Add entity tag on each word flag. Flags are listed here
numerals_conversion bool FALSE Convert numerals flag. This automatically enables entities recognition.
find_replace bool FALSE Apply find-replace rules

And the following Body Attributes:

Attribute Type Default Description
uid string (required) Uid obtained at upload to our GCS bucket.
language string (required) Language code of your file. Full list here.
name string (required) Uploaded file name.
success_url string None The success url where you would like to be notified when the transcript is ready.
fail_url string None The fail url where you would like to be notified if the transcription process has failed.
hotwords list of strings None The list of words, phrases, or both from your custom vocabulary
hotwords_weight float None A value between 0.1 and 10.0 to control how much weight should be applied to your keywords/phrases

NOTE: The succes_url should look like this:

POST https://your-success-url.com?format=type&uid=${uid}

Where type should be one of PLAIN_TEXT or JSON, i.e. the format your endpoint can receive the transcript.

NOTE: And the fail_url should look like this:

POST https://your-fail-url.com?uid=${uid}

Where you will receive as a plain text the error why your file could not be automatically transcribe.

The Vatis Tech POST method from above, will respond with a JSON with the following attributes:

Attribute Type Default
uid string File uid.
discriminator string Type of file. Can be one of: TRANSCRIPTED_FILE, VATIS_TRANSCRIBED_FILE.
name string File name.
secondsDuration integer Duration of the file in seconds.
language string Language code of the file. Full list here.
createdDate date File created date without timezone (e.g. 2020-09-05T09:02:02.978931Z).
status string Transcription status for this file. Can be one of: NOT_TRANSCRIPTED, TRANSCRIPTING, TRANSCRIPTED, QUEUED.

Upload from public link and start the transcript

This is done in only one step, with the following HTTP request:

POST https://vatis.tech/api/v1/files/transcribe/link?transcribe=[true|false]

It has the same Query Parameters as the one above, and the same Body Attributes, with the fact that instead of the uid you will pass the following Body Attributes:

Attribute Type Default Description
url string (required) Public link to your file.
media_type string (generated) File media type. Full list here.

The response will be the same as the one above, with the same notes about success_url and fail_url.

Upload a file directly

This once again is done in only one step, with a HTTP request as follows:

POST https://vatis.tech/api/v1/files/transcribe/file?transcribe=[true|false]

With the following Parameters that can be query parameters or form data:

Parameter Type Default Description
file binary (required) The binary file.
language string (required) Language code of your file.
success_url string None The success url where you would like to be notified when the transcript is ready.
fail_url string None The fail url where you would like to be notified if the transcribe process failed.
transcribe boolean TRUE Specify if the transcription process should be automatically initiated after the upload.
parent UUID Root folder The identifier of the parent folder of this file
model UUID Default model The model used for transcription. It can be used to specify a custom model
hotwords list of strings None The list of words, phrases, or both from your custom vocabulary
hotwords_weight float None A value between 0.1 and 10.0 to control how much weight should be applied to your keywords/phrases
speaker_diarization bool FALSE Enable speakers diarization flag
speakers_number int 1 The speakers count in the audio hint
multi_channels bool FALSE Transcribe each audio channel separately. multi_channels and speaker_diarization can't be used simultaneously
disfluencies bool TRUE Enable disfluencies flag
punctuation_capitalization bool FALSE Add punctuation and capitalization flag
entities_recognition bool FALSE Add entity tag on each word flag.
numerals_conversion bool FALSE Convert numerals flag. This automatically enables entities recognition.
find_replace bool FALSE Apply find-replace rules

It will have the same response as the first method.

Conclusions

There are two main ways of automatically transcribing your media files using the Vatis Tech speech to text technology.

One way through the Vatis Tech Web Application, and one through the Vatis Tech API.

The Web Application is mostly used for users with a small amount of media files, while the API is best suited for those who want to fully automate their transcription process.

If you have any questions, please let us know at any of the bellow emails:

Continue Reading

Experience the Future of Speech Recognition Today

Try Vatis now, no credit card required.

Waveform visual