Text to Speech using Web Speech API

Text to Speech using Web Speech API

Introduction

Voice data is incorporated into online apps using the Web Speech API. In this tutorial, we'll create a simple webpage that implements text-to-speech using the Web Speech API. The Web Speech API's browser compatibility can be found here.

Designing the HTML Page

On the HTML page, we'll have the following things:

  • A select menu for selecting voice from the list of available voices
  • Range sliders for volume, pitch, and rate
  • A textarea to put our content
  • Control buttons (Start, Pause, Resume, and Cancel)

Now, let us design the webpage. We'll use Bootstrap 5 to style the webpage. Create an index.html file and put the following content there.

<!DOCTYPE html lang="en">
<head>
  <link
    href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css"
    rel="stylesheet"
    integrity="sha384-1BmE4kWBq78iYhFldvKuhfTAU6auU8tT94WrHftjDbrCEXSU1oBoqyl2QvZ6jIW3"
    crossorigin="anonymous"
  />
  <title>Text To Speech</title>
</head>
<body class="container mt-5 bg-light">
  <h1 class="text-dark">Text to Speech</h1>
  <div class="row">
    <div class="col-md-5">
      <p class="lead text-dark mt-4">Select Voice</p>

      <!-- Select Menu for Voice -->
      <select id="voices" class="form-select text-dark"></select>
    </div>
    <div class="col-md-1"></div>
    <div class="col-md-6">
      <!-- Range Slliders for Volume, Rate & Pitch -->
      <div class="d-flex mt-4 text-dark">
        <div>
          <p class="lead">Volume</p>
          <input
            type="range"
            min="0"
            max="1"
            value="1"
            step="0.1"
            id="volume"
          />
          <span id="volume-label" class="ms-2">1</span>
        </div>
        <div class="mx-5">
          <p class="lead">Rate</p>
          <input
            type="range"
            min="0.1"
            max="10"
            value="1"
            id="rate"
            step="0.1"
          />
          <span id="rate-label" class="ms-2">1</span>
        </div>
        <div>
          <p class="lead">Pitch</p>
          <input type="range" min="0" max="2" value="1" step="0.1" id="pitch" />
          <span id="pitch-label" class="ms-2">1</span>
        </div>
      </div>
    </div>
  </div>

  <!-- Text Area  for the User to Type -->
  <textarea
    class="form-control bg-light text-dark mt-5"
    cols="30"
    rows="10"
    placeholder="Enter text here..."
  ></textarea>

  <!-- Control Buttons -->
  <div class="mb-5">
    <button id="start" class="btn btn-success mt-5 me-3">Start</button>
    <button id="pause" class="btn btn-warning mt-5 me-3">Pause</button>
    <button id="resume" class="btn btn-info mt-5 me-3">Resume</button>
    <button id="cancel" class="btn btn-danger mt-5 me-3">Cancel</button>
  </div>
</body>
<script src="script.js"></script>

The webpage looks like this:

Web Speech API and Its Properties

Let's start by making a SpeechSynthesisUtterance instance. We'll set up several settings for this instance.

let speech = new SpeechSynthesisUtterance();

Now, let’s configure some properties on this SpeechSynthesisUtterance instance.

If you look at the documentation of SpeechSythensisUtterance, you'll find the following six properties associated with it:

1. lang

The lang property gets and sets the language of the utterance. If unset, the <html lang="en"> lang will be used, or the user-agent default if that is unset too. It accepts a DOMString representing a BCP 47 language tag.

speech.lang = "en";

2. pitch

The pitch property gets and sets the pitch at which the utterance will be spoken. It is a float representing the pitch value. It can range between 0 (lowest) and 2 (highest), with 1 being the default pitch for the current platform or voice.

Add a onInput listener to the pitch range slider and modify the pitch property when the slider value changes. The slider's min, max, and default values have already been specified in the HTML tag.

Let's also add a <span> to the homepage that displays the pitch's value next to the range slider.

document.querySelector("#pitch").addEventListener("input", () => {
  // Get pitch Value from the input
  const pitch = document.querySelector("#pitch").value;

  // Set pitch property of the SpeechSynthesisUtterance instance
  speech.pitch = pitch;

  // Update the pitch label
  document.querySelector("#pitch-label").innerHTML = pitch;
});

3. rate

The rate property gets and sets the speed at which the utterance will be spoken. It is a float representing the rate value. It can range between 0.1 (lowest) and 10 (highest), with 1 being the default pitch for the current platform or voice, which should correspond to a normal speaking rate. Other values act as a percentage relative to this, so for example 2 is twice as fast, 0.5 is half as fast, etc.

Let’s add a onInput listener to the rate range slider and adjust the rate property when the value of the slider changes. The slider's min, max, and default values have already been specified in the HTML tag.

Let’s also set the <span> that displays the value of the rate in the webpage next to the range slider.

document.querySelector("#rate").addEventListener("input", () => {
  // Get rate Value from the input
  const rate = document.querySelector("#rate").value;

  // Set rate property of the SpeechSynthesisUtterance instance
  speech.rate = rate;

  // Update the rate label
  document.querySelector("#rate-label").innerHTML = rate;
});

4. text

The text property gets and sets the text that will be synthesized when the utterance is spoken.

The text may be provided as plain text or a well-formed SSML document. The SSML tags will be stripped away by devices that don't support SSML.

Let’s add a click listener to the start button. When the button is clicked, we should get the text value from the textarea and set it to this property.

document.querySelector("#start").addEventListener("click", () => {
  speech.text = document.querySelector("textarea").value;
});

5. voice

The voice property gets and sets the voice that will be used to speak the utterance.

This should be set to one of the SpeechSynthesisVoice objects returned by SpeechSynthesis.getVoices(). If not set by the time the utterance is spoken, the voice used will be the most suitable default voice available for the utterance's lang setting.

We need to retrieve the list of available voices in the window object to set the voice of the utterance. The voices will not be available right away when the window object loads. It's an asynchronous operation. When the voices are loaded, an event will be triggered. When the voices are loaded, we can specify a function that should be run.

window.speechSynthesis.onvoiceschanged = () => {
  // On Voices Loaded
};

We can get the list of voices using window.speechSynthesis.getVoices(). It’ll return an array of SpeechSynthesisVoice objects that are available. Let’s store the list in a global array voices and update the select menu on the web page with the list of available voices.

let voices = []; // global array

window.speechSynthesis.onvoiceschanged = () => {
  // Get List of Voices
  voices = window.speechSynthesis.getVoices();

  // Initially set the First Voice in the Array.
  speech.voice = voices[0];

  // Set the Voice Select List. (Set the Index as the value, which we'll use later when the user updates the Voice using the Select Menu.)
  let voiceSelect = document.querySelector("#voices");
  voices.forEach(
    (voice, i) => (voiceSelect.options[i] = new Option(voice.name, i))
  );
};

Now that the voice menu has been modified, we can add a onChange event listener to it to update the voice of the SpeechSynthesisUtterance instance. We'll utilize the index number (which is set as the value for each choice) and the global array of voices to update the voice when a user updates it.

document.querySelector("#voices").addEventListener("change", () => {
  speech.voice = voices[document.querySelector("#voices").value];
});

6. volume

The volume property gets and sets the volume that the utterance will be spoken. If not set, the default value 1 will be used.

Let’s add an onInput listener to the volume range slider and adjust the volume property when the value of the slider changes. The slider's min, max, and default values have already been specified in the HTML tag.

Let’s also set the <span> that displays the value of the volume in the webpage next to the range slider.

document.querySelector("#volume").addEventListener("input", () => {
  // Get volume Value from the input
  const volume = document.querySelector("#volume").value;

  // Set volume property of the SpeechSynthesisUtterance instance
  speech.volume = volume;

  // Update the volume label
  document.querySelector("#volume-label").innerHTML = volume;
});

Controls

As we know, we'll have four controls: Start, Pause, Resume, and Cancel. Let’s add them to the SpeechSynthesis instance.

1. Start

The SpeechSynthesisUtterance instance should be passed to the window. When the start button is pressed, the window.speechSynthesis.speak() method is invoked. This will begin the process of transforming the text into speech. Before calling this function, the text property must be set.

document.querySelector("#start").addEventListener("click", () => {
  speech.text = document.querySelector("textarea").value;
  window.speechSynthesis.speak(speech);
});

2. Pause

We can utilize the window.speechSynthesis.pause() to pause the SpeechSynthesisUtterance instance that is currently running. Select the pause button and add a click event listener to it.

document.querySelector("#pause").addEventListener("click", () => {
  window.speechSynthesis.pause();
});

3. Resume

Using window.speechSynthesis.resume(), we can resume the SpeechSynthesisUtterance instance that is currently paused. Let's add a click event listener to the resume button and resume the instance when the button is clicked.

document.querySelector("#resume").addEventListener("click", () => {
  window.speechSynthesis.resume();
});

4. Cancel

We can use window.speechSynthesis.cancel() to stop the SpeechSynthesisUtterance instance that is currently running. Let's add a click event listener to the cancel button and cancel the instance when the button is clicked.

document.querySelector("#cancel").addEventListener("click", () => {
  window.speechSynthesis.cancel();
});

Final Javascript File

Now, our script.js file looks like this:

// Initialize new SpeechSynthesisUtterance object
let speech = new SpeechSynthesisUtterance();

// Set Speech Language
speech.lang = "en";

let voices = []; // global array of available voices

window.speechSynthesis.onvoiceschanged = () => {
  // Get List of Voices
  voices = window.speechSynthesis.getVoices();

  // Initially set the First Voice in the Array.
  speech.voice = voices[0];

  // Set the Voice Select List. (Set the Index as the value, which we'll use later when the user updates the Voice using the Select Menu.)
  let voiceSelect = document.querySelector("#voices");
  voices.forEach(
    (voice, i) => (voiceSelect.options[i] = new Option(voice.name, i))
  );
};

document.querySelector("#rate").addEventListener("input", () => {
  // Get rate Value from the input
  const rate = document.querySelector("#rate").value;

  // Set rate property of the SpeechSynthesisUtterance instance
  speech.rate = rate;

  // Update the rate label
  document.querySelector("#rate-label").innerHTML = rate;
});

document.querySelector("#volume").addEventListener("input", () => {
  // Get volume Value from the input
  const volume = document.querySelector("#volume").value;

  // Set volume property of the SpeechSynthesisUtterance instance
  speech.volume = volume;

  // Update the volume label
  document.querySelector("#volume-label").innerHTML = volume;
});

document.querySelector("#pitch").addEventListener("input", () => {
  // Get pitch Value from the input
  const pitch = document.querySelector("#pitch").value;

  // Set pitch property of the SpeechSynthesisUtterance instance
  speech.pitch = pitch;

  // Update the pitch label
  document.querySelector("#pitch-label").innerHTML = pitch;
});

document.querySelector("#voices").addEventListener("change", () => {
  // On Voice change, use the value of the select menu (which is the index of the voice in the global voice array)
  speech.voice = voices[document.querySelector("#voices").value];
});

document.querySelector("#start").addEventListener("click", () => {
  // Set the text property with the value of the textarea
  speech.text = document.querySelector("textarea").value;

  // Start Speaking
  window.speechSynthesis.speak(speech);
});

document.querySelector("#pause").addEventListener("click", () => {
  // Pause the speechSynthesis instance
  window.speechSynthesis.pause();
});

document.querySelector("#resume").addEventListener("click", () => {
  // Resume the paused speechSynthesis instance
  window.speechSynthesis.resume();
});

document.querySelector("#cancel").addEventListener("click", () => {
  // Cancel the speechSynthesis instance
  window.speechSynthesis.cancel();
});

Result

You can take a look at the project that’s been deployed using GitHub Pages here.

Conclusion

In this blog, we saw how we can use the Web Speech API to convert text to speech. There can be different use-cases for this project. One such is, you can integrate it into your blog site where your users will be able to listen to your article.

You can check out the final code in this GitHub Repository.

Thanks for reading!

Did you find this article valuable?

Support Ashutosh Krishna by becoming a sponsor. Any amount is appreciated!