Bringing AI Conversations to Life

Malvine Owuor

May 26th

2432 Views

Imagine being able to talk to your application as naturally as you would with a friend, asking it questions, and getting instant verbal responses. This isn't a scene from a sci-fi movie—it's a reality you can create using Azure AI Speech and Azure OpenAI Service. In this blog post, we'll show you how to harness these powerful tools to build an application that can understand and respond to spoken language. Whether you're developing a customer service bot, a virtual assistant, or an interactive educational tool, integrating Azure Speech-to-Speech can significantly enhance user engagement and experience. Let's dive into how you can make your application conversant, bridging the gap between humans and machines with the seamless integration of Azure's cutting-edge technologies.

Prerequisites

Azure subscription - Create one for free
Create a Microsoft Azure OpenAI Service resource in the Azure portal.
Deploy a model in your Azure OpenAI resource. For more information about model deployment, see the Azure OpenAI resource deployment guide.
Get the Azure OpenAI resource key and endpoint. After your Azure OpenAI resource is deployed, select Go to resource to view and manage keys. For more information about Azure AI services resources, see Get the keys for your resource.
Create a Speech resource in the Azure portal.
Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys. For more information about Azure AI services resources, see Get the keys for your resource.
Install the Microsoft Visual C++ Redistributable for Visual Studio 2015, 2017, 2019, and 2022 for your platform. Installing this package for the first time might require a restart.
On Linux, you must use the x64 target architecture.

Note: Install Python from 3.7 or later

Implementation in your application

In this step, I am assuming you have already set up the environment for creating the speech-to-speech chat.

Once you have set up the environment, you can clone the GitHub repo through this link Alternatively you can consider creating app.py file in your VS code and copying the below codes and pasting it.

import azure.cognitiveservices.speech as speechsdk
from openai import AzureOpenAI
# This example requires environment variables named "OPEN_AI_KEY", "OPEN_AI_ENDPOINT" and "OPEN_AI_DEPLOYMENT_NAME"
# Your endpoint should look like the following https://YOUR_OPEN_AI_RESOURCE_NAME.openai.azure.com/
api_key = "YOUR_OPEN_AI_KEY";
client = AzureOpenAI(
    azure_endpoint="YOUR_AZURE_END_POINT",
    api_key=api_key,
    api_version="2024-02-15-preview"
)
# This will correspond to the custom name you chose for your deployment when you deployed a model.
deployment_id= “YOUR_MODEL_NAME”;
# This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
speech_config = speechsdk.SpeechConfig(subscription='YOUR_SPEECH_SERVICE_KEY', region='eastus')
audio_output_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)

# Should be the locale for the speaker's language.
speech_config.speech_recognition_language="en-US"
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

# The language of the voice that responds on behalf of Azure OpenAI.
speech_config.speech_synthesis_voice_name='en-US-JennyMultilingualNeural'
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_output_config)
# tts sentence end mark
tts_sentence_end = [ ".", "!", "?", ";", "ã€‚", "ï¼", "ï¼Ÿ", "ï¼›", "\n" ]

# Prompts Azure OpenAI with a request and synthesizes the response.
def ask_openai(prompt):
    # Ask Azure OpenAI in streaming way
    response = client.chat.completions.create(model=deployment_id, max_tokens=200, stream=True, messages=[
        {"role": "user", "content": prompt}
    ])
    collected_messages = []
    last_tts_request = None

    # iterate through the stream response stream
    for chunk in response:
        if len(chunk.choices) > 0:
            chunk_message = chunk.choices[0].delta.content # extract the message
            if chunk_message is not None:
                collected_messages.append(chunk_message) # save the message
                if chunk_message in tts_sentence_end: # sentence end found
                    text = ''.join(collected_messages).strip() # join the recieved message together to build a sentence
                    if text != '': # if sentence only have \n or space, we could skip
                        print(f"Speech synthesized to speaker for: {text}")
                        last_tts_request = speech_synthesizer.speak_text_async(text)
                        collected_messages.clear()
    if last_tts_request:
        last_tts_request.get()

# Continuously listens for speech input to recognize and send as text to Azure OpenAI
def chat_with_open_ai():
    while True:
        print("Azure OpenAI is listening. Say 'Stop' or press Ctrl-Z to end the conversation.")
        try:
            # Get audio from the microphone and then send it to the TTS service.
            speech_recognition_result = speech_recognizer.recognize_once_async().get()

            # If speech is recognized, send it to Azure OpenAI and listen for the response.
            if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
                if speech_recognition_result.text == "Stop.":
                    print("Conversation ended.")
                    break
                print("Recognized speech: {}".format(speech_recognition_result.text))
                ask_openai(speech_recognition_result.text)
            elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
                print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))
                break
            elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
                cancellation_details = speech_recognition_result.cancellation_details
                print("Speech Recognition canceled: {}".format(cancellation_details.reason))
                if cancellation_details.reason == speechsdk.CancellationReason.Error:
                    print("Error details: {}".format(cancellation_details.error_details))
        except EOFError:
            break

# Main

try:
    chat_with_open_ai()
except Exception as err:
    print("Encountered exception. {}".format(err))

Once the above step is completed, you will be required to install two libraries (openai and azure.cognitiveservices.speech)

Consider running the following commands in terminal for the installation:

pip install openai azure-cognitiveservices-speech

Note: Remember to replace the open ai key and endpoint as well as the azure speech service key with the specified fields in the code (all these are available through the azure portal on the open ai and speech service you created.) before running the application otherwise you will encounter errors

Conclusion

Bringing Azure Speech-to-Speech into your application is a game-changer for creating more natural and engaging user experiences. With Azure AI Speech and Azure OpenAI Service, your app can understand and respond to spoken language, making interactions seamless and intuitive.

In this guide, we've covered everything from setting up your environment to implementing speech recognition and synthesis in your app. With these tools, you're ready to build applications that can converse with users in real-time.

As you continue developing, think about adding features like multi-language support and personalized responses. The potential is limitless, and the technology keeps improving.

We hope this guide has inspired you to create amazing, interactive applications with Azure Speech-to-Speech. Happy coding!

References

For more information regarding the step-by-step implementation, visit Microsoft's official documentation website.