Search for an optimal audio speech recognition system with closed source code, but with open APIs for integration. Google opens access to Cloud Speech API, the underlying speech recognition technology for Google Assistant

  • Asterisk,
  • Google API,
  • Yandex API
  • Choosing a Speech Recognition API

    I only considered the API option; packaged solutions were not needed because they required resources, recognition data is not critical for business, and using them is much more complicated and requires more man-hours.

    The first was Yandex SpeechKit Cloud. I immediately liked it because of its ease of use:

    Curl -X POST -H "Content-Type: audio/x-wav" --data-binary "@speech.wav" "https://asr.yandex.net/asr_xml?uuid=<идентификатор пользователя>&key= &topic=queries"
    Pricing policy: 400 rubles per 1000 requests. The first month is free. But after that there were only disappointments:

    When transmitting a large proposal, a response of 2-3 words was received
    - These words were recognized in a strange sequence
    - Attempts to change the topic did not bring positive results

    Perhaps this was due to the average recording quality; we tested everything through voice gateways and ancient Panasonic phones. For now, I plan to use it in the future to build IVR.

    The next one was a service from Google. The Internet is replete with articles that suggest using the API for Chromium developers. Now the keys for this API cannot be obtained so easily. Therefore, we will use a commercial platform.

    Pricing policy - 0-60 minutes per month free. Next, $0.006 per 15 seconds of speech. Each request is rounded to a multiple of 15. The first two months are free, a credit card is required to create a project. The API use cases in the underlying documentation are varied. We will use a Python script:

    Script from documentation

    """Google Cloud Speech API sample application using the REST API for batch processing.""" import argparse import base64 import json from googleapiclient import discovery import httplib2 from oauth2client.client import GoogleCredentials DISCOVERY_URL = ("https://(api). googleapis.com/$discovery/rest?" "version=(apiVersion)") def get_speech_service(): credentials = GoogleCredentials.get_application_default().create_scoped(["https://www.googleapis.com/auth/cloud-platform "]) http = httplib2.Http() credentials.authorize(http) return discovery.build("speech", "v1beta1", http=http, discoveryServiceUrl=DISCOVERY_URL) def main(speech_file): """Transcribe the given audio file. Args: speech_file: the name of the audio file. """ with open(speech_file, "rb") as speech: speech_content = base64.b64encode(speech.read()) service = get_speech_service() service_request = service.speech ().syncrecognize(body=( "config": ( "encoding": "LINEAR16", # raw 16-bit signed LE samples "sampleRate": 16000, # 16 khz "languageCode": "en-US", # a BCP-47 language tag ), "audio": ( "content": speech_content.decode("UTF-8") ) )) response = service_request.execute() print(json.dumps(response)) if __name__ == " __main__": parser = argparse.ArgumentParser() parser.add_argument("speech_file", help="Full path of audio file to be recognized") args = parser.parse_args() main(args.speech_file)

    Preparing to use the Google Cloud Speech API

    We will need to register the project and create a service account key for authorization. Here is the link to get the trial, you need a Google account. After registration, you need to activate the API and create an authorization key. Then you need to copy the key to the server.

    Let's move on to setting up the server itself, we will need:

    Python
    - python-pip
    - python google api client

    Sudo apt-get install -y python python-pip pip install --upgrade google-api-python-client
    Now we need to export two environment variables, for successful work with api. The first is the path to the service key, the second is the name of your project.

    Export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_file.json export GCLOUD_PROJECT=your-project-id
    Let's download the test audio file and try to run the script:

    Wget https://cloud.google.com/speech/docs/samples/audio.raw python voice.py audio.raw ("results": [("alternatives": [("confidence": 0.98267895, "transcript": "how old is the Brooklyn Bridge")])])
    Great! The first test is successful. Now let’s change the text recognition language in the script and try to recognize it:

    Nano voice.py service_request = service.speech().syncrecognize(body=( "config": ( "encoding": "LINEAR16", # raw 16-bit signed LE samples "sampleRate": 16000, # 16 khz "languageCode" : "ru-RU", # a BCP-47 language tag
    We need a .raw audio file. We use sox for this

    Apt-get install -y sox sox test.wav -r 16000 -b 16 -c 1 test.raw python voice.py test.raw ("results": [("alternatives": [("confidence": 0.96161985, " transcript": "\u0417\u0434\u0440\u0430\u0432\u0441\u0442\u0432\u0443\u0439\u0442\u0435 \u0412\u0430\u0441 \u043f\u0440\u0438\u0432\u0 435\u0442\u0441\u0442 \u0432\u0443\u0435\u0442 \u043a\u043e\u043c\u043f\u0430\u043d\u0438\u044f")])])
    Google returns the answer to us in Unicode. But we want to see normal letters. Let's change our voice.py a little:

    Print(json.dumps(response))
    We will use

    S = simplejson.dumps(("var": response), ensure_ascii=False) print s
    Let's add import simplejson. The final script is below the cut:

    Voice.py

    """Google Cloud Speech API sample application using the REST API for batch processing.""" import argparse import base64 import json import simplejson from googleapiclient import discovery import httplib2 from oauth2client.client import GoogleCredentials DISCOVERY_URL = ("https://(api ).googleapis.com/$discovery/rest?" "version=(apiVersion)") def get_speech_service(): credentials = GoogleCredentials.get_application_default().create_scoped(["https://www.googleapis.com/auth/cloud -platform"]) http = httplib2.Http() credentials.authorize(http) return discovery.build("speech", "v1beta1", http=http, discoveryServiceUrl=DISCOVERY_URL) def main(speech_file): """Transcribe the given audio file. Args: speech_file: the name of the audio file. """ with open(speech_file, "rb") as speech: speech_content = base64.b64encode(speech.read()) service = get_speech_service() service_request = service .speech().syncrecognize(body=( "config": ( "encoding": "LINEAR16", # raw 16-bit signed LE samples "sampleRate": 16000, # 16 khz "languageCode": "en-US", # a BCP-47 language tag ), "audio": ( "content": speech_content.decode("UTF-8") ) )) response = service_request.execute() s = simplejson.dumps(("var": response ), ensure_ascii=False) print s if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("speech_file", help="Full path of audio file to be recognized") args = parser.parse_args( ) main(args.speech_file)


    But before running it, you will need to export one more environment variable export PYTHONIOENCODING=UTF-8. Without it, I had problems with stdout when called in scripts.

    Export PYTHONIOENCODING=UTF-8 python voice.py test.raw ("var": ("results": [("alternatives": [("confidence": 0.96161985, "transcript": "Hello, welcome to the company")]) ]))
    Great. Now we can call this script in the dialplan.

    Asterisk dialplan example

    To call the script, I will use a simple dialplan:

    Exten => 1234,1,Answer exten => 1234,n,wait(1) exten => 1234,n,Playback(howtomaketicket) exten => 1234,n,Playback(beep) exten => 1234,n,Set( FILE=$(CALLERID(num))--$(EXTEN)--$(STRFTIME($(EPOCH),%d-%m-%Y--%H-%M-%S)).wav) exten => 1234,n,MixMonitor($(FILE),/opt/test/send.sh [email protected]"$(CDR(src))" "$(CALLERID(name))" "$(FILE)") exten => 1234,n,wait(28) exten => 1234,n,Playback(beep) exten => 1234,n,Playback(Thankyou!) exten => 1234,n,Hangup()
    I use mixmonitor to record and run the script when finished. You can use record and this will probably be better. Example send.sh for sending - it assumes you already have mutt configured:

    #!/bin/bash #script for sending notifications # export the necessary ones environment variables# Google license file export GOOGLE_APPLICATION_CREDENTIALS=/opt/test/project.json # project name export GCLOUD_PROJECT=project-id # python encoding export PYTHONIOENCODING=UTF-8 #list of input variables EMAIL=$1 CALLERIDNUM=$2 CALLERIDNAME=$3 FILE= $4 # recode sound file in raw in order to give it to Google API sox /var/spool/asterisk/monitor/$FILE -r 16000 -b 16 -c 1 /var/spool/asterisk/monitor/$FILE.raw # assign variable value completed script to convert sound into text and cut off unnecessary TEXT=`python /opt/test/voice.py /var/spool/asterisk/monitor/$FILE.raw | sed -e "s/.*transcript"://" -e "s/)])]))//"` # send the letter, include the recognized text in the letter echo "new notification from the number: $CALLERIDNUM $CALLERIDNAME $ TEXT " | mutt -s "This is the header of the letter" -e "set [email protected] realname="I'm sending alerts"" -a "/var/spool/asterisk/monitor/$FILE" -- $EMAIL

    Conclusion

    Thus, we solved the problem. I hope my experience is useful to someone. I will be glad to receive comments (perhaps this is the only reason why it’s worth reading Habr!). In the future, I plan to implement an IVR with voice control elements based on this.

    Google Speech API- Google voice recognition service.

    Speech recognition allows you to create automatic customer service systems in cases where control using tone dialing. As an example, we can consider an air ticket booking service, which involves selecting large number cities. The tone menu in such a service is not convenient, so voice control will be the most effective. The dialogue between the system and the subscriber may look like this:

    System: Hello. Where do you want to fly? Subscriber: Kazan System: Where do you want to fly from? Subscriber: Moscow System: Specify departure date Subscriber: April 10

    A speech recognition system typically consists of the following parts:

    • Recording a message from a subscriber
    • Voice recognition and receiving text data from the service
    • Analyzing the information received and taking the necessary actions

    For use Google Speech API on your system do the following:

    Step 1. Download and import scripts into your system Oktell.

    Download the script:(for versions Oktell older than 2.10)

    The archive contains two scripts:

    • Google_Speech_API_main- script for recording voice message, is an example correct use recognition service in the main scenario.
    • Google_Speech_API- script for sending a recording to Google service and receiving the recognized message.

    After importing scripts into Oktell, save them" To server"

    NOTE: Google Speech API is paid product. In the script (Web request component GoogleVoice) a trial key is used, which can be blocked due to a certain number of requests. During tests maximum amount no requests found. If you want to purchase paid version Google Speech API contact Google support.

    Step 2. In the module " Administration" - "External numbers" add extension number with type " Launching IVR". Select IVR scenario Google_Speech_API_main.

    Here is information from the Internet from the site vorabota.ru :

    To start converting voice to text, you will need a microphone (in laptops it is built-in), a good one is desirable internet connection speed and browser Google Chrome no lower than version 25. In other browsers the function voice dialing texts, unfortunately, does not work.

    Launch the page to enter text by voice in Chrome browser. At the bottom of the window, select the language in which you plan to dictate the text. Click on the microphone icon in the top right corner. And in the pop-up line, click the “allow” button for the browser to use the microphone.

    Now you can slowly and clearly pronounce short phrases. After you finish dictating text by voice, you can select it using a keyboard shortcut Ctrl+C copy to the clipboard and then paste into any editor for processing. If desired, the text can be sent immediately by e-mail.

    Perhaps, Web Speech API– the simplest and fairly high-quality way to convert your speech into text. Since there is no need to be distracted by any additional manipulations with the keyboard. Just turn on the microphone and speak the text.

    In any case, you will have to use some additional text editor for further correction of the dictated text.

    Launched in the browser Google Chrome page http://vorabota.ru/voice/text.html and tried voice text input. I read the phrase " Web Speech API Voice typing. Select all. Send Email", but received " Websphere api voice typing select all send email". Second try: " Click the Allow button to unmute the microphone» — « Click the Allow button to enable the microphone«.

    A comparison of the original phrase and the result shows that: a) the Russian phrase is converted into Russian text with sufficient quality; b) the English phrase is converted into English text with errors that are easy to correct; c) mandatory text correction is required to correct errors and place punctuation marks, and capital letters; d) the difference between this implementation Voice typing from others available on the Internet, in extreme simplicity: there is nothing superfluous in it, which makes it easy to learn and use.

    My conclusion is this: it makes sense to implement this way Voice text input on your website to make it easier to enter text on website pages.

    You just need to paste the required code onto the appropriate page of the site.

    Created separate page, intended only for Voice input text, and began debugging it.

    Here is the page code Dictate the text:

    Debugging code...

    You can use the given code on your website, transforming it as you see fit.

    I invite everyone to speak out in

    Nowadays it’s simply impossible to get by without a computer. modern world. You are not required to be a Photoshop master or a professional video edit (unless it’s work-related, of course). But being able to type some text is minimum required.

    No. 2.


    Web Speech API

    The Web Speech API online program is absolutely identical in functionality to the previous ones.

    This service, like those listed above, was also created by Google. Home page


    looks like that: IN simple interface

    It’s immediately obvious that to start recording you need to select a language and then press the microphone.


    After you click on the icon on the right, the system will definitely make a request for access.

    After you give the go-ahead, you can immediately begin work. Type text by voice, and its printed version will appear in the window.

    After finishing the work, you can copy the text wherever you need it (again, ctrl+C, ctrl+V).

    No. 3. Talktyper No less– this is Talktyper.

    To get started, go to the website: https://talktyper.com/ru/index.html.


    To get started, just click on the microphone icon on the right.

    Unlike those described above, this typewriter can be opened using any browser. Although the site was created in the USA, the application easily recognizes the most popular languages ​​of the world, including Russian.

    Talktyper is multifunctional: it not only types text, but also puts punctuation marks and corrects mistakes on its own. If the system cannot recognize a word you read as correct, it will definitely be highlighted.

    In addition, Talktyper has a translation function, as well as voiceover.

    Note! After you finish voice typing, be sure to click on the arrow so that the typed document is transferred to another field. After this, it can be sent by email or copied to the desired file.

    Possible problems when working with voice dialing programs

    When you start using these programs, you will definitely wonder how the computer recognizes our voice and then translates it into live text.

    The speech recognition scheme of the device looks like this:

    The whole process can be divided into 3 main stages:

      Acoustic recognizer.

      It is important to speak clearly, loudly, and the microphone must transmit your voice without interruption.

      Linguistic processing.

      The more words there are in the program’s dictionary, the better the quality of the typed text. That is, everything you say will be recognized and transmitted to text form without distortion.

      Recognized spelling text.

      Program in automatic mode displays spelling graphic version dictated speech, relying on pauses, clarity of words, lexemes found in the dictionary, etc.

    When working with computer typists, 2 problems most often arise:

    1. The acoustic recognizer “catch” your speech intermittently.
    2. There are not enough words in the system's dictionary to recognize everything you said.

    To solve the first problem, you need to speak clearly and loudly. But for the second problem there is practically no solution, but at least, free.

    Freely distributed versions of speech recognition programs have a very limited vocabulary.

    To provide a program with an extensive vocabulary, developers need to invest a lot of money, which is why many recognizers demonstrate low level translation of speech into text.

    Has advanced the furthest in this matter Google company, because has enough funds for investment. This company, among other things, has created the largest online dictionary that helps recognize voices and translate it into a graphical version.

    Look detailed guide in this video:

    1. When you give a speech, the room should be quiet. The sounds of nature, music, and the crying of a child are perceived by the system as noise. Because of this, the text will be typed with large errors.
    2. Don't talk if you eat something. This will not only affect the quality of the set, but it is also life-threatening.
    3. Before you start, you need to choose the correct volume of your voice, and also understand what sensitivity your microphone has.

      To do this, try writing down a few sentences in a familiar tone. If there are interruptions in the recording, refer to the microphone settings.

    4. Take short breaks between words.
    5. Avoid long phrases.

    Someone will say that voice dialing program is a wonderful assistant that frees up their hands and makes their life easier in general. Others will decide that “the game is not worth the candle.” Therefore, you have to decide whether to use them yourself.

    And you already know which services to choose from...

    Since that day, independent developers have had access to the Cloud Speech API, the speech recognition technology on which Google products are based. The updated product is now available on Google Cloud.

    An open beta version of Cloud Speech was released last summer. This technology, with a simple API, allows developers to convert audio to text. Models neural network can recognize more than 80 languages ​​and dialects, and the finished transcription appears immediately after speaking the text.

    The API is built on technology that provides speech recognition functionality in Google Assistant, Search and Now, but the new version has made changes to adapt the technology to the needs of Cloud users.

    What's different about the new version of the Cloud Speech API?

    Thanks to developer feedback Google team was able to increase the accuracy of transcription of long audio recordings and speed up data processing by 3 times compared to the original version. Support for other audio formats has also been added, including WAV, OPUS and Speex.

    According to statistics, in the past this API was most often used to manage applications and devices using voice search, speech commands and voice menu. But Cloud Speech can be used in a wide range of IoT devices, including cars, TVs, speakers and, of course, phones and PCs.

    Among the common uses of technology, it is worth noting its use in organizations to analyze the work of call centers, track communication with customers and increase sales.