Web Development

In this blog I'll be explaining what Alexa is, and the importance of voice-recognition technology, before giving a brief overview of creating Alexa Skills on AWS.

What is Alexa?

Alexa is a cloud-based voice recognition and response service provided by Amazon. It allows users to control their devices and access information by speaking commands.

It can work with various devices, including Amazon’s Echo and other third-party manufactured devices.

Why Alexa?

This is purely based on my own familiarity. The other main players in the field are Google Assistant and Apple’s Siri, both of which are very comparable voice recognition services.

Why Voice?

Computer user interface design has gone through various stages of evolution. Each step tries to improve on previous designs in terms – some more successfully than others. In the case of accessibility, this is especially important. However, if you ask anyone involved in accessibility, they’ll likely tell you how far short it often falls to meet the needs of many users.

A (Very) Brief History of Computer User Interface

Batch Computing / Punch Cards

Very early “general use” computers required the laborious creation of punch cards, and generally had less computing power than your average washing machine.

Command Line Interface (CLI)

Although some people still regard this as the *only* way to control your computer, the general consensus is that it can be a bit inconvenient for some tasks that one might want to perform. The main method of input at this point was obviously the keyboard. Although still a staple of any workstation setup (and it probably will be for a long, long time), learning to type can be a difficult barrier to usage.

Graphical User Interface (GUI)

The advent of the graphical user interface saw the introduction of Windows-based systems, which was a huge leap forward in terms of usability and UX. And of course, to drag these shiny new windows around, the mouse was invented – a very intuitive method of interacting with the computer.

Smartphone

Although previously used with desktop computers, touchscreen technology never really took off until smartphones and other “smart” devices appeared. No more loud clanking of keys!

Voice

The development of voice-operated technology forms another great leap forward in terms of usability and UX. The only ability you need in order to interact with the software is to be able to speak in the same language. Since the interface is almost completely natural, there is almost no learning curve.

Why the use of “almost” in the previous sentence? Well, there is still the fact that you have to learn how to pose your commands in a way that will get the results you want.

Aside from the ease with which people can start using this kind of interface, there are other benefits, such as posture and eye-strain – both issues which anyone who works at a computer for long periods of time will no doubt be familiar with.

Evolution of user interface

The Balance of Power

One thing that becomes apparent when looking at the various types of user interfaces that have evolved over the years, is that they all have something in common. Despite a lot of resources and person-hours devoted to addressing the issue, they all require people working to some degree, in order to fit with how computers operate. Arguably in this sense, technology such as voice and gesture recognition (which are difficult problems to solve) show a change in this balance. The computers are working to fit in with how humans operate.

Challenges of Speech Recognition

Going way back to the 1950s, speech recognition is one of those problems that is deceptively difficult, because to us it comes as second nature. Take, for example, the following sentence:

Shopkeeper: “May I ask what you are looking for?”

Customer: “Four candles.”

Of course, when we see this in written form, we know exactly what the person wants. But when heard in spoken form, it could be that the person actually wanted “fork handles”!

This is a fairly trivial and unlikely scenario, but ambiguity such as this is extremely common in spoken language. The implications are enormous when you consider the kinds of operations computers are tasked with.

Many methods have been utilised during that time, with Hidden Markov models and neural-networks being amongst the most successful. More recently, deep-learning has been utilised, and that’s what has given us the robust solutions currently available.

Comparison of Required Steps

“There is no substitute for hard work.” ~ Thomas A. Edison

… Or is there? Even relatively modern technology comes with an overhead of extra steps, making us work to the technology and not the other way around.

Here are some basic tasks set out “algorithmically” to illustrate how much work we do without even realising:

Turn on/off the TV:

Traditional method

  • Find remote control
  • Pick up remote
  • Find correct button
  • Press button

Voice Method

  • “Alexa, turn on/off TV”

Add item to To-Do list:

Traditional Method

  • Find phone (and pick it up)
  • Unlock phone
  • Find app
  • Open app

Voice Method

  • “Alexa, add “buy bread” to my to-do list”

As you can see, having the ability to control devices with your voice involves far fewer steps.

Metrics

Here are some graphs illustrating the general state of voice-activated services. Although numbers are small compared to app-store downloads, the field is growing quickly:

Smart speaker market share in the US, December 2017Alexa app downloads on Google

Credit: https://www.voicebot.ai/amazon-echo-alexa-stats

Intro to Amazon Skills API

The Amazon Alexa Skills API provides the framework with which to receive, recognise, process and respond to voice commands received via a device.

The general steps are as follows:

Overview of Required Steps

Name Your Skill

Choose an invocation name for your skill, which will be used to activate Alexa so it can respond to your command.

Define Your Intent

In Amazon Skills, “intent” or “intentions” are requests or actions associated with a user’s commands.

For example, in the sentence: “Alexa, what happened on this day in 1729?”, the command “what happened on this day in 1729?” will map to a predefined intent, providing the correct keywords for the app to fetch the answer.

To make your skill more flexible we use something called “slots”, which are essentially placeholders for particular types of data.

Build the Model

Here, we set out the overall structure of interacting with Alexa, including prompts for more information and possible answer patterns.

Define an Endpoint

This points to the physical location of the code that will handle the logic of your application. It can be housed anywhere on the internet, but the simplest option is to create an AWS Lambda, which can be called remotely by your skill.

Create a Lambda Function

Creating this on AWS is pretty straightforward. You need to do a bit of setting up and mapping, but blueprints are provided, which give a good idea of what’s necessary.

When commands are received from an Alexa Skill, your code here will process the correct response and update any models that need to be tracked.

Conclusion

This has been a (very) brief overview of what voice-activated services do and how to create one using the Alexa Skills Kit. In the next post, we’ll go through actually creating a small skill and publishing it to Amazon Skills! Alternatively, for help with your web design and development, get in touch with us today.

Leave a Reply

Your email address will not be published. Required fields are marked *