BBH Labs interview with Marcel Kornblum, Head of Creative Technology, BBH London

Voice is one of the big technological themes of 2017 and with Amazon now claiming that ‘Alexa’s speech is protected by the First Amendment’, it is time to listen in to the conversation around voice.

Bots are already a channel that brands are taking up, so voice seems to be the next logical step. We’re expecting to see voice interfaces in lots of new places, not just Alexa and Siri and Google Home fighting to be the voice controlled service of choice within people’s four walls.

This poses an interesting brand experience design challenge, as brand strategy and technology converge in a very specific field. On the technology side there are many conversations being had about ‘voice tech’, while brand strategists have been discussing a brand’s tone of voice for years. The intersection of these two areas is what interests us most at the moment.

We are currently exploring this area with a number of our clients, and as this is all about conversations, I sat down with our BBH’s Head of Creative Technology, Marcel Kornblum, to discuss the process of designing a brand’s voice and the challenges that we are currently facing.

Labs: Contagious just released an overview on voice interfaces and said that “voice technology is an interesting way of conveying a brand’s personality and it can help foster a more emotional, human-like engagement with consumers.” What are your thoughts on this?

MK: I think there are two distinct challenges in the context of automated assistants: what the bot says, and what the voice it says it with sounds like.

Traditional tone of voice is still a factor in the sense of the choice of words that the brands use to communicate, but the shape of the challenge itself is now different. In the context of bots, the interaction between the brand and the user is much more conversational.

Tone of voice is becoming harder to control. For example, when you make a TV ad, you write a script and the brand has the voice of the brand within that script. In social, you have people who are trained to talk in the right way when they talk from the voice of the brand and that can be conversational, but it’s still up to the individual writing the messages, even though in many cases they speak directly as the brand.

With bots, it’s that conversational side of things except that it’s now all automated, so it’s even tougher to have a distinct voice.

In addition to all that, there is a new challenge which is the actual sound of the voice that says those words.

Labs: So how would you start building a brand’s voice?

MK: There are out of the box frameworks to make a bot, for example to create a new sales channel for your ecommerce site. Microsoft, IBM, Google and Facebook have open frameworks to make a bot and you can configure how the bot responds to things. Most of those solutions will have some generic understanding of conversation.

The risk there is that the conversation becomes generic. Inevitably it’s friendly, helpful and clear but not specific to your brand. It’s expensive to build a model that will be more specific to your brand.

On the voice side of things, there are lots of different aspects to the challenge but essentially the tech landscape around bots has some speech generation frameworks and tools.

These are appropriate for being very flexible and allowing any text being spoken aloud, but in the worst case they sound extremely robotic, like your accessibility assistant on your computer or your sat nav.

That approach is valid because it’s very flexible, quick and cost effective but it doesn’t sound very human, although many companies are making great strides in that area — for example IBM Watson has become much more human.

On the other hand, there are very bespoke solutions that allow the brand to really own the voice itself; many of these come from the media sector rather than the bot frameworks, and as a result may not be very flexible.

Labs: So how do you decide which approach to go for?

MK: If you need to be able to say absolutely anything with the voice, like a free ranging open conversation, you will probably sacrifice a bespoke brand voice for the infinite number of words to be expressed. Like with Alexa, users can say anything and the framework needs to come back with some sort of response, even if it is ‘Sorry I don’t understand that’.

If your application is much more on rails, as in there are specific choices that a user makes, you can go with a more templated approach.

It is a bit of an old-fashioned lo-fi option: essentially it is to record templates and words and drop the words into the template. Much like TFL or Sat-nav does (most of them anyway).

That allows you to have a clearly assembled voice but has personality that an auto-generated voice doesn’t. So it depends what kind of application you want to build.

Labs: Google, Amazon and co are the obvious players in this field but are there other areas where the innovation is coming from?

MK: Interestingly there are more sophisticated options around that are coming from specialist companies from the film or gaming industry who do lots of amazing stuff with voice. One of those is to get a voice actor to do an extremely technical recording of their voice which is phoneme based. So you record every individual ‘oh, ah, uh’ and so on. When you do that enough you can make a generated speech engine that uses that person’s voice, which then sounds much more unique and also offers an infinite choice of words.

Labs: Are there certain types of brands that lend themselves better to be expressed via voice?

MK: I think there are certain types of functionality that lend themselves better for voice than others. Voice recognition is not 100% accurate. The deficit in accuracy is really important. The standard in the industry is 94% but the 6% are infuriating. So 94% of the time everything works well, but 6% of the time it’s unbelievably awful and that is a really significant thing.

Most Alexa owners will probably say that generally they love it, but it’s terrible some of time when it doesn’t recognise things. Accents play a really big part. If you speak English with a foreign accent, recognition is way down.

When Siri gets stuff wrong it’s annoying but it’s just search and there is also a screen to fall back on. When Alexa gets stuff wrong, it’s really infuriating because there is no screen — no fallback — and no getting around it. Alexa for that reason is very command based. Google Home supposedly can be much more conversation based. So if you say a few phrases into your conversation it knows the context of your first phrase. Which makes things more natural.

The models are trained on voices. I think there is a whole other interesting topic here which is that of implicit bias, both in recognition and generation. All the generated voices are all white, they are all Susan and James. That may matter to brands if they use generated voices.

Labs: Are there differences in how to use voice, depending on the channel or platform?

MK: I think it’s much the same as with brands and their content on owned vs. paid platforms. It’s a tradeoff and there are different options available.

For example, on YouTube you put your video there and what you get is lots of people passing by and the algorithm showing it to people who might be interested, and what you give up for it is the context in which that video sits. On your own platform you get to control the whole thing, and you might still use the YouTube player in there.

If you put a voice service on Alexa or Google Home you need to talk to Alexa to get to your service, and Alexa will talk back to you. So as a brand you can provide a custom skill to Alexa, but ultimately it’s Alexa talking with Alexa’s voice.

Your owned platform offers more control but is more expensive in the development and in actually getting people there.

Labs: What is the right context to use a voice interface?

MK: There is a whole environmental thing. It can feel weird talking to your phone in public, especially if people can hear you.

It’s different in the privacy of your own home but then the challenge is multiple users, as these devices are normally shared technology. Microsoft (among others) is working on distinguishing between different people to recognise individual preferences, which is interesting as bots become part of a group conversation.

For example, bots sit on our messaging services and it is becoming normal for them to be part of the conversation. So when we use Slack as a team, there are bots in there too. There is an interesting thing about how bots will become part of our conversation in an offline world via voice and knowing which person is talking and being able to service group conversations in the real world.

Labs: In the movie ‘Her’, a super smart AI operating system gets intimate with her user and interacts with him in the most natural way possible. When we get there, will voice become the main way to interact with brands?

MK: It’s definitely going to be an important one, but I think it will differ according to territory, sector and functionality.

For example, lots of the voice innovation has come from China. China is ahead in uptake of voice interfaces on platforms like Weibo, because a huge part of the population there is not fully literate.

Another factor is that people speak much faster than they write. As voice recognition becomes better, there will definitely be brands speaking with voices because it’s a Marketing and service opportunity.

It’s a really exciting time be experimenting with this stuff.