VoiceXML – Good, Bad & the Ugly

[Disclaimer: I am CEO of TringMe, company which created the VoicePHP.  However,  views in this article are purely from  my personal experiences working with VoiceXML & treat it as the views of an independent programmer.]

Last week, I had an interesting discussion with a industry veteran in voice. The discussion slowly turned to VoicePHP, how it compares to VoiceXML,  demos and so on. At the end of the demo, his first reaction was “Is Vo*** dead?” Even though it might be a slight hyperbole, it was a genuine naked reaction of the moment. As a programmer, I couldn’t agree more but let’s stake the claim to ground and spend some time going over the reason behind such a reaction & the pain points of programming in VoiceXML.

Although, VoiceXML has been around for a while, it hasn’t really gained significant momentum around it. As a VoiceXML developer for quite long time, I can definitely see why that is the case. This is my attempt to identify some of the key deficiencies of VoiceXML as a programming language for voice. 

While XML (and in general any marking language) has been great for representing the data, using it for programming is like using a wrench to hit a nail. All you need is a hammer to nail! Although a wrench can do a manageable job of hitting a nail, it’s not elegant, creates a mess and cannot address the problem with precision.  Same is the case with XML. It is best suited for and was designed for data-representation, data-transfer and has repeatedly proven its worth for the same. Trying to use that for programming is adapting it for something that it inherently wasn’t meant to do. Let me try and explain this in detail: 

Consider a typical “hello world” application in VoiceXML (courtesy, vxml.org). 

<?xml version="1.0" encoding="UTF-8"?>
<vxml version = "2.1" >
  <form>
    <block>
    <prompt>
      Hello World. This is my first telephone application.
    </prompt>
    </block>
  </form>
</vxml>


It took 10 lines of code to write something as simple as that. On top of it, for a simple hello world application, one may be tempted to seek more information about <form> and <block> tags which seem like overkill. 

Shocked? You are not alone. As Dominique Boucher states on his blog “from a developer’s perspective, it’s (VoiceXML) like having to program in Cobol! And I only slightly exaggerate.

The same application in C, PHP is reduced to barely 2-3 lines of code which is more readable and intuitive. 

<?php
      
echo 'Hello World. This is my first telephone application';
?>

See the differnce? Don’t take my word for it, here is what veterans and users say about VoiceXML. We will soon dwell into why do they say so.

From the Industry Veterans

In early 2007, Brian OConnor commented on annoyance in VXML standard. As he states, he was unable to use <if> within a <prompt> tag. It’s an arbitrary limitation and requires a nasty workaround. 

In an article “Is VoiceXML the Right Tool for Your Voice Application?”, Brian Brown identifies very precise weaknesses of VoiceXML.  For example,  when even a basic voice controls (pause, resume etc) are not available in VoiceXML, how it can be even considered the language to program voice?  He nailed the problem very well. Look at how VoicePHP addresses it beautifully in a sample application here 

Dannis in his interesting email and unique style shares the pain of VoiceXML, “TCL was the most ugly languge of the 90-ies. VXML has now taken over. The language appears not to have iteration (while, for) and no recursion. But it DOES have the goto primitive, which was banned by Dijkstra 30 years ago. There is no function abstraction and neither object-oriented constructs.

He further adds which I will elaborate later in this post “VXML is an interpreted language using Javascript. Why not using only Javascript with a bundle of speech specific predefined functions? Hacking java-servlet code already entails generating HTML and Javascript. I don’t see why we have to follow the same painful route with VXML.

Even VoiceXML vendors are aware of the limitations and they have tried to create specific & proprietary  enhancements to get around VoiceXML limitations, for example CallXML by Voxeo. Infact, Voxeo CEO commented on VoicePHP coverage by Gigaom that  “As a developer I do not like VoiceXML. Personally I find it to be too complicated, painful, and a barrier to entry for new developers as others have said. This is why Voxeo offers many other ways to create voice applications, including: CallXML – a very powerful yet simple XML based telephony markup; “ . Do I need to say anything more?

The big question – Why do they say so? 

Unorganized jungle of XML, JavaScript, CDATA etc.

In my opinion, VoiceXML looks like a creation out of obsession. XML was the new kid on the block and perhaps impressed or obsessed by it, somehow fitting it to the Voice programming needs became the name of the game.  Basic TTS & ASR was made to work – wow! So far so good!

Then someone realized that even simple/common programming requirements cannot be done in XML. There wasn’t a simple solution to address this in XML and that’s how Javascript (ECMAScript) became a part of the VoiceXML standard. In my vocabulary, this is nothing more than a “workaround”. If Javascript was being considered then why not do everything in Javascript? When Javascript can completely replace XML (exactly like VoicePHP), is there any logical reason to keep XML around and more importantly continue it as a “standard”? To be honest, this workaround has only made the life of programmers complex. To illustrate, consider the following sample code that reads out a caller-id (courtsey):

<?xml version="1.0" encoding="UTF-8"?>
<vxml version = "2.1">
<meta name="maintainer" content="YOUREMAILADDRESS@HERE.com"/>
  <form id="HelloWorld">
    <block>
      <script>
        <![CDATA[
          function sayasDigits(number)
          {
            var digitNumber = number.charAt(0);
            for(var i = 1; i < number.length; i++)
            {
              digitNumber += ' ' + number.charAt(i);
            }
            return digitNumber;
          }
        ]]>
      </script>
    <prompt>
      Hello there. The caller id value is: <value expr="sayasDigits(session.callerid)"/>
      The called i d of this application is: <value expr="sayasDigits(session.calledid)"/>
    </prompt>
    </block>
  </form>
</vxml>

Now compare this code with VoicePHP equivalent (demo here):

<?php 
echo "Your caller ID is {$_VOICEPHP['callerid']}"
?>

What a mind-blowing difference between both the solutions. Maybe the above example demonstrates the point we all are trying to make. A workaround v/s a natural programming language.  As one can see from the above example, VoiceXML has to fall back upon Javascript since XML cannot even have the basic capabilities to manipulate the numbers or strings, how can it be even considered for programming. 

As you explore more, you will realize that VoiceXML is handicapped enough even not to be able to offer simple loops on its own. Can you imagine an application without such basic control statement & despite of that, such a basic structures were not addressed in VoiceXML. VoiceXML simply falls back to Javascript for doing such basic stuff in a messy way.  

In contrast, take a look at just about any application at http://code.voicephp.com to see how easily one can take an existing application and move over to VoicePHP with all the programming constructs usually available in most programming languages.

CDATA – Add it to the mess

Well, it keeps getting better. To bring-in Javascript, VoiceXML uses CDATA directive. What is going on? Isn’t it messy already? Why do I have to care about all the subtleties? For the curious minds, CDATA directive is used so that our script can contain characters that are normally reserved for XML syntax usage.  

It’s truly getting messy – XML, Javascript, CDATA and off-course unreadable code. Keep in mind all that we have done so far is really just “read out a phone number”. It begs to me ask this question: Why is it so damn complicated? 

Consider the code for the first tutorial on Voice Recognition from vxml.org

<?xml version="1.0" encoding="UTF-8"?>
<vxml version="2.1">
  <form id="MainMenu">
    <field name="SouthParkCharacter">
      <!-- Since we are in a field tag, we do not need <prompt> tags surrounding the text-to-speech -->
      Please say your favorite South Park character's name.
      <!-- Define our grammar -->
      <grammar type="text/gsl"> 
      <![CDATA[
        ;Match one of the enclosed terms
        [
          kenny cartman stan kyle canadians chef wendy timmy 
          hanky garrison pip ike mephisto jimbo tweak marvin
          [terrance phillip] 
          (mister hat) (big gay al) (cartmans mom) (mister mackey) 
        ]
      ]]>
      </grammar> 
      <!-- The user was silent, restart the field -->
      <noinput>
        I did not hear anything.  Please try again.
        <reprompt/>
      </noinput>
      <!-- The user said something that was not defined in our grammar -->
      <nomatch>
        I did not recognize that character.  Please try again.
        <reprompt/>
      </nomatch>
    </field>
    <!-- Set the namelist attribute to the name of the corresponding field -->
    <!-- This is filled only when one of the values in the grammar is matched -->
    <!-- That value is then sent to this section -->
    <filled namelist="SouthParkCharacter">
      <!-- Check the "SouthParkCharacter" variable against each of the valid values -->
      <!--  defined in our grammar -->
      <if cond="SouthParkCharacter == 'kenny'">
        <prompt>Kenny has more lives than a cat.</prompt>
      <elseif cond="SouthParkCharacter == 'cartman'"/>
        <prompt>Cartman is not fat.  He is big boned.</prompt>
      <elseif cond="SouthParkCharacter == 'stan'"/>
        <prompt>Stan likes Wendy.</prompt>
      <elseif cond="SouthParkCharacter == 'kyle'"/>
        <prompt>Kyle has a gay dog.</prompt>
      <elseif cond="SouthParkCharacter == 'canadians'"/>
        <prompt>Canada.  What is that aboot?</prompt>
      <elseif cond="SouthParkCharacter == 'chef'"/>
        <prompt>Chef is the coolest man in South Park.</prompt>
      <elseif cond="SouthParkCharacter == 'misterhat'"/>
        <prompt>Mister Hat is a puppet.</prompt>
      <elseif cond="SouthParkCharacter == 'biggayal'"/>
        <prompt>Big Gay Al is gay.</prompt>
      <elseif cond="SouthParkCharacter == 'wendy'"/>
        <prompt>Wendy likes Stan.</prompt>
      <elseif cond="SouthParkCharacter == 'timmy'"/>
        <prompt>Timmmy!  Timmmy tim maugh!</prompt>
      <elseif cond="SouthParkCharacter == 'hanky'"/>
        <prompt>Mister Hanky, the Christmas poo.</prompt>
      <elseif cond="SouthParkCharacter == 'garrison'"/>
        <prompt>Mister Garrison is gay.</prompt>
      <elseif cond="SouthParkCharacter == 'cartmansmom'"/>
        <prompt>Cartman's mom loves the Denver Broncos.</prompt>
      <elseif cond="SouthParkCharacter == 'pip'"/>
        <prompt>Pip is British.</prompt>
      <elseif cond="SouthParkCharacter == 'ike'"/>
        <prompt>Ike is also Canadian.</prompt>
      <elseif cond="SouthParkCharacter == 'mistermackey'"/>
        <prompt>Mister Mackey.  Mmmmmmkay.</prompt>
      <elseif cond="SouthParkCharacter == 'mephisto'"/>
        <prompt>Mephisto enjoys experimenting on animals.</prompt>
      <elseif cond="SouthParkCharacter == 'jimbo'"/>
        <prompt>Jimbo is a redneck.</prompt>
      <elseif cond="SouthParkCharacter == 'marvin'"/>
        <prompt>Marvin is really hungry.</prompt>
      <else/>
        <prompt>
          A match has occurred, but no specific if statement
          was written for it.  Probably just a minor character
          like Tweak or Jimbo's gun-toting friend.
        </prompt>
      </if> 
    </filled>
  </form>
</vxml>

I am sure you need a coffee break after reading the above code; the code looks verbose, repetitive and unmanageable. This same application when written in a commonly used ‘real’ programming language will have a lot less code and will read much better. Again refer to any code snippet at http://code.voicephp.com

Server side programming

One cannot use VoiceXML by itself to write a complete application. For even simple client-side processing you need Javascript. Moving on, if you need to integrate some back-end logic (a.k.a Server side programming), you need to take help of one of commonly used back-end technologies (e.g. PHP, ASP, .NET etc.).

This is not me saying but vxml.org – “Coding an application with just straight VoiceXML is just fine and dandy, thankyouverymuch, but the *real* potential of VoiceXML is harnessed when we add some ASP or JSP into the mix” . Look at the emphasis on “real”. I sincerely appreciate their candid confession and applaud them on succinctly putting the limitation of VoiceXML across so distinctly. So as you can see, in addition to learning VoiceXML tags and attributes, Javascript and a different programming style, one now has also to learn a server side language. Think about it – you need a chilled beer to relax but you are being given a cocktail – like it or not!

In Closing

Anyway you slice it; VoiceXML doesn’t come close to the meeting the requirements of real world applications. Voice applications would do really well if there was an easy way to bring them to life. Developers do not want to use complicated technology to achieve something simple, intuitive and obvious  - I know I won’t.

We are not against VoiceXML. Infact, VoiceXML spearheaded the way for voice programming and took away the complexity that one had to deal with in the early days (remember hardware card and proprietary drivers nightmare?). When it launched, VoiceXML was the “new” way to program voice and we were completely supportive of it too. We released the world’s first “Adobe Flash based VoiceXML Platform”.

But it’s about time that VoiceXML realizes its inadequacies and makes way for better alternatives. Alternatives like VoicePHP (or maybe even VoicePERL or VoicePYTHON) could do a better job.  The web is evolving and solutions that can tightly integrate with it will become more and more important. Dedicated solutions to tackle a specific problem are a thing of past. Some technologies (e.g. PHP for web programming, Flash for UI and widgets, Mobile applications using data network etc.) have proven themselves and it’s about time that we re-use them and not bind ourselves to technologies which began with the right attitude to solve a problem but couldn’t really establish themselves due to technical limitations.

10 Comments

Ted NaleidFebruary 18th, 2009 at 7:46 pm

Nice post. I’m just about to get back into doing some VoiceXML work after a 2-3 year hiatus. It seems like things haven’t changed really at all in that time.

I’d love it if there were broader support for some of the technologies that you suggest, using a non-xml based programming language with real recursion and looping.

What we’re trying to do shouldn’t be that complicated, it’s the equivalent to interacting with a command prompt, but VXML just makes things hard.

[...] Their only concern is at this moment is their existing investment and we will work with them to resolve that. In fact, some of the concern they raised are the same which Yusuf blogged earlier - VoiceXML – Good, Bad & the Ugly. [...]

[...] Developers and enterprises like it alike due to simplicity and natural way of programming. As I predicted earlier, companies will come to this kind of platforms rather than staying with messy [...]

Ilja V.April 24th, 2009 at 9:22 am

Hi Yusuf,
two years ago I started working as a developer in what Brian Brown called a “VoiceXML service bureau” in his article. Wiithout question, the “uglyness” of VoiceXML as a programming language which you describe so well in your article was striking. It was obvious to me that VoiceXML cannot be first choice when it comes to implementing more or less complex applications. That’s why our “bureau” developed frameworks in Java and C that generated VoiceXML as an output. Obviously, VOXEO and many other companies in the Voice business field did the same. In the end, we use VoiceXML solely as a standard for interpreters. For the developers there’s no ECMA-Scripting, no tags, just POJO.
So here’s a question: what’s wrong with it? Maybe I’m on the wrong trail, but what’s the reason for replacing VoiceXML as a standard for data transfer? Why not generate VXML code for voice applications dynamically out of Java just at you do with HTML code for dynamic web applications? I mean, standards do have a purpose…
I agree that there are shortcoming in the VXML standard, such as the lack of variable playback options. But the fact that VXML is a bad programming language doesn’t really bother me as long as I do not use it for programming. Sure enough, you need to develop a framework first, and this can be a quite annoying. But I guess developing a VoicePHP interpreter was also rather tricky…

Thomas HoweMay 4th, 2009 at 11:29 pm

Nice post – I think it’s interesting how the flavor-of-the-day affects our programmer sensibilities. Back when HTML was all the rage, it was easy to accept that voice programming for the web would follow graphics programming for the web. Now, with some experience… not so much.

Loved how you used code in your examples… made it much clearer.

Dominique BoucherMay 5th, 2009 at 8:37 am

Hi Yusuf,

I tend to agree with Ilja V. Just use VoiceXML as an abstraction layer atop the telephony/voice resources. And if your framework is well designed, you will be able to unit test your applications (using JUnit, for example). I have been doing this with a lot of success for the past 6 years.

Although I tend to like the scripting approach, one thing that makes me uncomfortable with the current solutions out there is their poor support for handling speech recognition grammars/recognition results, one thing that VoiceXML got (mostly) right, making you think that speech recognition will work all the time, automagically. For DTMF-only applications, it’s a great approach. But developing robust and efficient speech recognition based applications is not easy, and the API should not limit your ability to implement sophisticated voice user interfaces to get the most out of the technology.

Yusuf MotiwalaMay 6th, 2009 at 5:18 am

Dominique, Ilja,

TringMe has tried its hands at VoiceXML and were infact the first ones to release a beta of Flash-based VoiceXML (link). We’ve also looked at ‘exposing’ VoiceXML for sake of standards and doing the grunt work underneath in Flash, PHP etc. But here is where we realized that several folks weren’t quite happy with the abstraction approach since it’s yet another thing that they either have to maintain and possibly understand a little bit to know what’s going on.

With regards to ‘poor support for handling speech recognition’, I think we haven’t perhaps communicated the ASR capabilities in VoicePHP as much as we should have. Complete ASR support is available right within VoicePHP. Infact, ASR, TTS, DTMF are an integral part of VoicePHP. Your applications don’t have to do anything special to use these capabilities. For e.g. TTS is automatically integrated and used when someone uses ‘echo’. DTMF values are returned back for ‘prompt’ API which is meant to take user input. Here is a sample script:

$destination = prompt(“Dial the destination phone number”, 1);
echo “You have typed, $destination”;

Similarly, one can use ABNF format for specifying grammar for ASR. The prompt API will return the value from the grammar once the spoken words are recognized by the platform. Here is a sample script:

< ?php

$grammar = '#ABNF 1.0;
language en-US;
mode voice;
root $small_number;
$base = one|two|three|four|five|six|seven|eight|nine;
$teen = ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen;
$twenty_to_ninetynine = (twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)[$base];
$tens = $base|$teen|$twenty_to_ninetynine;
$hundred = ([a] hundred | $base hundred);
$small_number = $hundred [[and] $tens] | $tens;';
$result = prompt("Hello Say some number", 1, 10, $grammar);

echo "You said $result";
?>

BTW, I will update the post to contain the information about ASR. Thanks for your comment.

Dominique BoucherMay 6th, 2009 at 8:35 am

Yusuf,

Maybe I was misled by the API documentation.

It’s nice to see that you support the SRGS ABNF standard. Does VoicePHP also support external grammars (URLs)? Can I do something like:

Of course, you can alternatively do:

< ?php

$grammar = ‘#ABNF 1.0;
language en-US;
mode voice;
root $external;

$external = $;’;
$result = prompt(“Hello Say some number”, 1, 10, $grammar);

echo “You said $result”;
?>

Will that work?

And is $result limited to be a string? What if I want to return a number? Or a structure with several fields? What semantic tag format does VoicePHP support?

I should definitely give VoicePHP a try instead of guessing ;-)

Nilesh TrivediSeptember 13th, 2010 at 3:29 am

Yusuf,

I came here from our discussion at Pluggd.in. I agree with your point here that voice interaction is fundamentally different than interaction on a web page. And VoiceXML isn’t expressive enough.

Having said that, I will still agree with Ilja V. I can’t think of writing the app in VoiceXML directly. Instead, I’d use a dynamic language like Ruby with its blocks and meta-programming capabilities or even PHP. The point being that VXML works with all voice browsers out there which works out great. VoicePHP, even if better than VXML for expressiveness, feels like a vendor-specific solution which makes the TringMe platform that much less attractive to me. VXML at least allowed us to use whatever programming languages / frameworks we wanted to use as long as the output was compliant.

Now if you had something like VoiceRuby in place, I’ll be on board in no time :-)

Jayakrishnan KJune 7th, 2011 at 12:31 am

Rather late in the day for this comment, but it was obvious to me back in 2003 (http://jayakrishnan.livejournal.com/347.html) that VXML was not as neat as it was professed to be. Glad that saner voices are prevailing now.

Leave a comment

Your comment