Speech Synthesis Markup Language (SSML) Version 1.1

The Voice Browser Working Group has sought to develop standards to enable access to the Web using spoken interaction. The Speech Synthesis Markup Language Specification is one of these standards and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the 10 January 2007 First Public Working Draft of "Speech Synthesis Markup Language (SSML) Version 1.1".

This document enhances SSML 1.0 [SSML] to provide better support for a broader set of languages.

The design of SSML 1.0 has been widely reviewed (see the disposition of comments) and satisfies the Working Group's technical requirements. A list of implementations is included in the SSML 1.0 Implementation Report, along with the associated test suite.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

1. Introduction

This W3C specification is known as the Speech Synthesis Markup Language specification (SSML) and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A. The JSML specification can be found at [JSML].

SSML is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms. A related initiative to establish a standard system for marking up text input is SABLE [SABLE], which tried to integrate many different XML-based markups for speech synthesis into a new one. The activity carried out in SABLE was also used as the main starting point for defining the Speech Synthesis Markup Requirements for Voice Markup Languages [REQS]. Since then, SABLE itself has not undergone any further development.

The intended use of SSML is to improve the quality of synthesized content. Different markup elements impact different stages of the synthesis process (see Section 1.2). The markup may be produced either automatically, for instance via XSLT or CSS3 from an XHTML document, or by human authoring. Markup may be present within a complete SSML document (see Section 2.2.2) or as part of a fragment (see Section 2.2.1) embedded in another language, although no interactions with other languages are specified as part of SSML itself. Most of the markup included in SSML is suitable for use by the majority of content developers; however, some advanced features like phoneme and prosody (e.g. for speech contour design) may require specialized knowledge.

1.1 Design Concepts

The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages [REQS].

1.2 Speech Synthesis Process Steps

A Text-To-Speech system (a synthesis processor) that supports SSML will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.

Document creation: A text document provided as input to the synthesis processor may be produced automatically, by human authoring, or through a combination of these forms. SSML defines the form of the document.

Document processing: The following are the six major processing steps undertaken by a synthesis processor to convert marked-up text input into automatically generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output. Although each step below is divided into "markup support" and "non-markup behavior", actual behavior is usually a mix of the two and varies depending on the tag. The processor has the ultimate authority to ensure that what it produces is pronounceable (and ideally intelligible). In general the markup provides a way for the author to make prosodic and other information available to the processor, typically information the processor would be unable to acquire on its own. It is then up to the processor to determine whether and in what way to use the information.

1.3 Document Generation, Applications and Contexts

There are many classes of document creator that will produce marked-up documents to be spoken by a synthesis processor. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the previous section. The following are some of the common cases.

The following are important instances of architectures or designs from which marked-up synthesis documents will be generated. The language design is intended to facilitate each of these approaches.

1.4 Platform-Dependent Output Behavior of SSML Content

SSML provides a standard way to specify gross properties of synthetic speech production such as pronunciation, volume, pitch, rate, etc. Exact specification of synthetic speech output behavior across disparate processors, however, is beyond the scope of this document.

Unless otherwise specified, markup values are merely indications rather than absolutes. For example, it is possible for an author to explicitly indicate the duration of a text segment and also indicate an explicit duration for a subset of that text segment. If the two durations result in a text segment that the synthesis processor cannot reasonably render, the processor is permitted to modify the durations as needed to render the text segment.

1.5 Terminology

2. SSML Documents

2.1 Document Form

A legal stand-alone Speech Synthesis Markup Language document must have a legal XML Prolog [XML §2.8]. If present, the optional DOCTYPE must read as follows:

The XML prolog is followed by the root speak element. See Section 3.1.1 for details on this element.

The speak element must designate the SSML namespace. This can be achieved by declaring an xmlns attribute or an attribute with an "xmlns" prefix. See [XMLNS §2] for details. Note that when the xmlns attribute is used alone, it sets the default namespace for the element on which it appears and for any child elements. The namespace for SSML is defined to be http://www.w3.org/2001/10/synthesis.

It is recommended that the speak element also indicate the location of the SSML schema (see Appendix D) via the xsi:schemaLocation attribute from [SCHEMA1 §2.6.3]. Although such indication is not required, to encourage it this document provides such indication on all of the examples.

The meta, metadata and lexicon elements must occur before all other elements and text contained within the root speak element. There are no other ordering constraints on the elements in this specification.

2.2. Conformance

2.2.1 Conforming Speech Synthesis Markup Language Fragments

A document fragment is a Conforming Speech Synthesis Markup Language Fragment if:

2.2.2 Conforming Stand-Alone Speech Synthesis Markup Language Documents

A document is a Conforming Stand-Alone Speech Synthesis Markup Language Document if it meets both the following conditions:

The SSML specification and these conformance criteria provide no designated size limits on any aspect of synthesis documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.

2.2.3 Using SSML with other Namespaces

The synthesis namespace may be used with other XML namespaces as per the Namespaces in XML Recommendation [XMLNS]. Future work by W3C is expected to address ways to specify conformance for documents involving multiple namespaces.

2.2.4 Conforming Speech Synthesis Markup Language Processors

In a Conforming Speech Synthesis Markup Language Processor, the XML parser must be able to parse and process all XML constructs defined by XML 1.0 [XML] and Namespaces in XML [XMLNS]. This XML parser is not required to perform validation of an SSML document as per its schema or DTD; this implies that during processing of an SSML document it is optional to apply or expand external entity references defined in an external DTD.

A Conforming Speech Synthesis Markup Language Processor must correctly understand and apply the semantics of each markup element as described by this document.

A Conforming Speech Synthesis Markup Language Processor must meet the following requirements for handling of natural (human) languages:

When a Conforming Speech Synthesis Markup Language Processor encounters elements or attributes, other than xml:lang and xml:base , in a non-synthesis namespace it may:

There is, however, no conformance requirement with respect to performance characteristics of the Speech Synthesis Markup Language Processor. For instance, no statement is required regarding the accuracy, speed or other characteristics of speech produced by the processor. No statement is made regarding the size of input that a Speech Synthesis Markup Language Processor must support.

2.2.5 Conforming User Agent

A Conforming User Agent is a Conforming Speech Synthesis Markup Language Processor that is capable of accepting an SSML document as input and producing a spoken output by using the information contained in the markup to render the document as intended by the author. A Conforming User Agent must support at least one natural language.

Since the output cannot be guaranteed to be a correct representation of all the markup contained in the input there is no conformance requirement regarding accuracy. A conformance test may, however, require some examples of correct synthesis of a reference document to determine conformance.

2.3 Integration With Other Markup Languages

2.3.1 SMIL

The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") [SMIL] enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text editor. See the SMIL/SSML integration examples in Appendix F.

2.3.2 ACSS

Aural Cascading Style Sheets [CSS2 §19] are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.

2.3.3 VoiceXML

The Voice Extensible Markup Language [VXML] enables Web-based development and content-delivery for interactive voice response applications (see voice browser ). VoiceXML supports speech synthesis, recording and playback of digitized audio, speech recognition, DTMF input, telephony call control, and form-driven mixed initiative dialogs. VoiceXML 2.0 extends SSML for the markup of text to be synthesized. For an example of the integration between VoiceXML and SSML see Appendix F.

2.4 Fetching SSML Documents

The fetching and caching behavior of SSML documents is defined by the environment in which the synthesis processor operates. In a VoiceXML interpreter context for example, the caching policy is determined by the VoiceXML interpreter.

3. Elements and Attributes

3.1 Document Structure, Text Processing and Pronunciation

3.1.1 speak Root Element

The Speech Synthesis Markup Language is an XML application. The root element is speak.

xml:lang is a required attribute specifying the language of the root document.

xml:base is an optional attribute specifying the Base URI of the root document.

The version attribute is a required attribute that indicates the version of the specification to be used for the document and must have the value "1.1".

The lang-voice attribute is an optional attribute that has the enumerated values: "static", "dynamic". Its default value is "dynamic". The value of this attribute is used to determine the expected behavior of the synthesis processor with respect to voice changes when xml:lang indicates a change in the natural language of the document content (see Section 3.1.2 for full details).

3.1.2 Language: xml:lang Attribute

The xml:lang attribute, as defined by XML 1.0 [XML §2.12], may be used in SSML to indicate the natural language of the content of the element on which it occurs. BCP47 [BCP47] can help in understanding how to use this attribute.

Language information is inherited down the document hierarchy, i.e. it needs to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.

xml:lang is permitted on p, s, and w only because it is common to change the language at those levels.

The synthesis processor should use the value of the xml:lang attribute to assist it in determining the best way of rendering the content of the element on which it occurs. The voice, say-as, phoneme, sub, emphasis, and break elements should also be rendered in a manner that is appropriate to the current language.

If the lang-voice attribute of the root speak element is set to "static" then:

If the lang-voice attribute of the root speak element is set to "dynamic" then:

If the document author requires a new voice that is better adapted to the new language, then the synthesis processor can be explicitly requested to select a new voice by using the voice element. Further information about voice selection appears in Section 3.2.1.

In the following example, the lang-voice attribute is set to "static". Thus, a bilingual American English / Japanese voice must be used to render both sentences; however, an error may occur if such a voice is not available.

In the following example, the lang-voice attribute is set to "dynamic". Thus, a bilingual American English / Japanese voice should be used to render both sentences; however, an voice change may occur if such a voice is not available.

The text normalization processing step may be affected by the enclosing language. This is true for both markup support by the say-as element and non-markup behavior. In the following example the same text "2/1/2000" may be read as "February first two thousand" in the first sentence, following American English pronunciation rules, but as "the second of January two thousand" in the second one, which follows Italian preprocessing rules.

3.1.3 Base URI: xml:base Attribute

Relative URIs are resolved according to a base URI, which may come from a variety of sources. The base URI declaration allows authors to specify a document's base URI explicitly. See Section 3.1.3.1 for details on the resolution of relative URIs.

The base URI declaration is permitted but optional. The two elements affected by it are

The xml:base attribute

The base URI declaration follows [XML-BASE] and is indicated by an xml:base attribute on the root speak element.

3.1.3.1 Resolving Relative URIs

User agents must calculate the base URI for resolving relative URIs according to [RFC3986]. The following describes how RFC3986 applies to synthesis documents.

User agents must calculate the base URI according to the following precedences (highest priority to lowest):

3.1.5 Lexicon Documents: lexicon and lookup Elements

An SSML document may reference one or more lexicon documents. A lexicon document is located by a URI with an optional media type and is assigned a name that is unique in the SSML document.

A lexicon document may contain information (eg., pronunciation) for tokens that can appear in a text to be rendered. The information contained within a lexicon document should be used by the synthesis processor when rendering tokens that appear within the context of a lookup element. However, the processor may choose not to use the lexicon if it is deemed incompatible with the content of the SSML document. For example, a vendor-specific lexicon may be used only for particular values of the interpret-as attribute of the say-as element, or for a particular set of voices. Vendors should document the expected behavior of the synthesis processor when SSML content refers to a lexicon.

3.1.5.1 The lexicon element

Any number of lexicon elements may occur as immediate children of the speak element.

The lexicon element must have a uri attribute specifying a URI that identifies the location of the lexicon document.

The lexicon element must have an xml:id attribute that assigns a name to the lexicon document. The name must be unique to the current SSML document. The scope of this name is the current SSML document.

The lexicon element may have a type attribute that specifies the media type of the lexicon document. The default value of the type attribute is application/pls+xml, the media type associated with Pronunciation Lexicon Specification [PLS] documents as defined in [RFC4267].

Details of the type attribute

Note: the description and table that follow use an imaginary vendor-specific lexicon type of x-vnd.example.lexicon. This is intended to represent whatever format is returned/available, as appropriate.

A lexicon resource indicated by a URI reference may be available in one or more media types. The SSML author can specify the preferred media type via the type attribute. When the content represented by a URI is available in many data formats, a synthesis processor may use the preferred type to influence which of the multiple formats is used. For instance, on a server implementing HTTP content negotiation, the processor may use the type to order the preferences in the negotiation.

Upon delivery, the resource indicated by a URI reference may be considered in terms of two types. The declared media type is the alleged value for the resource and the actual media type is the true format of its content. The actual type should be the same as the declared type, but this is not always the case (e.g. a misconfigured HTTP server might return text/plain for a document following the vendor-specific x-vnd.example.lexicon format). A specific URI scheme may require that the resource owner always, sometimes, or never return a media type. Whenever a type is returned, it is treated as authoritative. The declared media type is determined by the value returned by the resource owner or, if none is returned, by the preferred media type given in the SSML document.

Three special cases may arise. The declared type may not be supported by the processor; this is an error. The declared type may be supported but the actual type may not match; this is also an error. Finally, no media type may be declared; the behavior depends on the specific URI scheme and the capabilities of the synthesis processor. For instance, HTTP 1.1 allows document introspection (see [RFC2616 §7.2.1]), the data scheme falls back to a default media type, and local file access defines no guidelines. The following table provides some informative examples:

Media type examples
	HTTP 1.1 request	Local file access
Media type returned by the resource owner	text/plain	x-vnd.example.lexicon	<none>	<none>
Preferred media type from the SSML document	Not applicable; the returned type is authoritative.	x-vnd.example.lexicon	application/pls+xml
Declared media type	text/plain	x-vnd.example.lexicon	x-vnd.example.lexicon	<none>
Behavior for an actual media type of x-vnd.example.lexicon	This must be processed as text/plain. This will generate an error if text/plain is not supported or if the document does not follow the expected format.	The declared and actual types match; success if x-vnd.example.lexicon is supported by the synthesis processor; otherwise an error.	Scheme specific; the synthesis processor might introspect the document to determine the type.

3.1.5.2 The lookup element

The lookup element must have a ref attribute. The ref attribute specifies a name that references a lexicon document as assigned by the xml:id attribute of the lexicon element. The synthesis processor should use the lexicon document named when rendering the content of the lookup element.

A lookup element may contain other lookup elements. When a lookup element contains other lookup elements, the child lookup elements have higher precedence. Precedence means that a token is first looked up in the lexicon with highest precedence. Only if the token is not found in that lexicon is it then looked up in the lexicon with the next lower precedence, and so on until the token is successfully found or until all lexicons have been used for lookup.

3.1.6 meta Element

The metadata and meta elements are containers in which information about the document can be placed. The metadata element provides more general and powerful treatment of metadata information than meta by using a metadata schema.

A meta declaration associates a string to a declared meta property or declares "http-equiv" content. Either a name or http-equiv attribute is required. It is an error to provide both name and http-equiv attributes. A content attribute is required. The seeAlso property is the only defined meta property name. It is used to specify a resource that might provide additional metadata information about the content. This property is modelled on the seeAlso property of Resource Description Framework (RDF) Schema Specification 1.0 [RDF-SCHEMA §5.4.1]. The http-equiv attribute has a special significance when documents are retrieved via HTTP. Although the preferred method of providing HTTP header information is by using HTTP header fields, the "http-equiv" content may be used in situations where the SSML document author is unable to configure HTTP header fields associated with their document on the origin server, for example, cache control information. Note that HTTP servers and caches are not required to introspect the contents of meta in SSML documents and thereby override the header values they would send otherwise.

Informative: This is an example of how meta elements can be included in an SSML document to specify a resource that provides additional metadata information and also indicate that the document must not be cached.

3.1.7 metadata Element

The metadata element is a container in which information about the document can be placed using a metadata schema. Although any metadata schema can be used with metadata, it is recommended that the XML syntax of the Resource Description Framework (RDF) [RDF-XMLSYNTAX] be used in conjunction with the general metadata properties defined in the Dublin Core Metadata Initiative [DC].

The Resource Description Format [RDF] is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. Content creators should refer to W3C metadata Recommendations [RDF-XMLSYNTAX] and [RDF-SCHEMA] when deciding which metadata RDF schema to use in their documents. Content creators should also refer to the Dublin Core Metadata Initiative [DC], which is a set of generally applicable core metadata properties (e.g., Title, Creator, Subject, Description, Rights, etc.).

Document properties declared with the metadata element can use any metadata schema.

Informative: This is an example of how metadata can be included in an SSML document using the Dublin Core version 1.0 RDF schema [DC] describing general document information such as title, description, date, and so on:

The metadata element can have arbitrary content, although none of the content will be rendered by the synthesis processor.

3.1.8 Text Structure: p, s, and w Elements

3.1.8.1 p and s Elements

The use of p and s elements is optional. Where text occurs without an enclosing p or s element the synthesis processor should attempt to determine the structure using language-specific knowledge of the format of plain text.

3.1.8.2 w Element

The w element allows the author to indicate its content is a word or a token and to eliminate word segmentation ambiguities of the synthesis processor.

The w element is necessary in order to render languages

that do not use white-space as a boundary identifier, such as Chinese, Thai, and Japanese
that use white space for syllable segmentation, such as Vietnamese
that use white space for other purposes, such as Urdu

Use of this element can result in improved cues for prosodic control (e.g., pause) and may assist the synthesis processor in selection of the correct pronunciation for homographs.

Issue: Other names for the element have been suggested. Some people suggest using <token> because the name should be consistent with other specifications, especially SRGS. Some people suggest using <word> because the name is easier for document authors to understand.

The use of w elements is optional. Where text occurs without an enclosing w element the synthesis processor should attempt to determine the word segmentation using language-specific knowledge of the format of plain text.

xml:lang is a defined attribute on the w element to identify the language of the content.

xml:id is a defined attribute on the w element.

role is an optional defined attribute on the w element. The role attribute takes as its value one or more white-space separated QNames (as defined in Section 4 of Namespaces in XML 1.0 (Second Edition) [XMLNS]). A QName in the attribute content is expanded into an expanded-name using the namespace declarations in scope for the containing w element. Thus, each QName provides a reference to a specific item in the designated namespace. In the second example below, the QName within the role attribute expands to the "VV0" item in the "http://www.example.com/claws7tags" namespace. This mechanism allows for referencing defined taxonomies of word classes, with the expectation that they are documented at the specified namespace URI.

The role attribute is intended to be of use in synchronizing with other specifications, for example to describe additional information to help the selection of the most appropriate pronunciation for the contained text inside an external pronunciation lexicon (see the lexicon element).

The w element can only contain text to be rendered and the following elements: audio, break, emphasis, mark, phoneme, prosody, say-as, sub, voice.

Issue: The exact permitted contents of the w element are still under discussion. SSML 1.0 implicitly assumes that the word or token level is the smallest unit that can be marked up. One piece of evidence for this assumption is that any element inserted within a word must be treated as if whitespace had been inserted before and after the element, breaking the word into two separate words. Chinese-language users of SSML 1.0 have, however, used prosodic marking down to the ideograph level, which corresponds roughly to a syllable, and synthesis processors for Chinese languages commonly permit user control at this level. These users would like to preserve this capability when using the w element by permitting prosodic markup within the w element. An issue this presents is that the interpretation of such markup for pronunciation-based orthographies (as in English or Western European languages) is not well-defined, and it is not clear what support requirements should be expected of synthesis processors for those languages if such markup is used within the w element. There is also a problem with recursion of content. If the w element may contain voice and prosody, which themselves can contain s, then w can also contain s, although logically a word cannot contain sentences.

The w element can only be contained in the following elements: audio, emphasis, lang, lookup, prosody, speak, p, s, voice.

Here is an example showing the use of the w element.

<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="zh-CN">

  <!-- The Nanjing Changjiang River Bridge -->
  <w>南京市</w><w>长江大桥</w>
  <!-- The mayor of Nanjin city, Jiang Daqiao -->
  南京市长<w>江大桥</w>
  <!-- Shanghai is a metropoli -->
  上海是个<w>大都会</w>
  <!-- Most Shanghainese will say something like that -->
  上海人<w>大都</w>会那么说
</speak>

Here is an example showing the use of the role attribute.

Here is a sample pronunciation lexicon (PLS) for the Chinese word "处":

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
         xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
         xmlns:claws="http://www.example.com/claws7tags"
         alphabet="x-myorganization-pinyin"
         xml:lang="zh-CN">
  <lexeme role="claws:VV0">
    <!-- base form of lexical verb -->
    <grapheme>处</grapheme>
    <phoneme>chu3</phoneme>
    <!-- pinyin string is: "chǔ" in 处罚 处置 -->
  </lexeme>
  <lexeme role="claws:NN">
    <!-- common noun, neutral for number -->
    <grapheme>处</grapheme>
    <phoneme>chu4</phoneme>
    <!-- pinyin string is: "chù" in 处所 妙处 -->
  </lexeme>
</lexicon>

This is a sample document which references the above lexicon and
shows how the role attribute may be used to select the appropriate
pronunciation of the Chinese word "处" in the dialog.

<?xml version="1.0" encoding="UTF-8"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                             http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xmlns:claws="http://www.example.com/claws7tags"
         xml:lang="zh-CN">
  <lexicon uri="http://www.example.com/lexicon.pls"
           type="application/pls+xml"
           xml:id="mylex"/>
  <lookup ref="mylex">
    他这个人很不好相<w role="claws:VV0">处</w>。
    此<w role="claws:NN">处</w>不准照相。
  </lookup>
</speak>

3.1.9 say-as Element

The say-as element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.

Defining a comprehensive set of text format types is difficult because of the variety of languages that have to be considered and because of the innate flexibility of written languages. SSML only specifies the say-as element, its attributes, and their purpose. It does not enumerate the possible values for the attributes. The Working Group expects to produce a separate document that will define standard values and associated normative behavior for these values. Examples given here are only for illustrating the purpose of the element and the attributes.

The say-as element has three attributes: interpret-as, format, and detail. The interpret-as attribute is always required; the other two attributes are optional. The legal values for the format attribute depend on the value of the interpret-as attribute.

The interpret-as and format attributes

The interpret-as attribute indicates the content type of the contained text construct. Specifying the content type helps the synthesis processor to distinguish and interpret text constructs that may be rendered in different ways depending on what type of information is intended. In addition, the optional format attribute can give further hints on the precise formatting of the contained text for content types that may have ambiguous formats.

When specified, the interpret-as and format values are to be interpreted by the synthesis processor as hints provided by the markup document author to aid text normalization and pronunciation.

In all cases, the text enclosed by any say-as element is intended to be a standard, orthographic form of the language currently in context. A synthesis processor should be able to support the common, orthographic forms of the specified language for every content type that it supports.

When the value for the interpret-as attribute is unknown or unsupported by a processor, it must render the contained text as if no interpret-as value were specified.

When the value for the format attribute is unknown or unsupported by a processor, it must render the contained text as if no format value were specified, and should render it using the interpret-as value that is specified.

When the content of the say-as element contains additional text next to the content that is in the indicated format and interpret-as type, then this additional text must be rendered. The processor may make the rendering of the additional text dependent on the interpret-as type of the element in which it appears.
When the content of the say-as element contains no content in the indicated interpret-as type or format, the processor must render the content either as if the format attribute were not present, or as if the interpret-as attribute were not present, or as if neither the format nor interpret-as attributes were present. The processor should also notify the environment of the mismatch.

Indicating the content type or format does not necessarily affect the way the information is pronounced. A synthesis processor should pronounce the contained text in a manner in which such content is normally produced for the language.

The detail attribute

The detail attribute is an optional attribute that indicates the level of detail to be read aloud or rendered. Every value of the detail attribute must render all of the informational content in the contained text; however, specific values for the detail attribute can be used to render content that is not usually informational in running text but may be important to render for specific purposes. For example, a synthesis processor will usually render punctuations through appropriate changes in prosody. Setting a higher level of detail may be used to speak punctuations explicitly, e.g. for reading out coded part numbers or pieces of software code.

If the detail attribute is not specified, the level of detail that is produced by the synthesis processor depends on the text content and the language.

When the value for the detail attribute is unknown or unsupported by a processor, it must render the contained text as if no value were specified for the detail attribute.

3.1.10 phoneme Element

The phoneme element provides a phonemic/phonetic pronunciation for the contained text. The phoneme element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.

The ph attribute is a required attribute that specifies the phoneme/phone string.

This element is designed strictly for phonemic and phonetic notations and is intended to be used to provide pronunciations for words or very short phrases. The phonemic/phonetic string does not undergo text normalization and is not treated as a token for lookup in the lexicon (see Section 3.1.5), while values in say-as and sub may undergo both. Briefly, phonemic strings consist of phonemes, language-dependent speech units that characterize linguistically significant differences in the language; loosely, phonemes represent all the sounds needed to distinguish one word from another in a given language. On the other hand, phonetic strings consist of phones, speech units that characterize the manner (puff of air, click, vocalized, etc.) and place (front, middle, back, etc.) of articulation within the human vocal tract and are thus independent of language; phones represent realized distinctions in human speech production.

The alphabet attribute is an optional attribute that specifies the phonemic/phonetic alphabet. An alphabet in this context refers to a collection of symbols to represent the sounds of one or more human languages. The only valid values for this attribute are "ipa" (see the next paragraph), values defined in Pronunciation Alphabet Registry and vendor-defined strings of the form "x-organization" or "x-organization-alphabet". For example, the Japan Electronics and Information Technology Industries Association [JEITA] might wish to encourage the use of an alphabet such as "x-JEITA" or "x-JEITA-2000" for their phoneme alphabet [JEIDAALPHABET].

Synthesis processors should support a value for alphabet of "ipa", corresponding to Unicode representations of the phonetic characters developed by the International Phonetic Association [IPA]. In addition to an exhaustive set of vowel and consonant symbols, this character set supports a syllable delimiter, numerous diacritics, stress symbols, lexical tone symbols, intonational markers and more. For this alphabet, legal ph values are strings of the values specified in Appendix 2 of [IPAHNDBK]. Informative tables of the IPA-to-Unicode mappings can be found at [IPAUNICODE1] and [IPAUNICODE2]. Note that not all of the IPA characters are available in Unicode. For processors supporting this alphabet,

It is an error if a value for alphabet is specified that is not known or cannot be applied by a synthesis processor. The default behavior when the alphabet attribute is left unspecified is processor-specific.

3.1.10.1 Pronunciation Alphabet Registry

Issue: We are still working out the location and details of the Registry. A link will be provided in this document when it is available.

Issue: The LTRU IETF WG (which is working on language tags) is currently discussing the introduction of a subtag for IPA, and maybe other alphabets. We are coordinating with them to determine what overlap, if any, there is between our two efforts.

3.1.11 sub Element

The sub element is employed to indicate that the text in the alias attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. The required alias attribute specifies the string to be spoken instead of the enclosed string. The processor should apply text normalization to the alias value.

3.1.12 lang Element

The lang element is used to specify the natural language of the content.

The lang element has one attribute, xml:lang, which is always required.

This element may be used when then there is a change in the natural language. There is no text structure associated with the language change indicated by the lang element. It may be used to specify the language of the content at a level other than a paragraph, sentence or word level. When language change is to be associated with text structure, it is recommended to use the xml:lang attribute on the respective p, s or w element.

Issue: The name of this element is still under discussion. One alternative that has been suggested is "span".

<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  The French word for cat is <w xml:lang="fr">chat</w>.
  He prefers to eat pasta that is <lang xml:lang="it">al dente</lang>.
</speak>

The lang element can only contain text to be rendered and the following elements: audio, break, emphasis, lang, mark, p, phoneme, prosody, say-as, sub, s, voice, w.

3.2 Prosody and Style

3.2.1 voice Element

The voice element is a production element that requests a change in speaking voice. Attributes are:

Although each attribute individually is optional, note that the voice element inherits the value of the xml:lang attribute from its superior in the hierarchy of nested elements. A language descriptor is always available because the xml:lang attribute must be specified in the root speak element. The inherited value of the xml:lang attribute indicates the language to be spoken by the voice. Thus, even using the voice element without explicitly specifying any attributes may result in a change in voice if a more appropriate voice for the language is available.

Although indication of language (using xml:lang) and selection of voice (using voice) are independent, there is no requirement that a synthesis processor support every possible combination of values of the two. However, a synthesis processor must document expected rendering behavior for every possible combination.

When there is not a voice available that exactly matches the attributes specified in the document, or there are multiple voices that match the criteria, the following voice selection algorithm must be used. There are cases in the algorithm that are ambiguous; in such cases voice selection may be processor-specific. Approximately speaking, the xml:lang attribute has the highest priority and all other attributes are equal in priority but below xml:lang. The complete algorithm is:

Issue: Note that <lang xml:lang="XXX"><voice attribute=""...>...</voice></lang> is NOT the same as <voice attribute=""...><lang xml:lang="XXX">...</lang></voice> because voice inherits xml:lang from its hierarchy. We are still considering whether or not this situation is acceptable.

voice attributes are inherited down the tree including to within elements that change the language.

Relative changes in prosodic parameters should be carried across voice changes. However, different voices have different natural defaults for pitch, speaking rate, etc. because they represent different personalities, so absolute values of the prosodic parameters may vary across changes in the voice.

The quality of the output audio or voice may suffer if a change in voice is requested within a sentence.

3.2.2 emphasis Element

The emphasis element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesis processor determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:

3.2.3 break Element

The break element is an empty element that controls the pausing or other prosodic boundaries between words. The use of the break element between any pair of words is optional. If the element is not present between words, the synthesis processor is expected to automatically determine a break based on the linguistic context. In practice, the break element is most often used to override the typical automatic behavior of a synthesis processor. The attributes on this element are:

The strength attribute is used to indicate the prosodic strength of the break. For example, the breaks between paragraphs are typically much stronger than the breaks between words within a sentence. The synthesis processor may insert a pause as part of its implementation of the prosodic break. A pause of a specific length can also be inserted by using the time attribute.

If a break element is used with neither strength nor time attributes, a break will be produced by the processor with a prosodic strength greater than that which the processor would otherwise have used if no break element was supplied.

If both strength and time attributes are supplied, the processor will insert a break with a duration as specified by the time attribute, with other prosodic changes in the output based on the value of the strength attribute.

3.2.4 prosody Element

The prosody element permits control of the pitch, speaking rate and volume of the speech output. The attributes, all optional, are:

Although each attribute individually is optional, it is an error if no attributes are specified when the prosody element is used. The "x-foo " attribute value names are intended to be mnemonics for "extra foo". All units ("Hz", "st") are case-sensitive. Note also that customary pitch levels and standard pitch ranges may vary significantly by language, as may the meanings of the labelled values for pitch targets and ranges.

Number

A number is a simple positive floating point value without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or more digits.

Relative values

Pitch contour

The pitch contour is defined as a set of white space-separated targets at specified time positions in the speech output. The algorithm for interpolating between the targets is processor-specific. In each pair of the form (time position,target), the first value is a percentage of the period of the contained text (a number followed by "%") and the second value is the value of the pitch attribute (a number followed by "Hz", a relative change, or a label value). Time position values outside 0% to 100% are ignored. If a pitch value is not defined for 0% or 100% then the nearest pitch target is copied. All relative values for the pitch are relative to the pitch value just before the contained text.

The duration attribute takes precedence over the rate attribute. The contour attribute takes precedence over the pitch and range attributes.

The default value of all prosodic attributes is no change. For example, omitting the rate attribute means that the rate is the same within the element as outside.

Limitations

All prosodic attribute values are indicative. If a synthesis processor is unable to accurately render a document as specified (e.g., trying to set the pitch to 1 MHz or the speaking rate to 1,000,000 words per minute), it must make a best effort to continue processing by imposing a limit or a substitute for the specified, unsupported value and may inform the host environment when such limits are exceeded.

In some cases, synthesis processors may elect to ignore a given prosodic markup if the processor determines, for example, that the indicated value is redundant, improper or in error. In particular, concatenative-type synthetic speech systems that employ large acoustic units may reject prosody-modifying markup elements if they are redundant with the prosody of a given acoustic unit(s) or would otherwise result in degraded speech quality.

3.3 Other Elements

3.3.1 audio Element

The audio element supports the insertion of recorded audio files (see Appendix A for required formats) and the insertion of other audio formats in conjunction with synthesized speech output. The audio element may be empty. If the audio element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available. The alternate content may include text, speech markup, desc elements, or other audio elements. The alternate content may also be used when rendering the document to non-audible output and for accessibility (see the desc element). The required attribute is src, which is the URI of a document with an appropriate MIME type.

Deciding which conditions result in the alternative content being rendered is processor-dependent. If the audio element is not successfully rendered, a synthesis processor should continue processing and should notify the hosting environment. The processor may determine after beginning playback of an audio source that the audio cannot be played in its entirety. For example, encoding problems, network disruptions, etc. may occur. The processor may designate this either as successful or unsuccessful rendering, but it must document this behavior.

3.3.2 mark Element

A mark element is an empty element that places a marker into the text/tag sequence. It has one required attribute, name, which is of type xsd:token [SCHEMA2 §3.3.2]. The mark element can be used to reference a specific location in the text/tag sequence, and can additionally be used to insert a marker into an output stream for asynchronous notification. When processing a mark element, a synthesis processor must do one or both of the following:

3.3.3 desc Element

The desc element can only occur within the content of the audio element. When the audio source referenced in audio is not speech, e.g. audio wallpaper or sonicon punctuation, it should contain a desc element whose textual content is a description of the audio source (e.g. "door slamming"). If text-only output is being produced by the synthesis processor, the content of the desc element(s) should be rendered instead of other alternative content in audio. The optional xml:lang attribute can be used to indicate that the content of the element is in a different language from that of the content surrounding the element.

4. References

4.1 Normative References

4.2 Informative References

5. Acknowledgments

This document was written with the participation of the following participants in the W3C Voice Browser Working Group (listed in family name alphabetical order):

Appendix A: Audio File Formats

SSML requires that a platform support the playing of the audio formats specified below.

Required audio formats
Audio Format	Media Type
Raw (headerless) 8kHz 8-bit mono mu-law (PCM) single channel. (G.711)	audio/basic (from [RFC1521])
Raw (headerless) 8kHz 8 bit mono A-law (PCM) single channel. (G.711)	audio/x-alaw-basic
WAV (RIFF header) 8kHz 8-bit mono mu-law (PCM) single channel.	audio/x-wav
WAV (RIFF header) 8kHz 8-bit mono A-law (PCM) single channel.	audio/x-wav

The 'audio/basic' MIME type is commonly used with the 'au' header format as well as the headerless 8-bit 8kHz mu-law format. If this MIME type is specified for playing, the mu-law format must be used. For playback with the 'audio/basic' MIME type, processors must support the mu-law format and may support the 'au' format.

Appendix B: Internationalization

SSML is an application of XML 1.0 [XML] and thus supports [UNICODE] which defines a standard universal character set.

SSML provides a mechanism for control of the spoken language via the use of the xml:lang attribute. Language changes can occur as frequently as per word, although excessive language changes can diminish the output audio quality. SSML also permits finer control over output pronunciations via the lexicon and phoneme elements, features that can help to mitigate poor quality default lexicons for languages with only minimal commercial support today.

Appendix C: MIME Types and File Suffix

The media type associated with the Speech Synthesis Markup Language specification is "application/ssml+xml" and the filename suffix is ".ssml" as defined in [RFC4267].

Appendix D: Schema for the Speech Synthesis Markup Language

Note: the synthesis schema includes a no-namespace core schema, located at http://www.w3.org/TR/speech-synthesis/synthesis-core.xsd, which may be used as a basis for specifying Speech Synthesis Markup Language Fragments (Sec. 2.2.1) embedded in non-synthesis namespace schemas.

Appendix E: DTD for the Speech Synthesis Markup Language

Due to DTD limitations, the SSML DTD does not correctly express that the metadata element can contain elements from other XML namespaces.

Appendix F: Example SSML

The following is an example of reading headers of email messages. The p and s elements are used to mark the text structure. The break element is placed before the time and has the effect of marking the time as important information for the listener to pay attention to. The prosody element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.

The following example combines audio files and different spoken voices to provide information on a collection of music.

It is often the case that an author wishes to include a bit of foreign text (say, a movie title) in an application without having to switch languages (for example via the lang element). A simple way to do this is shown here. In this example the synthesis processor would render the movie name using the pronunciation rules of the container language ("en-US" in this case), similar to how a reader who doesn't know the foreign language might try to read (and pronounce) it.

With some additional work the output quality can be improved tremendously either by creating a custom pronunciation in an external lexicon (see Section 3.1.5) or via the phoneme element as shown in the next example.

It is worth noting that IPA alphabet support is an optional feature and that phonemes for an external language may be rendered with some approximation (see Section 3.1.5 for details). The following example only uses phonemes common to US English.

SMIL Integration Example

The SMIL language [SMIL] is an XML-based multimedia control language. It is especially well suited for describing dynamic media applications that include synthetic speech output.

SMIL Example 1: W3C logo image appears, and then one second later, the speech sequence is rendered. File 'greetings.smil' contains the following:

SMIL Example 2: W3C logo image appears, then clicking on the image causes it to disappear and the speech sequence to be rendered. File 'greetings.smil' contains the following: