MRCP 是 speech server 給 client 提供服務(例如 speech recognition, speech synthesis)的傳輸協定,MRCP 無法獨立運作,必須透過 RTSP 或 SIP 建立 control session 與 audio streams。MRCP 是使用類似 http 的 text style protocol,每個訊息包含三個部分:first line, header, body。
MRCP 使用跟 http 一樣的 a request and reponse model,例如 MRCP client 發送 request,要求要發送 audio data 給 server 做語音辨識,server 會回傳一個訊息,裡面包含要接收資料的 port number,因為 MRCP 並沒有規範要如何傳送語音資料,這部分就透過 RTP 處理。
MRCP v2 (RFC 6787)使用 SIP 管理 session 與 audio stream,v1 (RFC 4463) 則沒有規範這部分要使用哪一種 protocol,目前比較常討論的是 MRCP v2,另外因為MRCPv1依賴 RTSP (RFC2326),但在討論 MRCP v2 時,大家一致 RTSP 的這種使用方式,會導致向後兼容性問題,因此在 (Requirements for Distributed Control of Automatic Speech Recognition (ASR), Speaker Identification/Speaker Verification (SI/SV), and Text-to-Speech (TTS) Resources RFC4313) 的3.2節禁止使用,這就是為什麼MRCPv2不能在RTSP上運作的原因。
MRCP V2 中使用了 SIP 負責建立獨立的媒體和會話支持語音媒體資源,增加了對講話者變化和講話者的身份引擎的支援(speaker verification 和 identification),同時增加了未來的擴充能力。
MRCP v2 規範中的架構圖為
MRCPv2 client MRCPv2 Media Resource Server
|--------------------| |------------------------------------|
||------------------|| ||----------------------------------||
|| Application Layer|| ||Synthesis|Recognition|Verification||
||------------------|| || Engine | Engine | Engine ||
||Media Resource API|| || || | || | || ||
||------------------|| ||Synthesis|Recognizer | Verifier ||
|| SIP | MRCPv2 || ||Resource | Resource | Resource ||
||Stack | || || Media Resource Management ||
|| | || ||----------------------------------||
||------------------|| || SIP | MRCPv2 ||
|| TCP/IP Stack ||---MRCPv2---|| Stack | ||
|| || ||----------------------------------||
||------------------||----SIP-----|| TCP/IP Stack ||
|--------------------| || ||
| ||----------------------------------||
SIP |------------------------------------|
| /
|-------------------| RTP
| | /
| Media Source/Sink |------------/
| |
|-------------------|
Figure 1: Architectural Diagram
W3C 在 1999年建立 Voice Broswer Working Group(VBWG),研究如何透過 web 支援語音辨識及 DTMF 處理,然後發佈了基於 web 的語音介面架構,核心是 VoiceXML。
W3C 的 Speech Recognition Grammar Specification (SRGS) 是一種 XML 標準,支援語音語法的規則,可識別的短詞語。和 SRGS 比較接近的是 W3C Semantic Interpretation for Speech Recognition (SISR),它更常用在標記語義信息支援語音語法,構成了對自然語言理解的基本格式。W3C Speech Synthesis Markup Language (SSML)是基於 XML 的方式指定內容進行語音合成的方式,可控制語音的各種屬性,包括音量大小,發音,語音間距,語速等方面的控制。
SRGS和SSML能互補和控制W3C的發音語法規則(Pronunciation Lexicon Specification (PLS))。PLS可以使用標準的發音字母來指定單字和短詞語發音。
VoiceXML 協助 MRCP,可支援多種第三方語音辨識及合成引擎。
MRCPv2 Media Resource Types
一個 MRCPv2 server 就是一種 SIP server,因此是用 SIP URI 方式定址 (sip:mrcpv2@example.net or sips:mrcpv2@example.net
),可提供以下 media processing resources 給 clients
Basic Synthesizer
透過連接 audio clips 產生語音 media stream,speech data 是以 limited subset of the Speech Synthesis Markup Language (SSML) 描述,最簡單的 synthesizer 必須支援這些 SSML tags:
<speak>, <audio>, <say-as>, <mark>
Speech Synthesizer
有完整 TTS 功能,必須完整支援 SSML
Recorder
recoding audio 並提供該錄音的 URI,必須支援在錄音的最前面及後面要 supressing silence,錄音檔的中間可選擇要不要 supress silence,如果有做靜音處理,要記錄 timting metadata,才能知道原始錄音 media 實際發生語音的 timestamp
DTMF Recognizer
能取得 media stream 中的 Dual-Tone Multi-Frequency (DTMF) digits,並對應到 supplied sigit grammar 中
Speech Recognizer
完整的 speech recognition resource 可接收 audio media stream 並辨識取得結果,另外包含一個 natural language semantic interpreter 做辨識結果的 post-process,轉為 grammar 中的 semantic data
Speaker Verifier
可辨別已存在的 voice print 的 speaker
Resource Type | Resource Description |
---|---|
speechrecog | Speech Recognizer |
dtmfrecog | DTMF Recognizer |
speechsynth | Speech Synthesizer |
basicsynth | Basic Synthesizer |
speakverify | Speaker Verification |
recorder | Speech Recorder |
MRCPv2 的規範中,整個應用的使用過程如下:
MRCP Client 通過SIP&SDP建立與MRCP Server的MRCP control channel(使用MRCP 通道ID進行唯一標識,MRCP Server返回200消息時,通過a==channel屬性指定)
可以使用SIP的Re-INVITE消息添加或者刪除一個會話中的MRCP control channel,所以一個 session 可以擁有多個MRCP control channels(比如:一個會話可以同時擁有ASR&TTS channel)
多個MRCP control channel 可以共享同一個TCP connection
一個 MRCP message 只能攜帶一個MRCP channel ID。
MRCP控制消息不能更改 SIP dialog 的狀態。
由於MRCP不保證傳輸的可靠性,所以必須使用TCP/TLS來保證其傳輸
resourse control channel
MRCPv2 附在 SIP 的 SDP 裡面,client 透過 SIP Invite 連接 MRCPv2 server,產生 SIP dialog,SDP 讓兩個端點協調所有要建立的 resource control channel,並產生 server 與 source/sink of audio 之間的 media session。
client 需要建立獨立的 MRCPv2 resource control channel,控制 SIP dialog 裡面要處理的 media resource,因此需要產生一個唯一的 channel identifier string。
在 SDP 中,要有一行 "m=" 給 session 中每一個 MRCPv2 resource 使用,transport type 必須要是 "TCP/MRCPv2" or "TCP/TLS/MRCPv2",client 可透過 TCP 或 TCP/TLS 連接到 MRCPv2 server。
example:
連接到 synthesizer 的範例,server 會產生一個單向 audio stream 傳給 client
- 產生 Synthesizer Control Channel
C->S: INVITE sip:mresources@example.com SIP/2.0
Via:SIP/2.0/TCP client.atlanta.example.com:5060;
branch=z9hG4bK74bf1
Max-Forwards:6
To:MediaServer <sip:mresources@example.com>
From:sarvi <sip:sarvi@example.com>;tag=1928301774
Call-ID:a84b4c76e66710
CSeq:314161 INVITE
Contact:<sip:sarvi@client.example.com>
Content-Type:application/sdp
Content-Length:...
v=0
o=sarvi 2890844526 2890844526 IN IP4 192.0.2.12
s=-
c=IN IP4 192.0.2.12
t=0 0
m=application 9 TCP/MRCPv2 1
a=setup:active
a=connection:new
a=resource:speechsynth
a=cmid:1
m=audio 49170 RTP/AVP 0
a=rtpmap:0 pcmu/8000
a=recvonly
a=mid:1
S->C: SIP/2.0 200 OK
Via:SIP/2.0/TCP client.atlanta.example.com:5060;
branch=z9hG4bK74bf1;received=192.0.32.10
To:MediaServer <sip:mresources@example.com>;tag=62784
From:sarvi <sip:sarvi@example.com>;tag=1928301774
Call-ID:a84b4c76e66710
CSeq:314161 INVITE
Contact:<sip:mresources@server.example.com>
Content-Type:application/sdp
Content-Length:...
v=0
o=- 2890842808 2890842808 IN IP4 192.0.2.11
s=-
c=IN IP4 192.0.2.11
t=0 0
m=application 32416 TCP/MRCPv2 1
a=setup:passive
a=connection:new
a=channel:32AECB234338@speechsynth
a=cmid:1
m=audio 48260 RTP/AVP 0
a=rtpmap:0 pcmu/8000
a=sendonly
a=mid:1
C->S: ACK sip:mresources@server.example.com SIP/2.0
Via:SIP/2.0/TCP client.atlanta.example.com:5060;
branch=z9hG4bK74bf2
Max-Forwards:6
To:MediaServer <sip:mresources@example.com>;tag=62784
From:Sarvi <sip:sarvi@example.com>;tag=1928301774
Call-ID:a84b4c76e66710
CSeq:314161 ACK
Content-Length:0
上面的 RTP 資源,另外再對 recognizer 要求取得一個 resource control channel 的資源,並改為 sendrecv 雙向傳輸語音
C->S: INVITE sip:mresources@server.example.com SIP/2.0
Via:SIP/2.0/TCP client.atlanta.example.com:5060;
branch=z9hG4bK74bf3
Max-Forwards:6
To:MediaServer <sip:mresources@example.com>;tag=62784
From:sarvi <sip:sarvi@example.com>;tag=1928301774
Call-ID:a84b4c76e66710
CSeq:314162 INVITE
Contact:<sip:sarvi@client.example.com>
Content-Type:application/sdp
Content-Length:...
v=0
o=sarvi 2890844526 2890844527 IN IP4 192.0.2.12
s=-
c=IN IP4 192.0.2.12
t=0 0
m=application 9 TCP/MRCPv2 1
a=setup:active
a=connection:existing
a=resource:speechsynth
a=cmid:1
m=audio 49170 RTP/AVP 0 96
a=rtpmap:0 pcmu/8000
a=rtpmap:96 telephone-event/8000
a=fmtp:96 0-15
a=sendrecv
a=mid:1
m=application 9 TCP/MRCPv2 1
a=setup:active
a=connection:existing
a=resource:speechrecog
a=cmid:1
S->C: SIP/2.0 200 OK
Via:SIP/2.0/TCP client.atlanta.example.com:5060;
branch=z9hG4bK74bf3;received=192.0.32.10
To:MediaServer <sip:mresources@example.com>;tag=62784
From:sarvi <sip:sarvi@example.com>;tag=1928301774
Call-ID:a84b4c76e66710
CSeq:314162 INVITE
Contact:<sip:mresources@server.example.com>
Content-Type:application/sdp
Content-Length:...
v=0
o=- 2890842808 2890842809 IN IP4 192.0.2.11
s=-
c=IN IP4 192.0.2.11
t=0 0
m=application 32416 TCP/MRCPv2 1
a=setup:passive
a=connection:existing
a=channel:32AECB234338@speechsynth
a=cmid:1
m=audio 48260 RTP/AVP 0 96
a=rtpmap:0 pcmu/8000
a=rtpmap:96 telephone-event/8000
a=fmtp:96 0-15
a=sendrecv
a=mid:1
m=application 32416 TCP/MRCPv2 1
a=setup:passive
a=connection:existing
a=channel:32AECB234338@speechrecog
a=cmid:1
C->S: ACK sip:mresources@server.example.com SIP/2.0
Via:SIP/2.0/TCP client.atlanta.example.com:5060;
branch=z9hG4bK74bf4
Max-Forwards:6
To:MediaServer <sip:mresources@example.com>;tag=62784
From:Sarvi <sip:sarvi@example.com>;tag=1928301774
Call-ID:a84b4c76e66710
CSeq:314162 ACK
Content-Length:0
釋放 recofnizer channel 的資源,改回 recvonly
C->S: INVITE sip:mresources@server.example.com SIP/2.0
Via:SIP/2.0/TCP client.atlanta.example.com:5060;
branch=z9hG4bK74bf5
Max-Forwards:6
To:MediaServer <sip:mresources@example.com>;tag=62784
From:sarvi <sip:sarvi@example.com>;tag=1928301774
Call-ID:a84b4c76e66710
CSeq:314163 INVITE
Contact:<sip:sarvi@client.example.com>
Content-Type:application/sdp
Content-Length:...
v=0
o=sarvi 2890844526 2890844528 IN IP4 192.0.2.12
s=-
c=IN IP4 192.0.2.12
t=0 0
m=application 9 TCP/MRCPv2 1
a=resource:speechsynth
a=cmid:1
m=audio 49170 RTP/AVP 0
a=rtpmap:0 pcmu/8000
a=recvonly
a=mid:1
m=application 0 TCP/MRCPv2 1
a=resource:speechrecog
a=cmid:1
S->C: SIP/2.0 200 OK
Via:SIP/2.0/TCP client.atlanta.example.com:5060;
branch=z9hG4bK74bf5;received=192.0.32.10
To:MediaServer <sip:mresources@example.com>;tag=62784
From:sarvi <sip:sarvi@example.com>;tag=1928301774
Call-ID:a84b4c76e66710
CSeq:314163 INVITE
Contact:<sip:mresources@server.example.com>
Content-Type:application/sdp
Content-Length:...
v=0
o=- 2890842808 2890842810 IN IP4 192.0.2.11
s=-
c=IN IP4 192.0.2.11
t=0 0
m=application 32416 TCP/MRCPv2 1
a=channel:32AECB234338@speechsynth
a=cmid:1
m=audio 48260 RTP/AVP 0
a=rtpmap:0 pcmu/8000
a=sendonly
a=mid:1
m=application 0 TCP/MRCPv2 1
a=channel:32AECB234338@speechrecog
a=cmid:1
C->S: ACK sip:mresources@server.example.com SIP/2.0
Via:SIP/2.0/TCP client.atlanta.example.com:5060;
branch=z9hG4bK74bf6
Max-Forwards:6
To:MediaServer <sip:mresources@example.com>;tag=62784
From:Sarvi <sip:sarvi@example.com>;tag=1928301774
Call-ID:a84b4c76e66710
CSeq:314163 ACK
Content-Length:0
MRCPv2 message
MRCPv2 訊息包含 client 給 server 的 request,及server 發給 client 的 response 與asynchronous events,資料格式包含一行 start-line,多個 headers,一行 empty line 代表 header 結束,然後是 optional message body,跟 http protocol 類似
generic-message = start-line
message-header
CRLF
[ message-body ]
message-body = *OCTET
start-line = request-line / response-line / event-line
message-header = 1*(generic-header / resource-header / generic-field)
resource-header = synthesizer-header
/ recognizer-header
/ recorder-header
/ verifier-header
ex:
C->S: MRCP/2.0 877 INTERPRET 543266
Channel-Identifier:32AECB23433801@speechrecog
Interpret-Text:may I speak to Andre Roy
Content-Type:application/srgs+xml
Content-ID:<request1@form-level.store>
Content-Length:661
<?xml version="1.0"?>
<!-- the default grammar language is US English -->
<grammar xmlns="http://www.w3.org/2001/06/grammar"
xml:lang="en-US" version="1.0" root="request">
<!-- single language attachment to tokens -->
<rule id="yes">
<one-of>
<item xml:lang="fr-CA">oui</item>
<item xml:lang="en-US">yes</item>
</one-of>
</rule>
<!-- single language attachment to a rule expansion -->
<rule id="request">
may I speak to
<one-of xml:lang="fr-CA">
<item>Michel Tremblay</item>
<item>Andre Roy</item>
</one-of>
</rule>
</grammar>
S->C: MRCP/2.0 82 543266 200 IN-PROGRESS
Channel-Identifier:32AECB23433801@speechrecog
S->C: MRCP/2.0 634 INTERPRETATION-COMPLETE 543266 200 COMPLETE
Channel-Identifier:32AECB23433801@speechrecog
Completion-Cause:000 success
Content-Type:application/nlsml+xml
Content-Length:441
<?xml version="1.0"?>
<result xmlns="urn:ietf:params:xml:ns:mrcpv2"
xmlns:ex="http://www.example.com/example"
grammar="session:request1@form-level.store">
<interpretation>
<instance name="Person">
<ex:Person>
<ex:Name> Andre Roy </ex:Name>
</ex:Person>
</instance>
<input> may I speak to Andre Roy </input>
</interpretation>
</result>
request-line 的格式為
request-line = mrcp-version SP message-length SP method-name SP request-id CRLF
method-name = generic-method
/ synthesizer-method
/ recognizer-method
/ recorder-method
/ verifier-method
request-id = 1*10DIGIT
response 的格式為
response-line = mrcp-version SP message-length SP request-id
SP status-code SP request-state CRLF
status-code = 3DIGIT
request-state = "COMPLETE"
/ "IN-PROGRESS"
/ "PENDING"
event-line 的格式為
event-line = mrcp-version SP message-length SP event-name
SP request-id SP request-state CRLF
event-name = synthesizer-event
/ recognizer-event
/ recorder-event
/ verifier-event
注意到訊息格式中,分別對 synthesizer, recognizer, recorder, verifier 四種 resource type,有不同的定義 methods, headers, events
Generic Methods, Headers, Result Structure
所有 resource 通用的 methods, headers
MRCPv2 支援兩種 generic methods,可 reading, writing 相關資源的 state
generic-method = "SET-PARAMS"
/ "GET-PARAMS"
SET-PARAMS
client 發送給 server,通知該 session 的 MRCPv2 resource 要定義 parameter
C->S: MRCP/2.0 ... SET-PARAMS 543256
Channel-Identifier:32AECB23433802@speechsynth
Voice-gender:female
Voice-variant:3
S->C: MRCP/2.0 ... 543256 200 COMPLETE
Channel-Identifier:32AECB23433802@speechsynth
GET-PARAMS
client 發送給 server,通知要取得 MRCPv2 resource 目前的 session parameters
C->S: MRCP/2.0 ... GET-PARAMS 543256
Channel-Identifier:32AECB23433802@speechsynth
Voice-gender:
Voice-variant:
Vendor-Specific-Parameters:com.example.param1;
com.example.param2
S->C: MRCP/2.0 ... 543256 200 COMPLETE
Channel-Identifier:32AECB23433802@speechsynth
Voice-gender:female
Voice-variant:3
Vendor-Specific-Parameters:com.example.param1="Company Name";
com.example.param2="124324234@example.com"
所有 MRCPv2 header 中,包含 generic-headers 及 resource-specific headers
header 的定義如下
generic-field = field-name ":" [ field-value ]
field-name = token
field-value = *LWS field-content *( CRLF 1*LWS field-content)
field-content = <the OCTETs making up the field-value
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
generic header 有
generic-header = channel-identifier
/ accept
/ active-request-id-list
/ proxy-sync-id
/ accept-charset
/ content-type
/ content-id
/ content-base
/ content-encoding
/ content-location
/ content-length
/ fetch-timeout
/ cache-control
/ logging-tag
/ set-cookie
/ vendor-specific
Channel-Identifier
在產生一個 control channel 時,由 server 指定一個 Channel Id
channel-identifier = "Channel-Identifier" ":" channel-id CRLF
channel-id = 1*alphanum "@" 1*alphanum
Accept
Active-Request-Id-List
在 request 裡面,這個 header 代表這個 request 對這個 list of request-ids 有作用。在 response ,這個 header 代表該 method 影響到的 list of request-ids
active-request-id-list = "Active-Request-Id-List" ":" request-id *("," request-id) CRLF
Proxy-Sync-Id
當某個 server resource 產生 "barge-in-able" event,也會產生一個 unique tag,該 tag 會透過這個 header 放在 event 裡面,傳給 client
proxy-sync-id = "Proxy-Sync-Id" ":" 1*VCHAR CRLF
Accept-Charset
在 request 裡面指定 response or event 可接受能夠處理的 character sets。
例如指定 Natural Language Semantic Markup Language (NLSML) results 的 RECOGNITION-COMPLETE event 可使用的 character set
Content-Type
MRCPv2 的 content 支援有限 media types,例如 speech markup, grammer, recofnition results
content-type = "Content-Type" ":" media-type-value CRLF media-type-value = type "/" subtype *( ";" parameter ) type = token subtype = token parameter = attribute "=" value attribute = token value = token / quoted-string
Content-ID
該 content 參考或引用的 ID or name
Content-Base
指定 base URI
content-base = "Content-Base" ":" absoluteURI CRLF
Content-Encoding
某個 Content-Type 的附加資訊,例如
Content-Encoding:gzip
content-encoding = "Content-Encoding" ":" *WSP content-coding *(*WSP "," *WSP content-coding *WSP ) CRLF
Content-Location
content-location = "Content-Location" ":" ( absoluteURI / relativeURI ) CRLF
Content-Length
message body 的長度
content-length = "Content-Length" ":" 1*19DIGIT CRLF
Fetch Timeout
當 recognizer/synthesizer 需要取得文件或其他資源,定義 server 透過網路取得資源的 timeout 時間
fetch-timeout = "Fetch-Timeout" ":" 1*19DIGIT CRLF
Cache-Control
如果 server 有支援 content caching,遵循 http 1.1 的規則提供 cache
cache-control = "Cache-Control" ":" [*WSP cache-directive *( *WSP "," *WSP cache-directive *WSP )] CRLF cache-directive = "max-age" "=" delta-seconds / "max-stale" [ "=" delta-seconds ] / "min-fresh" "=" delta-seconds delta-seconds = 1*19DIGIT
Logging-Tag
SET-PARAMS/GET-PARAMS method 的 header,可 set/retrieve server 產生的 log 的 logging tag
logging-tag = "Logging-Tag" ":" 1*UTFCHAR CRLF
Set-Cookie
類似 http 的 cookie,讓 server 在 client 存放 cookie values
Vendor-Specific Parameters
ex:
com.example.companyA.paramxyz=256 com.example.companyA.paramabc=High com.example.companyB.paramxyz=Low
Generic Result Structure
Recognizer 與 Verifier resource server 產生的 result data,以 Natural Language Semantics Markup Language (NLSML) 格式提供
ex:
Content-Type:application/nlsml+xml
Content-Length:...
<?xml version="1.0"?>
<result xmlns="urn:ietf:params:xml:ns:mrcpv2"
xmlns:ex="http://www.example.com/example"
grammar="http://theYesNoGrammar">
<interpretation>
<instance>
<ex:response>yes</ex:response>
</instance>
<input>OK</input>
</interpretation>
</result>
Resource Discovery
透過 SIP OPTIONS 向 server 詢問 server capabilities
server 必須以 SDP 回應 capabilities,包含 media type, transport type: m=application 0 TCP/TLS/MRCPv2 1
,以及 resource: a=resource:speechsynth
ex:
C->S:
OPTIONS sip:mrcp@server.example.com SIP/2.0
Via:SIP/2.0/TCP client.atlanta.example.com:5060;
branch=z9hG4bK74bf7
Max-Forwards:6
To:<sip:mrcp@example.com>
From:Sarvi <sip:sarvi@example.com>;tag=1928301774
Call-ID:a84b4c76e66710
CSeq:63104 OPTIONS
Contact:<sip:sarvi@client.example.com>
Accept:application/sdp
Content-Length:0
S->C:
SIP/2.0 200 OK
Via:SIP/2.0/TCP client.atlanta.example.com:5060;
branch=z9hG4bK74bf7;received=192.0.32.10
To:<sip:mrcp@example.com>;tag=62784
From:Sarvi <sip:sarvi@example.com>;tag=1928301774
Call-ID:a84b4c76e66710
CSeq:63104 OPTIONS
Contact:<sip:mrcp@server.example.com>
Allow:INVITE, ACK, CANCEL, OPTIONS, BYE
Accept:application/sdp
Accept-Encoding:gzip
Accept-Language:en
Supported:foo
Content-Type:application/sdp
Content-Length:...
v=0
o=sarvi 2890844536 2890842811 IN IP4 192.0.2.12
s=-
i=MRCPv2 server capabilities
c=IN IP4 192.0.2.12/127
t=0 0
m=application 0 TCP/TLS/MRCPv2 1
a=resource:speechsynth
a=resource:speechrecog
a=resource:speakverify
m=audio 0 RTP/AVP 0 3
a=rtpmap:0 PCMU/8000
a=rtpmap:3 GSM/8000
Speech Synthesizer Resource
client 發送 text markup,讓server 即時產生 audio stream,可指定語音合成的參數,例如 voice characteristics, speaker speed
有兩種: speech synth, basicsynth
Synthesizer State Machine
pending 的 SPEAK request 可以被 deleted/stopped
Idle Speaking Paused
State State State
| | |
|----------SPEAK-------->| |--------|
|<------STOP-------------| CONTROL |
|<----SPEAK-COMPLETE-----| |------->|
|<----BARGE-IN-OCCURRED--| |
| |---------| |
| CONTROL |-----------PAUSE--------->|
| |-------->|<----------RESUME---------|
| | |----------|
|----------| | PAUSE |
| BARGE-IN-OCCURRED | |--------->|
|<---------| |----------| |
| | SPEECH-MARKER |
| |<---------| |
|----------| |----------| |
| STOP | RESUME |
| | |<---------| |
|<---------| | |
|<---------------------STOP-------------------------|
|----------| | |
| DEFINE-LEXICON | |
| | | |
|<---------| | |
|<---------------BARGE-IN-OCCURRED------------------|
Synthesizer Methods
synthesizer-method = "SPEAK"
/ "STOP"
/ "PAUSE"
/ "RESUME"
/ "BARGE-IN-OCCURRED"
/ "CONTROL"
/ "DEFINE-LEXICON"
Synthesizer Events
synthesizer-event = "SPEECH-MARKER"
/ "SPEAK-COMPLETE"
Synthesizer Header Fields
synthesizer-header = jump-size
/ kill-on-barge-in
/ speaker-profile
/ completion-cause
/ completion-reason
/ voice-parameter
/ prosody-parameter
/ speech-marker
/ speech-language
/ fetch-hint
/ audio-fetch-hint
/ failed-uri
/ failed-uri-cause
/ speak-restart
/ speak-length
/ load-lexicon
/ lexicon-search-order
Example:
text 會被合成並播放到 media stream,resource 會產生 IN-PROGRESS, SPEAK-COMPLETE event
C->S: MRCP/2.0 ... SPEAK 543257
Channel-Identifier:32AECB23433802@speechsynth
Voice-gender:neutral
Voice-Age:25
Prosody-volume:medium
Content-Type:application/ssml+xml
Content-Length:...
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>You have 4 new messages.</s>
<s>The first is from Stephanie Williams and arrived at
<break/>
<say-as interpret-as="vxml:time">0342p</say-as>.
</s>
<s>The subject is
<prosody rate="-20%">ski trip</prosody>
</s>
</p>
</speak>
S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
Channel-Identifier:32AECB23433802@speechsynth
Speech-Marker:timestamp=857206027059
S->C: MRCP/2.0 ... SPEAK-COMPLETE 543257 COMPLETE
Channel-Identifier:32AECB23433802@speechsynth
Completion-Cause:000 normal
Speech-Marker:timestamp=857206027059
Speech Recognizer Resource
接收 client 提供的 voice stream,轉換為文字
有兩種: speechrecog, dtmfrecog
recognizer resource 的能力有:
Normal Mode Recognition:會將整個語音或 DTMF 判斷是否吻合
Hotword Mode Recognition
判斷是否有出現某個特定的 speech grammar or DTMF sequence
Voice Enrolled Grammars
(optional) enrollment 是用某個人的 voice 進行判斷, server 會維護 a list of contacts,包含人員的名稱以及 voice,這個技術也稱為 speaker-dependent recognition
Interpretation
natural language interpretation
以 text 作為 input,產生該文字的 grammar
Recognizer State Machine
Idle Recognizing Recognized
State State State
| | |
|---------RECOGNIZE---->|---RECOGNITION-COMPLETE-->|
|<------STOP------------|<-----RECOGNIZE-----------|
| | |
| |--------| |-----------|
| START-OF-INPUT | GET-RESULT |
| |------->| |---------->|
|------------| | |
| DEFINE-GRAMMAR |----------| |
|<-----------| | START-INPUT-TIMERS |
| |<---------| |
|------| | |
| INTERPRET | |
|<-----| |------| |
| | RECOGNIZE |
|-------| |<-----| |
| STOP |
|<------| |
|<-------------------STOP--------------------------|
|<-------------------DEFINE-GRAMMAR----------------|
Recognizer Methods
recognizer-method = recog-only-method
/ enrollment-method
recog-only-method = "DEFINE-GRAMMAR"
/ "RECOGNIZE"
/ "INTERPRET"
/ "GET-RESULT"
/ "START-INPUT-TIMERS"
/ "STOP"
enrollment-method = "START-PHRASE-ENROLLMENT"
/ "ENROLLMENT-ROLLBACK"
/ "END-PHRASE-ENROLLMENT"
/ "MODIFY-PHRASE"
/ "DELETE-PHRASE"
Recognizer Events
recognizer-event = "START-OF-INPUT"
/ "RECOGNITION-COMPLETE"
/ "INTERPRETATION-COMPLETE"
Recognizer Header Fields
recognizer-header = recog-only-header
/ enrollment-header
recog-only-header = confidence-threshold
/ sensitivity-level
/ speed-vs-accuracy
/ n-best-list-length
/ no-input-timeout
/ input-type
/ recognition-timeout
/ waveform-uri
/ input-waveform-uri
/ completion-cause
/ completion-reason
/ recognizer-context-block
/ start-input-timers
/ speech-complete-timeout
/ speech-incomplete-timeout
/ dtmf-interdigit-timeout
/ dtmf-term-timeout
/ dtmf-term-char
/ failed-uri
/ failed-uri-cause
/ save-waveform
/ media-type
/ new-audio-channel
/ speech-language
/ ver-buffer-utterance
/ recognition-mode
/ cancel-if-queue
/ hotword-max-duration
/ hotword-min-duration
/ interpret-text
/ dtmf-buffer-time
/ clear-dtmf-buffer
/ early-no-match
enrollment-header = num-min-consistent-pronunciations
/ consistency-threshold
/ clash-threshold
/ personal-grammar-uri
/ enroll-utterance
/ phrase-id
/ phrase-nl
/ weight
/ save-best-waveform
/ new-phrase-id
/ confusable-phrases-uri
/ abort-phrase-enrollment
Example
C->S:MRCP/2.0 ... RECOGNIZE 543257
Channel-Identifier:32AECB23433801@speechrecog
Confidence-Threshold:0.9
Content-Type:application/srgs+xml
Content-ID:<request1@form-level.store>
Content-Length:...
<?xml version="1.0"?>
<!-- the default grammar language is US English -->
<grammar xmlns="http://www.w3.org/2001/06/grammar"
xml:lang="en-US" version="1.0" root="request">
<!-- single language attachment to tokens -->
<rule id="yes">
<one-of>
<item xml:lang="fr-CA">oui</item>
<item xml:lang="en-US">yes</item>
</one-of>
</rule>
<!-- single language attachment to a rule expansion -->
<rule id="request">
may I speak to
<one-of xml:lang="fr-CA">
<item>Michel Tremblay</item>
<item>Andre Roy</item>
</one-of>
</rule>
</grammar>
S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
Channel-Identifier:32AECB23433801@speechrecog
S->C:MRCP/2.0 ... START-OF-INPUT 543257 IN-PROGRESS
Channel-Identifier:32AECB23433801@speechrecog
S->C:MRCP/2.0 ... RECOGNITION-COMPLETE 543257 COMPLETE
Channel-Identifier:32AECB23433801@speechrecog
Completion-Cause:000 success
Waveform-URI:<http://web.media.com/session123/audio.wav>;
size=424252;duration=2543
Content-Type:application/nlsml+xml
Content-Length:...
<?xml version="1.0"?>
<result xmlns="urn:ietf:params:xml:ns:mrcpv2"
xmlns:ex="http://www.example.com/example"
grammar="session:request1@form-level.store">
<interpretation>
<instance name="Person">
<ex:Person>
<ex:Name> Andre Roy </ex:Name>
</ex:Person>
</instance>
<input> may I speak to Andre Roy </input>
</interpretation>
</result>
Recorder Resource
將收到的 audio/video 存到指定的 URI
Recorder State Machine
Idle Recording
State State
| |
|---------RECORD------->|
| |
|<------STOP------------|
| |
|<--RECORD-COMPLETE-----|
| |
| |--------|
| START-OF-INPUT |
| |------->|
| |
| |--------|
| START-INPUT-TIMERS |
| |------->|
| |
Recorder Methods
recorder-method = "RECORD"
/ "STOP"
/ "START-INPUT-TIMERS"
Recorder Events
recorder-event = "START-OF-INPUT"
/ "RECORD-COMPLETE"
Recorder Header Fields
recorder-header = sensitivity-level
/ no-input-timeout
/ completion-cause
/ completion-reason
/ failed-uri
/ failed-uri-cause
/ record-uri
/ media-type
/ max-time
/ trim-length
/ final-silence
/ capture-on-speech
/ ver-buffer-utterance
/ start-input-timers
/ new-audio-channel
example
C->S: MRCP/2.0 ... RECORD 543257
Channel-Identifier:32AECB23433802@recorder
Record-URI:<file://mediaserver/recordings/myfile.wav>
Media-Type:audio/wav
Capture-On-Speech:true
Final-Silence:300
Max-Time:6000
S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
Channel-Identifier:32AECB23433802@recorder
S->C: MRCP/2.0 ... START-OF-INPUT 543257 IN-PROGRESS
Channel-Identifier:32AECB23433802@recorder
S->C: MRCP/2.0 ... RECORD-COMPLETE 543257 COMPLETE
Channel-Identifier:32AECB23433802@recorder
Completion-Cause:000 success-silence
Record-URI:<file://mediaserver/recordings/myfile.wav>;
size=242552;duration=25645
Speaker Verification and Identification
辨識 speaker 的身份
Speaker Verification State Machine
Idle Session Opened Verifying/Training
State State State
| | |
|--START-SESSION--->| |
| | |
| |----------| |
| | START-SESSION |
| |<---------| |
| | |
|<--END-SESSION-----| |
| | |
| |---------VERIFY--------->|
| | |
| |---VERIFY-FROM-BUFFER--->|
| | |
| |----------| |
| | VERIFY-ROLLBACK |
| |<---------| |
| | |
| | |--------|
| | GET-INTERMEDIATE-RESULT |
| | |------->|
| | |
| | |--------|
| | START-INPUT-TIMERS |
| | |------->|
| | |
| | |--------|
| | START-OF-INPUT |
| | |------->|
| | |
| |<-VERIFICATION-COMPLETE--|
| | |
| |<--------STOP------------|
| | |
| |----------| |
| | STOP |
| |<---------| |
| | |
|----------| | |
| STOP | |
|<---------| | |
| |----------| |
| | CLEAR-BUFFER |
| |<---------| |
| | |
|----------| | |
| CLEAR-BUFFER | |
|<---------| | |
| | |
| |----------| |
| | QUERY-VOICEPRINT |
| |<---------| |
| | |
|----------| | |
| QUERY-VOICEPRINT | |
|<---------| | |
| | |
| |----------| |
| | DELETE-VOICEPRINT |
| |<---------| |
| | |
|----------| | |
| DELETE-VOICEPRINT | |
|<---------| | |
Speaker Verification Methods
verifier-method = "START-SESSION"
/ "END-SESSION"
/ "QUERY-VOICEPRINT"
/ "DELETE-VOICEPRINT"
/ "VERIFY"
/ "VERIFY-FROM-BUFFER"
/ "VERIFY-ROLLBACK"
/ "STOP"
/ "CLEAR-BUFFER"
/ "START-INPUT-TIMERS"
/ "GET-INTERMEDIATE-RESULT"
Verification Events
verifier-event = "VERIFICATION-COMPLETE"
/ "START-OF-INPUT"
Verification Header Fields
verification-header = repository-uri
/ voiceprint-identifier
/ verification-mode
/ adapt-model
/ abort-model
/ min-verification-score
/ num-min-verification-phrases
/ num-max-verification-phrases
/ no-input-timeout
/ save-waveform
/ media-type
/ waveform-uri
/ voiceprint-exists
/ ver-buffer-utterance
/ input-waveform-uri
/ completion-cause
/ completion-reason
/ speech-complete-timeout
/ new-audio-channel
/ abort-verification
/ start-input-timers
References
MRCP協議學習筆記-語音識別資源的概括和全部Methods
MRCPv2 - Speech Synthesizer Resource
沒有留言:
張貼留言