Turbolift: An application server for voice powered computing Version 2.0, March 2005. Information: http://hobbiton.thisside.net/turbolift Builds: http://hobbiton.thisside.net/turbolift/builds CVSWeb: http://hobbiton.thisside.net/cgi-bin/cvsweb.cgi/turbolift (C) 2001-2005 Rupert Scammell GNU General Public License Hacking Turbolift - The Turbolift Development Guide ------------------------------------------- 1. Architectural overview The Turbolift application is built on a server/client model, using TCP sockets for communication between components. The server component , (select_ports.py) acts as a router, moving messages between connected clients. No actual processing of data is done on the server side. Clients connected to the server can take two forms: - Provider modules - Consumer modules Provider modules provide some sort of service within the application to consumer modules. This service can take many forms. In this version of Turbolift, the following provider modules exist: - lcd_module.py --- Provides LCD screen display service - speechio_module.py --- Provides speech input/output service Consumer modules take advantage of the services in provider modules, but usually do not provide a service that other modules can directly take advantage of. For instance, the mp3_module.py module, which allows MP3 selection and play in this version, uses the services of the two Provider modules above, in order to recognize commands, synthesize speech, and display song names and information on the LCD. See Docs/turbolift_sc.jpg for a schematic diagram of how modules are connected to one another. 2. Communication Every module within the system is assigned a module name, which is specified in the Config/alice.config file, within the module registration section (module_reg). This module name is associated with a port number, that is used by the module for communication with the server. When a module wishes to send a command to a module, the command sent from the requesting module takes the form: destination_module_name: data_for_module Where destination_module_name represents the name of the module to which the data, data_for_module is to be passed. When the event distribution server receives this string, it separates the "destination_module_name: " portion of the string, and uses the service name to locate the appropriate destination socket. The server then sends the data_for_module string to this socket. Assuming that the module requested is actually listening and connected, the received data is processed by the destination module, which takes some sort of action on the data, such as speaking it, displaying output on an LCD, or sending a response back to another module connected to the EDS. If the request cannot be sent for some reason (module not listening, module name for module not recognized, etc) an error is generated by the EDS to the log and console (if so configured). As of this version, however, no error message is sent back to the requesting module. 3. Adding a new module Adding a new module is easy. ALICE dynamically registers modules at runtime, so all that's required is the addition of a line to Config/alice.config under the section header [module_reg]. In version 1.00b, this section looks like: # Module registration section # Add entries below to register new modules. # Format is module_name = listen_port [module_reg] lcd_module = 8555 mp3_module = 8556 greet_module = 8557 speechio_module = 8558 diagnostic_port = 8559 Lines beneath the section header take the form module_name = port_value, where module_name represents the canonical name by which the module will be referenced by all other components, and port_value represents the port that will be opened to allow a connection between the event distribution server and your module. After adding the line, you can confirm that the module entry has been registered correctly by looking at the initialization output of the event distribution server, which by default looks like this: init: registering module lcd_module on port 8555 init: registering module mp3_module on port 8556 init: registering module diagnostic_port on port 8559 init: registering module speechio_module on port 8558 init: registering module greet_module on port 8557 init: Module registration completed successfully. Your module will generate an additional 'registering module' line in the section of log shown above. 4. Writing a new module The minimum requirement for a connected Turbolift module is that it be capable of connecting to the server on its assigned port, and listening for data that's passed to it by the server. However, most Turbolift modules do quite a bit more. A typical module takes the following steps when started: a) Loads information specific to the module from Config/alice.config b) Uses info from step (a) to connect to the server. c) Opens handles to data files, or secondary servers (i.e. festival) d) Sends any intial startup data to other modules. e) Listens for data from server. f) When data arrives, enters an event loop, processes data, then returns to step (e). If the module is being written in the Python language, look at other modules for examples of how to do all of this. It's also recommended that your module implement the 'ping' event, which allows other modules to query whether a particular module is listening, and receive a 'pong' response back, along with the name of your module. See Sec. 6, 'Command Reference' below for details. For a simple example of a module, see Docs/stub_module.py . This sample module connects by default to port 8559 (diagnostic_port), and implements several trivial events. Note that it doesn't have many of the nice features of more formal modules (such as using Config/alice.config to obtain configuration information, etc). It's still a good starting point, however. 5. Getting Turbolift to recognize custom speech commands CMU Sphinx II, the speech recognition system that Turbolift uses, is a limited domain, speaker independent system. That is, it doesn't work in the same way as commercial apps like Dragon Dictate, or IBM ViaVoice. Sphinx II only recognizes words that are provided to it within the context of a set of config files called a language model (LM). Language models are generated by passing a list of phrases and vocabulary words to a language model generator, which consists of either a local application, named SimpleLM, or an online generator, QuickLM, which returns a .tar.gz archive of the necessary files. SimpleLM can be downloaded from: http://sourceforge.net/projects/cmusphinx . The QuickLM language model generator is located at: http://www.speech.cs.cmu/tools/lmtool.html Once you have the necessary words and phrases incorporated into your language model, it's necessary to provide Turbolift with command bindings for the spoken commands. Command bindings are contained in the Config/alice.bind file. Lines within the file take the form: matching_regex_str ; bound_command [(1) (2)... (n)] matching_regex_str represents a regular expression string that matches a given speech input. A word of caution here. More specific regular expression matches should always be placed above less specific ones, since the file is read at runtime into a list structure that's scanned from 0 -> (length of list) each time a command in the hash table gets a positive match. This matching_regex_str string must also be placed in the Config/alice.regex file, which contains a list of regular expressions to use in building keys for the speech data hash table (pickled within Config/speech_data.dat), the first time the .dat file is created. bound_command represents the event that the positive regular expression match against the given speech input generates. It takes the form described in Section 2, "Communication", above, with one significant addition. If the regular expression contains numbered subgroups (see http://www.python.org/doc/current/lib/module-re.html for more info), optional, space character separated parameters that take the form " (n) ", where n represents the corresponding subgroup number within the matching_regex_str regular expression string may be added as needed. The square braces shown in the notation above denote these numbered subgroups as optional parameters, and are not part of the actual syntax. Since speechio_module holds state for both the original speech text, when the appropriate event template (an event containing subgroup value references, as above) is returned, speechio_module takes the matching_regex_str, and does a re.compile(matching_regex_str) call against it, to generate a temporary regular expression object, temp_regex_ob. temp_regex_ob.groups() is then called on the object, which returns a tuple object that contains subgroup strings whose subscripts within the tuple are 1 less than the values of the original, numbered subgroups (with the special case 0 subgroup omitted from the tuple). string.replace() is then called on the bound_command [(1) (2)... (n)] string, for as many parameters as there were specified in the bound_command string. The " (n) " substrings are replaced with the actual data contained within the numbered subgroups, and the finished event is passed to the EDS. An example is definitely called for here. a) Consider the following alice.bind entry: ^GO TO TRACK (.+)$ ; mp3_module: go_to_song (1) b) Speech text is received: GO TO TRACK FOUR c) speechio_module (via speechrule) positively matches the input speech text in Step b, with the regex string in Step a. d) speechio_module gets back the alice.bind string in Step a. A temporary regular expression object gets compiled with the regex string, and the base event template string is stored. e) The regex is run against the 'GO TO TRACK FOUR' string, and a call of .groups() against the temporary regex object returns the tuple, ('FOUR'). f) A for loop looks through the base event template string for instances of '(1)' to be replaced. It finds a match, so '(1)' gets replaced with 'FOUR' g) No more items have been specified for insertion within the base event template string, so the for loop exits, and returns the finished event string: mp3_module: go_to_song FOUR h) This event string gets sent to the EDS, which routes it appropriately. 6. Command Reference Some knowledge of the standard commands available within each of Turbolift's Provider modules is essential to writing a useful application. An attempt is made below to summarize the major commands that exist within each service, along with brief commentary on each. loader_module - Client module loader ----------------------------------- loader_module: quit Closes the socket connection to the EDS, and exits the loader module. Since any modules started using the start_module event below run as a child process of the loader, execution of this event will also cause immediate termination of those modules as well. loader_module: start_module modname Starts up a module named modname, where modname represents the name of the module to load. The module file to load is determined by concatenating modname + '.py'. The assumption is made that the module resides in the same directory as the loader_module, however this could potentially be overridden by prepending an absolute pathname before the actual module name. loader_module: display_startup_banner Displays a startup screen on the connected LCD. The startup screen is configured within the loader_module section of the Config/alice.config file. loader_module: stop_startup_banner Clears the startup banner from the screen, and stops any running clock displays. lcd_module - LCD Display Services --------------------------------- lcd_module: quit Closes the socket connection to the EDS, and exits the module. lcd_module: ping [ref_mod] Returns a 'pong [lcd_module]' response to the module named in the ref_mod string. This is useful for other modules that want to positively determine that this module is operational and responding to input events. lcd_module: crlf LCD carriage return / linefeed combination lcd_module: cls Clear all text from the LCD screen, and return cursor to position 0,0 (top left character). lcd_module: out 'foobar' Print text string foobar to LCD at current cursor position. lcd_module: hide_disp Hide the contents of the LCD screen (this is a non-destructive operation, unlike cls above). lcd_module: restore_disp Show the contents of the LCD display. This command reverses the hiding performed by hide_disp, above. If the display isn't hidden, it has no effect. lcd_module: uline_cursor Use an underline style cursor. lcd_module: block_cursor Use a block style cursor. lcd_module: invert_block_cursor Use an inverted block style cursor. lcd_module: bs Send backspace character to LCD. lcd_module: lf Send linefeed character to LCD. lcd_module: del Send destructive delete character to LCD. lcd_module: cr Send carriage return character to LCD. lcd_module: cursor x y Set LCD cursor position to column x, row y. x = 0 - 20, y = 0 -3. (0,0) is top left. x and y are integers. lcd_module: backlight_on Turn on the LCD backlight. lcd_module: backlight_off Turn off the LCD backlight. lcd_module: set_contrast x Set LCD contrast to x (integer). lcd_module: contrastup Incrementally raise LCD contrast. lcd_module: contrastdown Incrementally decrease LCD contrast. lcd_module: backlightup Incrementally raise backlight brightness. lcd_module: backlightdown Incrementally decrease backlight brightness. lcd_module: set_marq 'foobar' Set the marquee buffer to text string foobar. lcd_module: marq_on Display the marquee text set with set_marq, above. lcd_module: marq_off Hide the marquee text. N.B. Marquee text with the CrystalFontz is an interesting kettle of fish. The marquee buffer string (set using set_marq) can contain a maximum of 20 chars, which is the width of a single line on the CF 632/634 LCD. However, it's possible to create a 40 character scrolling marquee by writing the first 20 chars of the desired string statically to the line you want to display the marquee on, and the second 20 characters into the actual marquee buffer via set_marq. Then call marq_on and scroll_on in order to activate the marquee. This is how ALICE's MP3 module displays 20+ character song names in the line 0 scrolling marquee. lcd_module: scroll_on Scroll the created marquee. lcd_module: scroll_off Stop scrolling the created marquee. lcd_module: wrap_on Turn on line wrap. lcd_module: wrap_off Turn off line wrap. lcd_module: reboot Reboot the firmware of the LCD screen. This is very rarely needed, but returns the screen to a known state. lcd_module: up Move cursor up relative to current position lcd_module: down Move cursor down relative to current position lcd_module: left Move cursor left relative to current position lcd_module: right Move cursor right relative to current position lcd_module: cline ln lt Display a line of centered text (lt) on line ln. lcd_module: hbar graph_index style start_col end_col length row Display a horizontal bar graph. Parameters are: graph_index: Custom characters to use. 0 is default. style: Bit pattern to use when drawing the graph. Useful values here are: 255: Thick bar 000: Not visible (all pixels off) 085: Striped bar 060: Medium width bar (centered) 015: Medium width bar (low in the row) 240: Medium width bar (high in the row) However, this parameter can be any value from 000 - 255, with the MSB at the top of the row, and the LSB at the bottom of the row. start_col: X coordinate of the column to start the bar graph display on. Value is in characters. Valid values are 00-19. end_col: X coordinate of the column to end the bar graph display on. Value is in characters. Valid values are 00-19. N.B, start_col < end_col. length: Length of the horizontal graph. Value is in pixels. Valid values are 000-120. row: The row to display the horizontal graph on. Value is in rows. Valid values are 00 - 03. lcd_module: display_sysdate Display the system date via time.ctime(time.time()) on the LCD. This time value is not updated. lcd_module: display_epoch Display time in epoch seconds via time.time() on the LCD. This time value is not updated. lcd_module: start_display_clock x Display a date and time on line x of the LCD, updated once a second, using the output of time.ctime(time.time())[:-4], which is a UNIX standard timestamp, with the year trimmed away. lcd_module: stop_display_clock Stop the display of the clock created using start_display_clock, and clear the LCD screen. speechio_module - Speech input/output services ---------------------------------------------- speechio_module: quit Closes socket connections to all servers, and terminates the module. speechio_module: ping [ref_mod] Returns a 'pong [speechio_module]' response to the module named in the ref_mod string. This is useful for other modules that want to positively determine that this module is operational and responding to input events. speechio_module: speech_in: stext Speech input, stext, is processed using the speechrule module, and gets converted into a command for another module. For instance, stext of "GO TO THE NEXT SONG" gets bound to "mp3_module: go_next_song", which when sent to the EDS, gets passed to mp3_module, which handles the go_next_song event in its event loop. Speech text will generally only arrive in this fashion when manually input, though. The usual method for speech input to be received is directly via speechio_module's connection to the Sphinx server. speechio_module: speech_out: stext This command provides speechio_module with a text string, stext, which is converted into the Scheme statement (SayText "stext"), and sent to the Festival speech synthesis server, which speaks the stext string. speechio_module: scheme_cmd: command Send a raw Scheme language command string (command), directly to the Festival server. The Festival server provides a direct interface to a fully featured Scheme interpreter, allowing any valid Scheme statement to be passed here. speechio_module: soundcheck The soundcheck event causes a response of 'Yes, I am listening to you.' to be synthesized via the Festival server. The event provides an easy method of testing the function of both the speech input and output functions of speechio_module. In this version, this event is bound to the spoken command 'are you listening to me?'. mp3_module - MP3 playing and playlist management ------------------------------------------------ mp3_module is technically a consumer module (see definition in Sec. 1, above), but it has its own set of events associated with it, which speechio_module passes in order to execute commands. The following commands are available: mp3_module: quit Closes socket connection to the EDS, and terminates the module. mp3_module: ping [ref_mod] Returns a 'pong [mp3_module]' response to the module named in the ref_mod string. This is useful for other modules that want to positively determine that this module is operational and responding to input events. mp3_module: random_mode Reception of this event causes the randomplay flag in mp3_module to be set to 1. This causes songs within the playlist to subsequently be played in random order.The current mode is displayed as the last field of information on line 3 of the LCD. In random mode, 'rnd' is displayed. If random play mode is already selected, this event has no effect. mp3_module: sequential_mode Reception of this event causes the randomplay flag in mp3_module to be set to 0. This causes songs within the playlist to subsequently be played in sequential order, beginning with whatever song is currently selected within the playlist. The current mode is displayed as the last field of information on line 3 of the LCD. In sequential mode, 'seq' is displayed. If sequential play mode is already selected, this event has no effect. mp3_module: play_current_song Play the currently selected song in the playlist, i.e, the one displayed in the marquee on line 0 of the LCD. Playing a song causes any other song playing at the time to be terminated. Songs play asynchrously in their own thread. mp3_module: go_next_song Go to the next song in the playlist. Move the new song name into position on line 0 of the LCD. Attempt to extract artist & song length information from the ID3 tag, and print on line 1. Grab the name of the song following this one, and print it on line 2. Update the current_song / total_songs counter on line 3. mp3_module: go_prev_song Go to the previous song in the playlist, and once in position, do the same routine as specified in mp3_module: go_next_song, above. mp3_module: down_volume Use the mixer application specified to lower the volume by 10 points. After doing this, do a redraw on the LCD screen, and update the volume value (vol: x) on line 3 of the screen. mp3_module: up_volume Use the mixer application specified to raise the volume by 10 points. Do a redraw in the same way as mp3_module: down_volume, above. mp3_module: go_forward_10 Go forward 10 songs in the playlist, and do the same routine as specified in mp3_module: go_next_song, above. mp3_module: go_back_10 Go back 10 songs in the playlist, and do the same routine as specified in mp3_module: go_next_song, above. mp3_module: go_first_song Go to the first song in the playlist (song 0). Do a redraw as specified by mp3_module: go_next_song, above. mp3_module: go_last_song Go to the last song in the playlist. Do a redraw as specified by mp3_module: go_next_song, above. mp3_module: stop_play Stop playing the current song. This is accomplished by the simple and brutish method of doing a kill -9 on the running MP3 player application process. If no song is playing when this event is received, no action is taken. mp3_module: say_song_name Speak the name of the currently selected song. If we were able to extract a song name from the ID3 tag, then this information is used. Otherwise, we just use the filename of the song itself. The string to be spoken is passed to speechio_module encapsulated within a speechio_module: speech_out: ... event. mp3_module: dnp_song Removes the song from the currently loaded list of songs. This event doesn't currently remove the song from the file playlist which is loaded at run time, however. Voice feedback confirming this action is provided. mp3_module: say_song_artist Tries to say the artist name of the currently selected song, which is contained within song_info['artist']. If this value is equal to a blank string, None value, or generates a KeyError (usually indicating that the ID3 tag for the song doesn't have the info), we inform the user that the info wasn't available. mp3_module: mute_volume Mute the volume, by setting the volume level to 0 via a volume(0) call. This mute value is never set in c_vol, allowing the unmute_volume event below to easily restore the volume when requested. mp3_module: unmute_volume Unmute the volume, to the value stored in c_vol. Done via a volume(c_vol) command. mp3_module: go_to_song x Go to the x'th song in the playlist, where x represents the location of the song within the playlist loaded into mp3_module's list structure, 'a', at runtime. mp3_module: set_volume x Sets the output volume level to x, where x represents a word-number string in the form 'TWO THREE FOUR', expressed in arbitrary units, using the volume() call. The event also causes the current volume global, c_vol, to be updated, and causes a draw_screen() to occur, updating the displayed volume value. mp3_module: load_playlist x Loads playlist x as the current playlist, where x represents a word-number string (e.g. 'FIVE, 'ONE SIX', etc) Each playlist is defined within the mp3_module section of the Config/alice.config file, and takes the form: playlist_y_title = Cool Techno Mix playlist_y = /home/rupe/mp3/cool_techno_mix Where y represents the integer conversion of the x word-number string passed in the event. '..._title' is the title of the playlist, and will be spoken when the playlist is loaded. '..._y' is the actual filename and path of the playlist. *** Note: The following section requires revision, and information is now partially inaccurate. 7. Speech recognition training Turbolift incorporates a predictive technique into the speech I/O module, which improves the accuracy of the CMU Sphinx II speech recognition system. The system is based upon the assumption that users, when the system fails to correctly interpret their desired voice command for the first time, will continue trying until their desired command is recognized by the system. The first time Turbolift is started, and receives speech input, the speechrule.process_init() function is called. This command generates a hash table that contains words and phrases from the existing language model, which it extracts from the speechio_module,vocab_file option set in Config/alice.config . Each of these word or phrase items becomes a unique key within the hash table, which is stored in Config/speech_data.dat using Python's pickle module. A nested hash table is created as the value for each key. This nested hash table holds another key, 'P' (Parent), which has as a value for each of these initial items, the original word of phrase string that is the master key in the top level hash table. A top level key called try_list is also created, which contains a running list of misinterpreted phrases, which is cleared each time a target phrase (i.e. one with an existent top level key in the hash table) is encountered. Expressed in ASCII art, the sample list of words: GO RUN STOP PAUSE Becomes a hash table structure that takes the form: std_dict = {} |__'GO' = {} |__'P' = 'GO' | |__'RUN' = {} |__'P' = 'RUN' | |__'STOP' = {} |__'P' = 'STOP' | |__'PAUSE' = {} | |__'P' = 'PAUSE' | |__'try_list' = [ ] Sphinx allows recombination of words and phrases that exist in a language model, so using the above set as an example, any of the four words alone, or any combination of the four words could arrive as a string. Let's examine a contrived situation where the user is trying to have the utterance 'GO' recognized by the system. Generally an utterance (aka received speech) this simple will be recognized on the first try, but for the sake of the argument, assume the following transpires: a) User: 'GO' b) Computer interprets: 'GO GO' c) User (tries again): 'GO' d) Computer interprets: 'GO STOP' e) User (again): 'GO' f) Computer interprets: 'GO' So here we have three user attempts, and three interpretations by the computer, the last resulting in a match between what was said by the user, and interpreted by the computer. For each stage, the contents of the try_list list inside the master hash table looks like this: a) [] b) ['GO GO'] c) ['GO GO'] d) ['GO GO', 'GO STOP'] e) ['GO GO', 'GO STOP'] f) [] At the conclusion of step (f), the hash table, std_dict, that we looked at above, now looks like this: std_dict = {} |__'GO' = {} | |__'P' = 'GO' | |__'GO GO' = 1 | |__'GO STOP' = 1 | |__'RUN' = {} |__'P' = 'RUN' | |__'STOP' = {} |__'P' = 'STOP' | |__'PAUSE' = {} | |__'P' = 'PAUSE' | |__'try_list' = [ ] Note the addition of the two misinterpretations by the computer to the nested hash table 'GO', with the misinterpretations as keys. Each of these keys has a value of 1, which represents the number of times that the misinterpretation has been encountered in association with the given target phrase (a top level hash table key, that in this case is 'GO'). When the value of a misinterpreted phrase reaches a certain point, the misinterpretation is 'promoted' to a key value in the master hash table, std_dict, and inherits the 'P' value of the nested hash table from which it came. Let's assume that the point at which misinterpretations are promoted is when they've been seen two times. To illustrate this, let's look at a second session between the user and computer, which picks up at the point where we left off: a) User: 'GO' b) Computer interprets: 'GO GO GO' c) User (tries again): 'GO' d) Computer interprets: 'GO GO' e) User (again): 'GO' f) Computer interprets: 'GO' Session try_list contents: a) [] b) ['GO GO GO'] c) ['GO GO GO'] d) ['GO GO GO', 'GO GO'] e) ['GO GO GO', 'GO GO'] f) [] Notice that in step (d), the phrase 'GO GO' is seen again, and by step (f), is associated with the target phrase 'GO'. Recall that before the session started, 'GO GO' already had a value of 1. So, at the conclusion of the session, it has a value of 2, which by the rules of our example, allows it to be promoted to a top level key. So, the hash table now takes this form: std_dict = {} |__'GO' = {} | |__'P' = 'GO' | |__'GO GO' = 2 | |__'GO STOP' = 1 | |__'RUN' = {} |__'P' = 'RUN' | |__'STOP' = {} |__'P' = 'STOP' | |__'PAUSE' = {} | |__'P' = 'PAUSE' | |__'GO GO' = {} | |__ 'P' = 'GO' | |__'try_list' = [ ] Notice that in the hash table, the original nested key 'GO GO' remains in existence. This isn't a problem, however, since 'GO GO' is now a top level key, and will be matched on the first pass. The nested 'GO GO' key and value can actually be removed safely, but this isn't currently done. Also, note that the newly created top level key, 'GO GO', does not have a 'P' value in its nested hash table that's isomorphically mapped to the name of the top level key, as the initial items do. It's inherited the 'P' value of the target phrase, 'GO', from which it came instead. Now, any time the computer interprets speech input as 'GO GO', speechrule will return the 'P' value for the phrase instead. This inheritance process can happen to n-levels, with derived top level keys such as 'GO GO', passing along their 'P' values to other misinterpreted words or phrases that become top level keys. In order to take the example here to completion, we illustrate this property using a third session between the user and computer: a) User: 'GO GO' b) Computer interprets: 'GO GO GO GO' c) User (tries again): 'GO GO' d) Computer interprets: 'GO GO GO GO' e) User (again): 'GO GO' f) Computer interprets: 'GO GO' Session try_list contents: a) [] b) ['GO GO GO GO'] c) ['GO GO GO GO'] d) ['GO GO GO GO', 'GO GO GO GO'] e) ['GO GO GO GO', 'GO GO GO GO'] f) [] At the conclusion of the session, the hash table looks like this: std_dict = {} |__'GO' = {} | |__'P' = 'GO' | |__'GO GO' = 2 | |__'GO STOP' = 1 | |__'RUN' = {} |__'P' = 'RUN' | |__'STOP' = {} |__'P' = 'STOP' | |__'PAUSE' = {} | |__'P' = 'PAUSE' | |__'GO GO' = {} | |__'P' = 'GO' | |__'GO GO GO GO' = 2 | |__'GO GO GO GO' = {} | |__'P' = 'GO' | |__'try_list' = [ ]