Department of Computer Science
University of Minnesota
Minneapolis, MN 55455
{herlocke,konstan}@cs.umn.edu
http://www.cs.umn.edu/research/GIMME/
This paper discusses the design and implementation of the command stream. We address the intractability of inverting and skipping arbitrary code by constructing commands that are aware of their position in time and include logic for supporting VCR-style playback options.
A command is a set of code fragments, written in an interpreted programming language, together with the logical time describing when in the presentation the command should be executed. A command stream is an ordered collection of commands. Command streams may be synchronized with other media streams, such as audio or video, to define presentations.
The command stream was designed to help support the creation of multimedia presentations belonging to three domains.
Choreographed Presentations: Presentations offering multiple synchronized media types, but are not limited to a predefined media types, nor are they limited to media directly playable by a computer. These presentations can control external devices such as room lighting, audio/visual equipment, and special effects such as stage fog or fireworks. An example of a choreographed presentation could be a interactive educational presentation on astronomy. The presentation would control a telescope attached to the computer. As part of the presentation, a student might select a point on map of the solar system. The presentation would move the telescope to the given point in the sky, then display the image on the screen.
Enriched Interactive Presentations: Presentations that incorporate a greater interactivity and adaptability into a browsing environment. Such presentations require an extensive user interface component to allow a user to interact with the presentation, affecting its playback. These presentations can change media types or change presentation of media based on user input. These highly interactive presentations can also generate dynamic presentation elements based on user input or browsing environment. A multilingual, multi-background training presentation is an example of one such presentation. After determining the native language of the student, the presentation can select the media streams that are understandable by the student. The presentation will ask the user questions in order to determine the extent of the user's knowledge in the subject being taught. Then the presentation will present the user with training for knowledge and skills that the user can understand and has not already learned.
Presentations with Simulations: Presentations containing interactive simulations of real processes. This requires the presentation to be able to display customized simulation runs based on input from a user, without prerecording simulation runs. A physics lab presentation might use simulations to help teach a student about conservation of momentum. A student could change parameters such as mass and velocity of colliding air cars, and see a simulation run based on the given inputs.
The next section presents the requirements for the command stream that we derive from these examples. We then discuss related research. The following sections present our basic model of the command stream, a brief description of its implementation, and our evaluation of its success in meeting the requirements. And, the final sections discuss a revised model that addresses some problems discovered in the initial implementation, our plans for future work, and some conclusions.
Expressiveness. In order to support the desired flexibility, a presentation must be able to execute arbitrary instructions as part of the presentation. Arbitrary instructions will provide the following functionality:
Distribution. With the arrival of gigabit networks, the future of information providing and sharing through multimedia will involve distributed multimedia presentations. In order to be a practical multimedia alternative for the future, any presentation solution must be network capable. Presentations will be stored on central servers, but will be played on local workstations or computers. Presentations must have a small start-up delay (download and play is unacceptable for large presentations).
Synchronization. In order to fully support arbitrary actions as part of a presentation medium, we must be able to synchronize them with other media streams. We must be able to explicitly synchronize discrete arbitrary actions with discrete elements of other media streams such as frames in video clips or points in audio samples. *1
Browsability. In order to support multimedia browsing, a multimedia player should support playback at variable speeds, playback in reverse, and random access to points within the multimedia presentation.
These five requirements lead us to our basic model and initial implementation of a command stream.
The HotJava browser[12] from Sun Microsystems provides an environment that allows distribution of arbitrary code (written in Java[13], a C++-like language) alongside other media. The Java language provides the necessary expressiveness by allowing arbitrary code. Like most media on the World Wide Web[2], Java data must be downloaded in full before execution, which can provide for unacceptable start-up delays for large data files. The HotJava environment does not provide for browsability of presentations.
Kaleida Lab's Media Player[5] provides a complete multimedia environment. Code written in ScriptX[5] can be used to create interactive elements, generate animations, and complete arbitrary actions. Arbitrary ScriptX code can be synchronized with other media. ScriptX does not provide support for reversibility of code elements or random access, although such support could be added. In the Media Player environment, distribution of data is also download and play, again leaving the possibility of unacceptable start-up costs.
Apple Computer's QuickTime[1] standard provides for a text media tracks, in addition to audio and video tracks. A text track could be used to transport interpreted code and QuickTime would provide synchronization with the other media streams. QuickTime does not provide a mechanism to execute interpreted code, but it does allow applications to register their own media handlers that manage playback of data, so a QuickTime extension could easily support interpreted code.
The command stream is a first-class presentation medium that supports random access, variable play speed and direction, and fine-grained synchronization with other media streams involved in the presentation of the presentation. Command streams will maintain dependencies that exist between commands and will support "undoing" of commands when rewinding or playing in reverse a command stream. Commands streams will also support a distributed multimedia environment.
Figure 1
shows an example of two command streams which are synchronized with both audio and video streams.

For the duration of the paper, we will assume a multimedia environment where playback and synchronization of separate media streams is controlled through the use of a logical clock. A logical clock is an object with two attributes: value and speed. The value attribute is basically a counter which increases as a presentation moves forward, and decreases as a presentation moves backward. The rate of change of the clock is governed by the speed attribute of the clock which is usually a multiple of the computer's system clock ticks. Both the value and the speed of the clock can be changed at any point in the presentation. Media synchronization is achieved by scheduling media to play at specific values of the logical clock. When the speed of the clock is positive, then the counter is increasing, and we consider the clock to be moving forward. When the speed of the clock is negative, we consider the clock to be moving backwards.
In our model, there are four significant states of the command stream: forward, rush-ahead, backward, and rush-behind. The state of the command stream will determine the action that a command takes when it is played. There is a different fragment of code associated with each of the four states of a command.
A command stream is in the forward state when the logical clock is moving forward, and the command stream is able to play each command at its given logical time. The forward code fragment is the core element of a command because it represents the action that happens during normal playback. We will refer to the forward code fragment as the primary code fragment, due to its importance.
A command stream enters the rush-ahead state when the logical clock is moving forward, and the command stream is behind in execution of commands. This can happen in three situations: 1) The command prior to the current command took too long to execute, and the start time of the current command has already been passed, 2) the presentation skips the current command by changing the value of the logical clock from a time earlier than the current command's play time to a time later, or 3) a command is unable to reach the client-playback engine before its scheduled play time due to network lags or other bottlenecks. The purpose of the rush-ahead state is expedite re-establishment of synchronization, while maintaining the integrity of the command stream. When a large positive change in logical time happens, audio and video streams can drop data while re-synchronizing. However, since a command represents an arbitrary code fragment, dropping a command may cause later commands to fail, destroying the integrity of command stream. A command which creates a window cannot be dropped because a later command may attempt to draw into it. The rush-ahead code fragment provides the mechanism to service dependencies of later commands. Well written rush-ahead fragments will accomplish the minimal set of tasks necessary to maintain the command stream integrity quickly. In addition, the rush-ahead code allows the presentation to avoid ugly visual artifacts that may result from playing commands fast or out of sync. An example of such a visual artifact would be a presentation that included subtitles. Under the current model, if there was no rush-ahead code and the forward code was always played, jumping forward in such a presentation would result in having all of the skipped subtitles played at super speed as the command stream rushes to catch up with the value of the logical clock. Since the subtitles are being played so fast, they would just appear as a flickering on the screen. This could be avoided by having null rush-ahead code for commands that displayed subtitles.
In order to support the browsability requirement stated in the previous section, a command stream must be able to be reversed and rewound. The inverse state of a command stream occurs when the logical clock is moving backward (speed of clock is negative), and each command is played at its given time. The purpose of the inverse code fragment is to undo all effects created by the forward (or rush-ahead) code fragment of the same command. The inverse of a code fragment that creates a window must remove the window from the screen. The rush-behind state is analogous to the rush-ahead state when the speed of the logical clock is negative, i.e. it is executed when a later command takes to long to execute, or when logical clock is changed from a time before the command to a time earlier.
Our initial implementation of a command stream, called TclStream 1.0, used the Tcl language[7] for interpreted code. A command was specified by a 5-tuple, which included the logical time and the four Tcl code fragments. The logical time is represented by a floating point number that can be expressed in two ways: relative to the previous command or absolutely (relative to the start of the presentation) system. The ability to specify relative values of execution allows collections of commands to be relocated as a whole, without recalculating the logical execution time of each command. This feature supports the development of libraries of command sequences, analogous to media "clips".
The code fragments can be arbitrary pieces of Tcl code. They can be single lines of code, or they can be large blocks of code. Since commands are atomic relative to the command stream, fragments of interpreted code should be "sized" at the granularity that provides the needed synchronization with other media. Large commands can be problematic because the execution time may be much harder to predict, and their length may overlap the start times of later commands. Code fragments can perform any action allowed in Tcl, including defining and calling new procedures.
Figure 2 contains two examples of commands taken from a dance animation program. The value of 0.1 seconds specifies that the first command is to be executed a tenth of a second after the previous command. The primary code will move the end of the arm to (100,100) on the screen. Notice that there is no rush-ahead. This command is in the middle of a sequence of arm movements and therefore it is unnecessary to execute when we are rushing-ahead, because a later command will cancel this command out, when it moves the entire arm to a different location on the screen. The inverse code moves the arm back to its original position. The rush-behind code is empty for the same reason as the rush-ahead.

In the second code fragment, the logical time is specified absolutely by appending an `a' to the end of the logical time. This command will be scheduled exactly 100 seconds into the presentation. During normal execution moving forward, this command will pop up a button, asking if the user wishes to jump to a later point in the presentation. If the user pushes the button, then the TclStream will change the value of the logical time system itself, moving the entire presentation to a point 300 seconds in, where presumably the Polka demonstration starts. If the command is skipped for whatever reason moving forward, the button is not created, but a variable is set so that later commands can react to the fact that the button was not displayed. The inverse and rush-behind commands both remove the button (using the "destroy" command) if it exists and undo all changes to variables that might have occurred.
TclStream was implemented as a component of the Continuous Media Toolkit (CMT) [8]. CMT is a distributed real-time multimedia system, developed at the University of California Berkeley by the Plateau project. It is implemented in a combination of C and Tcl, with the API being in Tcl. CMT runs with a Tcl interpreter whose core has been augmented to better support real time scheduling, as well as the TclDP[10] and CMT command extensions. Since the CMT API is implemented in Tcl, TclStream code fragments can directly control the multimedia playback environment, by initiating new media streams, changing the speed of playback, and changing the point in the timeline.
In the CMT architecture, there are three main processes: the application, the media-server, and the media-player (see figure 3). The media-server process runs on fileservers that hold the data (e.g. video, audio, or commands) and is responsible for reading the data from disk and sending it across the network. The media-player process runs on the machine where the media is to be displayed. The media-player process is responsible for receiving network data and playing it at the right time. The application is responsible for building the initial user interface, and initializing the media streams from the media-server.

Synchronization in CMT is fine-grained and based on the timeline model. An exact mapping of media and tight time tolerances are achieved through a logical clock which is shared by all of the CMT processes. In CMT, the logical clock is known as the Logical Time System, commonly referred to as the LTS. The LTS is implemented as a distributed object, with two slots: speed and value. A speed of 1 indicates a relation of one system second to one logical second. Fast forwards, fast rewinds or slow motion can be attained by setting the speed to a value other than 1 or 0. A value of 2.5 would be a fast forward, -2.5 a fast rewind, and 0.5 a slow motion forward. Jumps to a specific logical time can be achieved by directly setting the value of the logical clock.
Media are synchronized in CMT by assigning them logical play times. Individual frames can be assigned their own times, or more commonly a stream will have a start time and a frame rate. Internally, each frame (e.g. video frame or audio segment) has a designated start time and duration that are used by both the media-server (to supply the media-player) and the media-player (to play the frames). The command is the atomic frame of the command stream medium, and the logical time of the command is used to schedule the play of the command and synchronize that command with other media streams. A command is arbitrary code, and it's duration will be unpredictable, so the duration of a command is not considered in scheduling and playing commands.
The media-server schedules commands to be sent to the media-player based on the value and speed of the current timeline. The media-server can be configured to send commands slightly ahead of time to account for network and processor lag. When the media-server determines that a command needs to be sent, it sends the two fragments of Tcl code that are appropriate to the direction of the LTS (the primary and rush-ahead if speed is positive, inverse and rush-behind if negative) and the logical start time. Figures 4 and 5 illustrate this, with playback starting forward in figure 4 then changing to backward in figure 5. Only the relevant code fragments are sent due to the consideration that the code fragments may be used to transport large amounts of arbitrarily encoded data. If such were the case, sending unnecessary code fragments may result in considerable bandwidth waste.


The media-player receives each pair of Tcl command fragments together with the logical start time. If the logical start time has not yet been reached, then the media-player schedules a timer to execute the first of the two code fragments in the application via RPC at the right time (remember that there is send-ahead). If the LTS has a positive speed, then the first code fragment will be the primary code fragment. If negative, then the code fragment will be the inverse code fragment. If the logical start time of the code fragments has already passed, then the code fragments have arrived late, and the second, "rush" code fragment is executed, be it the rush-ahead or rush-behind. All code fragments are dropped once they have been played.
An application can jump to a different point in logical time by directly setting the value slot of the LTS. This allows the random-access into any point of the command stream. When the value slot of the LTS is changed or the direction of playback is changed, the media-player drops all scheduled code fragments that have not yet been played. This is seen in both figure 5, where the direction of playback is changed, and in figure 6, where the application jumps to a different point in the time line. The media-server determines all commands that were skipped for a change in logical time forwards or the commands that must be undone for a jump backwards. For each of these commands, the media-server sends only the rush code fragment appropriate the direction. Only one fragment per command is sent because the rush-ahead or rush-behind has to be executed (there is no alternative). The media-server sends them all as quickly as possible, and the media-player executes each rush code fragment as quickly as possible. Figure 6 demonstrates a jump.

In order to construct a TclStream, a media author codes by hand each 5-tuple command. For each command, the author determines what the four different fragments of Tcl code will be. Given a primary code fragment, the programmer will have to come up with appropriate inverse, rush-ahead, and rush-behind.
It proved to be easy to create streams of Tcl commands that were synchronized with other media streams. In the matter of a day or two we created a simple animated "karaoke" program (see figure 7
) in which an animated stick-figure danced in a window and sang along with the with the music (in subtitles). Accurately choosing logical time values for commands in order to keep in synch with the music proved to be a trial-and-error process but otherwise creating the presentation required little more than a basic knowledge of Tcl and Tk.

The ease with which we are able to create interactive user interface elements in a presentation with minimal programming effort is the most exciting gain from TclStream 1.0.
The Tk toolkit provides the user interface elements and the simple interface procedures to control them. The three major ways that feedback and adaptation to user input is accomplished are: 1) Tcl code is executed to cause visible changes or to set variables within the Tcl interpreter that will affect execution of later code fragments, 2) completely new media streams are initialized and played, or 3) the speed or value of the LTS can be changed by the code fragments, causing the entire presentation to either move to a different location in the time line or to move at a different speed.
The ability of Tcl code fragments to directly control the multimedia environment through the CMT API provides considerable flexibility and power to a multimedia presentation. Upon determination of the application's environment or upon cue from the user, the TclStream can initiate completely new media streams, choosing audio and video clips that are appropriate to the occasion. The dance animation program could begin by prompting the user to select a language for narration, then open up streams customized to the chosen language (perhaps even a command stream that animates American Sign Language). Being able to change the value of the LTS allows a command stream to branch to different points of the time line based on user input. Multiple alternatives for an presentation sequence could be placed in different segments of the logical time line. A command stream could ensure that only one of the sequences is played by automatically skipping over the other sequences if the application didn't jump directly into them. Consider the dance animation program. At one point, while the presentation is demonstrating how to dance the Polka, we can have the stream create a two Tk buttons, with the messages: "Press here to move on to the Waltz" and "Press here to skip to the Hora." Pressing either of the buttons will change the offset of the logical time line so that it resumes execution at a later point in time where the appropriate dance begins. Note that the change in timeline will affect all other media, so the correct music will be played. If the user does not press either button, the presentation will continue with the Polka, and the buttons will go away after a brief period of time.
The second problem also dealt with the browsability requirement. If a command stream is large then a large change in the value of the logical time system will sometimes result in a considerable delay before the stream returns to normal execution. The delay results from the fact that the rush-ahead or rush-behind code for every command that was skipped must be executed. It is impossible to get around the delay resulting just from executing a large amount of Tcl code fragments (since the rush-aheads must be played), but it is hoped that rush-aheads and rush-behinds will be programmed with speed in mind. When code fragments are kept simple, the major bottleneck of the rush-aheads is the network delay resulting from sending all of the rush code fragments.
The third problem was that creating the 5-tuples of Tcl code that comprised the commands was often daunting. Determining the "inverse" was not always simple, and finding a fragment of code that worked as an inverse in the context of a command stream was often a process of trial and error. Determining the rush-ahead and rush-behind fragments of a command sometimes also required much thought and experimentation to perfect. This led to the addition of a easy-to-use authoring environment to our list of requirements for a multimedia presentation environment.
These three problems led us to revise our model.
Once a command is cached in the media-player, the media-server controls the execution of that individual command by sending messages across the network to the media-player. The code fragments for that specific command are no longer sent. When the media player receives a message from the media-server, specifying a command that is to be played, the media-player sends a message to the command, instructing the command to execute a fragment of code. Each message contains a reference to the desired code fragment and the logical play time of that command or, in the case of a rush, the logical time that the command stream is rushing to. The command will ensure that no code fragment is executed twice, and will ensure that all previous commands have executed by sending a message to the previous command. If a command receives a message to play a rush fragment, it will execute the fragment if the code fragment has not played already and will forward the message to the next command (forward if speed is positive, backward if speed is negative). The rush message will continue to be forwarded until the logical time of a command is later than the logical time being rushed to (or earlier if a rush-behind). This way the entire rush-ahead can be propagated in the media-player with only one message between the media-server and the media-player (assuming that all of the command objects are resident in the media-player).
It is important to note that since code fragments are only sent the first time the command is played, problems can occur if the all the code fragments of a needed command do not arrive at the media-player. Since CMT sends data using a semi-reliable protocol[11], this will rarely happen, but should the media-player be told to play a command for which it has not received the necessary code fragments, the media-player can immediately request the data to be sent through a guaranteed channel using an RPC call to the media-server.
Thus the global state problem of the distributed system is solved by maintaining and tracking all state in the media-player, and removing the need to know state from the remote media-server. The media-server is responsible for telling the media-player what command need to be played, and the media-player will ensure that no commands are executed out-of-order or executed twice. Rush-behinds become efficient as they are freed from network transmission delays. Rush-aheads may still require network transmission, because the commands may not yet be resident in the media-player. This can be minimized or completely avoided by distributing commands to the media-player before they are actually needed. Under this model, all commands in the command stream could be distributed initially before playback, which would eliminate rush-ahead network delays, but because command streams could be very long, it is desirable to maintain the ability to do real-time distribution of commands.
Another possible solution was inspired by an observation of the MPEG encoding standard for video [4]. The basic idea is to designate certain time spots as "key points" and to pre-compute when possible the easiest way to jump from each key point among all other key points. When making a large jump, we would execute the rush-ahead or rush-behind code to get to the nearest key point, then use the pre-computed jump code to jump along the key point chain to the key point closest to the destination, and finally execute the rush-ahead (or behind) code to reach the destination. This mechanism can be implemented (trivially) by combining all of the rush-ahead/rush-behind code into the key point jumps, but could be optimized by placing key points strategically at logical separations in a presentation.
The remaining problem is that of minimizing the complexity of the authoring interface. Common programming elements (such as buttons, windows, labels) require multiple commands to implement. It is our hope that we can build a library of commonly used commands or command templates that will easily allow the TclStream author to build powerful interactive multimedia applications quickly, yet still allow him the complete flexibility to create his commands from scratch. Groups of commands can be organized into higher-level objects or abstractions that are easier for authors to understand. Templates for the higher level objects will make it easy for authors to customize them. Figures 8
and 9 provide an example of a how a template for a generic button widget might work.


With the development and release of a fully functional command-stream (with TclStream 2.0), research will shift to an exploration of applications and the command-stream authoring environment. Prototype applications will probably consist of distributed multimedia training presentations that incorporate a high level of interactivity.
TclStream 2.0 will be released to the public as soon as it is stable. See http://www.cs.umn.edu/research/GIMME for information on the release.
We would also like to thank Lynda Hardman for her valuable suggestions regarding this paper.