A case study of web subtitling formats applied to a video project with substantial translation

Thomas Levine, February 2019


I have been producing a video over the past three years, always rendering it in web formats. Subtitles are essential to the project, and I have consequently developed several opinions related to the specification and implementation of web media formats.

The project is an illustration of tongue twisters in different languages. For example, for "Peter Piper picked a peck of pickled peppers", I sat next to a peck of pickled and spoke the phrase.

The tongue twisters are in different languages, and they are translated to even more languages. Some tongue twisters exist only in particular dialects. Others exist in multiple languages. Some languages have multiple writing systems.


In developing my tongue twister videos and rendering them to alternative output formats, I have been able to compare alternative subtitle formats. Furthermore, the style of translations involved in this project is unusual enough that I may have identify situations that others have not thought so much about.

My video software and associated opinions are based mostly on the WebVTT draft from 2016. I have not kept especially up-to-date with recent developments. In case some of my findings have been addressed, they are still informative as real demonstrations of the need for these features.

Separation of web subtitles from video files

Because WebVTT files are separate from their corresponding audio/video files, it is easy to render just the subtitles without updating a container file. This is very helpful.

I generally translate tongue twisters by going to hacker meetings and finding people who speak a new language and collaborating with them to translate these bizarre phrases to other languages. It often takes several meetings with different people to arrive at a good translation, as these phrases are very strange.

The result of a translation session may be to add a few translations and to update a few existing translations. Because the subtitle format is separated from the video format, it is straightforward for me to configure file-based build system (such as make, though the build system is another discussion) to update only the subtitles in this case, so that my collaborator can see his or her contributions right away.

Contrast this with Matroska subtitles, for example, which are part of the container file. (Note that I did not look at what would be involved in updating just the subtitles in a Matroska file and that a more informed build system may be able to handle such a partial rebuild.)

IETF language codes

WebVTT and the associated HTML both specify that languages should be identified by IETF language codes. These are inadequete for the cataloging of my tongue twisters because they do not have information about linguistic similarities.

Language component

One of the simpler examples is the tongue twister "Strč prst skrz krk" works in both Czech and Slovak, so I use the same text for each. More complicated examples include phrases in Serbocroatian, Serbian, Croatian, Bosnian, and Montenegrin. (And the last language is not recognized by IETF.)

I am addressing this by identifying languages internally by glottocode and converting them to IETF language code. The glottolog is an taxonomy of languages, so it is possible to specify the above "Strč prst skrz krk" in Czech-Slovak and to render separate Czech and Slovak subtitles where each one looks for text specific Czech or Slovak and then falls back to Czech-Slovak.

IETF code script component

IETF language codes may include the script, but the specification is gives much freedom to the implementor about encode the script. I am consequently not unsure of what I should write for that component.

For example, many languages use slightly different latin scripts. French and Spanish both have "é", but they mean different things, and English doesn't really have it. Are these all "Latn"?

Russian and Ukranian are both said to be written in cyrillic, but only the former has "ё", and only the latter has "ґ". Are they both "Cyrl"?

Furthermore, I am pretty sure I just made up "Cyrl" or took it from Wikipedia; I couldn't find a reference anywhere about standard names for the different scripts.

Chinese languages

The specification of Chinese languages is particularly confusing. I imagine that part of this comes from the Chinese distinction between written and spoken languages, as there is no model for that in IETF language codes.

But it seems that most of the complexity in the IETF language codes for Chinese languages is there for legacy support, so I won't comment much on it.

Unclear motivation

Much of the my confusion around IETF language codes relates to how I do not understand their purpose. That is, I do not know why it matters that my language codes follow these conventions. If there is a good reason for the format, my confusion could perhaps be resolved with documentation.

Lack of validation of language codes

Given that the IETF language codes do not match how I record my tongue twisters, it is very convenient that no player seems to validate the codes. For example,

In a way, this is another benefit over Matroska. The Matroska implementations that I have tried require that subtitles be annotated with ISO 639-2 codes.

srclang attribute of the track tag

The srclang attribute is supposed to specify one and only one language, encoded as an IETF language tag. This does not match how my videos work.

Many srclang to one src

As mentioned previously, it is possible that the track with srclang="cs" (Czech) will be exactly the same as the one with srclang="sk" (Slovak). In this case I can reference the same WebVTT file from two different track tags.

One/many srclang to many src

Oppositely, it is possible that there may be multiple different tracks in one particular language, such as where differences in dialect within one language and country make for different spellings. I have come across this issue with Chakavian, and Shtokavian dialects of Serbo-Croatian. (Kajkavian is not affected because it has an ISO 639-3 code.) Here are some compliant approaches that are all model the languages/dialect incorrectly.

In all of the above options, it is also strange that Kajkavian is treated differently from Chakavian and Shtokavian.

In practice, I have ignored this entirely. I have made separate tracks for "hr", "bs", and "sr", not accounting for minority dialects/languages.

No reasonable srclang

Finally, I make some tracks that are not in any particular language or alphabet.
  1. One track is the text of each tongue twister in its original language, and of course in an appropriate script.
  2. I provide literal translations of each tongue twister, which is a bit different from the normal translation
  3. In some versions of the video, I have created subtitle tracks that contain all of the alternative scripts for a particular language, such as a Serbian track in both latin and cyrillic.

For the first I set the srclang to "xx". For the second I use "en-literal". I don't remember what I did for the third.

WebVTT lang tag

The WebVTT lang tag is much more appropriate than the srclang attribute for annotation of the original language track. However, I am not sure how exactly it should be used in many cases.

Generous strangers have translated the Peter Piper tongue twister to Finnish as "Peter Piper poimi yhdeksän litraa (yksi 'pek') säilöttyjä paprikoita." Should "Peter Piper" and "pek" be annotated as English words?

I transliterated чокањче/čokanjče to English as "chuokañcheh". Should the transliteration be annotated a foreign language? Furthermore, what language should it be annotated as? Reasonable options include Serbocroatian, Serbian, Croatian, Bosnian, Montenegrin, Kajkavian, Shtokavian, and Chakavian.

A tangential issue: Shtokavian and Chakavian do not have ISO 639 codes. Montenegrin does, but only since December 2017.

Support in different softwares

Very conveniently for me, all of the WebVTT viewer softwares that I have tried have supported almost all of the features that I have tried to use. There was only one feature that I wanted but could not use: karaoke mode. I would of course like for karaoke mode to be implemented in players, but this got me thinking about the cataloging of supported features.

I imagine that, as WebVTT is standardized and adopted, people will develop tools to catalog which implementions support which features. When this happens, I suggest that such catalogs include softwares other than web browsers, such as video players like mpv that play subtitles without HTML files.

The specification could also indicate which features are most important, so that people interested in implementing basic WebVTT support may know where to focus their efforts.

Listing of many subtitle tracks

I have tried playing my video in several browsers, and each one has failed to display all of the subtitle tracks. A video with so many subtitle tracks that you cannot select the top-most tracks because they fall off the screen

One could consider this an issue in implementation except that this is a problem on all players. I suggest that the specification explicitly state that the video player must be able to handle a large list of tracks.

Audio subtitles

I plan on releasing tongue twister lessons, and I may do these as just-audio files rather than as video-and-audio files, but still with subtitles. If I were to release these today, I would reference WebVTT files in track tags within a video tag, and I would style the video element so that the subtitles would take up most of the space.

I think it would be straightforward to do this with video tag, but I thought this style of use was noteworthy.


One of the few complicated parts of my build system is the part where I combine the several short tongue twister videos into one video with many tongue twisters. The complicated part is the calculation of absolute subtitle timings to match the video.

Consider videos of three tongue twisters:

  1. The first, about pickled peppers, is six seconds long.
  2. The second, about red cabbage, is five seconds long.
  3. The third, about fish, is three seconds long.

Within each tongue twister, I specify subtitle timings relative the beginning of the clip. For example, the first clause of the red cabbage tongue twister may start one second into the clip. See this discussion on suckless-dev for more detail.

To convert these timings from relative the clip to relative the full video, I render each clip, check how long the clip rendered as, and cumulatively add the durations of previous clips to the relative timings within each subsequent clip.

If it were possible to specify a playlist of several videos and subtitles, where the timings of each subtitle track relate to the beginning of the corresponding video, then I would not need the above calculation. There would be other benefits too. For example, I would not need to re-encode video as much.


When I was initially implementing by WebVTT renderer, I recall that it took me a little while to realize that my only error was to exclude the milliseconds component. This particular issue could be resolved by a change to the specification, but I think it mostly highlights the need for validation and debugging tools.


With little difficulty I managed to render my tongue twister subtitles as WebVTT format. The only thing that I really wanted but could not use was karaoke mode. It was helpful for my build system that subtitles are in separate files from the video files.

I found many of the details of WebVTT and HTML related to track language specifications to be inappropriate for my project. All players seem to ignore these for now, but I am concerned for what will happen if they start to be used for something. In summary,

Moreover, I do not understand the purpose of these language specifications. Regardless of whether the specifications are to be changed based on the above issues, I suggest that the specification document should document the motivations for the particular data model that is specified.

I have some suggestions that could be alternatively seen as issues of specification or implementation.

Finally, I presented a few ideas for new features, all of which go slightly beyond the scope of web-based subtitles.

Even with the many issues that I pointed out and without the features that I proposed, it was straightforward for me to implement basic WebVTT rendering within a video build system. I believe that that is a good sign.


The project of interest is published at The tongue twister catalog (both data and software) is available at