A case study of web subtitling formats applied to a video project with substantial translation
Thomas Levine, February 2019
IntroductionI have been producing a video over the past three years, always rendering it in web formats. Subtitles are essential to the project, and I have consequently developed several opinions related to the specification and implementation of web media formats.
The project is an illustration of tongue twisters in different languages. For example, for "Peter Piper picked a peck of pickled peppers", I sat next to a peck of pickled and spoke the phrase.
The tongue twisters are in different languages, and they are translated to even more languages. Some tongue twisters exist only in particular dialects. Others exist in multiple languages. Some languages have multiple writing systems.
FindingsIn developing my tongue twister videos and rendering them to alternative output formats, I have been able to compare alternative subtitle formats. Furthermore, the style of translations involved in this project is unusual enough that I may have identify situations that others have not thought so much about.
My video software and associated opinions are based mostly on the WebVTT draft from 2016. I have not kept especially up-to-date with recent developments. In case some of my findings have been addressed, they are still informative as real demonstrations of the need for these features.
Separation of web subtitles from video filesBecause WebVTT files are separate from their corresponding audio/video files, it is easy to render just the subtitles without updating a container file. This is very helpful.
I generally translate tongue twisters by going to hacker meetings and finding people who speak a new language and collaborating with them to translate these bizarre phrases to other languages. It often takes several meetings with different people to arrive at a good translation, as these phrases are very strange.
The result of a translation session may be to add a few translations and to update a few existing translations. Because the subtitle format is separated from the video format, it is straightforward for me to configure file-based build system (such as make, though the build system is another discussion) to update only the subtitles in this case, so that my collaborator can see his or her contributions right away.
Contrast this with Matroska subtitles, for example, which are part of the container file. (Note that I did not look at what would be involved in updating just the subtitles in a Matroska file and that a more informed build system may be able to handle such a partial rebuild.)
IETF language codesWebVTT and the associated HTML both specify that languages should be identified by IETF language codes. These are inadequete for the cataloging of my tongue twisters because they do not have information about linguistic similarities.
Language componentOne of the simpler examples is the tongue twister "Strč prst skrz krk" works in both Czech and Slovak, so I use the same text for each. More complicated examples include phrases in Serbocroatian, Serbian, Croatian, Bosnian, and Montenegrin. (And the last language is not recognized by IETF.)
I am addressing this by identifying languages internally by glottocode and converting them to IETF language code. The glottolog is an taxonomy of languages, so it is possible to specify the above "Strč prst skrz krk" in Czech-Slovak and to render separate Czech and Slovak subtitles where each one looks for text specific Czech or Slovak and then falls back to Czech-Slovak.
IETF code script componentIETF language codes may include the script, but the specification is gives much freedom to the implementor about encode the script. I am consequently not unsure of what I should write for that component.
For example, many languages use slightly different latin scripts. French and Spanish both have "é", but they mean different things, and English doesn't really have it. Are these all "Latn"?
Russian and Ukranian are both said to be written in cyrillic, but only the former has "ё", and only the latter has "ґ". Are they both "Cyrl"?
Furthermore, I am pretty sure I just made up "Cyrl" or took it from Wikipedia; I couldn't find a reference anywhere about standard names for the different scripts.
Chinese languagesThe specification of Chinese languages is particularly confusing. I imagine that part of this comes from the Chinese distinction between written and spoken languages, as there is no model for that in IETF language codes.
But it seems that most of the complexity in the IETF language codes for Chinese languages is there for legacy support, so I won't comment much on it.
Unclear motivationMuch of the my confusion around IETF language codes relates to how I do not understand their purpose. That is, I do not know why it matters that my language codes follow these conventions. If there is a good reason for the format, my confusion could perhaps be resolved with documentation.
Lack of validation of language codesGiven that the IETF language codes do not match how I record my tongue twisters, it is very convenient that no player seems to validate the codes. For example,
- I get no validation errors when I use "xx" as a language name. (See the next section for why I do this.)
- Track names are based on the "label" attribute rather than the language code.
- Default track is based on the "default" attribute rather than on the default language of the browser.
In a way, this is another benefit over Matroska. The Matroska implementations that I have tried require that subtitles be annotated with ISO 639-2 codes.
srclang attribute of the track tagThe srclang attribute is supposed to specify one and only one language, encoded as an IETF language tag. This does not match how my videos work.
Many srclang to one srcAs mentioned previously, it is possible that the track with srclang="cs" (Czech) will be exactly the same as the one with srclang="sk" (Slovak). In this case I can reference the same WebVTT file from two different track tags.
One/many srclang to many srcOppositely, it is possible that there may be multiple different tracks in one particular language, such as where differences in dialect within one language and country make for different spellings. I have come across this issue with Chakavian, and Shtokavian dialects of Serbo-Croatian. (Kajkavian is not affected because it has an ISO 639-3 code.) Here are some compliant approaches that are all model the languages/dialect incorrectly.
- Assign language tag "hr" to Chakavian. Assign tags "bs" and "sr" to Shtokavian. This is wrong because both languages/dialects are spoken in both places.
- Assign language tag "sh" to both, so the srclang will not tell us which is which. Also, "sh" is deprecated.
- Assign all of the following language tags to both: "hr", "bs", "sr", This makes for lots of tracks and still does not disambiguate.
In all of the above options, it is also strange that Kajkavian is treated differently from Chakavian and Shtokavian.
In practice, I have ignored this entirely. I have made separate tracks for "hr", "bs", and "sr", not accounting for minority dialects/languages.
No reasonable srclangFinally, I make some tracks that are not in any particular language or alphabet.
- One track is the text of each tongue twister in its original language, and of course in an appropriate script.
- I provide literal translations of each tongue twister, which is a bit different from the normal translation
- In some versions of the video, I have created subtitle tracks that contain all of the alternative scripts for a particular language, such as a Serbian track in both latin and cyrillic.
For the first I set the srclang to "xx". For the second I use "en-literal". I don't remember what I did for the third.
WebVTT lang tagThe WebVTT lang tag is much more appropriate than the srclang attribute for annotation of the original language track. However, I am not sure how exactly it should be used in many cases.
Generous strangers have translated the Peter Piper tongue twister to Finnish as "Peter Piper poimi yhdeksän litraa (yksi 'pek') säilöttyjä paprikoita." Should "Peter Piper" and "pek" be annotated as English words?
I transliterated чокањче/čokanjče to English as "chuokañcheh". Should the transliteration be annotated a foreign language? Furthermore, what language should it be annotated as? Reasonable options include Serbocroatian, Serbian, Croatian, Bosnian, Montenegrin, Kajkavian, Shtokavian, and Chakavian.
A tangential issue: Shtokavian and Chakavian do not have ISO 639 codes. Montenegrin does, but only since December 2017.
Support in different softwaresVery conveniently for me, all of the WebVTT viewer softwares that I have tried have supported almost all of the features that I have tried to use. There was only one feature that I wanted but could not use: karaoke mode. I would of course like for karaoke mode to be implemented in players, but this got me thinking about the cataloging of supported features.
I imagine that, as WebVTT is standardized and adopted, people will develop tools to catalog which implementions support which features. When this happens, I suggest that such catalogs include softwares other than web browsers, such as video players like mpv that play subtitles without HTML files.
The specification could also indicate which features are most important, so that people interested in implementing basic WebVTT support may know where to focus their efforts.
Listing of many subtitle tracksI have tried playing my video in several browsers, and each one has failed to display all of the subtitle tracks.
One could consider this an issue in implementation except that this is a problem on all players. I suggest that the specification explicitly state that the video player must be able to handle a large list of tracks.
Audio subtitlesI plan on releasing tongue twister lessons, and I may do these as just-audio files rather than as video-and-audio files, but still with subtitles. If I were to release these today, I would reference WebVTT files in track tags within a video tag, and I would style the video element so that the subtitles would take up most of the space.
I think it would be straightforward to do this with video tag, but I thought this style of use was noteworthy.
PlaylistOne of the few complicated parts of my build system is the part where I combine the several short tongue twister videos into one video with many tongue twisters. The complicated part is the calculation of absolute subtitle timings to match the video.
Consider videos of three tongue twisters:
- The first, about pickled peppers, is six seconds long.
- The second, about red cabbage, is five seconds long.
- The third, about fish, is three seconds long.
Within each tongue twister, I specify subtitle timings relative the beginning of the clip. For example, the first clause of the red cabbage tongue twister may start one second into the clip. See this discussion on suckless-dev for more detail.
To convert these timings from relative the clip to relative the full video, I render each clip, check how long the clip rendered as, and cumulatively add the durations of previous clips to the relative timings within each subsequent clip.
If it were possible to specify a playlist of several videos and subtitles, where the timings of each subtitle track relate to the beginning of the corresponding video, then I would not need the above calculation. There would be other benefits too. For example, I would not need to re-encode video as much.
DebuggingWhen I was initially implementing by WebVTT renderer, I recall that it took me a little while to realize that my only error was to exclude the milliseconds component. This particular issue could be resolved by a change to the specification, but I think it mostly highlights the need for validation and debugging tools.
ConclusionsWith little difficulty I managed to render my tongue twister subtitles as WebVTT format. The only thing that I really wanted but could not use was karaoke mode. It was helpful for my build system that subtitles are in separate files from the video files.
I found many of the details of WebVTT and HTML related to track language specifications to be inappropriate for my project. All players seem to ignore these for now, but I am concerned for what will happen if they start to be used for something. In summary,
- The model of IETF language codes is very different from my language catalog, and it is not clear how I should convert between them.
- A particular track does not necessarily have one main language; it may have zero languages, and it may have more than one language.
- A particular phrase within a track is not necessarily in exactly one language.
Moreover, I do not understand the purpose of these language specifications. Regardless of whether the specifications are to be changed based on the above issues, I suggest that the specification document should document the motivations for the particular data model that is specified.
I have some suggestions that could be alternatively seen as issues of specification or implementation.
- Specification that browsers must be able to present long track lists.
- Specification of a subset of WebVTT as the basic version so that small video projects may confidently implement a coherent subset of the features
Finally, I presented a few ideas for new features, all of which go slightly beyond the scope of web-based subtitles.
- Debugging tools (not necessarily part of the specification)
- Playlists, which would make it it easier to align subtitles and would reduce the need for re-encoding of videos.
- Audio-without-video subtitles (possibly already well supported)
Even with the many issues that I pointed out and without the features that I proposed, it was straightforward for me to implement basic WebVTT rendering within a video build system. I believe that that is a good sign.