StoryToolkitAI

How to optimize the transcription process

Added 2023-05-12 07:08:35 +0000 UTC

Hello there!

We've been chatting with some of the Patrons in private and realized that sometimes the transcription process feels like it's 1992 and you're loading Lemmings on a Commodore 64 (takes very long).

Now, this might be fun in 1992 especially if you're a 10-year-old who hasn't heard of TikTok yet, but if you're dealing with hours and hours of footage and a post-production deadline, things might feel sad and depressing...

So, here are a few tips on how to optimize this - no matter what machine you're running StoryToolkit on (we'll talk about this more below):

A few tips for optimization

1. Instead of transcribing each timeline/audio at a time, render out all audio files in a long batch then add them to the transcription queue.

2. Don't use a model that's too heavy for your needs. In our experience, for production-quality English audio, the medium model works quite well in most cases. But, try to test and see what works best for you. Also...

3. For long audio that varies in quality, sometimes it's best to use a lesser model and then re-transcribe only the portions with errors using a larger model: select the bad segments (key V or click first+last segment then Shift+A) and then hit Re-transcribe (or key T).

4. Try to organize your audio based on language and quality groups that share the same transcription settings. Set those settings via the Preferences window and also check "Skip Transcription Settings" there. Then add each audio to the transcription queue, and only change the settings in the Preferences window when you get to the next group of audio (for eg. if you're switching from English to Spanish as the source language).

5. Try to leave your machine alone while it's transcribing. This might be hard especially if you only have one computer, but the AI models used for transcription use a lot of GPU, VRAM, and RAM and the process might crash or just be too slow. If you can't afford a dedicated machine, we recommend...

6. Transcribing overnight. As silly as this might sound, this is a simple solution that might boost the transcription speed. If you are dealing with more than 5-10 hours of transcriptions, try to prioritize the transcriptions that you need to have the next day. By the way, you can still add all the audio to the transcription queue (see point 1.), and then exit the tool once you want to use your computer. Next time when you start StoryToolkitAI, it re-starts the transcription with the last file it was on.

7. Last, but not least, if you're constantly dealing with many hours of footage, consider investing in an NVIDIA GPU-powered machine. Even an older generation GPU, like a GTX 1070 will be significantly faster than an M1 Mac - M1 Macs are already around 2x faster than Intel Macs BTW. This last point deserves a further explanation, so here it goes...

Beyond transcriptions

From what we've seen, most of Machine Learning (ML) and AI are currently done either on TPU (do not click that link unless you like numbers) or NVIDIA CUDA GPUs. This is of course a generous simplification, but the main idea is that most AI modules and models available now online are optimized to make use of these two technologies.

Unfortunately, getting TPU machines in your editing room is not as easy as it sounds (unless you're Google). So, if you're really considering making use of AI in the next years/decades/centuries - putting down some cash for an NVIDIA GPU (even an older one, as mentioned above), might not be such a bad idea...

Right now, most features available in the tool are optimized for the GPU because it's really hard to find decent non-GPU/TPU models in the cloud that fulfill functions that we want them to fulfill. We are keeping our ears and eyes open all the time for optimizations, but most implementations that present themselves as faster are not really worth investing our development time (for eg. see Whisper JAX - the authors say it's 70x faster, but tests reveal a different reality). Similarily, some solutions that give marginal gains are complicated to distribute in a standalone form or even to install manually using command line madness...

In the near future, we'll add a feature that allows ingesting video together with audio to let you search for objects/people/stuff/anything visual in your footage. This feature will work really well on CUDA GPUs, but it might not be optimal on other machines.

BTW, we're currently looking for ways to make the tool available remotely and will probably roll an update soon that enables that functionality. This will allow users to either set a local server for StoryToolkit operations, but also to rent GPU time and do operations in the cloud.

Hoping this was useful and not too long of a read!!

If you have other ideas or want to know more, feel free to get in touch.

Cheers!