Nice! I really like how many variations on this idea are coming out. MacWhisper used to be great, but is kinda of a buggy mess now.
I'm making my own, for personal use. I did a survey of many and they all (that I could find) skip the fundamentals.
The major issues that I've run into:
- Crash recovery. Most of these apps are incredibly buggy and crash all the time, taking the recorded audio with them. Macwhisper is incredibly bad at this.
- Disk space. Many of these apps save wav files to disk. After a few hours of meetings, you may end up with gigabytes eaten.
- Microphone bleed. People don't always use headphones, the system mic will pick up the speaker sounds, causing duplicate (approximately) transcriptions.
I've yet to find a solution that handles all these correctly, let alone having high quality transcriptions.
I will be happy to spend £10 on this. One feature question though -- does it continue transcribing the meeting even if I've turned my volume down / muted it?
I don't really recommend it. If the software is a one-time purchase, there's no need to rewrite it with an LLM. Rewriting the tokens could cost more than just $10.
It's kinda funny how frontier LLMs change the game when it comes to software. If it becomes so good to make whatever little utility you want, why would I pay 10 dollars when an AI subscription is 20 bucks and I can build way more in a month for that $20? Especially since it's very likely people on show HN have simply used AI anyway, so why would I pay for your prompts?
This looks like a good approach, though I would expect this to be a native macOs feature within 12 months -- this seems totally like it fits into their product roadmap.
Which Speech-to-Text is used? Is it possible to configure it? This might be crucial for supporting languages other than English - the model that comes built-in with macOS fails completely for German.
A lot of the available models are Whisper or Faster-Whisper derived and shared across multiple apps. The tier names are often funny... "Tiny" "base" "small" "medium" "large" "large-v2" "large-v3" "large-v3-turbo" -en only variants, etc.
In my experience, medium is often the sweet spot for English accuracy vs speed, especially if following-up with a post-processing pass. The large options are all fine, but can severely slow it down. There are some speed checks on my website if you're curious (link not posted because I don't want to hijack another post's app).
Agreed with JohnBiz, the moment flagging is interesting and unusual, and a nice contrast to passive transcription. I only recently learned about MacWhisper (I'm Windows primarily) and was floored to learn how expensive the Pro option is. Nowadays it's not so hard to have some-level of DIY transcription, so crazy that it's priced with a premium.
What's your diarization pipeline? Pyannote?
I'd taken a different approach that used a LLM clean-up pass to summarize and progressively compress the transcript for ultra-long content, but I like the idea of targeted "pay attention here" flags.
I'm making my own, for personal use. I did a survey of many and they all (that I could find) skip the fundamentals.
The major issues that I've run into:
- Crash recovery. Most of these apps are incredibly buggy and crash all the time, taking the recorded audio with them. Macwhisper is incredibly bad at this.
- Disk space. Many of these apps save wav files to disk. After a few hours of meetings, you may end up with gigabytes eaten.
- Microphone bleed. People don't always use headphones, the system mic will pick up the speaker sounds, causing duplicate (approximately) transcriptions.
I've yet to find a solution that handles all these correctly, let alone having high quality transcriptions.
Anyway, most of these apps are built around https://github.com/FluidInference/FluidAudio, if anyone is curious. Their readme has a big list of similar apps as well.
I would be more willing to purchase if it was open source and I could build from source to try it first.
Not the subsidized subs
In my experience, medium is often the sweet spot for English accuracy vs speed, especially if following-up with a post-processing pass. The large options are all fine, but can severely slow it down. There are some speed checks on my website if you're curious (link not posted because I don't want to hijack another post's app).
What's your diarization pipeline? Pyannote?
I'd taken a different approach that used a LLM clean-up pass to summarize and progressively compress the transcript for ultra-long content, but I like the idea of targeted "pay attention here" flags.
Add "open source" if you wish as well.
https://hn.algolia.com/?q=macOS+transcription