Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon

(github.com)

115 points | by MediaSquirrel 4 hours ago

6 comments

LuxBennu 3 hours ago
I run whisper large-v3 on an m2 max 96gb and even with just inference the memory gets tight on longer audio, can only imagine what fine-tuning looks like. Does the 64gb vs 96gb make a meaningful difference for gemma 4 fine-tuning or does it just push the oom wall back a bit? Been wanting to try local fine-tuning on apple silicon but the tooling gap has kept me on inference only so far.
[-]
- weitendorf 25 minutes ago
  Hey I was literally just working on this today (I was racing ahead on an audio FT myself but OP beat me by a few hours). For audio inference definitely try running your input through VAD first to drop junk data and if necessary, as one of several preprocessing steps before sending the audio to the large model. You can check out how I did it here: https://github.com/accretional/vad/blob/main/pkg/vad/vad.go
  I was using https://huggingface.co/onnx-community/pyannote-segmentation-... because with ONNX, I could run it on Intel servers with vectorized instructions, locally on my Mac, AND in-browser with transformers.js
  VAD is absurdly time-effective (I think like O(10s) to segment 1hr of audio or something) and reduces the false positive rate/cost of transcription and multimodal inference since you can just pass small bits of segmented audio into another model specializing in that, then encode it as text before passing it to the expensive model.
  [-]
  - MediaSquirrel 19 minutes ago
    Great minds think alike!
    Also, I had a huge head start, as I spent a month or two working on this in September 2025, shelved it and dusted it back off this weekend.
- MediaSquirrel 3 minutes ago
  re: Whisper v3 -- how is this possible? Whisper has a 30s context window. You have to chunk it.
- MediaSquirrel 3 hours ago
  Memory usage increases quadratically with sequence length. Therefore, using shorter sequences during fine-tuning can prevent memory explosions. On my 64GB RAM machine, I'm limited to input sequences of about 2,000 tokens, considering my average output for the fine-tuning task is around 1,000 tokens (~3k tokens total).
  [-]
  - LuxBennu 1 hour ago
    Ah that makes sense, quadratic scaling is brutal. So with 96gb i'd probably get somewhere around 4-5k total sequence length before hitting the wall, which is still pretty limiting for anything multimodal. Do you do any gradient checkpointing or is that not worth the speed tradeoff at these sizes?
    [-]
    - MediaSquirrel 21 minutes ago
      Haven’t tried yet. That’s on the do list. But good suggestion.
conception 1 hour ago
I’m pretty excited about the edge gallery ios app with gemma 4 on it but it seems like they hobbled it, not giving access to intents and you have to write custom plugins for web search, etc. Does anyone have a favorite way to run these usefully? ChatMCP works pretty well but only supports models via api.
craze3 4 hours ago
Nice! I've been wanting to try local audio fine-tuning. Hopefully it works with music vocals too
dsabanin 4 hours ago
Thanks for doing this. Looks interesting, I'm going to check it out soon.
[-]
- MediaSquirrel 3 hours ago
  you are welcome! It was a fun side quest
yousifa 3 hours ago
This is super cool, will definitely try it out! Nice work
pivoshenko 3 hours ago
nice!