Real-time transcription #4

lucabeetz · 2023-04-07T10:02:50Z

Hey, awesome package!

I wanted to ask how one could use this for on-device realtime description with microphone audio, similar to the objc example from the whisper.cpp package

cerupcat · 2023-04-07T20:31:04Z

I'd also like to see an SFSpeechRecognizer-like API for easy replacement of SFSpeechRecognizer.

libratiger · 2023-04-11T00:13:57Z

+1 for this feature

fakerybakery · 2023-04-15T22:47:57Z

Yes, this would be a great feature.

fakerybakery · 2023-04-16T22:25:38Z

Right now, I'm setting a timer to start + stop the transcription every 2 seconds, however it's not that accurate because if a word is cut off, then whisper tries to improvise, and the text often has hallucinations.

jacobjiangwei · 2023-04-23T11:00:45Z

how would whiper officially support real time? The cut off issue is same for official library, correct? @fakerybakery

cerupcat · 2023-04-23T18:14:03Z

how would whiper officially support real time? The cut off issue is same for official library, correct? @fakerybakery

The Whisper cpp repo has examples of how to implement realtime.

fakerybakery · 2023-04-23T18:36:04Z

I think that the whisper.cpp library stores some of the previous recording history and uses that to fix the cutoff issue, but I'm not sure.

jacobjiangwei · 2023-04-26T02:30:08Z

how would whiper officially support real time? The cut off issue is same for official library, correct? @fakerybakery

The Whisper cpp repo has examples of how to implement realtime.

thanks for point that out. Just curious, why it's in Obj-C but not in swift version...

fakerybakery · 2023-04-26T20:36:22Z

I don't know why, but if someone could port the example to Swift, I would really appreciate that (I'm really bad at Obj-C).

brytonsf · 2023-05-07T23:46:46Z

I think that the whisper.cpp library stores some of the previous recording history and uses that to fix the cutoff issue, but I'm not sure.

Yep, I believe it does too – see this line (and line 245)

barkb · 2023-05-11T21:53:33Z

Don't have a great understanding, but to me it looks like whisper.objc is storing the contents of a buffer when it fills up, then calling it's transcribe function against what it just stored, while clearing the buffer and re-enqueuing it. I don't know a ton about AVFAudio, but does anyone know if you could use AVAudioEngine and AVAudioPCMBuffer to create similar functionality? I'm thinking you could call Whisper.transcribe here with the buffer data if you can get that buffer data back from AVAudioEngine. Does anyone know if that would work?

ldenoue · 2023-08-16T09:08:21Z

@barkb have you ever found a solution to this real-time idea?

moaljazaery · 2023-08-18T05:31:51Z

+1

aehlke · 2023-09-15T23:57:44Z

I found this Swift implementation of streaming: https://github.com/leetcode-mafia/cheetah/blob/b7e301c0ae16df5c597b564b2126e10e532871b2/LibWhisper/stream.cpp with a swift file inside a swit project. It's CC0 licensed.

I couldn't tell if it's uses the right config to benefit from the latest Metal/OpenML performance-oriented config and it uses some tool that requires a brew install so I don't know how sandbox friendly it is.

fakerybakery · 2023-09-16T20:19:54Z

I found this Swift implementation of streaming: leetcode-mafia/cheetah@b7e301c/LibWhisper/stream.cpp with a swift file inside a swit project. It's CC0 licensed.

I couldn't tell if it's uses the right config to benefit from the latest Metal/OpenML performance-oriented config and it uses some tool that requires a brew install so I don't know how sandbox friendly it is.

@aehlke

The linked app is an AI interview ... er assistant? and it listens to your audio and tries to respond with GPT-4 (it doesn't use SwiftWhisper). It uses the sdl12 library, which, according to their website:

... provide low level access to audio, keyboard, mouse, joystick, and graphics hardware via OpenGL and Direct3D ...

I haven't extensively researched this subject, but my interpretation is that this allows the app to listen to your system audio and transcribe it, so you don't have to install external software such as BlackHole. This leads me to believe that the library may not be necessary if the object is to listen from the microphone, which may mean that it can be run on other devices, such as iOS.

aehlke · 2023-09-17T18:22:25Z

@fakerybakery it looks to me like https://github.com/leetcode-mafia/cheetah/blob/b7e301c0ae16df5c597b564b2126e10e532871b2/LibWhisper/WhisperStream.swift has similarities to https://github.com/exPHAT/SwiftWhisper/blob/master/Sources/SwiftWhisper/Whisper.swift and the latter could be extended with that logic with some effort

aehlke · 2023-09-18T14:53:07Z

I've ported it into SwiftWhisper here: dougzilla32/SwiftWhisper@master...lake-of-fire:SwiftWhisper:master#diff-bc90b919aba349b74638614ff99f2c0581ae2bcd8b4c2c816a9c9d93969853d0 still untested though. Looks like SDL can run on iOS.

fakerybakery · 2023-09-18T16:16:57Z

Wow, thank you so much! Might it be possible to update the README to add documentation?
Also, are you planning to make a PR to merge this into the main repository?

aehlke · 2023-09-18T19:27:36Z

No plans, but I'll update here if I test it and it works

fakerybakery · 2023-09-23T18:12:31Z

Hi @aehlke, were you able to get it to work?

aehlke · 2023-09-23T22:58:58Z

Haven't tried yet. I will within a week or two probably

cgfarmer4 · 2023-10-04T03:55:05Z

I have created a very poor mans version of the streaming here. It works but the reading from the buffer queue needs quite a bit of improvement.

aehlke · 2023-10-08T20:00:15Z

What's the downside to your queue implementation? Like what's the cost or risk of the technical debt as you implemented it - thanks

cgfarmer4 · 2023-10-08T20:04:32Z

@aehlke lost fidelity. If you test using GGeranov's implementation with AudioQueue, its a bit more accurate. I would say this implementation is like 90% good enough though.

I havent had time to invest in making it more true buffer where it puts audio drops back into the array, this is more of a FILO queue.

aehlke · 2023-10-09T01:47:20Z

I tested and fixed the one I linked above. I don't have a test implementation to share but it works.

cgfarmer4 · 2023-10-09T02:30:49Z

@aehlke mind sharing a code example?

aehlke · 2023-10-09T12:06:51Z

cheetah-main-2.zip
here's the "Cheetah" project I linked above, locally forked to use my SwiftWhisper fork with the added SwiftWhisperStream module. I disabled most of the functions of the app - all that remains is a demo of it downloading the medium model and then showing text results as you speak, ignore the other buttons

eni9889 · 2023-10-09T14:16:19Z

@aehlke this is pretty amazing. Are you using a more recent version of the code? When I try to add SwiftWhisper as a dependency from github.com/lake-of-fire/SwiftWhisper.git I get error that SwiftWhisperStream cannot be found

aehlke · 2023-10-09T14:25:05Z

https://github.com/lake-of-fire/SwiftWhisper/blob/master/Package.swift#L20 it's here...

btw this appears to work on both iOS and macOS tho I only really tested macOS. licensing of the dependencies involved are all properly open eg MIT, no GPL

my SwiftWhisper fork is messy and could be simplified for sure, either merged into SwiftWhisper or split out as a separate thing

eni9889 · 2023-10-09T14:29:27Z

@aehlke my mistake looks like I didn't actually add it to the target. Amazing work

cgfarmer4 · 2023-10-10T00:11:55Z

Personally think that Metal > CoreML so far but I mostly dissected your project and havent pulled it into mine yet. CoreML seems to spike the CPU but maybe Im doing something wrong. I also completely gutted the Cheetah project and created a simpler example here for anyone else following along. One thing to note is at least on my end the downloader isnt updating the state correctly when it finishes but could just be a me thing. You have to restart and it recognizes the file.

StreamWhisperExample.zip

cgfarmer4 · 2023-10-10T05:00:21Z

Pulled it into mine and yea, seems to run way better than the CoreML version I was using with my gist and accuracy is slightly better as expected. Since you dont even use SwiftWhisper at all, might make sense to split this out unless theres a clean interface for doing something like realTime = true. Given the dependency on SDL, dont imagine there is a clean way though.

Couple things I think would be nice:

Access to WhisperParams
A way to reset CaptureDevice so the microphone isnt always being accessed when not in use.
A cleaner delegate interface for the segments although I easily implemented a callback structure.

Going to try to implement this myself but I dont know cpp very well. Up for the challenge.

Yours:

Mine has these wild CPU spikes on transcribe but the memory footprint is similar, not sure if SwiftWhisper can do fully GPU.

ldenoue · 2023-10-10T08:29:57Z

@cgfarmer4 when I run your StreamWhisperExample.zip, XCode complains. Any idea on how I can fix this error?

cgfarmer4 · 2023-10-10T14:05:15Z

@ldenoue

I think you need to update to Xcode 15 and maybe macOS 14.

swift -version
swift-driver version: 1.87.1 Apple Swift version 5.9 (swiftlang-5.9.0.128.108 clang-1500.0.40.1)
Target: x86_64-apple-macosx14.0

aehlke · 2023-10-10T14:17:24Z

A way to reset CaptureDevice so the microphone isnt always being accessed when not in use.

this sounds like a critical feature. I'll also look into it...

aehlke · 2023-10-10T19:14:25Z

Looks like the swift SDL calls needed are:

  SDL_CloseAudioDevice(deviceObj)
  SDL_Quit() // if we want to also shut down the initialization step, not just the device that was opened

fakerybakery · 2023-10-10T20:11:54Z

Hi @aehlke @cgfarmer4, might it be possible to run this on iOS?

aehlke · 2023-10-10T20:16:10Z

I think it works on iOS. haven't fully tested yet.

cgfarmer4 · 2023-10-11T03:10:17Z

Yea I cant figure it out @aehlke. Dont have enough C experience. Tried calling both of those functions but neither closed the audio session. My hunch is you need to delete the ctx thats created and I couldnt figure out how to properly keep a reference to it:

From common-sdl

audio_async::~audio_async() {
    if (m_dev_id_in) {
        SDL_CloseAudioDevice(m_dev_id_in);
    }
}

aehlke · 2023-10-11T14:44:03Z

Should be able to call them from swift via the SDL module

aehlke · 2023-10-13T15:24:01Z

I've cleaned up my fork here https://github.com/lake-of-fire/SwiftWhisperStream

aehlke · 2023-10-15T05:13:19Z

Updates:

Auto lang detect is useless with streaming. Generates a lot of garbage, low accuracy. Need to set a language and stick to it (maybe be clever with also detecting languages and more slowly updating which language is set rather than with every chunk of audio)
Silence and noise descriptions are hard to filter out because they tend to generate hallucinations. So now I'm implementing libfvad and nearly done in my fork above, to only send audio to Whisper when it detects speech activity in the separate VAD.

cerupcat · 2023-10-15T20:25:22Z

@aehlke nice job! Do you have an example project setup anywhere? Looking to stream, but not from the device microphone (from live video).

aehlke · 2023-10-15T22:09:18Z

You can select/specify a device via CaptureDevice. I don't have any open source demo currently sorry - it's going into my iOS/macOS app ChatOnMac.com which isn't fully open source currently. I'm almost done with my fork and will update here once it's working.

cerupcat · 2023-10-15T23:13:36Z

Got it. It looks like the your fork only supports input from a device. Is there a way to support input from raw buffers (eg. streaming audio, video audio) that doesn't come from a device microphone?

aehlke · 2023-10-16T00:39:56Z

It's possible but I haven't implemented that yet. I'd like to eventually. Another option is to create a virtual device.

edit: one more discovery - ensure xcode debug mode thread sanitizer is OFF, otherwise accuracy plummets and CPU shoots up

cgfarmer4 · 2023-10-16T19:14:51Z

@cerupcat if youre using this for macOS, you can use a loopback device. https://github.com/ExistentialAudio/BlackHole

cerupcat · 2023-10-16T19:21:54Z

Thanks @cgfarmer4. Looking to use this for iOS though so need a way to pass in an audio buffer.

aehlke · 2023-10-22T18:33:49Z

My fork isn't working on iOS 😞 Trying to understand why - everything appears to work except that there's no actual incoming audio signal, despite gaining mic permission etc.

EDIT: Never mind, it works on iOS ~

dkapila · 2023-10-24T18:49:36Z

@aehlke thanks for your work! I'm looking to add live transcriptions to my own iOS app. Are you planning to create a PR into the SwiftWhisper repo? Or will you be keeping it as a separate fork?

aehlke · 2023-10-24T19:18:56Z

I currently don't plan to spend more time packaging it for reuse or submitting a PR, sorry. Just trying to get it working for my own purposes, and wanted to share my work openly while I'm at it.

fakerybakery · 2023-11-18T23:14:23Z

Hi @aehlke! Thanks for your fork! Are you planning to add some documentation to your fork?

aehlke · 2023-11-18T23:24:37Z

@fakerybakery hi! I have no plans to improve it for reuse, sorry. it works but it's a mess, which serves my needs. you may consider it unmaintained...

shuaiyuhao · 2023-11-22T05:43:29Z

+1 Needs this feature also!!

whydna · 2024-01-18T20:44:49Z

@aehlke looks like the package https://github.com/lake-of-fire/SwiftWhisperStream at main no longer builds anymore :(

Getting errors:

Module 'SwiftWhisperStream' was built with C++ interoperability enabled, but current compilation does not enable C++ interoperability

StreamWhisperExample-btstllfcvyxxmygdgzisjguycght/SourcePackages/checkouts/SwiftWhisper/Sources/whisper_cpp/include/common-sdl.h:6:10 'atomic' file not found

Anyone have any ideas ?

aehlke · 2024-01-18T22:43:35Z

It was building for me in my project as of a few weeks ago but I will take another look within a month or two

cgfarmer4 · 2024-02-18T19:01:40Z

This is the way ... https://github.com/argmaxinc/WhisperKit

aehlke · 2024-02-18T19:17:25Z

@cgfarmer4 it uses CoreML instead of Metal? Is that actually better now? It used to be much worse

btw my fork supports iOS 16 & macOS 13 (maybe earlier as well)

edit: also worth noting that one uses the huggingface lib, while mine/this one uses llama.cpp. it would be cool if someone wants to package up my work better but I can't afford time to work on it for a while. I will revisit its production use within weeks or months and might be able to open src more then.

cgfarmer4 · 2024-02-19T00:51:21Z

@aehlke it uses GPU + Neural Engine. That is a win that yours supports 13 however this one has a full test suite and the models are further optimized.

https://www.takeargmax.com/blog/whisperkit

They are also working on a Metal version:

Our Metal-based inference engine as an alternative backend to harness the GPU with more flexibility and higher performance.

exPHAT mentioned this issue Jan 31, 2024

thoughts on reading from microphone data? #28

Closed

Real-time transcription #4

Real-time transcription #4

Comments

lucabeetz commented Apr 7, 2023

cerupcat commented Apr 7, 2023

libratiger commented Apr 11, 2023

fakerybakery commented Apr 15, 2023

fakerybakery commented Apr 16, 2023

jacobjiangwei commented Apr 23, 2023

cerupcat commented Apr 23, 2023

fakerybakery commented Apr 23, 2023

jacobjiangwei commented Apr 26, 2023

fakerybakery commented Apr 26, 2023

brytonsf commented May 7, 2023 • edited Loading

barkb commented May 11, 2023

ldenoue commented Aug 16, 2023

moaljazaery commented Aug 18, 2023

aehlke commented Sep 15, 2023 • edited Loading

fakerybakery commented Sep 16, 2023 • edited Loading

aehlke commented Sep 17, 2023

aehlke commented Sep 18, 2023

fakerybakery commented Sep 18, 2023

aehlke commented Sep 18, 2023

fakerybakery commented Sep 23, 2023

aehlke commented Sep 23, 2023

cgfarmer4 commented Oct 4, 2023

aehlke commented Oct 8, 2023

cgfarmer4 commented Oct 8, 2023 • edited Loading

aehlke commented Oct 9, 2023

cgfarmer4 commented Oct 9, 2023

aehlke commented Oct 9, 2023

eni9889 commented Oct 9, 2023

aehlke commented Oct 9, 2023 • edited Loading

eni9889 commented Oct 9, 2023

cgfarmer4 commented Oct 10, 2023 • edited Loading

cgfarmer4 commented Oct 10, 2023

ldenoue commented Oct 10, 2023

cgfarmer4 commented Oct 10, 2023

aehlke commented Oct 10, 2023

aehlke commented Oct 10, 2023

fakerybakery commented Oct 10, 2023

aehlke commented Oct 10, 2023

cgfarmer4 commented Oct 11, 2023

aehlke commented Oct 11, 2023

aehlke commented Oct 13, 2023

aehlke commented Oct 15, 2023

cerupcat commented Oct 15, 2023 • edited Loading

aehlke commented Oct 15, 2023 • edited Loading

cerupcat commented Oct 15, 2023

aehlke commented Oct 16, 2023 • edited Loading

cgfarmer4 commented Oct 16, 2023

cerupcat commented Oct 16, 2023

aehlke commented Oct 22, 2023 • edited Loading

dkapila commented Oct 24, 2023

aehlke commented Oct 24, 2023 • edited Loading

fakerybakery commented Nov 18, 2023

aehlke commented Nov 18, 2023 • edited Loading

shuaiyuhao commented Nov 22, 2023

whydna commented Jan 18, 2024 • edited Loading

aehlke commented Jan 18, 2024

cgfarmer4 commented Feb 18, 2024

aehlke commented Feb 18, 2024 • edited Loading

cgfarmer4 commented Feb 19, 2024 • edited Loading

brytonsf commented May 7, 2023 •

edited

Loading

aehlke commented Sep 15, 2023 •

edited

Loading

fakerybakery commented Sep 16, 2023 •

edited

Loading

cgfarmer4 commented Oct 8, 2023 •

edited

Loading

aehlke commented Oct 9, 2023 •

edited

Loading

cgfarmer4 commented Oct 10, 2023 •

edited

Loading

cerupcat commented Oct 15, 2023 •

edited

Loading

aehlke commented Oct 15, 2023 •

edited

Loading

aehlke commented Oct 16, 2023 •

edited

Loading

aehlke commented Oct 22, 2023 •

edited

Loading

aehlke commented Oct 24, 2023 •

edited

Loading

aehlke commented Nov 18, 2023 •

edited

Loading

whydna commented Jan 18, 2024 •

edited

Loading

aehlke commented Feb 18, 2024 •

edited

Loading

cgfarmer4 commented Feb 19, 2024 •

edited

Loading