-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research - Dynamic speech reflex #4
Comments
Interesting line of thought from here. The issue that immediately pops up for me is "personality" - how much of a pushover is the thing? Does it stop talking as soon as you make a sound? Does it only speak when spoken to? |
Abstractly, it's an event that fires based on some activation threshold. The threshold should be configurable! |
Thoughts on reusing this? https://github.com/ggerganov/whisper.cpp/blob/1d716d6e34f3f4ba57bd9706a9258a0bdb008153/examples/stream/stream.cpp#L584-L592 If that looks good should be easy enough to modify the current audio stream and fire an event (actually, what event for Talk?) |
This is just a high-pass filter, it might fire too often for almost any type of noise, I think we need something more specific - ml-based. But we can use the same logic, just replacing this void high_pass_filter(std::vector<float> & data, float cutoff, float sample_rate) {
const float rc = 1.0f / (2.0f * M_PI * cutoff);
const float dt = 1.0f / sample_rate;
const float alpha = dt / (rc + dt);
float y = data[0];
for (size_t i = 1; i < data.size(); i++) {
y = alpha * (y + data[i] - data[i - 1]);
data[i] = y;
}
} |
We could also use [BLANK_AUDIO] as a response reflex when it is transcribed. This might require shrinking the buffer size to reduce latency, I'm not sure how that is controlled right now |
@choombaa I merged your voice detection |
Keeping issue open as we might make it a bit more involved |
Right now, I'm planning to initiate the response with a "vim pedal", aka a hotkey, because knowing when to respond is difficult. https://github.com/yacineMTB/talk/blob/master/index.ts#L108-L135
When humans speak to each other, we use intonation and other signals to let the other human know when the floor is open, and we also use it to let the other human know that we want the floor.
Right now, we just need some naive event firing when the speaker stops speaking.
Is this something that we can get out of whisper.cpp's embeddings? Possibly a classifier trained on top of the embeddings?
Also I wouldn't shy away from running a python sidecar that takes requests from the main node proc.
What would be awesome
Figuring out how to either get whisper.cpp, or some sidecar, that takes a byte stream and outputs a continous "activation function" based on likelihood to respond
The text was updated successfully, but these errors were encountered: