This is a demonstration of audio-visual user interactions with AI assistants, much like the Google Gemini demo. It captures video and audio, converts speech to text, processes the video as individual frames, and generates responses through text-to-speech output, simulating a conversation.
- Video + Audio Capture: The application begins by capturing both video and audio inputs.
- Screenshots and Transcription: Video frames are processed as screenshots while audio is transcribed to text.
- AI Processing: Both visual and textual inputs are analyzed by the GPT-4 Vision Model, which then generates a contextual response.
- Response Generation: The AI's response is converted from text to speech, providing the user with audible feedback.
- Audio-Visual Input Processing: Captures video and audio simultaneously, leveraging these inputs for dynamic interaction.
- Speech-to-Text Transcription: Utilizes OpenAI's Whisper to accurately transcribe user speech.
- AI-Driven Image Analysis: Uses the GPT-4 Vision Model to understand and interpret visual information from the captured video frames.
- Interactive AI Responses: Generates spoken responses using text-to-speech (TTS-1)
todo
First, run the development server:
npm run dev
# or
yarn dev
# or
pnpm dev
# or
bun dev
Open http://localhost:3000 with your browser to see the result.
You can start editing the page by modifying app/page.tsx
. The page auto-updates as you edit the file.
This project uses next/font
to automatically optimize and load Inter, a custom Google Font.