14 Ways ML Could Improve Video
There's a long road ahead for machine learning and video, even pre-AGI
~30-minute brainstorm. I haven’t done ML engineering myself, but am an enthusiast.
ML systems are getting scarily good. I’m a fan of online video for sharing information. Video is tougher to automate than text alone, but I think many of the steps can be automated with roughly existing ML.
Some video AI integrations I'd like to see include:
Every old video should get automatically cleaned up audio.
Every video and audio file online should get a generated transcript.
Each transcript should get summarized. Ideally, summarized are individualized or at least customized for viewers.
Each transcript/summary gets evaluated. We get estimates of how accurate/outdated/relevant/important/neglected/innovative the work is.
Using something like reinforcement learning, we get better at connecting people with information that is important for them to learn. So, it's not too difficult, inaccessible, or redundant to them.
If you do want to watch a video, there should be automatic breaks where it interjects it with extra additional context. Like, "Clarification: This point is now outdated. We recommend skipping ahead 2min."
Videos could also have a lot of extra text annotation. Text on the side that adds extra relevant information about different scenes.
Instead of watching full 20 minute videos, AI recommends that you only watch minutes 2-5, then 10-15. It summarizes the rest with auto-generated video snippets.
Stock footage can automatically be replaced by generated footage most preferable to the viewer.
Eventually, many videos will be completely autogenerated. AI figures out what information is best for you, using what methods, and creates videos on the fly.
Video is interactive. It's very easy to pause a video and ask it to change topic or answer a specific question.
Autogenerated and personalized video should be able to feed into user-provided data. So an autogenerated personality could say things like, “So, this concept would have been useful to you 5 days ago, when you had a conversation with Amelia.”
Once we get used to Virtual Reality, it might make sense to stop emphasizing 2D videos. It’s not clear how to best incorporate 3D videos into metaverse-like settings, but there are different options.
Once we get brain-computer interfaces, or at least strong video camera driven facial analysis, we could tune video content depending on signals of interest and engagement. If you start getting bored during an educational video, it could jump to a fun example.
I think video is better than raw text for many people. It's also more work and more information-dense. But much of the pipeline definitely seems automatable to me, mostly with existing technologies. It would be a lot of engineering work though.
Edit: I Asked ChatGPT to come up with some more examples. Here are its best suggestions:
Video content analysis to automatically tag and categorize videos for easy search and discovery.
Personalized video recommendations based on the viewer's preferences and viewing history.
Automatically generated highlights and recaps of videos for quick consumption.
Personalized video speed options.
Automatic generation of video summaries for use in social media and other platforms.
Automatically generated video quizzes and interactive elements to make videos more engaging.
Automatic generation of video chapters for easy navigation and organization of long videos.
Automatic generation of video-based presentations and slideshows.
Automatic generation of video thumbnails to improve click-through rates.
Automatic removal of irrelevant or offensive content from videos using image and speech recognition.
Automatic generation of video subtitles in different font sizes and styles.
Automatic generation of video "study notes" that viewers can refer to while watching the video.
Automatic generation of video explanations and annotations to provide additional context and information.
Automatic generation of video parodies and comedic reinterpretations of existing content for entertainment purposes.
Automatic generation of video-game like interactive elements within videos, allowing viewers to make choices that affect the outcome of the story.
Automatic generation of video "choose your own adventure" style interactive videos where viewers can make choices and influence the outcome of the story.
Automatic generation of video summaries that are transformed into interactive trivia games, where viewers can test their knowledge and compete against others.
Automatic generation of video-based scavenger hunts and puzzles, where viewers have to search for hidden clues and solve challenges to progress through the video.
Automatic generation of video-based social experiments, where viewers can interact with and influence virtual characters in real-time.
Automatic generation of virtual tours and experiences, allowing viewers to explore and interact with 3D environments and virtual worlds.
Automatic generation of video-based "choose your own camera angle" experiences, where viewers can switch between different camera angles to see the action from different perspectives.
Automatic generation of video-based "choose your own level of gore" option for horror movies, giving viewer the ability to control the level of violence and gore in the video.
Automatic generation of video-based personal avatars, allowing viewers to insert themselves into the video as the main character
Automatic generation of videos that respond to viewer's physical movements, such as nodding or shaking head, to progress through the story.
Automatic generation of videos that can be experienced with different perspectives, such as allowing viewers to see the video from the perspective of different characters.
Automatic generation of videos that can be experienced in different temperatures, such as allowing viewers to feel the heat or cold in the video.
Automatic generation of videos that can be experienced in different levels of immersion, such as allowing viewers to see the video from a first-person or third-person perspective.
Automatic generation of videos that can be experienced in different levels of emotional intensity, such as allowing viewers to control the level of sadness or joy in the video.
Automatic generation of videos that can be experienced in different levels of fantasy and reality, such as allowing viewers to control the level of surrealism in the video.