As the pandemic draws cities into lockdowns and hospitals into turmoils, we are horrified when watching the social media posts of the frontline health workers, expressing their exhaustion at work, their lack of communication with families, and their desperate need to re-unite with their children. These melancholic posts have driven us to consider potential solutions to reconnect frontline parents with their kids.
In order to shorten the distance between children and their heroic parents battling as frontline healthcare workers, we have designed and implemented a simple web application, Angel's Tale, which, after parents submit a short piece of voice and a selfie, generates videos of parents telling bedtime stories to their kids. In particular, parents submit 10 seconds of their voice and a selfie photo through our chat-based input interface, and five story videos, in which the parent tells each story respectively, will then be automatically generated through our AI-driven application. Children can then choose and watch the story videos after they are generated. We hope that our application can save those parents their precious time at the frontline, while also able to connect with their beloved children conveniently during this difficult time.
For our backend application, we utilized pre-trained AI models for audio and video processing. In particular, we used a model defined and pre-trained in First Order Motion Model for Image Animation, which, given a sample driving video in which one of our team members reads the bedtime story, and the parent's selfie, generates a video of the parent telling the story, based on the content in the driving video. For the audio, we utilized a pre-trained Real-Time Voice Cloning model which takes the parent's ten-second audio input and the text of the story, then generates the audio of the parent reading the text. Eventually, we combine the video and audio generated by the two models to form our resulting video of the parent telling the story.
For our frontend, we utilized Voiceflow for our chat-based user interface in which we obtain user information by either text or audio input; we built our file-upload interface, story-choosing landing page, as well as our APIs and user database, using Anvil.works, which also serves our frontend application.
There are several challenges we ran into.
- The design of the workflow of the application has been problematic. The culprit is Voiceflow's inability to obtain user-uploaded audio files, and thus we have to implement another frontend application using Anvil.works, to obtain these user input necessary to generate the videos.
- It is difficult to adjust the complexity of the user interface to best fit the ability of the audience (children).
- With three components in our workflow, the process of building APIs in both our backend and Anvil.works, as well as connecting the API calls and handling asynchronizations when the videos are generated, is complex and time-consuming.
- The AI models that we utilized are pre-trained. This makes it challenging to handle some of our input data, which may be of great difference to the data that the model is trained on. Specifically, we have observed that it is very challenging for the audio model to inference on extremely long or short sentences, or with difficult vocabularies.
- We have successfully leveraged three platforms (Voiceflow, Anvil.works and Google Colab) and integrated them into a single application without any previous experience with them.
- We have manipulated audio and video files programatically with various actions (resize, streaming, merging, synthesizing, trimming, fading effects, etc).
- We leveraged Amazon's Alexa Settings API from Alexa Skills Kit in order to track the user's device and its time zone. With the user time zone, we designed a user-friendly feature which takes time zone of the user's device into consideration, and only plays the stories for the children in a certain time interval of the day (from 8AM to 9PM).
- Angle's Tale is a web application which has high scalability. Specifically, with the backend currently hosting on Google Colab due to the time constraint of the hackathon, we are looking forward to package the backend into a docker container and host the container on cloud services such as AWS.
- Also, currently we directly use the pre-trained weights for the AI model to inference on user input. In the future, the ML model can be improved if it is trained on more stories with longer sentences and more difficult vocabularies.
- We are also looking forward to integrate the Voiceflow and Anvil.works components together as our only frontend.
- Anvil.works
- Machine-learning
- Python
- Tensorflow
- UI/UX
- Voiceflow
- Http request / response
- Compound sound