Producing a deepfake - a faceswap workflow on AWS
For a music video project I worked on we had to produce a deep fake. I had previously made some attempts at deepfakes using a local machine with an NVIDIA GeForce GTX 1080. Now I'll be using a MacBook Pro and an AWS account with a generous credit (the running cost of the server only ended up being about $30 per 24 hours).
If you want to understand more about how machine learning works I would recommend this video.
The first thing was to choose a library, the two main players in the open-source deepfake space are Faceswap and DeepFaceLab. I chose to go with Faceswap because the GUI and the CLI seemed more user-friendly to me, it supports multiple GPUs, has a solid workflow and great community and support.
I had originally planned to do the inital tests on my local machine but my MacBook Pro has an ATI graphics card 🥺. So straight to the cloud I go... I used a g4dn.xlarge EC2 instance and loaded it with the AWS Deep Learning AMI. After wrestling with the installation of faceswap (had some dependancy issues that may have been related to using the deep learning AMI) I managed to get it up and running on the remote instance.
To start we did a test run first to see what kind of results we could get using mostly default settings and with minimal intervention, just to get an idea of what was really possible with our material. For this I used the same footage for Face A from the target video that I wanted to convert, and for Face B I recorded 60 seconds of the subject performing basic facial movements from different angles.
From the preliminary results I could identify what the parameters that I was working within were. I noticed that a lot of my faces are a profile view, deepfakes work much better with a frontal view. Also my source material is a very low resolution (480p YouTube rip of an 80s video). This makes faces harder to detect, particularly smaller faces.
I would highly suggest doing a few tests before fully training your model. It'll also help you to tell which angles and expressions are lacking from your source footage.
My initial tests also encouraged me to define a folder structure, I found this to be an efficient structure for deepfake projects
/my-project /my-project/src /my-project/src/faceA/ // - Source footage/images for Face A /my-project/src/faceB/ // - " " Face B /my-project/src/target/ // - Footage to convert /my-project/extract/faceA // - Extracted faces from Face A /my-project/extract/faceB // - " " Face B /my-project/model/ // - Trained model /my-project/output/ // - Converted media (final exports)
Face A - 6586 faces extracted
I used 3 video clips of Robin Gibb that I could find from that era and cut out the footage that was worth extracting.
https://www.youtube.com/watch?v=A-U058BIAGo (This is also the target video)
Face B - 7382 faces extracted
I recorded 5 minutes of footage of the subject replicating the facial movements from the target source.
To prepare the extracted faces for training they need to be cleaned up. The better your source material is the better your result will be. This part of the process took about 10 hours.
- Remove any extracted faces that are of poor quality
- Remove any bad detected faces
- Regenerate alignments file to remove the deleted faces
- Manually go through each face and clean up the alignment
- Generate new masks
After extracting the faces to use for training I cleaned and sorted the faces to remove any poor quality faces or things that were detected by error. Faceswap actually has a tool to allow you to sort the extracted faces by relevance to help you filter them out. After deleting the bad faces you'll also need to remove them from your alignments file using the alignments tool.
There are a few different types of models you can train it all depends on what material you have and the results you want to achieve. I had time and computing power on my side so I decided to launch 3 instances and train variations of models to 20,000 iterations and then select the best one from there. I chose to go with Dfaker using a VCC-Clear mask.
Note: I made an error and accidentally deleted the full timelapse permanently but I have a few captures here for demo purposes. The final model was much more defined than shown in this timelapse.
My final model was trained for 75,000 iterations. I started it again and ran it for another 10,000 where I could see that the loss level was not improving, therefore the model was no longer learning and I could conclude that my model was trained.
To prepare the target footage to be converted we need to repeat step 1 (extract) again being as meticulous as possible with the alignments. I spent about 4 hours on ~4,100 frames to achieve my result.
I definitely ran into difficulties with the face detection because of the low resolution of my source footage.
The abilities of the technology is really hyped up in the media. To achieve an undetectable result you really do need source footage that fits into the parameters of what a deepfake can work with.
For this project the aim wasn't to have a perfectly undetectable deepfake, the limitations weren't a problem but infact contributed to the concept. The song is a cover and seeing artifacts from the deepfake reinforces this idea.
The cost of processing the deepfake on AWS comes at about $30 per day. If you were doing a couple of renders a year then it's an amazing option rather than using your own hardware. However, things like manual alignments are really tricky using a remote machine so ideally you would want to have a CUDA capable graphics card in your local. A dedicated graphics machine with comparable specs would start at about $1400 (about 46 days of AWS EC2 usage) plus electricity bills.