FireEye’s Data Science and Information Operations Analysis teams
released this blog post to
coincide with our Black Hat USA 2020 Briefing
, which details how
open source, pre-trained neural networks can be leveraged to generate
synthetic media for malicious purposes. To summarize our presentation,
we first demonstrate three successive proof of concepts for how
machine learning models can be fine-tuned in order to generate
customizable synthetic media in the text, image, and audio domains.
Next, we illustrate examples in which synthetically generated media
have been weaponized for information operations (IO), as detected on
the front lines by Mandiant Threat Intelligence. Finally, we outline
challenges in detecting synthetically generated content, and lay out
potential paths forward in a future where synthetically generated
media will increasingly look, speak, and write like us.


  • Open source, pre-trained
    natural language processing, computer vision, and speech
    recognition neural networks can be weaponized for offensive
    social media-driven IO campaigns.
  • Detection,
    attribution, and response is challenging in scenarios where
    actors can anonymously generate and distribute credible fake
    content using proprietary training datasets.
  • The
    security community can and should help AI researchers,
    policy makers, and other stakeholders mitigate the harmful
    use of open source models.

Background: Synthetic Media, Generative Models, and Transfer Learning

Synthetic media is by no means a new development; methods for
manipulating media for specific agendas are as old as the media
themselves. In the 1930’s, the chief of the Soviet secret police was
photographed walking alongside Joseph Stalin before being retouched
out of an official press photo, after
he himself was arrested and executed during the Great
. Digital graphic manipulation like this became prominent
with the advent of Photoshop. Then later in the 2010’s, the term
“deepfake” was coined. While deepfake videos, including techniques
like face swapping and lip syncing, are concerning in the long term,
this blog post focuses on more basic, but we argue more believable,
synthetic media generation advancements in the text, static image, and
audio domains. Machine learning approaches for creating synthetic
media are underpinned by generative models, which have been
effectively misused to
fabricate high volume submissions to federal public comment
and clone
a voice to trick an executive into handing over $240,000

The pre-training required to produce models capable of synthetic
media generation can cost thousands of dollars, take weeks or months
of time, and require access to expensive GPU clusters. However, the
application of transfer
can drastically reduce the amount of time and effort
involved. In transfer learning, we start from a large generic model
that has been pre-trained for an initial task where copious data is
available. We then leverage the model’s acquired knowledge to train it
further on a different, smaller dataset so that it excels at a
subsequent, related task. This process of training the model further
is referred to as fine-tuning, which typically requires less
resources compared to pre-training from scratch. You can think of this
in more relatable terms—if you’re a professional tennis player, you
don’t need to completely relearn how to swing a racket in order to
excel at badminton.

By admin