AI, or Artificial Intelligence as it’s commonly known, is transforming the way we interact with computers and the world around us. This essay explores the fascinating history of AI, its current state, and the exciting future that lies ahead, with a special focus on the cutting-edge field of what we started to call Motion Prompting and two case studies: the results of residencies in which we collaborated with artists Uncharted Limbo Collective and Natan Sinigaglia. It is based on the work within the S+T+ARTS AIR project and originally written for the Motion Prompting event that took place on October 10th, 2024 at Mu Hybrid ArtHouse in Eindhoven.
What is AI?
The history of AI dates back to 1955, when a group of scientists created the Logic Theorist, considered by many to be the first true AI. This machine was designed to mimic the problem-solving skills of a human, proving mathematical theorems by reasoning symbolically, much like a human would. In some cases, it even found new, more elegant proofs than those originally published. In the 1980s, following the first long AI winter in the ‘60s and ‘70s, systems like MYCIN emerged, which was a medical AI that could diagnose blood infections and recommend treatments. For the first time, doctors could consult a computer for a second opinion – and often, the computer was right! MYCIN could even explain its reasoning, which was a big deal at the time.
These early AI’s were like savants – brilliant in very narrow domains, but utterly lost outside their specialties. They were a far cry from the flexible, general intelligence we humans possess.
Today, in 2024, AI has become ubiquitous, often without us even realizing it. From streaming platforms recommending shows to phones unlocking by facial recognition to chatbots answering customer service questions, AI is everywhere. And all of this is a different type of AI.
Modern AI systems are incredibly diverse. Like the animal kingdom, to use a metaphor. We call them all animals, but the differences between cats (mammals), crocodiles (reptiles) and parrots (birds) are enormous. They belong to different groups of species. This also the case for AI, where we now have different groups.
The AI Species Atlas
Let’s take a closer look at what we call AI today. Below visual gives an overview of the currently most important AI Species out there, in 2024.
There are AI’s that can generate human like texts, engage in human-like conversations and even write code. These AI species are called LLMs, or large natural language processing models, for example Chat-GPT, or Claude. When we hear about AI or try it out for the first time, very often we look at this AI species.
Then there are AI’s that can understand and interpret visual information. These models can recognize objects or faces in images and videos. We call these computer vision models, like Midjourney, for example.
A third species in the web of AI are multimodal models, or, models that can combine multiple types of data. Models that can turn text into video’s or sound into text for instance. They are a sort of hybrid, much like a mule. Half horse, half donkey. They are capable of combining the powers of other species, like LLMs and computer vision models. Maybe you have heard of GPT4-o, which is such a model. These are very new, and not yet used much.
There are also AI models that can learn through interacting with their environment. They train by observing and through trial and error. You may know one of them: AlphaGo, the model that beat the best Go player in the world some years ago. Given a strict set of rules, hence it is so perfect for games, these models can become much better than humans can ever become within days or sometimes hours of training. But, outside of the highly complex game of Go, a model like AlphaGo, currently, knows absolutely nothing. It couldn’t distinguish an apple from a banana, so to say.
The next branch of AI species are able to generate realistic synthetic data, these models are called GANs, or Generative Adversarial Networks. Synthetic data is fake data, like faces that never existed, photo’s never taken, or music never composed. Often, when we hear stories about deep fakes, that is the work of these little buggers.
Finally, and deliberately last in this family tree, we reach the auto-encoder models, which is the AI focus within Motion Prompting. So more on that later.
These AI species offer unbelievable new skills and possibilities for humans to be assisted in generating tests, visuals, audio, and much more: offering unbelievable new skills and possibilities. But this progress comes at a cost.
Two sides of the AI coin
AI systems, particularly large language models, are energy-hungry beasts. Training a single AI model can consume as much electricity as 200 households use in an entire year. The cooling systems for AI data centres can use millions of litres of water per day. The increasing demand for energy and water resources poses significant challenges, particularly when considering the hierarchy of water usage priorities. This hierarchy, known as the ‘displacement sequence’ or ‘water use prioritization’, traditionally places essential needs like drinking water and food production at the top. However, technologies such as AI, which require substantial amounts of water for cooling data centres, are now competing for these limited resources. This raises critical questions about how we balance the water needs for AI training against those for food production, potentially disrupting the established order of water allocation priorities.
Economically, AI is a powerhouse, projected to add trillions of dollars to the global economy in the coming years. The recent valuations of AI companies are telling us: we are no longer talking about unicorns (companies with a valuation of more than 1 billion), but hectocorns (companies valued more than 100 billion).
It also costs a lot. In 2022, when ChatGPT was released, models typically cost €10 million or less to train. Today’s biggest models cost €100 million to train. It is expected that the next generation models cost €1 billion, and the following €10 billion. And these are just hardware and energy costs. Inherently, this leads to social challenges. There are concerns about justified expenses, job displacement, privacy issues, hidden slavery and the potential for AI to perpetuate or even amplify societal biases.
Considering the extremely fast developments, both in opportunities and threats, strengths and weaknesses of AI, it is so important that as many people as possible understand what is happening, and can relate to it. Because only then can we be a voice in co-designing these systems that will become more and more part of our daily lives in the coming years. This is why it is essential, now, to explore and develop these technologies not only technically or scientifically, but also artistically. And that is what art-driven innovation stands for.
Introducing Auto-encoders
Auto-encoders are a particular type of AI that’s crucial for this discussion. To try to make it easy to understand how they work, let’s picture the following: you and your friend are playing a game where you have to describe a detailed picture to your friend using only a few words, and then your friend has to recreate the picture based on your description. You mention certain features of the picture, certain details, and leave out others. In other words: you determine what high level features and qualities of the picture you think are most important for your friend to be able to recreate the picture as close as possible to the original. That’s essentially what an auto-encoder does.
An auto-encoder takes complex input data, like recordings of sounds in a city, with many types of sound involved. Or body movement data, from head to toes. The auto-encoder “encodes” it into a compact representation, which essentially means it tries to determine which high level features and qualities of the data are important enough to capture. Finally, it tries to “decode” it back into the original input. In this process, it learns to capture the most important features of the data. What makes auto-encoders special is their ability to find patterns and structure in data without being explicitly told what to look for. This makes them incredibly versatile.
_Introducing Prompting
If AI is the engine, prompting is like the steering wheel – it’s how we guide AI to do what we want. It’s like having a super-smart assistant who can do almost anything, but takes instructions very literally. Prompting is the art of giving this assistant the right instructions to get the desired outcome. For example, if you’re using an AI image generator, the prompt “a cat” might give you a very basic image of a cat. But “a regal Siamese cat sitting on a velvet throne, digital art style” would give you something much more specific and detailed.
Prompting isn’t just about being specific, though. It’s about understanding how the AI “thinks” and framing your request in a way that plays to its strengths. It’s a skill that’s becoming increasingly valuable as AI systems become more powerful.
Introducing Motion Prompting
Motion prompting is about using movement and gestures to communicate with AI systems. Instead of typing commands or speaking, we use our body language. This might sound futuristic, but you’ve probably already used a simple form of motion prompting if you’ve ever waved at your smart TV to turn it on or used a gesture-controlled video game like on the Nintendo Wii.
Now, how does it work? At its core, motion prompting works like this: Sensors or cameras capture our movements, then an AI model analyses these movements in real-time. Finally, an AI model interprets these movements as commands or information. In that way, it reaches a conclusion about whatever it was you wanted to prompt the system about. If you are a medical professional and you want to detect subtle movement deviations in your patients, you can prompt for that. If you are a dancer and you want to be presented with new possible dancing moves, you can prompt for that too. And so on. The true potential of motion prompting lies in its ability to transform data.
The two cases we present in this essay use a special type of auto-encoder to create what we call “latent space representations” of movements. Think of it like this: instead of memorizing every possible movement, the AI learns the “alphabet” of human motion. Just like how you can create any word using the 26 letters of the alphabet, the AI can understand any movement by combining these basic “motion letters.”
This approach has several advantages: It’s more efficient, allowing for faster processing and response times. It can generalize better, understanding new movements it hasn’t seen before. It can even generate new, realistic movements.
The Power of Body Language
Body language is powerful because it conveys unique qualities. Up to 65% of human communication is non-verbal. Our gestures, postures, and movements often convey more than our words. They can express emotions, intentions, and ideas that are hard to put into words. Body language is also often faster than speech or typing. A quick gesture can convey an idea that might take a sentence to explain. And in many ways, it’s more natural. We gesture instinctively when we communicate, often without even realizing it.
Let’s dive into two Motion Prompting Case Studies from S+T+ARTS AIR.
Case 1: SYMBODY – Transforming Movement into Machine-Readable Insights
Natan Sinigaglia, a sound visual artist with a strong background in dance, music, and data visualisation, has dedicated the last decade to developing installations and live shows that capture and translate sound into visual experiences. However, he always wanted to do the same with body movement and dance, but was frustrated by the lack of tools that provided rich information about body movement beyond basic 3D positions of joints.
In the Symbody project, Natan got the opportunity to explore the use of AI and machine learning to analyse body movement and understand more than just spatial positions. The goal was to create a representation that is simple enough for humans to relate to while still capturing the complexity of the input data. They wanted the AI to provide higher-level information about movement – not just the joints in motion but the nature of the movement itself, like whether it’s fluid or rigid, tense or relaxed. The hope is to create a sort of mirror for us to learn more about our movement and engage in exploration, where the AI’s output, like the colour or emission of particles, is associated with specific qualities of the movement.
To achieve this, Natan and his collaborators utilized the AIST++ dataset, which offers over 1,400 dance samples with 3D motion capture data and associated music. They developed a training pipeline specifically designed to handle this sequential data, using only sound (WAV files) and motion data (3D skeleton positions) without video. The goal was to create a visual representation of human movement that still contained (and maybe enhanced) some “subtle” or “hidden” qualities of the original movement. Although not strictly Explainable AI (XAI), as there is no specific target task being solved while explaining the solution, the project aimed to provide interpretable insights into the qualities of movement. Additionally, Natan developed a real-time visualization tool to efficiently explore and compare the generated models, representing how dataset samples are encoded within each model’s latent space. In the final months, they made significant improvements to the machine learning software, adding real-time inference capabilities and experimenting with attention layers, a technique drawn from the revolutionary Transformers architecture. Natan has also developed a real-time motion capture setup for live performance that uses Kinect sensors.
“This project feels like just the beginning of our research, as it is vast and complex—like scratching the surface of a much broader and deeper universe to explore.”
Natan’s goal is to improve the machine’s ability to understand and interpret the subtle nuances and details in human movements. By doing so, he hopes to create a more intuitive and unrestricted way for people to interact with machines. This approach recognizes that our bodies and physicality are the most natural and efficient tools we have for expressing ourselves and controlling our environment. As we enhance the machine’s capability to interpret finer information in our movements, we must also consider the implications of this technology. Will it lead to a more natural and free interaction with machines, or could it be used to monitor and control us in ways we cannot yet imagine?
Case 2: Monolith – A Dialogue Between Human and Machine
Uncharted Limbo Collective (ULC), a group of four unique individuals with diverse backgrounds in architecture, computational design, software engineering, and VFX, has worked extensively with motion capture, collaborating with dancers and choreographers to create pieces that challenge the notions of what a body is, what movement is, and the trajectories of a body in motion.
As generative AI gained traction, ULC began to question whether they could present society with a version of AI that is more than just a servant behind a paywall. This led to their project Monolith, inspired by the iconic monolith in 2001: A Space Odyssey, which poses the question: Can machines truly co-perform? Can an AI system express curiosity, empathy, seduction, deception, panic, and other emotions toward a human co-performer on stage?
ULC delved into the world of human body models – representations of what a human is to a machine. They investigated skeleton-based, contour-based, and volume-based models, navigating the challenges of incompatible or semi-compatible representations across different contexts. This made them realize that the definition of a human body varies significantly based on context and purpose, leading to different skeletal representations and a sea of incompatible or semi-compatible views. Leading to the question: how effective could a model be if it was trained on a limited range of human behaviours and scenarios? This inquiry led them to realize that most models are based on typical people performing everyday activities. But what about unusual individuals doing unusual things?
Therefore, they involved training an LSTM (Long Short-Term Memory) model based on own directed dance sessions with a simple brief: Imagine there is a monolith in front of you that can see you and communicate with you through sound and visuals. The aim was not to generate new dance moves, but to enable the machine to infer the “emotional state” of the human it interacts with. They encouraged the dancers to improvise their encounters with this entity, emphasizing that it might be afraid, hostile, deceptive, or even susceptible to seduction. Each dance clip was associated with a label or “mood”, and the LSTM network was trained on these snippets. Before training the machine on dance, the team had to define the context and understand the range of potential human responses. This iterative process of data collection and model refinement allowed them to create an AI system that could engage in emotionally responsive interactions with humans. They captured the 3D positions of the dancers’ joints using Ultra-Low-Light RGB cameras and the GVHMR model for 3D Human Pose Estimation.
“No matter how you approach tracking humans and training AI, you encounter the inevitable challenge of categorization. There’s always the risk that what you need does not fit into these categories. Even a dataset of significant size must select which categories to apply to the data points.”
This raises the question: Are we, as humans, really as predictable and categorical as these AI systems assume? Or is there something fundamentally human in our ability to defy categorization and surprise even the most advanced algorithms?
ULC discovered that the artistic act occurs in the tension between expectation and reality – between what one anticipates and what unfolds in front of them. They aim to further refine their self-developed Latent Spaceship visualization tool, making it an accessible resource for artists, educators, and researchers, and to continue exploring the evolving relationship between human and digital co-performance. The future of the Monolith project lies in its potential to redefine how we perceive digital agency and better communicate how we train, interact with, and are understood by AI technologies.
Concluding remarks
AI has come a long way from its early days of proving mathematical theorems. Today, with technologies like Motion Prompting, we’re on the cusp of a new era in human-computer interaction. An era where our devices will start to understand not just our words, but our movements, our gestures, our very body language.
In both case studies, the artists and researchers showed how they used a variety of technical tools and approaches to explore the possibilities of AI and Motion Prompting. The technology behind Motion Prompting has the potential to make our interactions with machines more natural, more intuitive, and more human. It could break down barriers for people with disabilities, revolutionize how we train and perform in sports, and even help us understand and express ourselves better. However, we must also consider the ethical implications of these technologies. Motion recognition systems, if not properly regulated, could be used for mass surveillance, infringing on individual privacy rights. They could also perpetuate biases if they are trained on limited or unrepresentative datasets.
To minimize bias, it is essential to include a diverse range of individuals in the development and training of AI systems, encompassing people of different genders, ethnicities, races, abilities, and other characteristics that reflect the diversity of our society. Inclusive technology takes into account the needs and perspectives of a wide range of users, while accessible technology is designed to be usable by people with disabilities. By striving for both inclusivity and accessibility, we can create AI systems that empower and support all members of society.
But as with any powerful technology, it comes with responsibilities. We should be mindful of the energy and resources these systems consume, and vigilant about issues of privacy and consent. As our machines become better at understanding human movement, they ought to be used to empower and enable, not to surveil or control.
Looking to the future, we must ask ourselves: How can we ensure that the benefits of AI and Motion Prompting are distributed fairly across society? How can we prevent these technologies from being used to exploit or manipulate people? And how can we create a regulatory framework that encourages innovation while also protecting individual rights and promoting social good?
The future of Motion Prompting is not set in stone. It will be shaped by the choices we make, the questions we ask, and the values we, the makers, uphold.
Background on S+T+ARTS AIR
S+T+ARTS AIR is a project that harnesses the power of art, science, and technology to address the pressing challenges facing our society by making the invisible visibl. By integrating art, science, and technology, we can unlock the transformative potential of the cultural sector, stimulating critical thinking and offering fresh perspectives on the future of our society in the face of technological advancements and environmental impacts.
A short glossary to navigate the case studies
● Autoencoder architecture
A neural network that compresses data and then reconstructs it, helping find patterns in complex information.
● Auto-encoder Models
AI that learns to capture the most important features of data without being explicitly told what to look for.
● Attention layers
A mechanism allowing AI to focus on specific parts of input data, similar to human attention.
● Computer Vision Models
AI that can understand and interpret visual information, recognize objects or faces in images and videos. Examples include Midjourney.
● Generative Adversarial Networks (GANs)
AI that can generate realistic synthetic data, like faces that never existed, photos never taken, or music never composed. These are often used in deep fakes.
● Kinect sensor
A device that tracks body movements in 3D space.
● Large Language Models (LLMs)
AI that can generate human-like texts, engage in conversations, and even write code. Examples include Chat-GPT and Claude.
● Latent space
A compressed representation of data within an AI model, where similar items are grouped together.
● Multimodal Models
AI that can combine multiple types of data, like turning text into videos or sound into text. These are very new, with models like GPT4-o emerging in the last two years. Hence, it is a system processing multiple types of input data, like movement and sound, simultaneously.
● Pipeline
A series of connected processes or steps, where the output of one step becomes the input for the next. In machine learning, a pipeline typically includes data preparation, model training, and evaluation stages. Bulk training pipeline refers to system for training many AI models at once with different settings.
● Real-time inference
AI making instant interpretations as it receives new data.
● Reinforcement Learning Models
AI that learns through interacting with their environment, observing and learning through trial and error. Examples include AlphaGo, which beat the world’s best Go player.
● Sonification
Translating data into sound to perceive patterns through audio.
● Transformers architecture
A powerful AI design that has advanced natural language processing and other tasks.
PROJECT: S+T+ARTS AIR