Large Language Objects

The Design of Physical AI and Generative Experiences

Large language models (LLMs) have shown an unprecedented ability to generate text, images, and code, surpassing in many ways our own human capabilities and promising to have a profound impact on design and creativity. While powerful, however, these new forms of intelligence remain largely ignorant of the world outside of natural language, lacking knowledge of our objects, bodies, and physical environments.

By Marcelo Coelho, Jean-Baptiste Labrune

Sep 6, 2024

This article was originally published in ACM Interactions magazine and can be found in full here.

Insights

Large language objects (LLOs) are physical interfaces that extend the capabilities of large language models into the physical world.
Due to their generative nature, LLOs present more fluid and adaptable functionalities; their behavior can be created and tailored for individual people and use cases; and interactions can progressively develop from simple to complex, better supporting both beginners and advanced users.

Large language objects (LLOs) are a new class of artifacts that extend the capabilities of LLMs into the physical world. Through general-purpose language understanding and generation, these design objects revisit and challenge traditional definitions of form and function, heralding the design of physical AI as a new frontier for design.

In this article, we describe a series of LLOs developed at the MIT Design Intelligence Lab that combine generative and discriminative models to reveal a host of new applications for AI, from new ways of experiencing music or creating and telling stories to new forms of play, communication, and creating physical forms. These LLOs provide a glimpse at a new kind of creative process that weaves together the capabilities of human and artificial intelligence from the early stages of concept development through form-finding, fabrication, and interaction.

A New Form of Intelligence

LLMs are neural network architectures designed to encode and generate human language. Trained on massive amounts of text data, they learn the statistical patterns of natural language and how it is used by humans and, from simple input prompts, can output complex and semantically coherent text. Today's state-of-the art LLMs have billions of parameters, are trained on trillions of tokens, and require substantial computational resources and cloud infrastructure for both training and inference.

However, rapid developments in hardware architecture coupled to software strategies, such as model quantization, fine-tuning, and retrieval-augmented generation, have improved the quality of LLMs' output and are allowing them to run on increasingly affordable embedded computers. In addition, new prototyping platforms designed for parallelism and tensor operations, such as Google's Coral and Nvidia's Jetson Nano, are lowering the barrier of entry and deployment of embedded machine learning, similar to how Arduino and the Raspberry Pi made embedded computing accessible in the past. What were once powerful and disembodied large-scale models running on high-performance computing clusters will soon become embedded into every object around us, having a profound impact on how we interact with and experience the physical world.

Nonetheless, in spite of their sophistication, these emergent forms of intelligence still remain largely ignorant of the world outside of language, lacking real-time, contextual understanding of our physical surroundings, our bodily experiences, and our social relationships. They are rarely “situated,” as defined by Lucy Suchman, since they don't interact in real time with their immediate environment [1]. In his 1990 paper “Elephants Don't Play Chess,” Rodney Brooks introduced the notion that if we want robots to perform tasks in everyday settings shared with humans, they need to be based primarily on sensory-motor couplings with the environment [2]. Or, as Jacob Browning and Yann Lecun describe it: “The problem is the limited nature of language. Once we abandon old assumptions about the connection between thought and language, it becomes clear that these systems are doomed to a shallow understanding that will never approximate the full-bodied thinking we see in humans” [3].

What were once powerful and disembodied large-scale models running on high-performance computing clusters will soon become embedded into every object around us.

One area in which LLMs seem to falter is their understanding of geometric and mathematical concepts. Ask GPT-4 to design a table and it will suspend a tabletop midair, disconnected from its legs. Correcting this issue requires several prompts and precise user instructions, making it less than an ideal replacement for CAD tools [4]. Another challenge is their lack of real-time and contextual knowledge. With an April 2023 training data cutoff, GPT-4 is incapable of telling us what the weather is like today in Cambridge.

Large Language Objects

LLOs are a new class of artifacts that extend the capabilities of large language models into the physical world. They act as physical interfaces for LLMs by providing physical affordances such as multimodal input, end effectors, user feedback, and, most importantly, world knowledge and real-time context.

Continue reading the original publication.

References

1. Suchman, L.A. Plans and situated actions: The problem of human-machine communication. In Learning in Doing: Social, Cognitive, and Computational Perspectives. Cambridge Univ. Press, 1987; https://api.semanticscholar.org/CorpusID:11242510

2. Brooks, R.A. Elephants don't play chess. Robotics and Autonomous Systems 6, 1–2 (Jun. 1990), 3–15; https://doi.org/10.1016/s0921-8890(05)80025-9

3. Browning, J. and Lecun, Y. AI and the limits of language. Noema. Aug. 23, 2022; https://www.noemamag.com/ai-and-the-limits-of-language