Multi-modal Artificial Intelligence

Although each sensory system (e.g., visual, auditory, olfactory, tactile, gustatory) is generally thought of as distinct, they each contribute toward cognitive and linguistic function. Many times, the knowledge acquired through these different systems is integrated. Consider the following concept:


Each of the multimedia above are linked to the same concept, though they engage different modalities. Additionally, some people may link different subsets of these media to the concept. For example,

  • a Chinese speaker who only speaks Chinese may not recognize cat or gato,
  • an illiterate person may not recognize any of the words, and
  • a deaf person may not incorporate the audio clip into his/her concept.

Multi-modal Approach to Petting a Cat

Consider this situation in which you are interacting with a being with artificial intelligence.

You tell the robot, “Pet the cat.”

In order for the robot to carry out the command, it must use different modes of acquiring data and must understand what you are saying by using the following:

  • Phonetics: Speech recognition is necessary to translate audio to words.
  • Semantics: The individual words must be relatable to objects and/or actions.
  • Syntax: An understanding of English sentence structure is needed to understand what the words mean when put together.
  • Pragmatics: Ideally, the interface should infer intention—that this is more than a descriptive sentence, but a command directed to the robot.

Assuming the linguistic system is sufficient and the robot understands the command, the robot must also physically carry out the command. This requires integration of sensory and motor functions:

  • Visual recognition: Identify the object to be interacted with; in this case, the cat.

Note that a connection is established here between the word cat and a visual of a cat. But, what if the object looks like a cat, but isn’t. For example the Corgi below is not a cat, but looks similar to one:

One is a cat; the other is a Corgi.

How might the system disambiguate dog from cat when they are visually similar? What other sensory systems might contribute?

The robot will also need to execute motor skills in order to physically pet the cat. This includes:

  • Sensorimotor feedback: Walking toward the cat.
  • Semantics: Associate an action—and its motor control—with the action verb pet.

While this task may seem simple, remember that a robot has to know important information in order to perform this task. Consider all of the additional nuances involved in the simple statement “Pet the cat.” How long does the robot pet the cat? How much pressure should it use? Where does it pet the cat? What if the cat moves when the robot approaches? It is a far more complex task than the three-word command implies.

Common Misconception

A system (e.g., a robot) must be explicitly programmed to do any and all of the actions it can perform.

IF ’walk’ THEN execute step...

This is not quite true, and the distinction between the statements

  1. a computer cannot do anything without being programmed to do it
  2. a computer can learn to do X, Y, or Z

is really a matter of abstraction. In fact, AI relies on this abstraction in defining what intelligence actually means.

The AI sub-field of machine learning is concerned with systems that learn connections among data from data. The programming required is not oriented toward manipulating known data as much as implementing strategies to learn from patterns in the data. For example, rather than program a robot to recognize objects with pointy ears, whiskers, and vertically slit pupils as cats, a machine learning system would learn these features (and/or possibly others) as common features of cats by training over a large set of images labeled cat. Researchers are currently working on unsupervised machine learning techniques, where the explicit training over samples is unnecessary.