Skip to content

Latest commit

 

History

History
66 lines (38 loc) · 6.55 KB

2004.10151.md

File metadata and controls

66 lines (38 loc) · 6.55 KB

Experience Grounds Language, Bisk et al., 2020

Paper, Tags: #nlp, #linguistics

Today's best systems still make mistakes that arise from a failure to relate language to the physical world it describes and to the social interactions it facilitates.

We posit that the present success of representation learning approaches trained on large text corpora can be deeply enriched from the parallel tradition of research on the contextual and social nature of language.

In this article, we consider work on the contextual foundations of language: grounding, embodiment, and social interaction. We describe a brief history and possible progression of how contextual information can factor into our representations.

We propose the notion of a World Scope (WS) as a lens through which to view progress in NLP. NLP researchers have consistently recognized the limitations of corpora in terms of coverage of language and experience. To organize this intuition, we propose five WSs, and note that most current work in NLP operates in the second (internet-scale data). We define five levels of World Scope:

  • WS1. Corpus (our past)
  • WS2. Internet (our present)
  • WS3. Perception
  • WS4. Embodiment
  • WS5. Social

WS1: corpora and representations

The Penn Treebank (Marcus et al., 1993) is the canonical example of a sterilized subset of naturally generated language, processed and annotated for the purpose of studying representations, a perfect example of WS1. Hidden Markov Models, LDA.

WS2: the written world

Domain adaptation, embeddings, crawling, BERT, transformers. We argue that continuing to expand hardware and data sizes by orders of magnitude is not the path forward.

WS3: the world of sights and sounds

Language learning needs perception. An agent observing the real world can build axioms from which to extrapolate. World knowledge forms the basis for how people make entailment and reasoning decisions. There exists strong evidence that children require grounded sensory input, not just speech, to learn language.

There has been remarkable progress in computer vision in the last decade, researchers have refined and codified the “backbone” of computer vision architectures. While a minority of language researchers make forays into computer vision, researchers in CV are often willing to incorporate language signals.

That the combination of language and vision supervision leads to models that produce clear and coherent sentences, or that their attention can translate parts of an image towords, indicates they truly are learning about language. Moving forward, we believe benchmarks incorporating auditory, tactile, and visual sensory information together with language will be crucial. Such benchmarks will facilitate modeling language meaning with respect to an experienced world.

WS4: embodiment and action

Understanding natural language requires representation of concepts derivable from perception such as color, counting, size, shape, spatial reasoning and stability. Recent work (Gardner et al., 2020) argues that these same dimensions can serve as meaningful perturbations of inputs to evaluate models’ reasoning consistency.

We see the need for embodied language with complex meaning when thinking deeply about even the most innocuous of questions:

Is an orange more like a baseball or a banana?

WS1 is likely not to have an answer beyond that the objects are common nouns that can both be held. WS2 may also capture that oranges and baseballs both roll, but is unlikely to capture the deformation strength, surface texture, or relative sizes of these objects (Elazar et al., 2019). WS3 can begin to understand the relative deformability of these objects, but is likely to confuse how much force is necessary given that baseballs are used much more roughly than oranges in widely distributed media. WS4 can appreciate the nuanced nature of the question— the orange and baseball share a similar texture and weight allowing for similar manipulation, while the orange and banana both contain peels, deform under stress, and are edible. We as humans have rich representations of these items. The words evoke a myriad of properties, and that richness allows us to reason.

As computers transition from desktops to pervasive mobile and edge devices, we must make and meet the expectation that NLP can be deployed in any of these contexts. Current representations have very limited utility in even the most basic robotic settings (Scalise et al., 2019), making collaborative robotics (Rosenthal et al., 2010) largely a domain of custom engineering rather than science.

WS5: the social world

Whenever language is used between two people, it exists in a concrete social context: status, role, intention, and countless other variables intersect at a specific point (Wardhaugh, 2011). Currently, these complexities are overlooked by selecting labels crowd workers agree on. That is, our notion of ground truth in dataset construction is based on crowd consensus bereft of social context (Zellers et al., 2020).

Social interaction is a precious signal and some NLP researchers have began scratching the surface of it, especially for persuasion related tasks where interpersonal relationships are key (Wang et al., 2019c; Tan et al., 2016).

WS6: Self-evaluation

You can’t learn language ...

  • ... from the radio (internet). WS2 ⊂ WS3
    • A learner cannot be said to be in WS3 if it can perform its task without sensory perception such as visual, auditory, or tactile information.
  • ... from a television. WS3 ⊂ WS4
    • A learner cannot be said to be in WS4 if the space of actions and consequences of its environment can be enumerated.
  • ... by yourself. WS4 ⊂ WS5
    • A learner cannot be said to be in WS5 if its cooperators can be replaced with cleverly pre-programmed agents to achieve the same goals.

By these definitions, most of NLP research still resides in WS2, and we believe this is at odds with the goals and origins of our science. This does not invalidate the utility or need for any of the research within NLP, but it is to say that much of that existing research targets a different goal than language learning.

Our proposed World Scopes are steep steps, and it is possible that WS5 is AI-complete. WS5 implies a persistent agent experiencing time and a personalized set of experiences. We call for and embrace the incremental, but purposeful, contextualization of language in human experience. With all that we have learned about what words can tell us and what they keep implicit, now is the time to ask: What tasks, representations, and inductive-biases will fill the gaps?