Vokenizing: A new way to give AI language models much-needed common sense
A pair of U.S. researchers say they have improved language understanding in AI using a more-efficient and scalable new technique that they call vokenization.
The University of North Carolina team’s novel method involves a riff on the term “token,” which is a word used by programmers to pretrain language models. A voken is a contextual token-visual input match.
Vokenization does away with the need to write captions for every image in every image data set — a task best suited for an infinite number of monkeys sitting at keyboards for infinity. It also provides context for language models, something a language model cannot get on its own when dealing with malleable and confusing communication concepts (i.e. the English language).
“This model takes language tokens as input and uses token-related images as visual supervision,” the authors write in a research paper. A sentence becomes a sequence of tokens in a “vokenizer,” which puts out a relevance score for the tokens and image within the context of the whole sentence.
In a more concrete view, it would make it easier for systems to operate autonomously and to explain to a human what it is doing in its environment.
“Our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG.”