How Clearview developed its method for fast search on an above-billion scale database
Databases used in facial recognition are growing to previously-unseen scale, which for Clearview AI created a need to develop a more efficient way to search them. Now, the company has moved to patent its new method for indexing vectors to enable database searches at scale.
‘Methods and Systems for Indexing Embedding Vectors Representing Disjoint Classes at Above-Billion Scale for Fast High-Recall Retrieval’ was filed under U.S. patent application number 18/214,782 on Tuesday.
Clearview VP of Machine Learning and Research Terence Liu explained the implications of the innovation and its patent protection in an exclusive interview with Biometric Update ahead of time.
The company felt that after its work on algorithm training and presentation attack detection, “the development on that side was kind of taken care of,” Liu says, “and the challenge after that was, with this new algorithm, you convert all the faces in your database into embedding vectors, and these vectors have to be stored somewhere” to be searched.
As explained in a company blog post by Liu and expanded on in conversation with Biometric Update, Clearview believes the smarter way is to index vectors so only small portion needs to be searched. This means “you can effectively search only a small portion of the database, finding very highly likely matches,” Liu says.
Storing a massive database like Clearview’s in CPU memory is cost-prohibitive, but searching it in disc memory introduces latency (speed) and reduces throughput (volume of simultaneous users with same response time).
“This challenge was less severe when we had, say, 3 million or 30 million, maybe 300 million images. As soon as we got to a larger database than 1 billion this became more of a research problem,” says Liu.
Fortunately, when training a neural network to recognize facial images, “to tell people apart and try to group faces of the same person very close together in this high-dimensional space,” the same embedding vectors are also effective for grouping similar faces. This is despite the fact that the process results in abstract number points, which do not pick out certain areas of the face for comparison.
“When you do a mathematical comparison, like a cosine similarity, similar faces will be grouped together, while different faces of different people will be separated,” Liu explains. He refers to these groups as “buckets.”
The result is that a “probe images’ embedding vector falls into a certain number of buckets that are very promising,” allowing the database search portion of the query to be limited to those buckets.
As described in the blog post, the new system adds “the assigner index” to the search process to identify the likely buckets for search. The patent application covers how assigner index was created.
The probe goes to the proxy, which reaches into the assigner index to determine where to find the right buckets.
A paradigm shift in search
Taking content from expensive RAM to disc “itself underpins a complete paradigm shift,” Liu claims. It is a necessary one, as “whenever you cross some scale boundary, something has to shift.”
He places the patent application in the context of the evolution of databases and information retrieval, with vector databases as the latest in the current family. This next step is based on the ability to use an approximation “that’s due to the nature of the vectors themselves.”
The shift has received a lot of attention in the large language model community, he says, in part due to the attention large-scale generative neural networks like ChatGPT have received.
Embeddings from language models are different from facial recognition, but the same concept applies, Liu says.
“I believe our unique contribution or innovation was surrounding embedding vectors that were trained to differ, that were trained to tell things apart. This formulation naturally applies to facial recognition, because facial recognition pushes the boundary in this formulation to the extreme.”
Extreme, because Clearview’s use case has very little metadata to use in limiting the search. Instead, the company makes use of the way the model judges similarity and differences.
Because similarity is a ratio (not match=1, non-match=0), “using these embeddings that you already have, the challenge is indexing it effectively to limit the search scope,” Liu says.
Building a new search architecture
To develop the new system, Liu spent time researching the problem, deriving the index, and once it was successful building the rest “including the C++ fusing of the open-source libraries.” A few months of tuning followed.
Liu credits open-source library providers as being instrumental to the process, in which Clearview developed the in-memory graph index used to determine which vectors belong in which buckets. That shortcut allows vectors to be stored in disc, while keeping much of the search process in memory.
Clearview claims the change delivers an 80 percent reduction in compute cost and 10-times improvement in throughput.
The system was deployed to production in April. It has delivered much better performance than the old system, which is being sunsetted, Liu says.
Having made the switch, he says the company is eager to share its “fundamental science and engineering work” with the biometrics and machine learning communities.