This week, Liquid AI launched two new retrieval fashions. They’re LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M. Each maintain 350M parameters. Each are the primary bidirectional members of the LFM household. They construct on LFM2.5-350M-Base, launched in March. The pair targets quick multilingual and cross-lingual search throughout 11 languages. Their footprint is sufficiently small to run nearly anyplace. Each can be found now on Hugging Face below the LFM Open License v1.0.
LFM2.5 Retrievers
The 2 fashions share one spine however characterize textual content in a different way. LFM2.5-Embedding-350M is a dense bi-encoder. It turns every doc right into a single vector. Decide it while you need the quickest search and the smallest, least expensive index.
LFM2.5-ColBERT-350M is a late-interaction mannequin. It converts every token right into a vector quite than one vector per doc. This lets it match queries word-by-word for increased accuracy and higher generalization. The trade-off is a bigger index. Decide it when accuracy issues greater than storage. Its question size is capped at 32 tokens. It might probably additionally rerank a first-stage retriever’s outcomes with out constructing an index.
Each goal short-context search. Good suits embody product catalogs, FAQ information bases, and assist docs. Liquid AI positions each as a drop-in substitute for an current RAG pipeline.
The Structure Change: Causal to Bidirectional
Each fashions begin from LFM2.5-350M-Base, a mid-trained general-purpose checkpoint. Liquid AI applies a small set of bidirectional patches to the LFM2 structure. These adapt it from a causal decoder to a bidirectional encoder.
In a causal setup, every token makes use of solely itself and former tokens. That fits left-to-right technology however is much less pure for retrieval. The workforce replaces the causal consideration masks with a bidirectional one. Now each token can attend to each left and proper context. Additionally they make the LFM2 quick convolutions non-causal. These combine native data symmetrically round every token, not solely from the previous.
This preserves the LFM2 spine’s effectivity whereas producing the full-context representations retrieval wants. Every mannequin has 17 layers: 10 convolution, 6 consideration, and 1 pooling or dense. Context size reaches 32,768 tokens, although paperwork are tuned to 512 tokens. From the shared encoder, the 2 fashions differ solely in output. Embedding makes use of CLS-style pooling for one 1024-dim vector. ColBERT retains 128-dim per-token embeddings for MaxSim late interplay.
Coaching and Knowledge
Each fashions comply with the identical three-stage recipe:
- Stage one is large-scale contrastive pretraining in English.
- Stage two is multilingual and cross-lingual distillation from a powerful trainer throughout all 11 languages.
- Stage three is ultimate fine-tuning on hard-mined negatives.
The Embedding mannequin receives barely extra cross-lingual information than ColBERT. Cross-lingual retrieval emerges extra naturally within the late-interaction setup. Coaching information combines curated inner information with open-source English retrieval datasets. LLM-based translation expands the multilingual and cross-lingual pairs.
Benchmark
Liquid AI evaluated two capabilities. The primary is multilingual retrieval with NanoBEIR. The second is cross-lingual open-domain QA with MKQA-11. Each report outcomes throughout all 11 languages: Arabic, German, English, Spanish, French, Italian, Japanese, Korean, Norwegian, Portuguese, and Swedish.
On common, each fashions lead their class. Listed here are the comparability particulars:
| Mannequin | Sort | NanoBEIR ML (NDCG@10) | MKQA-11 (Recall@20) |
|---|---|---|---|
| LFM2.5-ColBERT-350M | late interplay | 0.605 | 0.694 |
| LFM2.5-Embedding-350M | dense | 0.577 | 0.691 |
| Qwen/Qwen3-Embedding-0.6B | dense | 0.556 | 0.638 |
| LFM2-ColBERT-350M | late interplay | 0.540 | 0.646 |
| Alibaba-NLP/gte-multilingual-base | dense | 0.528 | 0.675 |
| lightonai/GTE-ModernColBERT-v1 | late interplay | 0.489 | 0.459 |
| BAAI/bge-large-en-v1.5 | dense | 0.359 | 0.413 |
ColBERT leads on each averages. Embedding is shut behind on MKQA-11 at 0.691. Each beat Qwen3-Embedding-0.6B, a bigger mannequin. The brand new ColBERT additionally improves on the sooner LFM2-ColBERT-350M, from 0.540 to 0.605 on NanoBEIR. Liquid AI additionally notes that NanoBEIR English tracks the costlier full BEIR. The 2 keep extremely correlated, with NanoBEIR scoring a near-constant ~15% increased. The analysis workforce subsequently makes use of NanoBEIR as a sensible proxy throughout coaching runs.
Latency and Edge Deployment
Liquid AI launched GGUF variants for llama.cpp. These let each fashions run on CPUs, laptops, and edge units. The figures under use a MacBook Professional M4 Max at FP16. Queries are 32 tokens; paperwork are 256 tokens.
| Mannequin | Stage | Docs cached | p50 |
|---|---|---|---|
| LFM2.5-Embedding-350M | Question embedding | sure | 7.3 ms |
| LFM2.5-ColBERT-350M | Question embedding + MaxSim | sure | 8.2 ms |
| LFM2.5-ColBERT-350M | Question + Doc embedding + MaxSim | no | 34.3 ms |
When doc embeddings are pre-computed, median (p50) question latency stays below 10 ms. Encoding paperwork at question time pushes ColBERT to 34.3 ms. For enterprise scale, Liquid AI additionally constructed an inner GPU stack. On an H100 at FP16, it observes latencies as little as 1 ms. Embedding question latency there’s 1.5 ms p50.
Use Circumstances With Examples
- E-commerce: Search a product catalog throughout many languages with one index. A client sorts a Korean question and the system surfaces an English product itemizing. Cross-lingual retrieval makes this work with out per-language indexes.
- FAQ and assist information bases: Retrieve the suitable reply reliably throughout customer-facing surfaces. A French assist query maps to an English assist article.
- On-device semantic search: Search recordsdata, emails, and notes regionally on shopper {hardware}. The GGUF construct retains information on the system at near-zero value.
- Enterprise information assistants: Retrieve inner authorized, monetary, and technical paperwork throughout languages. ColBERT fits this when reply accuracy outranks index dimension.
Code: Getting Began
The Embedding mannequin runs by means of sentence-transformers. At all times move the uneven prompts, question: and doc:. Omitting them silently degrades retrieval high quality.
from sentence_transformers import SentenceTransformer
mannequin = SentenceTransformer(
"LiquidAI/LFM2.5-Embedding-350M",
trust_remote_code=True,
)
queries = ["What is the capital of France?"]
paperwork = ["Paris is the capital and largest city of France."]
q_emb = mannequin.encode(queries, prompt_name="question", normalize_embeddings=True)
d_emb = mannequin.encode(paperwork, prompt_name="doc", normalize_embeddings=True)
scores = q_emb @ d_emb.T # form: (n_queries, n_documents)
The ColBERT mannequin runs by means of PyLate. Its PLAID index makes use of FastPLAID for environment friendly similarity search.
from pylate import indexes, fashions, retrieve
mannequin = fashions.ColBERT(
model_name_or_path="LiquidAI/LFM2.5-ColBERT-350M",
trust_remote_code=True,
)
mannequin.tokenizer.pad_token = mannequin.tokenizer.eos_token
index = indexes.PLAID(index_folder="pylate-index", index_name="index", override=True)
docs_emb = mannequin.encode(["document 1 text", "document 2 text"], is_query=False)
index.add_documents(documents_ids=["1", "2"], documents_embeddings=docs_emb)
retriever = retrieve.ColBERT(index=index)
q_emb = mannequin.encode(["a search query"], is_query=True)
scores = retriever.retrieve(queries_embeddings=q_emb, ok=10)
To rerank an current first-stage pipeline as a substitute, skip the index and use rank.rerank.
from pylate import fashions, rank
mannequin = fashions.ColBERT(model_name_or_path="LiquidAI/LFM2.5-ColBERT-350M", trust_remote_code=True)
queries = ["query A"]
paperwork = [["candidate doc 1", "candidate doc 2"]]
documents_ids = [[1, 2]]
q_emb = mannequin.encode(queries, is_query=True)
d_emb = mannequin.encode(paperwork, is_query=False)
reranked = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=q_emb,
documents_embeddings=d_emb,
)
You can even fine-tune both mannequin by yourself information. The Embedding card offers snippets utilizing sentence-transformers and MultipleNegativesRankingLoss.
Key Takeaways
- Liquid AI’s LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M are the primary bidirectional LFMs, constructed for multilingual search throughout 11 languages.
- Each 350M fashions lead their class on NanoBEIR and MKQA-11, beating the bigger Qwen3-Embedding-0.6B.
- Embedding provides the smallest, least expensive index; ColBERT trades a bigger index for increased per-token accuracy.
- GGUF builds run on CPUs, laptops, and edge through llama.cpp, with cached p50 question latency below 10 ms.
- They drop into current RAG pipelines by means of
sentence-transformersand PyLate, below the LFM Open License v1.0.
Interactive Explainer
‘+o.d.lg+’ · doc ‘+o.d.id+(i===0&&o.s>0?’ · TOP MATCH’:”)+’‘+
‘‘+o.s.toFixed(3)+’
‘+
‘
‘+o.d.tx+’
‘+
‘
