. Primarily, I work with my coding assistant in Chinese language. Nevertheless, my writing is usually blended: many engineering phrases are extra acquainted to me in English (particularly phrases we use in python, git, and so forth), and a few are even tough to translate naturally into Chinese language.
Yesterday, I requested my coding assistant in Chinese language:“run.py有早停吗?我在恒源云上跑,发现没有触发”, that means, “Does run.py implement early stopping? I used to be operating the venture on a shared GPU service, and I didn’t see early stopping triggered.” As standard, I naturally typed the technical token run.py in its authentic English type. The mannequin inspected the code and responded with the next:
All technical tokens remained in English (run.py, config.py, train_unified), whereas the explanatory construction shifted into Korean. This isn’t a singular case. It has occurred every so often: so long as I blended Chinese language and English engineering phrases, Korean at all times appeared.

This made me ask: Is that this a language subject, or one thing deeper within the embedding area?
Speculation
Embedding areas are usually not primarily structured by the character of languages. Having been skilled alongside language fashions, they are usually organized by activity registers similar to tutorial writing, conversational textual content, and, within the case of coding assistants, engineering/code. Chinese language, though spoken by the biggest inhabitants on this planet, will not be a pure medium for the engineering register and has restricted illustration in technical corpora.
In such a context, textual content could cease behaving like “Chinese language” within the embedding area as quickly as engineering tokens similar to overview / department / commit / PR / diff seem. As an alternative, it might drift into an engineering attractor subject.
We are going to conduct some experiments to supply empirical proof for this speculation.
Managed Language Drift
We assemble the next managed sequence of sentences the place English phrases take over Chinese language ones steadily:
Stage 0: 请帮我检查这个分支
Stage 1: 请帮我 overview 这个分支
Stage 2: 请帮我 overview 这个 department
Stage 3: Please overview this department pull request commit
Stage 4: Please overview this department pull request commit code diff
We now compute similarity utilizing cosine similarity between sentence embeddings. We outline Korean and English “clusters” as the typical embedding of a small set of consultant engineering-related sentences in every language. We use Δ (EN − KO) to indicate the distinction between English and Korean similarity scores, i.e., Δ = similarity(English) − similarity(Korean).
| Stage | Korean similarity | English similarity | Δ (EN − KO) |
|---|---|---|---|
| 0 | 0.4783 | 0.5141 | 0.0358 |
| 1 | 0.5235 | 0.5728 | 0.0492 |
| 2 | 0.5474 | 0.6140 | 0.0665 |
| 3 | 0.5616 | 0.7314 | 0.1698 |
| 4 | 0.5427 | 0.7398 | 0.1972 |
We noticed an attention-grabbing phenomenon: Korean similarity will increase first and is later overtaken by English similarity. Furthermore, the expansion in English similarity is non-linear, suggesting a phase-transition–like conduct quite than gradual drift.
When projecting the embeddings into two dimensions utilizing PCA, we observe a easy trajectory within the early phases, adopted by a pointy directional bounce between Stage 2 and Stage 3, and subsequent stabilization. This sample signifies that embeddings don’t transfer linearly via area; as a substitute, they seem to transition between attractor basins.

Actual-world Mannequin Conduct
Think about once more the sentence we talked about initially. I requested:
A. “run.py有早停吗?我在恒源云上跑,发现没有触发”, that means “Does run.py implement early stopping? I used to be operating the venture on a shared GPU service, and I didn’t see early stopping triggered.”
B. “원인을 찾았습니다. 결론: run.py에는 실제로 조기 종료가 없습니다. config.py에 USE_EARLY_STOPPING = True” (in Korean).
Translated again into Chinese language, we now have:
C. “我找到了原因。结论:run.py实际上没有早停。config.py里有 USE_EARLY_STOPPING = True。”
We compute the similarities of A, B, and C utilizing cosine similarity between sentence embeddings. For comparability, we outline three reference clusters: the Chinese language cluster as the typical embedding of basic Chinese language natural-language sentences, and the corresponding English and Korean clusters.
| Textual content | Korean sim | English sim | Chinese language sim |
|---|---|---|---|
| A. (Chinese language immediate) | 0.2003 | 0.2688 | 0.3134 |
| B. (Korean response) | 0.2745 | 0.2983 | 0.1641 |
| C. (Translated Chinese language) | 0.1634 | 0.3106 | 0.2798 |
As you may see, translating the Korean response again into Chinese language doesn’t ship the embedding again to the Chinese language area. As an alternative, it strikes even nearer to the English clusters.
This implies: Translation might restore language type, however in all probability not embedding location.
Conclusion
Each experiments give the identical conclusion: the embedding area will not be organized by language boundaries. As an alternative, it’s extra doubtless structured by activity natures, the place engineering English dominates.
When a sentence enters this area, language type could change, however the embedding construction stay within the engineering basin, resulting in bizarre behaviors similar to replying in Korean even if you’re under no circumstances a Korean speaker.
