Cross-Lingual Transfer Learning in Large Language Models: Evaluating Zero-Shot Performance Across Low-Resource Languages

The advent of large language models (LLMs) has fundamentally reshaped the landscape of natural language processing (NLP). Trained on vast, multilingual corpora, models like GPT-4, LLaMA, and PaLM exhibit a remarkable emergent capability: the ability to perform tasks in languages they were not explicitly fine-tuned for, a phenomenon known as zero-shot cross-lingual transfer¹. This capability holds profound promise for democratizing AI, particularly for the estimated 7,000+ languages globally that are considered low-resource—lacking the large-scale, annotated datasets required for traditional model training². This article critically evaluates the mechanisms, current performance, and persistent challenges of cross-lingual transfer in LLMs, with a specific focus on its efficacy for low-resource languages.

The Mechanisms of Cross-Lingual Transfer

Cross-lingual transfer learning in LLMs does not rely on explicit parallel translation data for every language pair. Instead, it emerges from the model’s pre-training on a multilingual, and often web-scraped, text corpus. Several interconnected mechanisms underpin this capability:

Cross-Lingual Transfer Learning in Large Language Models: Evaluating Zero-Shot Performance Across Low-Resource Languages — illustration 1

Shared Subword Representations and Semantic Alignment

Modern LLMs predominantly use subword tokenization algorithms like Byte-Pair Encoding (BPE) or SentencePiece³. These algorithms create a shared vocabulary across languages, where semantically similar concepts—even across different scripts—can map to overlapping or proximate regions in the model’s high-dimensional embedding space. For instance, the embeddings for “dog” (English), “perro” (Spanish), and “犬” (Japanese) become aligned through contextual co-occurrence patterns in the training data, a process facilitated by the transformer architecture’s self-attention mechanism⁴.

The Role of English as a Pivotal Language

A significant, and often critiqued, aspect of current LLMs is the central role of English. The pre-training corpora are overwhelmingly dominated by English text, making it a high-resource “anchor” language. In zero-shot transfer, a model is typically prompted or given an example (in-context learning) in English, and then asked to perform the task in a target language. The model effectively uses its internal, cross-lingual representations to project the task understanding from English onto the target language⁵. This creates an implicit translation pathway within the model’s parameters, bypassing the need for explicit translation.

Cross-Lingual Transfer Learning in Large Language Models: Evaluating Zero-Shot Performance Across Low-Resource Languages — illustration 3

In-Context Learning as a Transfer Catalyst

The few-shot or in-context learning capabilities of modern LLMs are a powerful vector for cross-lingual transfer. By providing a few annotated examples in the source language (e.g., English) within the prompt, the model can infer the task pattern and apply it to a query in a different target language. This demonstrates that the model is learning abstract, language-agnostic task templates, which it can then instantiate using its multilingual representations⁶.

Evaluating Zero-Shot Performance: A Mixed Landscape

Benchmarking the zero-shot performance of LLMs across languages reveals a stark hierarchy that often mirrors the digital footprint and linguistic proximity of languages to English.

Performance on High- and Medium-Resource Languages

For languages with substantial representation in pre-training data (e.g., Spanish, French, German, Chinese), zero-shot transfer can be surprisingly effective on tasks like text classification, named entity recognition, and question-answering. Performance may reach 70-90% of the supervised baseline in English for these languages⁷. Success is higher for tasks that rely more on semantic understanding than on strict syntactic structure.

The Challenge of Low-Resource and Linguistically Distant Languages

For truly low-resource languages (e.g., Swahili, Yoruba, Nepali) or those with different scripts and linguistic families (e.g., Amharic, Georgian, Inuktitut), performance drops precipitously. Key evaluation findings include:

Data Scarcity in Pre-training: The token count for these languages in the pre-training corpus may be orders of magnitude smaller than for English, leading to poor representation learning.
Script and Structural Divergence: Languages with non-Latin scripts or vastly different morphosyntactic structures (e.g., agglutinative or polysynthetic languages) struggle with the subword overlap and semantic alignment mechanisms.
Translation Artifacts: The dominant “English-pivot” transfer can introduce cultural or linguistic mismatches. A model may force Western conceptual frameworks onto the target language or rely on calques from English that sound unnatural⁸.
Benchmark Limitations: Many multilingual benchmarks are themselves created via translation from English, potentially embedding English-centric biases and failing to capture language-specific phenomena⁹.

Key Challenges and Research Frontiers

Improving cross-lingual transfer for low-resource languages is an active area of research, focusing on several frontiers:

Mitigating Representation Bias

The current paradigm entrenches linguistic hegemony. Research is exploring more balanced pre-training data curation, intentional up-sampling of low-resource language data, and novel objectives that force the model to build more equitable cross-lingual representations, such as translation language modeling or code-switching prompts¹⁰.

Beyond the English Pivot: Direct Cross-Lingual Transfer

Emerging work investigates prompting strategies that reduce dependency on English. This includes using a high-resource language closer to the target language as the pivot (e.g., French for Wolof) or developing meta-prompts that explicitly instruct the model to operate in a language-agnostic manner¹¹.

Adaptation and Efficient Fine-Tuning

For sustained use in a specific low-resource language context, full fine-tuning is often impractical. Parameter-efficient fine-tuning (PEFT) methods like LoRA (Low-Rank Adaptation) allow for adaptation of LLMs using very small amounts of target-language data, significantly boosting performance without catastrophic forgetting of other languages¹².

Intrinsic Evaluation of Multilingual Representations

Moving beyond task-based benchmarks, researchers are developing intrinsic probes to diagnose the quality of multilingual spaces—assessing isomorphism (structural similarity) and the degree of semantic alignment across languages to predict transfer performance¹³.

Conclusion: Toward Equitable Multilingual AI

Cross-lingual transfer learning in LLMs represents a paradigm shift, offering a viable path toward NLP applications for hundreds of languages that lack traditional resources. The zero-shot capabilities demonstrated by current models are a testament to the power of scale and the transformer architecture’s ability to learn deep, aligned semantic representations. However, the field must confront the uncomfortable reality that these capabilities are currently highly asymmetric, often failing the communities that stand to benefit most.

The path forward requires a multi-faceted approach: conscientious data stewardship to rebalance pre-training corpora, algorithmic innovations that promote linguistic equity, collaboration with native speaker communities for evaluation and data creation, and the development of benchmarks that reflect genuine, rather than translated, language use. The ultimate goal is not merely to transfer capabilities from high-resource to low-resource languages, but to foster LLMs that are truly multilingual—capable of understanding and generating language with native-level competence and cultural sensitivity across the full spectrum of human linguistic diversity.

¹ Conneau, A., et al. (2019). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
² Joshi, P., et al. (2020). The State and Fate of Linguistic Diversity and Inclusion in the NLP World. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
³ Sennrich, R., et al. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
⁴ Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30.
⁵ Pfeiffer, J., et al. (2022). MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. Transactions of the Association for Computational Linguistics.
⁶ Brown, T., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 33.
⁷ Hu, J., et al. (2020). XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. Proceedings of the 37th International Conference on Machine Learning.
⁸ Ahuja, K., et al. (2023). Beyond English-Centric Cross-Lingual Representation Learning. Findings of the Association for Computational Linguistics: EACL 2023.
⁹ Ruder, S., et al. (2021). On the Opportunities and Risks of Foundation Models. arXiv preprint arXiv:2108.07258.
¹⁰ Liu, Q., et al. (2021). Don’t Forget the Long Tail! A Comprehensive Analysis of Cross-Lingual Transfer from English to Low-Resource Languages. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
¹¹ Winata, G. I., et al. (2021). Meta-Learning for Fast Cross-Lingual Adaptation in Neural Machine Translation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
¹² Hu, E. J., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. Proceedings of the 39th International Conference on Machine Learning.
¹³ Patra, B., et al. (2021). Beyond Offline Mapping: Learning Cross-lingual Word Embeddings through Context Anchoring. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics.