Large Language Models (LLMs) have made significant strides in recent years, demonstrating impressive capabilities in understanding and generating human language. These models, such as GPT-4 and Gemini, are exceptionally proficient in Standard American English and a handful of other languages with abundant online data. However, accessing these models in certain regions, like Vietnam, can be challenging due to restrictions and limited availability.
This raises questions about the performance of LLMs when it comes to languages that are less represented online, such as Vietnamese: How do these powerful LLMs perform when tasked with understanding and generating text in languages less well-represented on the internet?
To address this gap, the Neurond team conducted research and trained a Vietnamese LLM model, specifically on Vietnamese literature data. This initiative aims to enhance the model’s understanding and generation capabilities in Vietnamese by leveraging culturally and linguistically rich datasets. Also, the evaluation methodology will provide a comprehensive framework for assessing the performance of LLMs in languages with limited online representation. It can serve as a valuable template for researchers and developers working on LLMs in other underrepresented languages.
Why Vietnamese LLM?
The Vietnamese language, spoken by over 100 million people worldwide, presents a significant opportunity to develop advanced language models. While there are existing Vietnamese LLMs, they’re still in their early stages of development, with most being commercial models and not widely accessible for research or public use. This limited availability of robust, open-source models hampers research and innovation in the field. This limitation underscores the need for dedicated efforts to create more advanced and open Vietnamese LLMs that cater to the Vietnamese-speaking population’s linguistic and cultural nuances.
Moreover, developing a robust Vietnamese LLM opens up new possibilities for creating effective language models in other underrepresented languages. By addressing the challenges associated with limited data availability and linguistic diversity, our methodologies and insights gained from training Vietnamese LLMs can be applied to similar efforts in other languages. This enhances the global reach and applicability of LLMs and promotes linguistic diversity in the digital realm. As more languages gain representation in advanced AI models, it paves the way for more equitable access to technology and information, fostering innovation and cultural preservation across different linguistic landscapes.
Vietnamese LLMs Training
Pretraining
The literature data our Neurond team uses will ensure the model is exposed to a wide range of vocabulary, communication styles, dialects, and contexts unique to the Vietnamese language. This approach improves the model’s linguistic capabilities and helps preserve and promote Vietnamese literary heritage in the digital age.
Our dataset consists of 1,600 questions covering a wide range of topics within Vietnamese literature as outlined in the official curriculum for grades 7 and 8. These questions are short stories, novels, and poems. They are designed to test comprehension, interpretation, and critical thinking skills, providing a robust challenge for any language model. They include questions about the content, plot, author, and analysis of the literary works.
Tested LLM Models
We evaluated the following LLMs to assess their capabilities in answering the questions from our dataset:
- VinaLLaMA: LLaMA 2 based, finetuned with 1 million tokens of synthetic data.
- BKAI Vietnamese-LLaMA-2: Based on LLaMA 2, used continual training with 800 million extra tokens.
- GhostX: Finetune from Mistral 7B using Vietnamese data.
- Sailor 7B and 14B: Based on Qwen 1.5, finetuned on data for languages from the SEA region, including Vietnamese.
- LLaMA 3: Stock model from Meta AI
*Note: We also tested the Command-R+ model, which proved to be the best performing in zero-shot settings for Vietnamese questions. However, due to its non-commercial license, we did not go into detailed settings for this model.
Performance Metrics
To evaluate each model’s performance, we manually evaluated several of the generated responses from each model using zero-shot and 1 epoch finetuned settings.
For finetuning, we used the codebase from LLaMa-Factory. Each training sample consists of one question, the corresponding context to answer that question (one literary work/one excerpt from a literary work from Vietnamese literature textbooks), and the instruction to use the context to answer the question. We also have the labeled/correct answer from the textbooks for verification. The training set included most of the data (1,550 samples); we chose 50 samples of varied types of questions/documents for the test set.
Training
For initial evaluation, we first tested each model using some samples in the test set. If the answers were at least coherent in Vietnamese, we finetuned the model for 1 epoch and manually checked the output afterward. In case the model results were promising (answers were at least acceptable and based on the context), we further finetuned the model for more epochs and tested the output.
Results
Here’s a summary of the findings:
Model |
Performance Notes |
VinaLLaMA (zero-shot) |
Bad performance in Vietnamese; incoherent and returns non-Vietnamese characters and words. |
BKAI Vietnamese-LLaMA-2 (zero-shot) |
Lacks coherency in Vietnamese; tends to be repetitive at times. |
GhostX (zero-shot and finetuned) |
Fluent and coherent in Vietnamese; can answer dataset questions, but sometimes repetitive. |
Sailor 7B (zero-shot and finetuned) |
Best performing model; finetuning enhances tone without breaking coherence. Detailed human evaluation conducted. |
Sailor 14B (zero-shot and finetuned) |
Consistent performance; slightly slower to reduce repetition compared to 7B. |
Sailor 7B Detailed Evaluation
To further investigate the performance of the Sailor 7B model, we curated a test dataset with 15 samples and conducted a human evaluation to assess the model’s performance.
Three evaluators rated each sample on a scale of 1 to 10 (higher is better), and the scores were averaged to produce the final performance score. The evaluators were instructed to compare the model answers to the labeled/correct answers.
- Models Evaluated: Base, 1 epoch, 2 epochs, 5 epochs.
General Evaluation:
- 1 Epoch: The model starts to become coherent but often repeats itself and can cut off unexpectedly.
- 2 Epochs: This model performed the best, rarely repeating itself and not cutting off.
- 5 Epochs: Performance slightly degraded from the 2-epoch model, with more repetition, but showed improvements in some answers.
The fine-tuned model was also capable of handling long-context questions not present in the training set.
Sailor 14B Detailed Evaluation
The Sailor 14B model was finetuned on 1,600 data points, with evaluations conducted across 1 to 5 epochs.
-
- General Evaluation:
- The model’s performance was more consistent across different finetuning stages compared to the 7B model.
-
- Overall performance did not exceed that of the 7B model.
-
- It took longer (3-4 epochs) for the model to reduce repetition.
Conclusion
This study highlights the varying capabilities of different LLMs in understanding and responding to questions about Vietnamese literature. While specialized models show promising results, there remains room for improvement. Even the best models we tested still sometimes generated repeated results and had trouble providing the same details as the labeled answer. As educational AI continues to evolve, ongoing research and refinement will be essential to enhance model performance in language education contexts.
Explore our other technical research on vision language models at Key Insights into Vision Language Models—A New Frontier in Multimodal AI.
Trinh Nguyen
I'm Trinh Nguyen, a passionate content writer at Neurond, a leading AI company in Vietnam. Fueled by a love of storytelling and technology, I craft engaging articles that demystify the world of AI and Data. With a keen eye for detail and a knack for SEO, I ensure my content is both informative and discoverable. When I'm not immersed in the latest AI trends, you can find me exploring new hobbies or binge-watching sci-fi