Feasibility of training a Turkmen LLM from scratch on 100MB data #930

tllmmaster · 2025-12-24T12:49:54Z

tllmmaster
Dec 24, 2025

Hello everyone,
I have a question about training an LLM from scratch with a small dataset.
I have about 100MB of clean text data in the Turkmen language. My goal is to:

Pre-train a model from scratch (Next Word Prediction) using only this 100MB dataset.
Afterwards, perform Instruction Fine-tuning on this resulting model using the same data (converted into instruction format).

To clarify my expectations: I only need the model to answer questions based on the information contained in this training data. I do not expect it to have general knowledge or capabilities beyond what is in the dataset.

Is this feasible? Can I effectively get a working model by following this pipeline with such a small amount of data?

Answered by rasbt

Dec 24, 2025

Hi there, I think this sounds somewhat feasible, but in the case of such a small dataset plus

To clarify my expectations: I only need the model to answer questions based on the information contained in this training data.

maybe try a RAG setup first. You could try one of these notebooks. Not totally from scratch but should be easy to use: https://github.com/rasbt/RAGs

View full answer

rasbt · 2025-12-24T15:32:38Z

rasbt
Dec 24, 2025
Maintainer

Hi there, I think this sounds somewhat feasible, but in the case of such a small dataset plus

To clarify my expectations: I only need the model to answer questions based on the information contained in this training data.

maybe try a RAG setup first. You could try one of these notebooks. Not totally from scratch but should be easy to use: https://github.com/rasbt/RAGs

0 replies

tllmmaster · 2025-12-25T03:36:29Z

tllmmaster
Dec 25, 2025
Author

Dear Sebastian,
Thank you so much for the quick reply and the recommendation. I agree that RAG would be the most practical solution for this use case.
However, this is a strict university assignment where the specific requirement is to build a model from scratch and then fine-tune it. Unfortunately, I cannot use RAG due to these constraints.
Given that I must proceed with training from scratch on this small 100MB dataset, could you advise on a reasonable model size (number of parameters) or architecture (e.g., a tiny GPT-2 config) that would minimize overfitting and allow the model to learn at least the basic structure of the language?
Best regards,
Turkmen engineer

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feasibility of training a Turkmen LLM from scratch on 100MB data #930

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Feasibility of training a Turkmen LLM from scratch on 100MB data #930

Uh oh!

Uh oh!

tllmmaster Dec 24, 2025

Replies: 2 comments

Uh oh!

rasbt Dec 24, 2025 Maintainer

Uh oh!

Uh oh!

tllmmaster Dec 25, 2025 Author

tllmmaster
Dec 24, 2025

rasbt
Dec 24, 2025
Maintainer

tllmmaster
Dec 25, 2025
Author