Bangladeshi Bangla LLM with Integrated STT and TTS: A Low-Cost Framework

Table of Contents
Existing speech-to-text (STT) and text-to-speech (TTS) systems often struggle with Bangladeshi Bangla. Automatic speech recognition for Bengali has “not reached an acceptable state” compared to major languages ([2109.13217] Challenges and Opportunities of Speech Recognition for Bengali Language). Off-the-shelf models like Whisper report low error rates on paper, but quirks in text normalization hide absolute accuracy issues for Bengali (Breaking Brahmic: How OpenAI's Text Cleaning Hides Whisper's True Word Error Rate for Many South Asian Languages - Deepgram Blog ⚡️ | Deepgram). In practice, without fine-tuning, such models can misrecognize local accents and code-mixed speech (Bangla-English mix). Likewise, generic TTS voices may sound unnatural to Bangladeshi listeners, as they often carry Indian accents or overly formal pronunciations that don’t reflect local speech patterns.
Differences from Indian Bengali: Bangladeshi Bangla (Eastern dialects) differs from West Bengal’s Bengali in pronunciation, vocabulary, and cultural context. For example, the Bengali “r” sound is softened or dropped in casual Bangladeshi speech (Understanding The Bengali Accent: #1 Complete Guide - ling-app.com) (Understanding The Bengali Accent: #1 Complete Guide - ling-app.com), and intonation tends to be evenly soft. Vocabulary also diverges – Bangladesh’s Bengali includes more Perso-Arabic loanwords (influenced by Islamic culture), whereas Indian Bengali favors Sanskrit-derived terms (Bengali dialects - Wikipedia). These regional differences extend to idioms and references; a phrase or cultural mention common in Dhaka might confuse a model trained only on Kolkata data. Syntax between the variants remains mostly the same (both use standard Bengali grammar). Still, some dialects in Bangladesh even add grammatical gender distinctions that are absent in the standard form (Bengali dialects - Wikipedia). Overall, current models fail to capture these subtleties, leading to errors in understanding and unnatural-sounding speech output. A Bangladeshi-centric model must accurately handle local dialectal variations, sociolects, and cultural references.
Model Selection
LLM Base Models: To build a Bangla-language LLM, we can start with an open-source base model and fine-tune it for Bangladeshi Bangla. Candidates include:
LLaMA 2 (Meta) is a strong foundation available in 7B, 13B, etc. Prior work has shown that fine-tuning LLaMA on Bengali data is feasible; “BengaliLlama” was created by instruction-tuning LLaMA-7B on 252k Bengali instructions using LoRA () (). This demonstrates LLaMA’s suitability for Bengali after adaptation.
Mistral 7B – A newer 7B model with Apache 2.0 license, noted for high efficiency. Mistral 7B outperforms even larger 13B models on many benchmarks (Mistral 7B | Mistral AI), meaning we get strong performance at a lower cost. Its smaller size and open license make it ideal for local fine-tuning and deployment.
BloomZ – An instruction-tuned version of the BLOOM multilingual model. BloomZ was trained in ~46 languages and can follow prompts in dozens of languages zero-shot (bigscience/bloomz-mt · Hugging Face). Bangla is among its languages, so Bengali understands it somewhat. This makes BloomZ (e.g. 7.1B or 3B variant) a cost-effective starting point for a Bengali assistant after some fine-tuning.
DeepSeek 7B – DeepSeek released open 7B and 67B models (trained on English/Chinese) (GitHub - deepseek-ai/DeepSeek-LLM: DeepSeek LLM: Let there be answers). The 7B variant could be fine-tuned on Bengali; however, since it wasn’t initially trained on Bangla, it’s less ready out-of-the-box than BloomZ. It’s still an option if its architecture or capabilities are desirable, but additional Bengali data would be needed to teach it the language.
STT Models: OpenAI Whisper. It’s trained on 680k hours of multilingual data and generalizes well to many languages (Whisper-Large-Bengali - Kaggle). Whisper supports Bengali, but fine-tuning can improve its accuracy on Bangladeshi accents. Researchers fine-tuning Whisper on regional speech (e.g. Noakhali dialect) achieved Word Error Rates as low as 1.5% (BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization). This shows Whisper can capture Bangladesh’s speech nuances when adapted with local data.
Another option is Mozilla DeepSpeech (now maintained by Coqui STT), which is fully open-source. However, DeepSpeech is an older architecture and not as accurate as Whisper on conversational speech. Wav2Vec2 (Facebook) models could also be fine-tuned for Bangla ASR if Whisper’s resource requirements are too high. Overall, Whisper-small or medium models, fine-tuned on Bangladeshi data, provide a strong, cost-effective STT backbone.
TTS Models: When trained on local voices, modern neural TTS can generate natural Bengali speech. Two open solutions stand out:
VITS (Visual TTS) is a state-of-the-art end-to-end TTS model that produces high-quality speech. A Bangla VITS model has been demonstrated using open data, achieving a Mean Opinion Score ~4.1 (out of 5) for naturalness (bangla-speech-processing/bangla_tts_male · Hugging Face). The VITS model in that project was trained with the Coqui TTS toolkit (formerly Mozilla TTS) (bangla-speech-processing/bangla_tts_male · Hugging Face). This shows that VITS can learn the prosody and accent of Bangladeshi speakers if provided with the right training corpus.
Mozilla TTS (Coqui-TTS) – an open framework supporting models like Tacotron2, FastSpeech, Glow-TTS, and VITS. It’s well-suited for low-resource languages. Using Mozilla’s tech, researchers have built Bangla TTS voices from as little as 24 hours of recorded speech (IndicSpeech: Text-to-Speech Corpus for Indian Languages) (IndicSpeech: Text-to-Speech Corpus for Indian Languages). The toolkit supports multi-speaker training, which could incorporate regional accent variations.
Pre-trained TTS for Bangla is limited, but datasets exist (see next section) to train our own. We would select a model architecture (VITS for best quality or FastSpeech for faster inference) and train a voice that sounds local (e.g., a Bangladeshi news narrator style). One could train a model on multiple voices or use voice conversion techniques for multi-speaker output.
Each chosen model is open-source and relatively lightweight. LLMs in the 7–13B range (LLaMA, Mistral, BloomZ) can be fine-tuned on a single GPU with the proper techniques. Whisper small or medium (around ~300M to 1.5B params) is also manageable for fine-tuning. Training a single-speaker model for TTS is feasible on consumer GPUs (TTS models are ~30M -100 M params). Overall, these selections balance capability with cost, and importantly, they all allow community use and further tuning without licensing hurdles.
Dataset Collection and Curation
Building a high-quality Bangladeshi Bangla dataset is crucial. We need text data for the LLM and paired audio-text data for STT/TTS. Here are some strategies for corpus building:
Leverage Public Texts
Gather Bangla texts from government websites, public domain literature, news media, and educational content. For example, Bangladesh’s government documents (laws, parliamentary transcripts, public notices) provide formal language data. Local newspapers and magazines (Prothom Alo, Ittefaq, BDnews24, etc.) offer contemporary vocabulary and topics. Classic literature (works of Kazi Nazrul, Rabindranath Tagore, Humayun Ahmed, etc.) can enrich stylistic and idiomatic coverage, though older texts may use a higher register of Bangla. Web content like Bangla Wikipedia and Bangla blogs can also be scraped. These sources ensure coverage of formal, journalistic, and narrative styles.
Include Social Media and Conversation
To capture colloquial Bangla and code-mixed usage, mine social media (Facebook posts, public groups, Twitter in Bangla, YouTube comments). Transcribe locally popular YouTube videos or TV show subtitles to get conversational transcripts. Caution is needed to filter out noise and offensive content, but this exposes the LLM to slang, regional idioms, and the mix of English words common in everyday Bangla (“Banglish”). Including such data helps the model handle informal queries and user-generated language.
Diversity and Dialect Coverage
Bangladesh has many dialects and sociolects;. At the same time, Standard Bangla (Choltibhasha) will be the primary focus; our dataset should include some dialectal content to make the system robust. We can crowdsource dialect phrases from different districts or use existing dialect corpora. Press. For instance, gather sample sentences from rural areas and various socio-economic groups (to include local terms and phrasing). Ensuring a mix of urban Dhaka Bangla and rural dialects will make the model more inclusive.
Audio Data for STT/TTS
We can bootstrap the speech dataset from open sources. The Common Voice project by Mozilla has crowdsourced Bangla speech – use it as a starting point for STT training. Additionally, OpenSLR provides a large Bengali ASR dataset (~196k utterances) that was crowdsourced and transcribed (openslr/openslr · Datasets at Hugging Face). This ~900+ hours of speech (collected by Google) includes a variety of speakers. Importantly, there is also a Bangladeshi Bengali TTS corpus: OpenSLR SLR37 contains high-quality recorded speech for bn-BD (~586 MB, hours of narrated text) with transcripts (openslr.org) (openslr.org). We should download these and extract transcripts and audio for fine-tuning Whisper and training TTS. Furthermore, local radio/TV archives (if accessible) can provide natural speech clips – e.g. Bangladesh Betar or talk shows. By assembling audio from multiple sources, we ensure variety in speaker age, gender, and accent, which improves model generalization.
Annotation and Cleaning
To keep costs low, prioritize existing text already in Bangla. However, for some tasks (like instruction tuning the LLM), we might need to create Q&A or conversational pairs. Here we can translate existing open datasets: e.g. take English instruction datasets (Alpaca, Dolly, etc.) and translate them to Bangla, then have native speakers proofread. This approach was used to create the 252k Bengali instruction dataset for BengaliLlama () (). We can employ bilingual students or crowdworkers in Bangladesh to do translation and validation at a relatively low cost (the manual verification can be limited to a subset for quality control). For audio transcripts, a small team of native speakers can clean the text (fix spelling, normalize colloquial shorthand) and ensure the transcripts match spoken words. Open-source tools like LangID can help separate Bangla vs English text, and simple scripts can standardize numerals and remove inappropriate content. By combining automated cleaning and selective human annotation, we can curate a robust Bangla dataset without an excessive budget.
In summary, use free and open resources first, augment with targeted data collection (possibly through community contributions). Aim for a balanced corpus that reflects formal and informal Bangladeshi Bangla across various domains. This ensures that the LLM, STT, and TTS components all learn Bangladesh's true linguistic and cultural landscape.
Training and Optimization
Fine-Tuning Strategy: Rather than training from scratch (impossible given resource limits), we’ll fine-tune the chosen base models on our Bangla data. Transfer learning is key – start from a model that already “knows” language basics and shape it with Bangladeshi Bangla examples. For the LLM, this means taking the base (e.g. LLaMA/Mistral/BloomZ) and training on a mixed dataset of Bangla text and instructions/conversations.
We should also align the LLM to follow user instructions in Bangla (similar to how ChatGPT was trained but in our language). A cost-effective method is Low-Rank Adaptation (LoRA), which freezes the original model weights and only trains a small number of extra parameters (low-rank update matrices) – drastically reducing GPU memory needs (). Using LoRA, the team behind BengaliLlama fine-tuned a 7B model in ~4 days on a single A100 GPU (). We can apply LoRA to our LLM to inject Bangla knowledge without full dense training.
We will also utilize quantization to optimize both training and inference. Quantizing the model to 8-bit or 4-bit precision can hugely cut memory and compute usage with minimal loss in performance. For example, the QLoRA technique recently allowed fine-tuning a 65B model on a single 48 GB GPU by using 4-bit weights (Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA). We can quantize our model (e.g., use 4-bit integer arithmetic during fine-tuning) so that even a consumer GPU (like an RTX 3090 with 24 GB) could handle a 7B–13B model.
During inference, an int8 or int4 quantized model will run faster and could even be deployed on CPU if needed, though GPUs will give better responsiveness.
Training Procedure
We will likely perform several training runs:
LLM Fine-Tuning
Use our Bangla text corpus to continue training the LLM (language modeling objective) so it becomes fluent in Bangla. Then, the supervised fine-tuning of instruction data (Q&A, dialogues) will be performed so that it learns to follow user prompts in Bangla. This two-stage process (pre-train on Bangla, then instruction-tune) aligns the model with linguistic and task needs. Techniques like gradient checkpointing and mixed precision (FP16/BF16) will be used to save memory. Multilingual, alignment – e.g. use an existing instruction-tuned model like BloomZ and do a lightweight fine-tune on Bangla data.
Whisper/STT Fine-Tuning
OpenAI’s Whisper models come pre-trained, so we only need to fine-tune on our Bengali speech dataset. We’ll feed in audio-transcript pairs and train for a few epochs so that the model adapts to local accents and proper nouns. We can use smaller batch sizes or freeze early layers because Whisper is large (the medium model has 769M parameters). Another trick is prompting Whisper with a task hint (Whisper allows a language token or example transcript as a prompt).
In low-resource cases, giving Whisper hints that the speaker uses Bengali (Bangladesh) can guide it. Recent research shows prompt-tuning Whisper with language info improved accuracy for Indian languages (Enhancing Whisper’s Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization) (Enhancing Whisper’s Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization). If it performs well, we can apply that here and perhaps avoid full fine-tuning.
TTS Training
For TTS, if using VITS or Tacotron, training from scratch on our voice dataset (which might be ~20–30 hours of speech) is feasible. We’ll train a model on Bangladesh Bengali speech (possibly a single speaker for clarity, or multi-speaker if we have labeled data for that). Training can be accelerated with transfer learning too – e.g., initialize from a pre-trained Tacotron model trained on another language, then train on Bangla voice (this might help if phoneme modeling is similar).
The Mozilla TTS framework supports transfer learning,familiar where a model trained on one dataset can be adapted to another low-resource dataset. We can also use techniques like NeMo’s ADA or fine-tune a multilingual TTS model on Bangla data to get a decent voice without colossal training.
Alignment and Accuracy
After initial training, we should evaluate and refine. If the LLM outputs literal translations of English idioms or shows cultural bias, we can perform an alignment tuning. This could involve Reinforcement Learning from Human Feedback (RLHF) if resources allow – e.g. have human annotators rank outputs for correctness and cultural appropriateness, then further tune the model to favor better responses.
At minimum, we’ll test the LLM on various user prompts (FAQs, casual questions, sensitive topics) and manually adjust the dataset or prompts to fix issues. Ensuring the model avoids misunderstandings (e.g., mixing up honorifics or failing to catch sarcasm familiar in local usage) may require adding more example dialogues to the fine-tuning data.
Finally, optimize everything for inference: use 4-bit quantized models for deployment and apply decoding tweaks (like adjusting Whisper’s beam width or the LLM’s temperature) to get reliable results quickly. We minimize computational cost while maintaining accuracy by combining LoRA (to fine-tune effectively on limited hardware) and quantization (to shrink model size). In short, train innovativelydoubling the possible batch size, not brute-force – leverage pre-trained knowledge, adapt with lightweight techniques, and carefully tailor the models to Bangladeshi Bangla.
Cost Reduction Strategies
Developing an LLM system can be expensive, but we employ several strategies to keep costs low:
Efficient Hardware Use
We plan training to fit on affordable hardware. A single modern GPU (such as Nvidia RTX 4090 or A6000) with 24–48 GB VRAM can handle our fine-tuning tasks thanks to optimization techniques (LoRA and QLoRA, as discussed). This avoids the need for an expensive multi-GPU server. We can also use cloud instances only when needed – e.g., rent an A100 GPU on AWS or Google Cloud for a few days to run the fine-tuning rather than maintaining our cluster. By using int4 quantization, even a 13B model fine-tune can run on one 48GB GPU (Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA), which dramatically cuts rental time and cost. We will also explore using cheaper cloud providers or credits from academic programs (many cloud providers offer grants for research on low-resource languages).
Cloud vs Local Balance
For initial experimentation and smaller models, we can use free or low-cost resources: Google Colab offers free GPU time (though limited), Kaggle kernels, or cloud TPUs from programs like Kaggle TPU or TensorFlow Research Cloud. Local universities in Bangladesh often have GPU servers – collaborating with an academic institution (e.g., a computer science department at BUET or BRAC University) could grant access to their hardware in exchange for co-authorship or open publication.
This educational collaboration spreads out the computing burden. Another angle is to approach local tech companies or startups (perhaps those in the Bangla NLP space) to share resources or sponsor computing in return for credit and early access to the model.
Distributed Training and FP16
If our dataset is large, we can split training across multiple smaller GPUs through gradient accumulation and small batch sizes. It’s slower but avoids needing a single high-memory GPU. Using mixed precision (FP16) cuts memory use by half, doubling the possible batch size on the same hardware. We’ll ensure our training code uses PyTorch’s automatic mixed precision to maximize throughput per GPU-hour. These software optimizations, combined with selecting smaller model sizes (we focus on 7B–13B LLMs, not 70B monsters), significantly reduce the compute requirements.
Quantized Inference on CPU
We aim to run the model on modest hardware for deployment. After fine-tuning, we can quantize the LLM to 4-bit and possibly run it on a CPU server or even ha igh-end mobile device. This avoids the cost of keeping a GPU online for inference. Libraries like GPTQ and Bits and Bytes allow running 4-bit models with reasonable speed on CPUs. Similarly, the STT and TTS models can be optimized. Whisper small model (maybe 300M parameters) can transcribe in real-time on a CPU with Intel MKL optimizations. TTS models like FastSpeech2 are lightweight enough to run on smartphones.
If real-time response is not required, we can batch-process requests on a single machine. In essence, we design the system so that a local NGO, school, or startup could deploy it on a minimal server without needing expensive accelerators.
Open Collaboration
We tap into the open-source community to further cut costs. By open-sourcing our project, we might attract contributions of data and computing. Enthusiasts could donate training time (for instance, someone with a gaming PC might run a few epochs overnight). There are also community GPU share programs (like Hugging Face’s TPU cluster or volunteer computing grids) that could be leveraged. Collaboration with initiatives like Masakhane (for African languages) and AI4Bharat (for Indian languages) provides a blueprint – those projects pool resources and expertise to build local language AI (Local AI Research Groups are Preserving Non-English Languages in the Digital Age | TechPolicy.Press) (Local AI Research Groups are Preserving Non-English Languages in the Digital Age | TechPolicy.Press).
For Bangladesh, we could engage the Bengali AI community (perhaps via the Bangla NLP Facebook group or Kaggle community) to help with annotation and validation, reducing the need for paid labor.
Phased Development
Before scaling up, we can start with smaller prototypes to test viability (e.g., train a 1.3B model first or use the tiny Whisper model).
By combining these strategies, we minimize financial barriers. The focus is on using intelligence over brute force: using pre-trained models, open data, efficient fine-tuning, and community partnerships. The goal is an affordable pipeline that a Bangladeshi institution could maintain without needing millions of dollars or a giant data center.
Implementation and Deployment
With fine-tuned models, we integrate STT and TTS to create a seamless user experience for Bangladeshi users. The system will be an interactive dialogue agent that users can talk to and listen to in Bangla.
Integration Architecture
The workflow is as follows: The user speaks in Bangla → STT transcribes it to text → LLM processes the text and generates a response (in Bangla) → TTS converts the response text back to speech audio for the user. This pipeline can be implemented modularly. For instance, Whisper (or the STT model) runs first and outputs recognized text. That text is fed into the LLM (which runs on a server or on-device if small enough). The LLM’s text answer is then passed to the TTS engine to synthesize voice.
We will ensure the components communicate quickly – possibly using lightweight protocols or an on-device setup for low latency. An end-to-end pipeline like this has been successfully demonstrated in research for Bangla dialect conversion BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization(BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization), which combined a fine-tuned Whisper, a Bangla language model, and AlignTTS to go from speech→text→standardized speech. We will adopt a similar integrated approach but tuned for general Q&A dialogue.
Seamless User Experience
The system should feel natural for users of varying literacy and digital skill levels. We can provide a text chat interface for literate users as well, but the primary mode is voice to cater to those who cannot easily read Bangla script or type on a keyboard. The voice assistant should understand casual speech, different dialect inflections, and some mixed English phrases (common in urban speech). Our fine-tuning of the STT on diverse accents will help it handle this variability. On output, the TTS will be configured to speak in a clear Bangladeshi accent and properly pronounce local names and places.
We might offer voice choices (male, female voice) if multiple TTS models are trained, but the default will be a friendly Bangladeshi voice speaking at a moderate pace.
Localization and Accessibility
We will incorporate cultural knowledge into the LLM to respond appropriately. For example, if asked about “pourashava” (municipality) or local dishes like “pitha”, the model should not be confused. This comes from the local corpus we trained on. The model will also handle honorifics (using appropriate polite forms when needed, such as using “আপনি” for formal 'you'). For users with low digital literacy, the system could have a simple interface – possibly a phone line one can call and speak to (since many rural users are more comfortable with phone calls than smartphone apps).
Because our STT/TTS modules are lightweight, deployment on an Android app that works offline is conceivable. The app could carry the quantized models and run entirely on the device, which is excellent for areas with poor internet. Alternatively, a centralized server could handle requests and respond with audio – in that case, we’d optimize the server to handle multiple concurrent users by using efficient batching (processing multiple requests together if using GPUs).
Testing and Iteration
We will test the integrated system with real users in Bangladesh. Their feedback (whether the voice is understandable, whether the system misheard certain words, etc.) will guide further tuning. For example, if many users in Sylhet find it fails to understand a Sylheti word, we can gather those cases and retrain the STT or add that vocabulary to the LLM. We’ll also ensure the TTS output is at the right speed and volume for users not accustomed to listening to long audio. Perhaps include a feature to slow down speech or repeat answers on demand.
For accessibility, the text output can be displayed with Bengali script and optional Romanization for those who can’t read the script but can read Latin letters. This multi-modal output caters to different literacy levels.
Optimization
The final deployed system will be optimized for low resource usage. For instance, on mobile, we could run the STT on-device (Whisper tiny model) and send the text to an LLM server and then return synthesized audio. Or if fully on-device, use a distilled smaller LLM. The key is to maintain accuracy while keeping latency low. We aim for only a few seconds from speaking to hearing the reply, which is achievable with our model sizes. Caching can be used for standard queries (the LLM’s response to frequent questions could be precomputed).
Moreover, we will incorporate fallback handling: if the STT is unsure (low confidence), the system might politely ask the user for clarification rather than giving a wrong answer. This kind of refinement ensures the user trusts the system.
In deployment, we will continue to monitor and improve the system. Because it’s built on open-source tools, improvements in those tools (e.g., a new version of Whisper or a better Bangla TTS model) can be integrated. We will also open-source our Bangla fine-tuned models, enabling the community to use them or even contribute improvements. By focusing on Bangladeshi users’ needs and constraints throughout (language style, accent, device limitations, etc.), this LLM with STT/TTS will significantly enhance digital accessibility, allowing users to interact with technology in their mother tongue naturally, whether they are reading, writing, or speaking.
Practical Steps Summary
We identify gaps in current Bangla LLM models, choose open models and tune them, gather local data with community help, optimize training with LoRA and quantization, cut costs via clever resource use, and deploy an integrated voice assistant tuned to Bangladesh’s linguistic and cultural context. The result is a cost-effective yet robust Bangla LLM system that understands and speaks to users as a fellow Bangladeshi would, ensuring inclusivity across literacy levels and regions.
Sources
- Mridha et al., “Challenges and Opportunities of Speech Recognition for Bengali,” Artificial Intelligence Review (2022) – Bengali ASR state and need for language-specific approaches ([2109.13217] Challenges and Opportunities of Speech Recognition for Bengali Language).
- Deepgram AI, “Breaking Brahmic: Whisper’s True Error Rates for South Asian Languages,” (2023) – Whisper accuracy issues for Tamil, Hindi, Bengali due to text normalization (Breaking Brahmic: How OpenAI's Text Cleaning Hides Whisper's True Word Error Rate for Many South Asian Languages - Deepgram Blog ⚡️ | Deepgram).
- Ling App, “Understanding the Bengali Accent: Complete Guide,” (2022) – Differences in Bangladeshi vs. West Bengal pronunciation and accent (e.g. ‘r’ sound) (Understanding The Bengali Accent: #1 Complete Guide - ling-app.com) (Understanding The Bengali Accent: #1 Complete Guide - ling-app.com).
- Wikipedia, “Bengali dialects,” – Notes on Eastern (Bangladesh) vs Western vocabulary influences (Perso-Arabic vs Sanskrit) (Bengali dialects - Wikipedia) and minimal grammar differences (Bengali dialects - Wikipedia).
- Muennighoff et al., “BloomZ & mT0 Models,” HuggingFace (2022) – Multitask finetuned LLM that follows instructions in dozens of languages zero-shot (bigscience/bloomz-mt · Hugging Face).
- OpenReview (anonymous), “BengaliLlama: Instruction Following LLaMA Model for Bengali,” (2023) – Used LoRA to fine-tune LLaMA-7B on 252k Bengali instructions; highlights data scarcity and approach () ().
- Mistral AI, “Announcing Mistral 7B,” (Sept 2023) – Performance of Mistral 7B (7.3B params) surpassing larger LLaMA2 models, and ease of fine-tuning with Apache 2.0 license (Mistral 7B | Mistral AI).
- HuggingFace Blog, “4-bit Quantization and QLoRA,” by Dettmers et al. (2023) – Describes QLoRA method that fine-tunes a 65B model on a single 48GB GPU with 4-bit precision (Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA).
- OpenSLR, “High Quality TTS Data for Bengali (bn-BD, bn-IN),” SLR37 (2018) – Dataset of transcribed Bangladeshi and Indian Bengali speech released by Google (CC BY-SA) (openslr.org) (openslr.org).
- Bangla Speech Processing Team, HuggingFace Model Card (2023) – Bangla TTS voice using VITS, trained on IIT Madras 24-hour corpus and Common Voice, achieved MOS 4.10 (bangla-speech-processing/bangla_tts_male · Hugging Face) (bangla-speech-processing/bangla_tts_male · Hugging Face).
- Samin et al., “BanglaDialecto: Dialect Speech Standardization,” arXiv (2023) – End-to-end pipeline converting Noakhali dialect speech to standard Bangla using Whisper ASR, BanglaT5 translation, and AlignTTS (BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization) (BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization).
- Radiya-Dixit, “Local AI Research Groups Preserving Non-English Languages,” TechPolicy Press (2023) – Highlights community-driven data collection (AI4Bharat’s IndicVoices involving diverse speakers, including low-income telephone users) (Local AI Research Groups are Preserving Non-English Languages in the Digital Age | TechPolicy.Press) and the importance of local collaboration (Local AI Research Groups are Preserving Non-English Languages in the Digital Age | TechPolicy.Press).