Daily logs and weekly reflections, newest first.
Jun 8, 2026 – Jun 14, 2026
2026-24
No reflection text.
Final weekly XP (after rules): 0 XP
- [AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios] - as the name of the paper, focus on realism and task diversity.
Jun 1, 2026 – Jun 7, 2026
2026-23
No reflection text.
Final weekly XP (after rules): 34 XP
- Test Qwen3-ASR on different sets of hyperparams.
Lessons: maybe I am wrong when trying to solve a solved problem, maybe I identify the problem wrongly, I am wrong when trying to put a solution on a problem, this is not how we solve sth.
: 400
- (SURF: Separation via Unsupervised Remixing Flow; ICML 2026, GG DeepMind) - combination of supervised flow matching and regression-based self-supervised techniques, do not require clean source data, connect to Wake-Sleep algorithm. This paper's idea is very creative.
- Try constraint decoding, have better mental model before fine-tuning
May 25, 2026 – May 31, 2026
2026-22
No reflection text.
Final weekly XP (after rules): 0 XP
- Adaptive Contrastive Search: : Uncertainty-Guided Decoding for Open-Ended Text Generation
- Normalize Chinese text and fine-tune again, add a small trick for token predicted at chunk boundary. Try to have more diverse hypotheses.
- Test TRELLIS
- Integrate Qwen3-ASR-0.6B ec ckpt to current streaming system
- Ultravox-v0.5-Llama3.2-1B is very bad at following the instructions.
Dell ma nay comeback (18 days left)
May 18, 2026 – May 24, 2026
2026-21
No reflection text.
Final weekly XP (after rules): 0 XP
- Pruning Ultravox with Llama backbone by 30% params in Llama layers, the overall performance is worsen but better than I expected for a training-free method.
May 11, 2026 – May 17, 2026
2026-20
Harder please.
Final weekly XP (after rules): 196 XP
- "A SIMPLE AND EFFECTIVE PRUNING APPROACH FOR LARGE LANGUAGE MODELS" - a training-free pruning technique leveraging weight magnitude and corresponding input activations, estimated using a small set of calibration data.
- Furthur fine-tuning on waihu (both normalization + punc restoration) and tested on waihu, but the performance is worse than original recognizer plus fine-tuning on waihu.
Let try adding some novelties.
- "https://www.youtube.com/watch?v=ptFiH_bHnJw" - LLM Architectures, Hyperparameters. This talks about Rotary Position Embeddings (RoPE), pre norm and RMSNorm, SwigGLU/GeGLU, dropping bias terms. Hyperparamters: d_ff~=4xd_model, head_dim*n_heads~=d_model, model should be deep or wide -> d_model/n_layers ~ 10 to 100, vocab size: 30-50k for monoling and 100-250k for multiling, weight decay for better training losses, z-loss & Q, K matrix layer norm for training stability, KV cache, attention head.
- Finetuned on punctuation restoration data vs mixed between punc and non-punc. Finetuned with 1e-5 lr vs 1e-4 lr.
: 300
Countdown: 7 days. Also, just started thinking about my MPhil Thesis
- I read "Why vulnerability discovery is mathematically difficult" with this link "https://github.com/yo-yo-yo-jbo/vr_difficulty". Why I read this? First, this topic belongs to CS. Second, it discusses challenges of AI in a specific domain by the domain expert. Random thought while reading: it would be cool if I could use reduction approach in my future paper. Note: AI can't solve everything because the Halting Problem is unsolvable.
I should practice both thinking and speaking in both English and Vietnamese since rely on one language for one skill will limit and confuse myself when using that skill in other language.
- "When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse" - report performance degration of AVSR on video conferencing, construct a VC dataset,
- Fine-tuning on mixed offline + online data. Then, fine-tuning on online data. The performance improves.
: 150
Những việc cần làm cho hôm nay: DynaCocktail, cải thiện kết quả của StreamCorrect trên data của WeBank, ngủ sớm
- Fine-tune on offline error correction data before transfering to streaming error correction data.
I have also tried continue fine-tuning from previous checkpoint's optimizer state and with this, the loss diverged. I am not sure this is good or bad, it could be good if the signigicantly low losses mean excessive memorization.
: 650
- Nhận ra mình chỉ có thể làm việc một mình hoặc đánh support chứ không giỏi việc lead.
- Papers thì đọc cho sota nhưng mà experiment thì còn stuck ở training data quality, duma how far behind the frontier am I :((?
May 4, 2026 – May 10, 2026
2026-19
No reflection text.
Final weekly XP (after rules): 34 XP
- "Pinyin Regularization in Error Correction for Chinese Speech Recognition with LLMs" - only use the BELLE model to collect the hypotheses, only infer at beam size of 10, one of the core technique is leveraging the Pinyin-regularized text. Gonna use the dataset in this paper to pretrain my error corrector.
- "Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition" - an old paper on error correction, I am thinking if there are any insights from this paper that could be applied to my work: weight initilization for unfreeze layers to better align weights of distinct models? Learnable adapter (but I already finetuned a speechlm)? Their prompt template?
- "Chain of Correction for Full-Text Speech Recognition with LLMs" - correct documents, articles, news report,... segment by segment. The paper also includes a correction threshold to balance between under-correction and over-rephrasing. Interestingly, I realize that this full-text correction is quite similar to my current streaming correction, instead of segment by segment, streaming needs to handle chunk by chunk, a smaller unit of speech.
Learnt that during vibe coding, if I give the coding agent the bugs and it tries to patch them case by case, it's highly likely that there still exists other bugs/cases aren't fixed. Simply asking the model for a more generalizable algorithm could result in better fix.
- "Better & Faster Large Language Models via Multi-token Prediction"
MTP wasn't scalable, the paper proposes scalable paradigm with no train time and memory overhead.
Works naturally with self-speculative decoding -> 3x faster.
4-token decoding is optimal based on experiments.
- Qwen3-ASR-1.7B full parametric fine-tuning - dell ma fine-tune có tí cho recognition task performance cải thiện nhanh hơn plug in error corrector vào nữa
- Nghịch Claude API, tks capstone
Chopin - Nocturne No. 21 in C Minor
- "Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding": using Bayesian Optimization to selectively skipping some layers of the original LLM, hence generating drafting tokens with faster inference speed.
Đang đọc "Better & Faster Large Language Models via Multi-token Prediction" mà đọc chậm vc
Apr 27, 2026 – May 3, 2026
2026-18
Dành gần hết cả tuần tìm cách synthesize data ít noise nhất có thể, rồi fine-tune Ultravox-v0.5-Llama3.2-1B và Qwen3-ASR-0.6B. Qwen3-ASR-0.6B pretrained trên speech recognition task nên khi fine-tune trên error correction task thì loss fluctuate kinh khủng. Ultravox-v0.5-Llama3.2-1B thì tốt hơn, không biết có phải do Ultravox là general speechlm không nữa...
Final weekly XP (after rules): 49 XP
- SepPrune proposes a differentiable pruning technique for speech separation model. After masking high computational cost layers, the model is fine-tuning on remaining parameters in 1 epoch, which recovers 86%+ performance of the original model fine-tuned on 493 epochs. The codebase of this paper isn't well documented. Pruning a model for 36x faster training is cool but I am not sure whether I need this technique now, I am just wondering why people who training foundational model don't use this?
- 16-bit LoRA fine-tuning on 2x data size
16-bit LoRA fine-tuning on 2x data size but removing samples where top-1 hypothesis=ground-truth to reduce bias towards top-1.
- Full parametric fine-tuning of Qwen3-ASR-0.6B. Losses are crazy.
How much "noisy" is acceptable for training data? I know this would vary across models and tasks. There obviously exists papers talking about this, at this stage, I only need to have a feeling about how much noisy my data will be during planning model architecture and data preparation pipeline.
Build cái web này hết mẹ buổi chiều.