About me

Hello! I am Yujia Xiao, a fourth-year PhD student in the DSP & Speech Technology Laboratory (DSP-STL) at The Chinese University of Hong Kong (CUHK), under the supervision of Prof. Tan Lee. Prior to this, I worked as an applied scientist at Microsoft from 2018 to 2022. I earned both my M.S. and B.S. degrees from South China University of Technology. My current research focuses on long-form audio and speech generation as well as multimodal agents.

😊 I plan to graduate in 2026 and am actively seeking new opportunities in academic or industry research positions. If you are interested in my work, feel free to contact me!

News

🌟 Oct 2, 2025: PodEval is released. PodEval is a comprehensive toolkit for podcast evaluation across multiple dimensions including audio, speech, and text using both objective metrics and subjective evaluation methods.
🌟 May 16, 2025: PodAgent is accepted by ACL 2025 Findings.
🌟 Mar 4, 2025: PodAgent is released. Given the topic to be discussed, PodAgent will simulate human behavior to create podcast-like audio presented as a talk show, featuring one host and several guests. The show will include diverse and insightful viewpoints, delivered in appropriate voices, along with structured sound effects and background music to enrich the listening experience.

Experience

💼 2018.05 - 2022.07: Applied Scientist at Microsoft (TTS Algorithm Team)
💻 2016.08 - 2018.04: Research Intern at Microsoft Research Asia (Speech Group & IEG)

Selected Publications

📖 PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation
- Yujia Xiao, Liumeng Xue, Lei He, Xinyi Chen, Aemon Yat Fei Chiu, Wenjie Tian, Shaofei Zhang, Qiuqiang Kong, Xinfa Zhu, Wei Xue, Tan Lee. Under Review.
📖 PodAgent: A Comprehensive Framework for Podcast Generation
- Yujia Xiao, Lei He, Haohan Guo, Fenglong Xie, Tan Lee. ACL 2025 Findings.
📖 Contrastive context-speech pretraining for expressive text-to-speech synthesis
- Yujia Xiao Xi Wang, Xu Tan, Lei He, Xinfa Zhu, Sheng Zhao, Tan Lee. ACM Multimedia, 2024.
📖 Contextspeech: Expressive and efficient text-to-speech for paragraph reading
- Yujia Xiao, Shaofei Zhang, Xi Wang, Xu Tan, Lei He, Sheng Zhao, Frank K. Soong, Tan Lee. INTERSPEECH 2023.
📖 Improving fastspeech tts with efficient self-attention and compact feed-forward network
- Yujia Xiao, Xi Wang, Lei He, Frank K Soong. ICASSP 2022.
📖 Improving prosody with linguistic and bert derived features in multi-speaker based mandarin chinese neural tts
- Yujia Xiao, Lei He, Huaiping Ming, Frank K. Soong. ICASSP 2020.
📖 Paired phone-posteriors approach to ESL pronunciation quality assessment
- Yujia Xiao, Frank K Soong, Wenping Hu. INTERSPEECH 2018.
📖 Proficiency Assessment of ESL Learner’s Sentence Prosody with TTS Synthesized Voice as Reference
- Yujia Xiao, Frank K Soong. INTERSPEECH 2017.
📖 Measuring Audio’s Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models
- Haolin He, Xingjian Du, Renhe Sun, Zheqi Dai, Yujia Xiao, Mingru Yang, Jiayi Zhou, Xiquan Li, Zhengxi Liu, Zining Liang, Chunyat Wu, Qianhua He, Tan Lee, Xie Chen, Wei-Long Zheng, Weiqiang Wang, Mark Plumbley, Jian Liu, Qiuqiang Kong. Under Review.
📖 Zsvc: Zero-shot style voice conversion with disentangled latent diffusion models and adversarial training
- Xinfa Zhu, Lei He, Yujia Xiao, Xi Wang, Xu Tan, Sheng Zhao, Lei Xie. ICASSP 2025.
📖 Audio-FLAN: An Instruction-Following Dataset for Unified Understanding and Generation of Speech, Music, and Sound
- Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Sitong Cheng, Yinghao Ma, Ruibin Yuan, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, ZHANG Xinshen, Tianchi Liu, Zeyue Tian, Ziyang Ma, Haohe Liu, Ge Zhang, Xu Tan, Emmanouil Benetos, Wenhao Huang, Yike Guo, Wei Xue. Under Review.
📖 Unistyle: Unified style modeling for speaking style captioning and stylistic speech synthesis)
- Xinfa Zhu, Wenjie Tian, Xinsheng Wang, Lei He, Yujia Xiao, Xi Wang, Xu Tan, Sheng Zhao, Lei Xie. ACM Multimedia, 2024.
📖 QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning
- Haohan Guo, Fenglong Xie, Jiawen Kang, Yujia Xiao, Xixin Wu, Helen Meng. IEEE Transactions on Audio, Speech and Language Processing.

Awards

🌟 2021.12 [Microsoft Hacathon] Executive Challenge - Hack for Consumer Business Growth - 2nd Place
🌟 2020.09 [Microsoft Hacathon] Honorable Mention
🌟 2019.09 [Microsoft Hacathon] Hackathon Challenge - Hack for Big Ideas - 2nd Place
🥇 2016 National Scholarship for Postgraduates
🥇 2013 National Scholarship
🥇 2012 National Scholarship

Teaching & Services

🧑‍🏫️ Teaching Assistant (CUHK) of UGEB1408-ENGG1920 Artificial Intelligence in Action
🧑‍🏫️ Teaching Assistant (CUHK) of ELEG2310B: Principles of Communication Systems
📑 Invited Reviewer of ICASSP 2025-2026 / IJCNN 2025

🎶🎙️💚

I love music, enjoy singing, and play the guzheng (amateur Level 10). I’m also into podcasts, interviews, stand-up comedy, badminton, and have a strong interest in mental health, with certifications in QPR Gatekeeping and MHFA Standard Course. If we share similar interests, let’s connect and explore them together!