CV

Education

Ph.D in Computer Science, The Graduate Center, CUNY, 2021
M.Phil. in Computer Science, The Graduate Center, CUNY, 2019
B.Tech. in Computer Science, Indian Institute of Engineering, Science and Technology, Shibpur, 2012

Work experience

Mar. 2022 - Present: Postdoctoral Researcher
- Carnegie Mellon University
- Advised by: Professor Shinji Watanabe
Jan. 2021 - Mar. 2022 : Machine Learning Research Engineer
- Apple Inc
- Multilingual Text-To-Speech synthesis
- Long form speech synthesis
Jun. 2020 - Oct. 2020: Research Intern, Student Researcher
- Google
- End-to-end source diarization for long sequence modeling.
Jun. 2019 - Aug. 2019: Machine Learning Intern
- Apple Inc
- Multilingual Text-To-Speech using internal data sets.
Jun. 2016 - Aug. 2016: Research Intern
- Interactions LLC
- Quality prediction for dialog systems using large commercial data sets. * Development of general quality prediction algorithms.
Jun. 2012 - Jun. 2015: Software Engineer
- Polais Networks

Skills

Programming Languages:
- Python
- C/C++
- Linux programming
Tools
- TensorFlow
- PyTorch
- Matlab
Languages:
- Bengali (Native)
- English
- Hindi

Publications

Concatenative Resynthesis Using Twin Networks.

Soumi Maiti, Michael Mandel, "Concatenative Resynthesis Using Twin Networks.." In the proceedings of INTERSPEECH, 2017.

Large Vocabulary Concatenative Resynthesis.

Soumi Maiti, Joey Ching, Michael Mandel, "Large Vocabulary Concatenative Resynthesis.." In the proceedings of INTERSPEECH, 2018.

Predicting interaction quality in customer service dialogs

Svetlana Stoyanchev, Soumi Maiti, Srinivas Bangalore, "Predicting interaction quality in customer service dialogs." In the proceedings of Advanced Social Interaction with Agents: 8th International Workshop on Spoken Dialog Systems, 2018.

Parametric resynthesis with neural vocoders

Soumi Maiti, Michael Mandel, "Parametric resynthesis with neural vocoders." In the proceedings of 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019.

Speech denoising by parametric resynthesis

Soumi Maiti, Michael Mandel, "Speech denoising by parametric resynthesis." In the proceedings of ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.

Generating multilingual voices using speaker space translation based on bilingual speaker data

Soumi Maiti, Erik Marchi, Alistair Conkie, "Generating multilingual voices using speaker space translation based on bilingual speaker data." In the proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.

Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement

Soumi Maiti, Michael Mandel, "Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement." In the proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.

End-to-end diarization for variable number of speakers with local-global networks and discriminative speaker embeddings

Soumi Maiti, Hakan Erdogan, Kevin Wilson, Scott Wisdom, Shinji Watanabe, John Hershey, "End-to-end diarization for variable number of speakers with local-global networks and discriminative speaker embeddings." In the proceedings of ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.

TriniTTS: Pitch-controllable end-to-end TTS without external aligner

Yooncheol Ju, Ilhwan Kim, Hongsun Yang, Ji-Hoon Kim, Byeongyeol Kim, Soumi Maiti, Shinji Watanabe, "TriniTTS: Pitch-controllable end-to-end TTS without external aligner." In the proceedings of Proc. Interspeech, 2022.

CMU’s IWSLT 2023 Simultaneous Speech Translation System

Brian Yan, Jiatong Shi, Soumi Maiti, William Chen, Xinjian Li, Yifan Peng, Siddhant Arora, Shinji Watanabe, "CMU’s IWSLT 2023 Simultaneous Speech Translation System." In the proceedings of Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), 2023.

EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers

Soumi Maiti, Yushi Ueda, Shinji Watanabe, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Yong Xu, "EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers." In the proceedings of 2022 IEEE Spoken Language Technology Workshop (SLT), 2023.

ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng, Siddharth Dalmia, Peter Pol{\'a}k, Patrick Fernandes, Dan Berrebbi, Tomoki Hayashi, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Soumi Maiti, Juan Pino, Shinji Watanabe, "ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit." arXiv preprint arXiv:2304.04596, 2023.

Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech

Dareen Alharthi, Roshan Sharma, Hira Dhamyal, Soumi Maiti, Bhiksha Raj, Rita Singh, "Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech." arXiv preprint arXiv:2310.00706, 2023.

FindAdaptNet: Find and Insert Adapters by Learned Layer Importance

Junwei Huang, Karthik Ganesan, Soumi Maiti, Young Kim, Xuankai Chang, Paul Liang, Shinji Watanabe, "FindAdaptNet: Find and Insert Adapters by Learned Layer Importance." In the proceedings of ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

Improving massively multilingual asr with auxiliary ctc objectives

William Chen, Brian Yan, Jiatong Shi, Yifan Peng, Soumi Maiti, Shinji Watanabe, "Improving massively multilingual asr with auxiliary ctc objectives." In the proceedings of ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke Takamichi, Hiroshi Saruwatari, "Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining." arXiv preprint arXiv:2301.12596, 2023.

Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute

William Chen, Xuankai Chang, Yifan Peng, Zhaoheng Ni, Soumi Maiti, Shinji Watanabe, "Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute." arXiv preprint arXiv:2306.06672, 2023.

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe, "Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data." arXiv preprint arXiv:2309.13876, 2023.

SpeechLMScore: Evaluating speech generation using speech language model

Soumi Maiti, Yifan Peng, Takaaki Saeki, Shinji Watanabe, "SpeechLMScore: Evaluating speech generation using speech language model." In the proceedings of ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

Service and leadership

Peer Reviews:
- ICASSP 2022, 2023, 2024
- Interspeech 2021, 2022, 2023
- SLT 2022
Academic Service:
- IEEE SLTC Associate Member , 2024
- Area Chair, EMNLP 2023
- Session Chair, ICASSP 2023
- Session Chair, IEEE SLT, 2023

Soumi Maiti

CV