Posts by Collection

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

Concatenative Resynthesis Using Twin Networks.

Published in In the proceedings of INTERSPEECH, 2017

Use Google Scholar for full citation

Recommended citation: Soumi Maiti, Michael Mandel, "Concatenative Resynthesis Using Twin Networks.." In the proceedings of INTERSPEECH, 2017.

Large Vocabulary Concatenative Resynthesis.

Published in In the proceedings of INTERSPEECH, 2018

Use Google Scholar for full citation

Recommended citation: Soumi Maiti, Joey Ching, Michael Mandel, "Large Vocabulary Concatenative Resynthesis.." In the proceedings of INTERSPEECH, 2018.

Predicting interaction quality in customer service dialogs

Published in In the proceedings of Advanced Social Interaction with Agents: 8th International Workshop on Spoken Dialog Systems, 2018

Use Google Scholar for full citation

Recommended citation: Svetlana Stoyanchev, Soumi Maiti, Srinivas Bangalore, "Predicting interaction quality in customer service dialogs." In the proceedings of Advanced Social Interaction with Agents: 8th International Workshop on Spoken Dialog Systems, 2018.

Parametric resynthesis with neural vocoders

Published in In the proceedings of 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019

Use Google Scholar for full citation

Recommended citation: Soumi Maiti, Michael Mandel, "Parametric resynthesis with neural vocoders." In the proceedings of 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019.

Speech denoising by parametric resynthesis

Published in In the proceedings of ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019

Use Google Scholar for full citation

Recommended citation: Soumi Maiti, Michael Mandel, "Speech denoising by parametric resynthesis." In the proceedings of ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.

Generating multilingual voices using speaker space translation based on bilingual speaker data

Published in In the proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020

Use Google Scholar for full citation

Recommended citation: Soumi Maiti, Erik Marchi, Alistair Conkie, "Generating multilingual voices using speaker space translation based on bilingual speaker data." In the proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.

Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement

Published in In the proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020

Use Google Scholar for full citation

Recommended citation: Soumi Maiti, Michael Mandel, "Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement." In the proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.

End-to-end diarization for variable number of speakers with local-global networks and discriminative speaker embeddings

Published in In the proceedings of ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021

Use Google Scholar for full citation

Recommended citation: Soumi Maiti, Hakan Erdogan, Kevin Wilson, Scott Wisdom, Shinji Watanabe, John Hershey, "End-to-end diarization for variable number of speakers with local-global networks and discriminative speaker embeddings." In the proceedings of ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.

TriniTTS: Pitch-controllable end-to-end TTS without external aligner

Published in In the proceedings of Proc. Interspeech, 2022

Use Google Scholar for full citation

Recommended citation: Yooncheol Ju, Ilhwan Kim, Hongsun Yang, Ji-Hoon Kim, Byeongyeol Kim, Soumi Maiti, Shinji Watanabe, "TriniTTS: Pitch-controllable end-to-end TTS without external aligner." In the proceedings of Proc. Interspeech, 2022.

CMU’s IWSLT 2023 Simultaneous Speech Translation System

Published in In the proceedings of Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), 2023

Use Google Scholar for full citation

Recommended citation: Brian Yan, Jiatong Shi, Soumi Maiti, William Chen, Xinjian Li, Yifan Peng, Siddhant Arora, Shinji Watanabe, "CMU’s IWSLT 2023 Simultaneous Speech Translation System." In the proceedings of Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), 2023.

EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers

Published in In the proceedings of 2022 IEEE Spoken Language Technology Workshop (SLT), 2023

Use Google Scholar for full citation

Recommended citation: Soumi Maiti, Yushi Ueda, Shinji Watanabe, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Yong Xu, "EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers." In the proceedings of 2022 IEEE Spoken Language Technology Workshop (SLT), 2023.

ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

Published in arXiv preprint arXiv:2304.04596, 2023

Use Google Scholar for full citation

Recommended citation: Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng, Siddharth Dalmia, Peter Pol{\'a}k, Patrick Fernandes, Dan Berrebbi, Tomoki Hayashi, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Soumi Maiti, Juan Pino, Shinji Watanabe, "ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit." arXiv preprint arXiv:2304.04596, 2023.

Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech

Published in arXiv preprint arXiv:2310.00706, 2023

Use Google Scholar for full citation

Recommended citation: Dareen Alharthi, Roshan Sharma, Hira Dhamyal, Soumi Maiti, Bhiksha Raj, Rita Singh, "Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech." arXiv preprint arXiv:2310.00706, 2023.

FindAdaptNet: Find and Insert Adapters by Learned Layer Importance

Published in In the proceedings of ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

Use Google Scholar for full citation

Recommended citation: Junwei Huang, Karthik Ganesan, Soumi Maiti, Young Kim, Xuankai Chang, Paul Liang, Shinji Watanabe, "FindAdaptNet: Find and Insert Adapters by Learned Layer Importance." In the proceedings of ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

Improving massively multilingual asr with auxiliary ctc objectives

Published in In the proceedings of ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

Use Google Scholar for full citation

Recommended citation: William Chen, Brian Yan, Jiatong Shi, Yifan Peng, Soumi Maiti, Shinji Watanabe, "Improving massively multilingual asr with auxiliary ctc objectives." In the proceedings of ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

Published in arXiv preprint arXiv:2301.12596, 2023

Use Google Scholar for full citation

Recommended citation: Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke Takamichi, Hiroshi Saruwatari, "Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining." arXiv preprint arXiv:2301.12596, 2023.

Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute

Published in arXiv preprint arXiv:2306.06672, 2023

Use Google Scholar for full citation

Recommended citation: William Chen, Xuankai Chang, Yifan Peng, Zhaoheng Ni, Soumi Maiti, Shinji Watanabe, "Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute." arXiv preprint arXiv:2306.06672, 2023.

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Published in arXiv preprint arXiv:2309.13876, 2023

Use Google Scholar for full citation

Recommended citation: Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe, "Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data." arXiv preprint arXiv:2309.13876, 2023.

SpeechLMScore: Evaluating speech generation using speech language model

Published in In the proceedings of ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

Use Google Scholar for full citation

Recommended citation: Soumi Maiti, Yifan Peng, Takaaki Saeki, Shinji Watanabe, "SpeechLMScore: Evaluating speech generation using speech language model." In the proceedings of ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

Published in arXiv preprint arXiv:2309.08531, 2023

Use Google Scholar for full citation

Recommended citation: Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Yeo, Shinji Watanabe, Yong Ro, "Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens." arXiv preprint arXiv:2309.08531, 2023.

Unsupervised data selection for tts: Using arabic broadcast news as a case study

Published in arXiv preprint arXiv:2301.09099, 2023

Use Google Scholar for full citation

Recommended citation: Massa Baali, Tomoki Hayashi, Hamdy Mubarak, Soumi Maiti, Shinji Watanabe, Wassim El-Hajj, Ahmed Ali, "Unsupervised data selection for tts: Using arabic broadcast news as a case study." arXiv preprint arXiv:2301.09099, 2023.

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

Published in arXiv preprint arXiv:2309.07937, 2023

Use Google Scholar for full citation

Recommended citation: Soumi Maiti, Yifan Peng, Shukjae Choi, Jee-weon Jung, Xuankai Chang, Shinji Watanabe, "Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks." arXiv preprint arXiv:2309.07937, 2023.

joint prediction and denoising for large-scale multilingual self-supervised learning

Published in arXiv preprint arXiv:2309.15317, 2023

Use Google Scholar for full citation

Recommended citation: William Chen, Jiatong Shi, Brian Yan, Dan Berrebbi, Wangyou Zhang, Yifan Peng, Xuankai Chang, Soumi Maiti, Shinji Watanabe, "joint prediction and denoising for large-scale multilingual self-supervised learning." arXiv preprint arXiv:2309.15317, 2023.

talks

SC-2: Inclusive Neural Speech Synthesis (iNSS)

Published: May 25, 2022

Towards robust speech generation

Published: September 28, 2023

Talk: Consolidating speech tasks with Spoken Language Models

Published: November 03, 2023

Abstract: Recent Large Language Models (LLMs) show great improvements in text processing and natural language processing applications. Spoken language modeling in comparison is a very recent research area. Speech, in contrast to text, has many different components - speaker characteristics, emotional cues, pausing, pitch variation, etc. Moreover, speech signals are of longer sequence length than text. In this talk, I will focus on two parts: First, explore the utility for spoken Language Models for speech evaluation. Second, we will discuss how to build a multi-modal voice and text language model for consolidating speech recognition/ synthesis with text and speech continuation tasks.

Guest Lecture: Text To Speech

Published: November 27, 2023

Consolidating Speech Tasks with Spoken Language Models

Published: December 07, 2023

teaching

Introduction to Computer Applications

Undergraduate course, Brooklyn College (CUNY), Computer Science, 2018

Introduction to Computer Applications

Taught computer literacy, word processing, and database systems.
Adjunct Lecturer
Fall 2016, Spring 2017,Fall 2017, Spring 2018, Fall 2018

Data Science workshop series

Workshop, Hunter College (CUNY), Mathematics, 2020

Data Science workshops

Math Fellow
Hunter College
Fall 2019, Spring 2020
Conducted workshops on Python, data analysis, visualization, Git, and LaTeX

Speech Processing

Undergraduate course, Carnegie Mellon University, LTI, 2023

Speech Processing

Co-Instructor
Spring 2023
Introduction to Text-To-Speech processing