Spotlight Talks & Colloquiums

Spotlight Talk

Monday, April 15, 2024 @ GRAND BALLROOM 101

SPOTLIGHTS		Title	Presenters
08:30 ~ 08:35		Opening
08:35 ~ 09:00	Talk 1	What is the fundamental essence of a clinical trial? What lies at the core of therapeutics?	Sungjee Kang (Co-founder & CEO of WELT)
09:00 ~ 09:25	Talk 2	Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services	Wei Chu (Founder and CEO of Olewave)
09:25 ~ 09:50	Talk 3	Deep dive to the evolution of the automobile	Jaesik Lee (Co-Founder and President of Technology, Navitas Solutions)
09:50 ~ 10:15	Talk 4	A Pattern for Making Impact in Trustworthy Machine Learning through Basic Research, Open Source, Product, and Deployment	Kush R. Varshney (Research scientist and senior manager at IBM Research – T. J. Watson Research Center)
		Break
10:30 ~ 10:55	Talk 5	Democratization of technologies	Kazuya Takeda (Professor and VP in charge of startup, Nagoya University, co-founder, former representative director, TierIV Inc.)
10:55 ~ 11:20	Talk 6	Synthetic data for algorithm development. Real world examples and lessons learned	Wontak Kim (Senior Manager of Research at Amazon)
11:20 ~ 11:45	Talk 7	AI Empowerment in Mental Health Early Warning Systems	Zhaojin Lu (Director of the Tellhow AI Research Institute and the Chief Scientist at Tellhow Creative Technology Group) Hyunwoo Kim (Senior Research Expert and a Principal Investigator at Zhejiang Lab)
11:45 ~ 11:45		Closing

(8:35-9:00)
What is the fundamental essence of a clinical trial? What lies at the core of therapeutics?
by Sungjee Kang (Co-founder & CEO of WELT)

Abstract: Healthcare revolves around the generation and analysis of data, and digital technologies are swiftly revolutionizing the healthcare landscape. With digital advancements, it becomes feasible to generate new data such as Real-World Data (RWD), and analyze existing data through methods like Decentralized Clinical Trials (DCT).Digital technologies always exhibit characteristics of being fast, large-scale, cost-effective, continuous, and patient-centric. We must explore the potential of digital technologies without bias or convention because ‘Data is All We Need’ after all.
(9:00-9:25)
Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services
by Wei Chu (Founder and CEO of Olewave)

Abstract: We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level transcriptions with confidence scores, enabling supervised ASR training. It also robustly generates phone-level timestamps even in the presence of transcription or recognition errors, facilitating the training of TTS models. Moreover, It automatically assigns labels such as scenario, accent, language, and topic tags to the data, enabling the selection of task-specific data for training a model tailored to that particular task. We assessed the effectiveness of the datasets by fine-tuning open-source large speech models such as Whisper and SeamlessM4T and analyzing the resulting metrics. In addition to openly-available data, our data handling system can also be tailored to provide reliable labels for proprietary data from certain vertical domains. This customization enables supervised training of domain-specific models without the need for human labelers, eliminating data breach risks and significantly reducing data labeling cost.
(9:25-9:50)
Deep dive to the evolution of the automobile
by Jaesik Lee (Co-Founder and President of Technology, Navitas Solutions)

Abstract: Automobiles, which have changed little over the past 100 years, have recently gone through a period of great transformation. One of them is the process of electrification, and the other is the realization of autonomous driving. The biggest contribution to the electrification is the change in powertrain. The basis for this is a shift towards more eco-friendly automobiles due to concerns about climate change. Electric vehicles (EVs) driven by rechargeable batteries are accelerating their practicality due to improved mileage and charging infrastructure with the development of battery technology. As such, Lithium-ion batteries offer a plethora of benefits, including high energy density, long cycle life, and fast charging. However, they also come with concerns such as cost, limited lifespan, safety concerns, environmental impact, and thermal sensitivity. A battery management system (BMS) ensures that the battery operate within its safe operating area (SOA).I started a company more than 10 years ago to create a smart wireless BMS. The newly proposed BMS makes individual components more readily replaced without an entire rebuild, while also eliminating the impact of system modifications on bulky wiring harnesses. The implemented system performed similarly to a wired system, while saving weight and significantly reducing failure points. The safety, controllability, and performance of the BMS, especially focusing on EV applications, is highly dependent on battery sensor measurement, interference-robust wireless protocol, fault-tolerant BMS, and data security. I will present the development processes for such a safe, reliable, and cost-effect wireless BMS system.
(9:50-10:15)
A Pattern for Making Impact in Trustworthy Machine Learning through Basic Research, Open Source, Product, and Deployment
by Kush R. Varshney (Research scientist and senior manager at IBM Research – T. J. Watson Research Center)

Abstract: In this talk, we will elucidate a repeatable pattern for making real-world impact in industry that begins with basic publishable/patentable algorithmic research and includes the development of open-source toolkits and enterprise-grade software products. The steps must be intentional. We will explain the progression and illustrate how it has come to fruition in the area of trustworthy machine learning, specifically in the areas of (a) algorithmic fairness and (b) interpretable and explainable machine learning. This will include discussion of the ICASSP 2019 paper “Bias Mitigation Post-Processing for Individual and Group Fairness,” open-source toolkits AI Fairness 360 and AI Explainability 360 governed by the Linux Foundation, and the IBM Watson OpenScale software offering. Some examples of deployments with non-profit social change organizations will be presented. As part of explaining the pattern, we will also describe the technical area of trustworthy machine learning based on the presenter’s 2022 book of the same title. At the end, we will also discuss how this pattern is being pursued in the context of detecting, understanding and preventing hallucinations in large language models.
(10:30-10:55)
Democratization of technologies
by Kazuya Takeda (Professor and VP in charge of startup, Nagoya University, co-founder, former representative director, TierIV Inc.)

Abstract: My personal research history may be a typical process of building a startup in the university environment in Japan. Fundamental research, potential applications, industry collaboration, disappointment, meeting the right person, launching a team, university regulations, founding, and funding,… a wonderful challenge, in short. If my talk can push young researchers, to go forward to the ambitious venture, i.e. changing the world, I would be more than happy.
(10:55-11:20)
Synthetic data for algorithm development. Real world examples and lessons learned
by Wontak Kim (Senior Manager of Research at Amazon)

Abstract: With the advent of ML and deep learning based algorithms, usage of synthesized audio data for tuning and training is becoming popular. But how effective are they? Is it realistic enough? What are the caveats?There have been a number of different acoustic modeling approaches that are developed for synthetic data generation - simple image source method to realistic physics based models with varying degree of accuracy vs. computation complexity. We will discuss cons and pros of different methods both in scientific aspects as well as practical engineering point of view.We share real life examples of synthetic data usage in the industry ranging from beamforming, echo cancellation, noise reduction, music processing including spatial audio, acoustic localization and detection applications. We also consider the tricky question of how to establish the effectiveness- impulse response accuracy compared to real data in physics point of view as well as performance of the algorithm tuned/trained with synthetic data vs. real. The best of the both worlds can be had with mixture of real and synthetic.Lastly for deep learning applications, it's important to architect the synthetic data generation pipeline early on so that you can manage deep learning model development work flow efficiently and be scalable.
(11:20-11:45)
AI Empowerment in Mental Health Early Warning Systems
by Zhaojin Lu (Director of the Taiho AI Research Institute and the Chief Scientist at Taiho Creative Technology Group) & Hyunwoo Kim (Senior Research Expert and a Principal Investigator at Zhejiang Lab)

Abstract: The advent of AI and big data mining offers unprecedented opportunities for early warning systems in mental health, focusing on the principle that behavior is a manifestation of the psyche, and the psyche underlies behavior. By leveraging AI algorithms and big data analytics, this initiative aims to analyze behavioral patterns through digital footprints left on various platforms, including social media, academic performance, and online interactions. These analyses allow for the identification of deviations from baseline behaviors, signaling potential mental health concerns. The system is designed to provide timely interventions, guiding individuals towards appropriate resources and support. This approach not only enhances the efficiency of mental health services but also emphasizes the importance of understanding and addressing the underlying psychological factors through observed behaviors. Our work underscores the potential of AI and big data in transforming mental health surveillance and support, making it more proactive, personalized, and preventive. This solution has been developed as a standard product, which has been deployed to many university campuses in China so far, and greatly reduced the occurrence of extreme behaviors among students.

Industry Colloquium

Monday, April 15, 2024 @ GRAND BALLROOM 101

Timetable:

14:00 ~ 15:30	[Colloquium 1] IVAS – the 3GPP standard for immersive real-time communication
15:30 ~ 16:00	Coffee Break
16:00 ~ 17:30	[Colloquium 2] Foundation models and the next era of AI

Industry Colloquium 1:

IVAS – the 3GPP standard for immersive real-time communication

Time: Monday, April 15, 2024, 14:00-15:30
Location: GRAND BALLROOM 101

Organizer: Adriana Vasilache (Nokia Technologies)
Presenters:
Brian Lee (Dolby Laboratories, Inc)
Stefan Bruhn (Dolby Laboratories, Inc)
Jan Kiene (Fraunhofer IIS)
Lasse Laaksonen (Nokia Technologies)
Stefan Döhla (Fraunhofer IIS)

Scope and Topics of the Colloquium:
3GPP, the worldwide partnership project among seven major regional telecommunications standard development organizations is now completing the standardization of a new codec intended for Immersive Voice and Audio Services (IVAS) in 5G mobile systems. The new IVAS codec will become a feature of 3GPP Release 18 (5G-Advanced). It extends the existing 3GPP EVS codec currently used in 4G (LTE) mobile telephony to additionally support, beyond mono voice, stereo and spatial voice and audio. The IVAS codec will enable immersive audio experiences in 3GPP mobile real-time communications and allow the sharing of immersive audio experiences. Addressed service scenarios are conversational voice with stereo and immersive telephony/conferencing, XR (VR/AR/MR) communications, i.e., XR telephony and conferencing, and live and non-live streaming of user-generated immersive and XR content.

The presentation will give a brief overview of the IVAS standardization, the codec structure, its supported formats spanning from stereo to objects-based audio, scene-based audio, multichannel audio and metadata assisted spatial audio and its main technology blocks (including spatial audio rendering). It will highlight the capabilities of the new codec, showing formal test results conducted as part of the 3GPP standardization. It will also be accompanied by real-time demos.

Industry Colloquium 2:

Foundation models and the next era of AI

Time: Monday, April 15, 2024, 16:00-17:30
Location: GRAND BALLROOM 101

Organizer: Midia Yousefi - Senior Research Scientist, AI at Microsoft, USA.
Organizer Bio :
Midia Yousefi is a Senior Researcher at Microsoft in the Cloud and AI organization, actively involved in various audio and speech related research projects. Prior to joining Microsoft, Midia was a Postdoc and PhD student at the University of Texas at Dallas under supervision of Dr. John Hansen contributing to multiple projects including recovering and digitization of Apollo corpus as well as developing benchmarks for overlap speech detection, separation, and recognition. Midia commenced her professional journey in the industry as a speech scientist intern at Bosch L.L.C. in the summer of 2019, specializing in extracting target-speaker speech from real audio recordings of smart voice assistants. Subsequently, in the summer of 2020, she transitioned to Microsoft as a research scientist intern, concentrating on emotion and toxic language recognition in the online gaming (Xbox) domain. Her research findings have been published in various journals and conference proceedings, focusing on speech processing and language technology.

Presenters:
Ozlem Kalinli - Senior Research Scientist Manager, AI at Meta, USA.
Midia Yousefi – Senior Research Scientist, AI at Microsoft, USA.

Scope and Topics of the Colloquium:
Over the last decade, AI has made significant progress on perception tasks like Image, Speech, and Language processing. More recently, the field is witnessing new advances in the form of generative AI, underpinned by a class of large-scale models known as “Foundation Models”. Foundation models are trained on massive amounts of data and are capable of performing a wide range of tasks. Throughout this presentation, we aim to dissect the potential of "Foundational Large Audio Models" and their pivotal roles across a spectrum of applications. These models are instrumental in Automatic Speech Recognition, Universal Speech Translation, Music Generation, Speaker Verification, Text-to-Speech Synthesis, Emotion Recognition, and beyond. Moreover, we will navigate the intricate landscape of challenges and opportunities within audio and speech processing, offering insights into its evolutionary path from academic and industrial standpoints. Finally, we will analyze the transformative impact of advancements in Large Audio Models on various sectors, shedding light on their tangible applications in real-world scenarios. Through this exploration, we hope to unravel the intricate web of possibilities that these models offer, paving the way for innovative solutions and enhanced experiences across diverse domains.

Presenter : Ozlem Kalinli-Senior Research Scientist Manager, AI at Meta, USA
Presenter Bio :

Title: Foundation Models for Speech, Language, and Audio

Bio: Ozlem Kalinli is a Senior Research Scientist Manager at Meta. She joined Meta in 2020, and her team focuses on advancing the state-of-the-art automatic speech recognition (ASR) and voice interaction on AR/VR devices such as RayBan | Meta smart glasses, Oculus headsets. Between 2017-2020, she was with the Siri Speech team at Apple, where she managed the acoustic modeling team for ASR. Before joining Apple, she was a senior research scientist in the R&D Group at Sony PlayStation from 2010-2017. In 2008, she was a visiting researcher in the Speech Group at Microsoft Research. Her work has been mainly focused on speech and audio processing, automatic speech recognition, natural language understanding, deep learning, and on generative AI in recent years.

Dr. Kalinli has published her work in speech and language processing conferences and journals. She has been serving as associate editor for IEEE Transactions on Audio, Speech and Language Processing since 2019, and in IEEE Women in Signal Processing Committee (WISP) since January 2021, and served in IEEE Speech and Language Technical Committee (SLTC) between 2018-2021. She has been in the organizing committees for various speech and language conferences including serving as technical program chair for ICASSP, Interspeech, and ASRU over the years.