A group of MIT Media Lab researchers have published Radiotalk, a massive corpus of talk radio audio with machine-generated transcriptions, with a total of 240,000 hours’ worth of speech, marked up with machine-readable metadata.
The audio was scraped from streaming radio services between Oct 2018 and Mar 2019, and the transcripts run to 2.8 billion words. The researchers hope the corpus will be used by “researchers in the fields of natural language processing, conversational analysis, and the social sciences.”
I’m mostly interested in the social science implications here: talk radio is incredibly important to the US political discourse, but because it is ephemeral and because recorded speech is hard to data-mine, we have very little quantitative analysis of this body of work.
As Gretchen McCulloch points out in her new book on internet-era language,Because Internet, research on human speech has historically relied on expensive human transcription, leading to very small and corpuses covering a very small fraction of human communication.
This corpus is part of a shift that allows social scientists, linguists and political scientists to study a massive core-sample of spoken language in our public discourse.
We introduce RadioTalk, a corpus of speech recognition transcripts sampled from talk radio broadcasts in the United States between October of 2018 and March of 2019. The corpus is intended for use by researchers in the fields of natural language processing, conversational analysis, and the social sciences. The corpus encompasses approximately 2.8 billion words of automatically transcribed speech from 284,000 hours of radio, together with metadata about the speech, such as geographical location, speaker turn boundaries, gender, and radio program information. In this paper we summarize why and how we prepared the corpus, give some descriptive statistics on stations, shows and speakers, and carry out a few high-level analyses.
RadioTalk: a large-scale corpus of talk radio transcripts[Doug Beeferman, William Brannon and Deb Roy/Arxiv]
(viaFour Short Links)
Shadow-banning is a process that dates back to at least the 1980s, with Citadel BBS’s “twit bit,” which would allow users to post replies to forums that they could see, but no one else could see.
When Congress legalized phone unlocking in 2014, they added a bunch of carve-outs that let phone companies veto your attempt to unlock your phone, with the big one being that you couldn’t unlock your phone while you were still in a contract that provided it to you at a reduced price.
Sci-Hub (previously) is a scrappy, nonprofit site founded in memory of Aaron Swartz, dedicated to providing global access to the world’s scholarship — journal articles that generally report on publicly-funded research, which rapacious, giant corporations acquire for free, and then charge the very same institutions that paid for the research millions of dollars a year […]
Everybody wants to rule the world, but only one video game lets you do it in style – and even peacefully if you’re savvy enough with your cultural dominance. Sid Meier’s Civilization is on its fifth sequel and counting for good reason. No two games are alike thanks to the random mapping and numerous special […]
When it comes to travel, Genius is one company that sweats the details. If you’ve never owned one of their suitcases or carry-on bags, they feature dedicated compartments for everything you could imagine and often incorporate compression technology to fit more of it in there. If you’re planning for one last summer trip, here’s a […]
Company executives typically know two things about the cloud: They need to be on it, and they need it to work smoothly. Which means that if you know your way around Google Cloud, you’re going to have employers that want you to lead them through. The Complete Google Cloud Mastery Bundle is just the online […]