How The New York Times is using generative AI as a reporting tool

If you don’t have a 1960s secretary who can do your audio transcription for you, AI tools can now serve as a very good stand-in.

Credit:
Getty Images

If you don’t have a 1960s secretary who can do your audio transcription for you, AI tools can now serve as a very good stand-in.


Credit:

Getty Images

This rapid advancement is definitely bad news for people who make a living transcribing spoken words. But for reporters like those at the Times—who can now accurately transcribe hundreds of hours of audio quickly and accurately at a much lower cost—these AI systems are now just another important tool in the reporting toolbox.

Leave the analysis to us?

With the automated transcription done, the NYT reporters still faced the difficult task of reading through 5 million words of transcribed text to pick out relevant, reportable news. To do that, the team says it “employed several large-language models,” which let them “search the transcripts for topics of interest, look for notable guests and identify recurring themes.”

Summarizing complex sets of documents and identifying themes has long been touted as one of the most practical uses for large language models. Last year, for instance, Anthropic hyped the expanded context window of its Claude model by showing off its ability to absorb the entire text of The Great Gatsby and “then interactively answer questions about it or analyze its meaning,” as we put it at the time. More recently, I was wowed by Google’s NotebookLM and its ability to form a cogent review of my Minesweeper book and craft an engaging spoken-word podcast based on it.

There are important limits to LLMs’ text analysis capabilities, though. Earlier this year, for instance, an Australian government study found that Meta’s Llama 2 was much worse than humans at summarizing public responses to a government inquiry committee.

Australian government evaluators found AI summaries were often “wordy and pointless—just repeating what was in the submission.”

Credit:
Getty Images

Australian government evaluators found AI summaries were often “wordy and pointless—just repeating what was in the submission.”


Credit:

Getty Images

In general, the report found that the AI summaries showed “a limited ability to analyze and summarize complex content requiring a deep understanding of context, subtle nuances, or implicit meaning.” Even worse, the Llama summaries often “generated text that was grammatically correct, but on occasion factually inaccurate,” highlighting the ever-present problem of confabulation inherent to these kinds of tools.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top