Skip to content

Anthropic destroyed millions of books to train its AI—was it worth it?

A race for AI dominance led to a shocking sacrifice: millions of books torn apart. Now, critics question whether the ends justified the means.

The image shows an open book with a variety of machines depicted on its pages. The book contains...
The image shows an open book with a variety of machines depicted on its pages. The book contains text and images of the machines, providing detailed information about them.

Anthropic destroyed millions of books to train its AI—was it worth it?

Anthropic, the AI firm behind the assistant Claude, has scanned millions of physical books using a destructive process. The company bought books in bulk, removed their bindings, and discarded them after digitisation. This large-scale operation aimed to gather training data quickly and at lower cost.

The project began after Anthropic initially relied on pirated ebooks for AI training. To secure legal and high-quality material, the firm shifted to purchasing used physical books. In February 2024, it hired Tom Turvey, a former Google Books executive, to lead efforts in acquiring 'all the books in the world.'

Anthropic spent 'many millions of dollars' on buying and scanning books. The process involved stripping bindings, cutting pages, and converting them into PDFs before throwing away the originals. This method prioritised speed and cost efficiency over preservation. A US judge later ruled that the destructive scanning qualified as fair use under specific conditions. Meanwhile, competitors like OpenAI and Microsoft took a different approach. They partnered with Harvard’s libraries to train AI models using nearly 1 million public domain books—all of which remain preserved.

Anthropic’s approach to data collection has drawn attention for its scale and methods. The company discarded millions of physical books after digitisation, relying on a legal ruling to justify the process. This contrasts with other AI developers who have focused on preserving original materials while training their models.

Read also:

Latest