Lazyweb: Getting public domain books out of Gutenberg en masse and onto a Kindle

I’m trying to figure out how to slurp down every single book in Gutenberg to throw on to a Kindle. It’s easy enough to download the Gutenberg archive, but everything comes down in .txt (with funky line breaks and no title/author information in index page of the Kindle) or in PDF, which I would have to pay Amazon to convert or run through one-at-a-time in a conversion program.

There are sites like Feedbooks and Manybooks that have some of the works pre-formated into Mobi, but no page by which to slurp down the entire library. I even tried writing a script to grab everything from Manybooks, but they’re doing something that renames the file with the proper author and title when you click the download link that doesn’t work when you use curl or wget, making it possible only to download every title they offer by number, which in turn makes it impossible for me to tell which file is which book and remove the ones in other languages.

It’s supposed to be a gift — most of the Western canon (especially any philosophy texts I can gather) on a single device. But I’m having a devil of a time getting it all in the right formats and am a little surprised no one else has done the same and thrown it all up in a torrent. Any ideas?

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

18 Responses to Lazyweb: Getting public domain books out of Gutenberg en masse and onto a Kindle

  1. Anonymous says:

    I hope that if you do use manybooks.net’s bandwidth to pull down everything they offer that you’ll pay it forward by making your collection available to everyone else — via bittorrent, or some other method.

  2. zuzu says:

    This is exactly why I won’t buy a Kindle. (Lack of) native PDF support is a dealbreaker, for sure.

    The only help I can offer is that I remember reading that the Kindle will (or could be easily hacked to) support Mobibook .prc files (which is what Amazon’s DRM is built upon), and that there are tools for converting documents to Mobi format yourself.

  3. simmare says:

    I convert multiple documents at a time (not sure of the numerical limit) through Amazon and skip their fee by loading them on the device myself over USB. Here are the instructions from Amazon’s help section:

    “If you are not in a wireless area or would like to avoid the ten-cent fee, you can send attachments to “name”@free.kindle.com to be converted and e-mailed back to your computer at the e-mail address associated with your Amazon.com account. You can then transfer the document to your Kindle using your USB connection.”

  4. mmcclintock says:

    You can download an ISO (3.7 GB) of 20,000+ Kindle files from manybooks.net using bittorrent: Kindle ISO torrent — if you find that useful, please consider making a donation to cover the cost of bandwidth.

  5. brianary says:

    curl -o ${tid}.azw -Ld "tid=${tid}&book=1:kindle:.azw:kindle" http://manybooks.net/scripts/send.php

  6. SamSam says:

    @zuzu: Why an eInk version?

    Seems to me that with an eInk version, you can only get the latest version of any page, and if the page on Wind Turbines just happened to be replaced that minute by “Mike is GAY!!1!lol”, well, I guess your post-apocalyptic civilization is never going to build a wind turbine.

    (Someone really ought to write a book about that…)

    If you download the entire thing with the divs (or even just the last 20 revisions of each page, but I bet the divs don’t add much weight to any page), then you get all the versions you need. You then run that on your nice little crankable-OLPC or something, and civilization is saved.

    • Joel Johnson says:

      I couldn’t get any of these commands working, so I downloaded a few hundred books by hand. But I appreciate you guys trying all the same!

  7. zuzu says:

    @PaulR

    This sounds damn good for my desire to perpetually archive Wikipedia on a eInk display device, incase of the collapse of civilization.

  8. brianary says:

    wget -O ${tid}.azw --post-data=tid=${tid}&book=1:kindle:.azw:kindle http://manybooks.net/scripts/send.php

    It is a bit odd that neither curl nor wget includes an option to use the redirect URL for the filename when auto-choosing a filename (no -O/default in wget; -O in curl).

    BTW: Why can’t we use the <code> HTML tag for style?

  9. PaulR says:

    I was looking (not that hard) for the links on the iLiad Users’ forums:
    Someone had already started work on this and had (mostly)created a searchable Wiki package for the iLiad.
    The Wiki-folks have put a kibosh on the idea of regular D/Ls of static Wiki, um, builds.

    I’ve got a 4GB-sized static archive of Wiki from the early Summer ’08 on this computer here .. somewhere. With an 8GB CF MicroDrive, I could put both the GP and Wiki…

  10. GeekMan says:

    If the biggest complaint is the conversion aspect, I’d say that you’re lucky it’s all in .txt format; nothing could be easier to work with. You just need to build yourself a little script that will process everything automatically. Perl would be ideally suited.

  11. gouldina says:

    Are you sure this is such a good idea? I don’t know about the Kindle but I have a Sony Reader and it slows right down when you have over about 30 books on it. Also, there’s a lot of books on Gutenberg – how much memory will this require and will a Kindle handle it?

  12. ben says:

    I don’t see a problem with crawling http://manybooks.net/language.php?code=en, loading each book page, getting the form inputs, posting to the http://manybooks.net/scripts/send.php script, and saving the data that gets sent.

  13. Anonymous says:

    Mobi has free tools for creating PRC files from PDF automatically. Some of them are command-line, and therefore probably very scriptable.

  14. Anonymous says:

    Hire a student to do the grunt work for $10/hour! They need the money – you get the stuff.

  15. SamSam says:

    I agree with Geekman, and I’d further suggest pushing it through LaTeX to get it all nice and pretty first.

    Wait, the Kindle accepts LaTeX natively, right?

  16. PaulR says:

    Zuzu @6:
    I know I’m annoying about this but..:
    iRex’s iLiad:
    http://www.irextechnologies.com/products/iliad
    It supports PDF out of the box. And you can write on the documents. with a stylus.

    Developer SDK available from iRex:
    http://developer.irexnet.com/
    It runs Linux, dontcha know.

    Put a 2G+ Compact Flash card in it and you can carry the whole Gutenberg Project.
    Suppose you D/L all the books in HTML form: Twenty-five thousand books at 60KB/book would give you about 1.5 GB.

    One of the users wrote a iLiad file utility to make transfering files from an external card to the internal memory easy, a la the old Norton Commander / PCTools style.

  17. dculberson says:

    Joel, you shouldn’t have! But I still look forward to getting my Kindle. :-P

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

 

More BB

Boing Boing Video

Flickr Pool

Digg

Wikipedia

Advertise

Displays ads via FM Tech

RSS and Email

This work is licensed under a Creative Commons License permitting non-commercial sharing with attribution. Boing Boing is a trademark of Happy Mutants LLC in the United States and other countries.

FM Tech