Uploading in Bulk

Depending on your situation, you may find yourself needing to upload many things to the Internet Archive at once. These are some ideas for how to proceed.

Jason Scott demonstrates on Twitter
I was handed 5,300 PDFs of medical manuals and now I'm going to put them into the Archive.

First, I'm keeping the original 29Gb .tar.bzip2 they gave it to me in, because I know there are folks for whom they want just one big pile, and don't want my clever little uploads. Keep the originals around, if you can&mdash;let someone else have the chance you did to start.

Next, the metadata is partially in the directory tree. I am writing a custom script to take the directory structure to add keywords.

I can make collections because I'm an admin. The collection name will be "manuals_medicaldevices" and be in the "manuals" collection.

I'm now rewriting my longtime uploading script to do a little extra work since the metadata is there in the file directory. I'll then test with a single item.

The collection is now waiting for me.

These things almost never go right the first time, so I'm running just one iteration of my script, on a single item: a Welch Allyn LCI 100 & 200 Imaging System Service Manual. I see it got uploaded, possibly with useful metadata.

Now we're going to run into an interesting situation&mdash;the Archive has a massive queue system running, with hundreds of thousands of "jobs" a day. My manual upload will fall into place, over the course of a few minutes, and then generate a readable version. It's not instant.

You can now see the item here.

Note that how it looks depends on when you see it. If you're following this thread this exact moment, then it's going to be very incomplete. But then, over time, it will pull in a thumbnail, generate an online readable version, and it'll add OCR to the search function.

Looking over this item, I already now see there's an interesting situation: I happened to choose an item where the creators of this collection would put two perfectly the same copies into the directories! I'm going to do an FDUPES run to see if that'll fix whenever this happens. Since the directory structure is so complicated and tight, I expect this will let me dump the dupes right now. FDUPES runs pretty quick. I do a dry run before I decide to actually go down there and make a mess. Obviously I have backups galore, nothing will be lost, but who likes doing a lot of work. Yes, I can now clearly see whoever made this directory was happy to put in a bunch of directory structure where the same manual worked for three models, so they'd simply put in all the same copies into three directories. Not good for my purpose. I'm doing a  which means find all the dupes and only keep the first one. This will cut things down. I will then probably see how many there are and potentially move items up one directory, since the model isn't needed.