Scraping other sites for metadata

From Internet Archive Unoffical Wiki
Jump to: navigation, search

This is a series of tutorials on scraping sites using javascript & python to speed up the process of adding metadata to large numbers of items. For example, let's say you have a collection of 30,000 C64 demos, and you want to link them with individual pages on CSDB and pull across the metadata into the archive. How would you achieve something like that without going insane? Here are a few practical examples I've worked on over the years:

Mickey And Friends

Let's start with an easy example I'm working on at the moment. Scanning Mickey and Friends, a UK comic. I have over a hundred issues, and am slowly uploading them to the archive. Wouldn't it be nice to know who authored each story in the comics, and which characters appear? Well apparently, people have created a database of that kind of information. And their license encourages use of their data in any way.

At this point, you're going to want to know a bit of javascript or maybe some python, because you don't want to copy/paste this information by hand. If you look at one of the items I uploaded, you'll see the end result of my scraping. Artists & characters cross referenced, so that users can browse other comics featuring the same characters or creators.

Here's the node script I used to add this metadata. And here's what it does:

  • The script makes an axios http request to the internet archive, and grabs the item names for all uploaded mickey & friends comics
    • This could be improved if written in python, since it could just import the internet archive python library and directly grab the item names from there.
  • Since the item names contain the year and the issue number, it's trivial for the script to generate the correct inducks database URL to visit for information about that comic. The KEY is naming each item consistently as I continue to upload issues.
  • There's another axios call to the inducks database for the each comic
  • "node-html-parser" is used to parse the page, grabbing whatever information I decide is important and putting it into a javascript object.
  • Then I generate a description for the item, using html to make it rich with links. The links are all internal links to the internet archive search page. For characters & creators, I know I'm going to put a comma separated list of each creator/character into a metadata field, so I make the search point to that creator's name. Check out the item linked above for an example of what that looks like
  • The final step is to prepare the data and output a csv, using the fast-csv package
  • The CSV can be used with the internet archive python library to update the metadata fields for description, characters, and creators

All the above can be seen in the go() function, and you can dive into the individual functions to hopefully understand what the script is doing along each step of the way. If you're familiar with code in any way I'm sure this will be easy. If not, a project like this might make a good starting point. And this can also be done in python - I've done it - using the requests library and beautiful soup. I'm just more of a javascript guy at the moment.

More to come:

C64 demos & the CSDB

Emulated video games and Mobygames' database

Reeling in the years & the TVDB