The Cary Showtimes

The Cary has a page that displays show times but is 1) not very readable, 2) does not take full use of external data (like IMDB), and 3) does not allow others access to data.

The Cary Project seeks to fix these issues by scraping the data, storing it, and displaying it in a much more useful way.

Project Activity

Update #4

Python with BeautifulSoup rocks! I'll talk about that and how I can get data from the new website at the 18 Nov. Meetup!

Update #3

21 Oct Meeting

  • Reviewed progress… which isn't much as I was out of town most of the last two weeks.
  • I decided to split off a bit and build a scraping tool that will accept a json or yaml object and process a provided html file. This should be useful in other scraping projects but also with having to refactor or add new pages to the projects.

Update #2

7 Oct Meeting

  • Reviewing recent changes to The Cary's calendar. They seem to have started using IMDB data.
  • We discussed possible issues with using IMDB data or using copyright film posters.
  • A possible API end point for IMDB.
  • We could also use Wikipedia if needed.
  • I was able to scrape the calendar page but not sure it will be any easier then the other page. The data is definitely easier to access as nodes use classes on this page versus the other page which has little useful attributes.
  • I didn't notice that the ugly calendar doesn't seem to be updated. It's still showing September dates.

Update #1

23 Sept Meeting

  • It may be easier to use the Cary events calendar to scrape the necessary data.
  • Looking at Python and Beautiful Soup. (I like the API, it's pretty close to JavaScript. :D )
  • Create repo in Github Code for Cary:

Regarding storage of parsed results:

  • We talked about City Gram once and was curious how to link this up to a similar service. Would we have either one database for all our different projects or would each project have their own DB with a REST API? REST API sounds like the better option.
  • Could host it on Digital Ocean once we're ready.

Github repo changes:

  • Added my work so far… not much.
  • Creating milestones, labels, and issues… mostly to keep things straight in my head.
  • Starting to create notes in Github Wiki.

Questions for next meeting:

  • Would it make sense to have multiple data sources to scrape? Would help with getting the most accurate data. For example, scrape the main showtime page, the calendar and the Films@TheCary page. If a showtime only shows up on the calendar but not the others, we can safely drop it from the final scrape. Could be something implemented later on as well.