Python with BeautifulSoup rocks! I'll talk about that and how I can get data from the new thecarytheater.com website at the 18 Nov. Meetup!
The Cary Showtimes
The Cary has a page that displays show times but is 1) not very readable, 2) does not take full use of external data (like IMDB), and 3) does not allow others access to data.
The Cary Project seeks to fix these issues by scraping the data, storing it, and displaying it in a much more useful way.
21 Oct Meeting
- Reviewed progress… which isn't much as I was out of town most of the last two weeks.
- I decided to split off a bit and build a scraping tool that will accept a json or yaml object and process a provided html file. This should be useful in other scraping projects but also with having to refactor or add new pages to the projects.
7 Oct Meeting
- Reviewing recent changes to The Cary's calendar. They seem to have started using IMDB data.
- We discussed possible issues with using IMDB data or using copyright film posters.
- A possible API end point for IMDB.
- We could also use Wikipedia if needed.
- I was able to scrape the calendar page but not sure it will be any easier then the other page. The data is definitely easier to access as nodes use classes on this page versus the other page which has little useful attributes.
- I didn't notice that the ugly calendar doesn't seem to be updated. It's still showing September dates.
23 Sept Meeting
- It may be easier to use the Cary events calendar to scrape the necessary data.
- Create repo in Github Code for Cary: https://github.com/Cary-Code-for-America/TheCary
Regarding storage of parsed results:
- We talked about City Gram once and was curious how to link this up to a similar service. Would we have either one database for all our different projects or would each project have their own DB with a REST API? REST API sounds like the better option.
- Could host it on Digital Ocean once we're ready.
Github repo changes:
- Added my work so far… not much.
- Creating milestones, labels, and issues… mostly to keep things straight in my head.
- Starting to create notes in Github Wiki.
Questions for next meeting:
- Would it make sense to have multiple data sources to scrape? Would help with getting the most accurate data. For example, scrape the main showtime page, the calendar and the Films@TheCary page. If a showtime only shows up on the calendar but not the others, we can safely drop it from the final scrape. Could be something implemented later on as well.