A new mission. We need to watch all of the Stephen King adaptations in order of release. It's all on https://stephenking.com/works/movie/index.html but I want to pull each title into a spreadsheet. So I'm going to knock up a quick script to extract all of the movies in order.
The markup on the page is quite nice.
<div class="works-inner"> <a href="/works/movie/carrie.html" class="row work" data-date="1976-0-03, " data-sort="Carrie"> <div class="col-12 col-sm-6 works-title">Carrie</div> <div class="col-6 col-sm-3 works-type">Movie</div> <div class="col-6 col-sm-3 works-date">November 03rd, 1976</div> </a> <a href="/works/movie/shining.html" class="row work" data-date="1980-0-23, " data-sort="Shining, The"> <div class="col-12 col-sm-6 works-title">The Shining</div> <div class="col-6 col-sm-3 works-type">Movie</div> <div class="col-6 col-sm-3 works-date">May 23rd, 1980</div> </a> </div>
I can grab the markup quite easily.
require 'open-uri' html = open('https://stephenking.com/works/movie/index.html').read
Ok, but I want to parse it. I can use Nokogiri to extract the data I need.
require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open('https://stephenking.com/works/movie/index.html'))
We can use .css method to extract all matches for the CSS selector.
doc.css('.work')
Each link has a selector of work and inside that we have a div each with a convenient selector for the data we want
doc.css('.work').map do |w| [ w.css('.works-title')[0].content, w.css('.works-date')[0].content ] end
That's great but I want to sort by the date of release. Ruby copes with Dates and Times sure. But Rails has some handy convenience extensions provided by active_support.
> Date.parse('November 03rd, 1976') => Wed, 03 Nov 1976
Not every value is actually a date.
> Date.parse('TBD') Traceback (most recent call last): (irb):10:in `parse': invalid date (Date::Error)
We can use a quick rescue. This is horrid but this is a quick script.
irb(main):012:0> Date.parse('TBD') rescue nil => nil
We can then sort our records by the date. Here's the full script. It works a treat.
require 'active_support' require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open('https://stephenking.com/works/movie/index.html')) puts doc.css('.work').map do |w| [ w.css('.works-title')[0].content, (Time.parse(w.css('.works-date')[0].content) rescue nil) ] end .sort_by { |a| a[1] || Time.now } .map { |a| a[0] } Carrie The Shining Creepshow Cujo The Dead Zone Christine Children of the Corn Cat's Eye Silver Bullet Maximum Overdrive Stand By Me Creepshow 2 The Running Man Pet Sematary (1989) Tales from the Darkside: The Movie Graveyard Shift Misery Sleepwalkers The Dark Half Needful Things The Shawshank Redemption The Mangler Dolores Claiborne Thinner The Night Flier Apt Pupil The Green Mile Hearts in Atlantis Dreamcatcher Secret Window Riding the Bullet 1408 The Mist Dolan's Cadillac Mercy A Good Marriage Cell My Pretty Pony The Dark Tower IT - Part 1: The Losers' Club Gerald's Game 1922 Pet Sematary (2019) IT: Chapter Two In the Tall Grass Doctor Sleep Firestarter Mr. Harrigan's Phone The Girl Who Loved Tom Gordon Hearts Suffer the Little Children Salem's Lot
I guess we're starting with Brian De Palma's Carrie.
Top comments (0)