Posted on Oct 26, 2020

How to figure out what skills are being hired for right now, programmatically

Ever curious what skills are really in demand on job boards? We're programmers, why not build a scraper to help figure that out?

Here's the type of output we want to get:

{ "javascript": 648, "react": 442, "java": 382, "agile": 345, "cloud": 309, "css": 305, "python": 301, "apis": 243, "sql": 241, }

Prerequisites

To get started you'll want to follow this guide which will walk you thru getting your environment setup and some basic scraping working.

Getting the job URL's

Following the same idea that the guide shows us, here's how we'll lay out our code. This code will get us our individual job listings to pull the words out. I opted to split this into 2 different files just in case there were errors. This will put the URL's of each job listing into a JSON file that we'll use in the next file.

I put in a few sample search result URL's, feel free to change the search terms or add as many as you'd like. The one I ended up running was about 80 URL's with a few different search terms.

# indeed_url.rb require 'kimurai' require "selenium-webdriver" class Indeed < Kimurai::Base @name = 'indeed_scrape' @start_urls = [ 'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY', 'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=10', 'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=20', 'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=30', 'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=40', 'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=50', 'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=60', 'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=70', 'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=80', 'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=90', ] @engine = :selenium_chrome @@jobs = [] def scrape_page # Update response to current response after interaction with a browser doc = browser.current_response # browser.save_screenshot sleep 2 while (doc.css('div.jobsearch-SerpJobCard')[0]) do # this loop goes thru the however many job listings are on the page doc = browser.current_response # get first job listing single_job = doc.css('div.jobsearch-SerpJobCard')[0] # get job information job_url = single_job.css('a.jobtitle').attribute('href') job_url = 'https://indeed.com' + job_url puts ' ===== ' puts job_url puts " ===== " @@jobs << job_url if !@@jobs.include?(job_url) doc.css('div.jobsearch-SerpJobCard')[0].remove browser.execute_script("document.querySelector('div.jobsearch-SerpJobCard').remove()") sleep 0.1 end end def parse(response, url:, data: {}) scrape_page File.open("tmp/indeed_jobs_urls.json","w") do |f| f.write(JSON.pretty_generate(@@jobs)) end @@jobs end end Indeed.crawl!

Getting the words

Now that we have the URL's to parse let's go through each individual posting and then see which words pop up the most.

# indeed_posting.rb require 'kimurai' require 'json' # this loads each url from the JSON file and pulls the description, # removes all punctuation and converts it all to lowercase # then, throw each word into a hash for JSON class JobScraper < Kimurai::Base @name = 'indeed_scrape' @start_urls = JSON.parse(File.read("tmp/indeed_jobs_urls.json")) @engine = :selenium_chrome @@word_count = {} def scrape_page sleep 2 doc = browser.current_response job_desc = doc.css('div.jobsearch-jobDescriptionText').text.gsub(/[[:punct:]]/, '').downcase job_array = job_desc.split(' ') job_array.each do |word| @@word_count[word] ? @@word_count[word] += 1 : @@word_count[word] = 1 end puts @@word_count end def parse(response, url:, data: {}) scrape_page sorted_hash = @@word_count.sort_by {|a,b| -b} sorted_hashery = sorted_hash.to_h File.open("tmp/new_sorted_skills.json","w") do |f| f.write(JSON.pretty_generate(sorted_hashery)) end end end JobScraper.crawl! puts 'done scraping'

The only problem I have with this is that you'll get a JSON file that you'll have to eventually trim down and remove all the non tech words. It's pretty easy to see where the block of tech words pop up and only takes a few minutes.

I'm very open to improvement on this though, so let me know if there's something that I should tweak!