Posted on Feb 23, 2020 • Originally published at dishy.dev

Scraping Images from Reddit Threads in Python

Introduction

This is a little side project I did to try and scrape images out of reddit threads. There's a few different subreddits discussing shows, specifically /r/anime where users add a lot screenshots of the episodes. And I thought it'd be cool to see how much effort it'd be to automatically collate a list of those screenshots from a thread and display them in a simple gallery. The result looked like this

PRAW

PRAW is the Python Reddit API Wrapper, that provides a nice set of bindings to talk to Reddit.

To scrape Reddit you need credentials. The way to generate credentials is hidden away at https://www.reddit.com/prefs/apps where you have to register a new "app" with Reddit. Connecting is as simple as

import praw reddit = praw.Reddit(client_id='id', \ client_secret='secret', \ user_agent='useragent', \ username='username', \ password='DevToIsCool')

Traversing reddit is made simple by the API, for example printing all of the comments in a thread.

submission = reddit.submission(url="https://reddit.com/r/abcde") for comment in submission.comments.list(): print(comment)

Finding links

99% of the images I was looking for are posted to imgur so I just matched on those. I used a regular expression to extract the links. I always recommend using a tool like RegEx101 that makes it really easy to debug your regular expressions as they can be pretty brain bending.

 REGEX_TEST = r"((http|https)://i.imgur.com/.+?(jpg|png))" p = re.compile(REGEX_TEST, re.IGNORECASE)

Check if an image still exists

One of the problems I found was dead image links, so I created a simple helper that checks the status_code for that link.

# Check if a link still is exists def checkLinkActive(url): request = requests.head(url) if request.status_code == 200: return True else: return False

Getting Thumbnails

To save bandwidth and your mobile data I wanted to return a smaller version of the image. In imgur you can append a size character onto a URL to get it at a different size, for example 'l' large and 's' small.

# Add a letter to an imgur url to make a small thumbnail def getImgurThumbnail(url, size): startStr = url[:(len(url)-4)] endStr = url[len(url)-4:] return startStr + size + endStr

Putting it all together

Putting all of these bits together you get

def getImages(url): submission = reddit.submission(url=url) # Tell API to return all comment in thread, results are  # paginated by default  submission.comments.replace_more(limit=None) # Create RegEx object for matching images  REGEX_TEST = r"((http|https)://i.imgur.com/.+?(jpg|png))" p = re.compile(REGEX_TEST, re.IGNORECASE) imageMatches = [] for comment in submission.comments.list(): matches = p.findall(comment.body) for match in matches: if checkLinkActive(match[0]): imageMatches.append( {"image": match[0], "thumbnail": getImgurThumbnail(match[0], "m")} ) return imageMatches

Trying it out

I decided to stand up a quick demo of this, using an Azure Function to host my new function and a simple web form to allow people to try it out. Just copy and paste a Reddit URL and the function will return any images.

The Demo App uses Bulma for the look and feel, and a little bit of JQuery for the loading of the page.

If you want to give it a go, you can have a play on my site here.

I'll be looking in a future article at providing a show name search instead of having to paste individual episode URLs. Happy Reddit scraping!