Introduction
This is a little side project I did to try and scrape images out of reddit threads. There's a few different subreddits discussing shows, specifically /r/anime
where users add a lot screenshots of the episodes. And I thought it'd be cool to see how much effort it'd be to automatically collate a list of those screenshots from a thread and display them in a simple gallery. The result looked like this
PRAW
PRAW is the Python Reddit API Wrapper, that provides a nice set of bindings to talk to Reddit.
To scrape Reddit you need credentials. The way to generate credentials is hidden away at https://www.reddit.com/prefs/apps where you have to register a new "app" with Reddit. Connecting is as simple as
import praw reddit = praw.Reddit(client_id='id', \ client_secret='secret', \ user_agent='useragent', \ username='username', \ password='DevToIsCool')
Traversing reddit is made simple by the API, for example printing all of the comments in a thread.
submission = reddit.submission(url="https://reddit.com/r/abcde") for comment in submission.comments.list(): print(comment)
Finding links
99% of the images I was looking for are posted to imgur so I just matched on those. I used a regular expression to extract the links. I always recommend using a tool like RegEx101 that makes it really easy to debug your regular expressions as they can be pretty brain bending.
REGEX_TEST = r"((http|https)://i.imgur.com/.+?(jpg|png))" p = re.compile(REGEX_TEST, re.IGNORECASE)
Check if an image still exists
One of the problems I found was dead image links, so I created a simple helper that checks the status_code for that link.
# Check if a link still is exists def checkLinkActive(url): request = requests.head(url) if request.status_code == 200: return True else: return False
Getting Thumbnails
To save bandwidth and your mobile data I wanted to return a smaller version of the image. In imgur you can append a size character onto a URL to get it at a different size, for example 'l' large and 's' small.
# Add a letter to an imgur url to make a small thumbnail def getImgurThumbnail(url, size): startStr = url[:(len(url)-4)] endStr = url[len(url)-4:] return startStr + size + endStr
Putting it all together
Putting all of these bits together you get
def getImages(url): submission = reddit.submission(url=url) # Tell API to return all comment in thread, results are # paginated by default submission.comments.replace_more(limit=None) # Create RegEx object for matching images REGEX_TEST = r"((http|https)://i.imgur.com/.+?(jpg|png))" p = re.compile(REGEX_TEST, re.IGNORECASE) imageMatches = [] for comment in submission.comments.list(): matches = p.findall(comment.body) for match in matches: if checkLinkActive(match[0]): imageMatches.append( {"image": match[0], "thumbnail": getImgurThumbnail(match[0], "m")} ) return imageMatches
Trying it out
I decided to stand up a quick demo of this, using an Azure Function to host my new function and a simple web form to allow people to try it out. Just copy and paste a Reddit URL and the function will return any images.
The Demo App uses Bulma for the look and feel, and a little bit of JQuery for the loading of the page.
If you want to give it a go, you can have a play on my site here.
I'll be looking in a future article at providing a show name search instead of having to paste individual episode URLs. Happy Reddit scraping!
Top comments (0)