There's a famous Wikipedia phenomena that by clicking the first link in the main text of the article on the English Wikipedia, you'll eventually end up on the philosophy page. An explanation can be found here. Briefly, it's because of Wikipedia Manual of Style guidelines that recommend that articles begin by telling "what or who the subject is, and often when and where".
This was true for roughly 97% of articles, so there's a big chance that by entering a random Wikipedia page and following the procedure you'll indeed end up on Philosophy. I could test this by hand, but this wouldn't be a dev.to article without writing some code. We'll start with how to download Wikipedia articles.
How to get data
It's simple - just request contents of and article with urllib3
. Wikipedia follows a convenient pattern for naming its articles. After the usual en.wikipedia.org/
there's a /wiki
and then /article_name
(or media! we'll deal with that later) for example, en.wikipedia.org/wiki/Data_mining
.
Firstly, I'll create a pool from which I'll make requests to Wikipedia.
import urllib3 from bs4 import BeautifulSoup pool = urllib3.PoolManager()
From now on, I'll could download the articles one by one. To automate the process of crawling through the site, the crawler will be recursive. Each iteration of it will return (current_url, [crawler for link_on site])
, the recursion will stop, at given depth. In the end, I'll end up with tree structure.
def crawl( pool: urllib3.PoolManager, url, phrase=None, deep=1, sleep_time=0.5, n=5, prefix="https://en.wikipedia.org", verbose=False, ): """ Crawls given Wikipedia `url` (article) with max depth `deep`. For each page extracts `n` urls and if `phrase` is given check if `phrase` in urls. Parameters ---------- pool : urllib3.PoolManager Request pool phrase : str Phrase to search for in urls. url : str Link to wikipedia article deep : int Depth of crawl sleep_time : float Sleep time between requests. n : int Number of links to return prefix : str, default="https://en.wikipedia.org"" Site prefix Returns ------- tuple Tuple of url, list """ if verbose: site = url.split("/")[-1] print(f"{deep} Entering {site}") # Sleep to avoid getting banned time.sleep(sleep_time) site = pool.request("GET", url) soup = BeautifulSoup(site.data, parser="lxml") # Get links from wiki (I'll show it later) links = get_links_from_wiki(soup=soup, n=n, prefix=prefix) # If phrase was given check if any of the links have it is_phrase_present = any([phrase in link for link in links]) and phrase is not None if deep > 0 and not is_phrase_present: return ( url, [ crawl( pool=pool, url=url_, phrase=phrase, deep=deep - 1, sleep_time=sleep_time, n=n, prefix=prefix, verbose=verbose, ) for url_ in links ], ) return url, links
If you read the code carefully, you'd notice a function get_links_from_wiki
. get_links_from_wiki
function parses the article. It works by finding a div that contains the whole article, then iterates through all paragraphs (or lists) and finds all links that match pattern /wiki/article_name
. Because there's no domain in that pattern, it is added at the end.
def get_links_from_wiki(soup, n=5, prefix="https://en.wikipedia.org"): """ Extracts `n` first links from wikipedia articles and adds `prefix` to internal links. Parameters ---------- soup : BeautifulSoup Wikipedia page n : int Number of links to return prefix : str, default="https://en.wikipedia.org"" Site prefix Returns ------- list List of links """ arr = [] # Get div with article contents div = soup.find("div", class_="mw-parser-output") for element in div.find_all("p") + div.find_all("ul"): # In each paragraph find all <a href="/wiki/article_name"></a> and # extract "/wiki/article_name" for i, a in enumerate(element.find_all("a", href=True)): if len(arr) >= n: break if ( a["href"].startswith("/wiki/") and len(a["href"].split("/")) == 3 and ("." not in a["href"] and ("(" not in a["href"])) ): arr.append(prefix + a["href"]) return arr
Now we have everything to check the phenomena. I'll set max depth to 50 and set n=1
(to only expand first link in the article).
crawl(pool, "https://en.wikipedia.org/wiki/Doggart", phrase="Philosophy", deep=50, n=1, verbose=True)
Output:
50 Entering Doggart 49 Entering Caroline_Doggart 48 Entering Utrecht 47 Entering Help:Pronunciation_respelling_key ... 28 Entering Mental_state 27 Entering Mind 26 Entering Thought 25 Entering Ideas ('https://en.wikipedia.org/wiki/Doggart', [('https://en.wikipedia.org/wiki/Caroline_Doggart', [('https://en.wikipedia.org/wiki/Utrecht', [('https://en.wikipedia.org/wiki/Help:Pronunciation_respelling_key', [('https://en.wikipedia.org/wiki/Pronunciation_respelling_for_English', ... [('https://en.wikipedia.org/wiki/Ideas', ['https://en.wikipedia.org/wiki/Philosophy'])])])])])])])])])])])])])])])])])])])])])])])])])])
As you can see after 25 iterations indeed we found Philosophy
page.
Top comments (2)
I think that you didn't know about Bacon number. I'm not sure about English Wikipedia but Polish article it's literly called Bacon number it's a number of Hops where you end at Kevin Bacon (just related words not related to Wikipedia)
I also suggest wathing Wikipedia Race video as part of
Chrome Developer Summit 2020
, where they have a game where person needs to find a page just by clicking a links on Wikipedia.Thanks for the comment, I didn't know about Bacon number.
On the topic of finding number of hops between pages I found this site sixdegreesofwikipedia, where you can find the shortest path between two articles.