1236. Web Crawler

Problem Description

This problem asks you to implement a web crawler that explores web pages starting from a given URL and collects all URLs that belong to the same hostname.

The web crawler should follow these rules:

  1. Start from a given URL (startUrl) and explore linked pages from there.

  2. Use the HtmlParser interface to get URLs from any webpage by calling HtmlParser.getUrls(url), which returns a list of all URLs found on that page.

  3. Stay within the same hostname - only crawl URLs that have the same hostname as the starting URL. For example, if you start at http://example.org/news, you can crawl http://example.org/sports but not http://another.com/page.

  4. Avoid duplicates - each unique URL should be visited only once.

  5. Return all discovered URLs in any order that belong to the same hostname.

The hostname is the domain part of the URL. For instance, in http://leetcode.com/problems, the hostname is leetcode.com. The crawler should treat URLs with and without trailing slashes as different (e.g., http://site.com and http://site.com/ are considered different URLs).

The solution uses a depth-first search (DFS) approach:

  • It extracts the hostname from a URL by removing the http:// prefix (7 characters) and splitting by / to get the domain
  • It maintains a set ans to track visited URLs and avoid duplicates
  • For each URL, it gets all linked URLs using the HtmlParser, filters for same hostname, and recursively visits unvisited ones
  • Finally, it returns all collected URLs as a list

Flowchart Walkthrough

First, let's pin down the algorithm using the Flowchart. Here's a step-by-step walkthrough:

Is it a graph?

  • Yes: The web pages and their links form a graph structure. Each URL is a node, and the links between pages are edges. When a page contains links to other pages, those represent directed edges in the graph.

Is it a tree?

  • No: The web structure is not a tree because:
    • Pages can have cycles (Page A can link to Page B, and Page B can link back to Page A)
    • Multiple pages can link to the same page (multiple parents)
    • There's no strict hierarchical parent-child relationship

Is the problem related to directed acyclic graphs?

  • No: The web graph can contain cycles. Websites commonly have circular references where pages link back to each other.

Is the problem related to shortest paths?

  • No: We're not looking for the shortest path between pages. The goal is to explore and collect all reachable URLs within the same hostname.

Does the problem involve connectivity?

  • No: We're not checking if nodes are connected or finding connected components. We're performing an exhaustive exploration within a constrained domain (same hostname).

Does the problem have small constraints?

  • Yes: The problem is constrained to URLs within the same hostname, which typically limits the search space. Also, we need to explore all possible paths from the starting URL.

DFS/backtracking?

  • Yes: DFS is the appropriate choice because:
    • We need to explore all reachable pages systematically
    • We mark visited pages to avoid infinite loops (cycles)
    • We recursively explore each unvisited link before backtracking
    • The traversal naturally follows the link structure depth-first

Conclusion: The flowchart correctly leads us to use a DFS (Depth-First Search) approach for the web crawler problem. DFS allows us to systematically explore all pages within the same hostname while keeping track of visited URLs to avoid revisiting them.

Quick Interview Experience
Help others by sharing your interview experience
Have you seen this problem before?

Intuition

Think of web crawling like exploring a maze of interconnected rooms. You start in one room (the starting URL) and can see doors leading to other rooms (links to other pages). Your goal is to visit every room in your building (same hostname) exactly once, while ignoring doors that lead to other buildings (different hostnames).

The key insight is recognizing this as a graph traversal problem. Each webpage is a node, and links between pages are directed edges. Since we need to explore all reachable pages within a domain, we need a systematic way to:

  1. Visit each page
  2. Discover all its links
  3. Follow only the valid links (same hostname)
  4. Remember which pages we've already visited

DFS is a natural fit because it follows a "go as deep as possible" strategy - when you find a link, you immediately follow it, explore everything reachable from there, then backtrack. This is simpler than maintaining a queue (as in BFS) and works well since we don't care about finding the shortest path to pages, just finding all pages.

The critical observation for avoiding infinite loops is that websites often have cycles (Page A → Page B → Page A). Without tracking visited pages, we'd get stuck in these cycles forever. By maintaining a set of visited URLs, we ensure each page is processed exactly once.

The hostname extraction is straightforward: after removing the http:// prefix (7 characters), the hostname is everything before the first /. By comparing hostnames before following links, we stay within our domain boundary.

The recursive DFS structure emerges naturally: for each unvisited URL with the same hostname, we recursively apply the same exploration process. The base case is when we encounter an already-visited URL, at which point we stop to avoid redundant work.

Learn more about Depth-First Search and Breadth-First Search patterns.

Solution Implementation

1# """ 2# This is HtmlParser's API interface. 3# You should not implement it, or speculate about its implementation 4# """ 5# class HtmlParser(object): 6# def getUrls(self, url): 7# """ 8# :type url: str 9# :rtype List[str] 10# """ 11 12from typing import List, Set 13 14 15class Solution: 16 def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]: 17 """ 18 Web crawler that crawls all URLs under the same hostname as startUrl. 19 20 Args: 21 startUrl: The starting URL for crawling 22 htmlParser: Parser object to get URLs from a given page 23 24 Returns: 25 List of all crawled URLs under the same hostname 26 """ 27 28 def get_hostname(url: str) -> str: 29 """ 30 Extract hostname from a URL. 31 Assumes URL format starts with 'http://' 32 33 Args: 34 url: URL string starting with 'http://' 35 36 Returns: 37 Hostname portion of the URL 38 """ 39 # Remove 'http://' prefix (7 characters) 40 url_without_protocol = url[7:] 41 # Split by '/' and take the first part (hostname) 42 return url_without_protocol.split('/')[0] 43 44 def depth_first_search(current_url: str) -> None: 45 """ 46 Recursively crawl URLs using depth-first search. 47 Only visits URLs with the same hostname as the start URL. 48 49 Args: 50 current_url: The current URL being processed 51 """ 52 # Skip if URL has already been visited 53 if current_url in visited_urls: 54 return 55 56 # Mark current URL as visited 57 visited_urls.add(current_url) 58 59 # Get all URLs from the current page 60 for next_url in htmlParser.getUrls(current_url): 61 # Only crawl URLs with the same hostname 62 if get_hostname(current_url) == get_hostname(next_url): 63 depth_first_search(next_url) 64 65 # Set to store all visited URLs (prevents duplicate visits) 66 visited_urls: Set[str] = set() 67 68 # Start crawling from the initial URL 69 depth_first_search(startUrl) 70 71 # Convert set to list and return 72 return list(visited_urls) 73
1/** 2 * // This is the HtmlParser's API interface. 3 * // You should not implement it, or speculate about its implementation 4 * interface HtmlParser { 5 * public List<String> getUrls(String url) {} 6 * } 7 */ 8 9class Solution { 10 // Set to store all visited URLs from the same hostname 11 private Set<String> visitedUrls; 12 13 /** 14 * Crawls all URLs under the same hostname starting from the given URL 15 * @param startUrl The starting URL for web crawling 16 * @param htmlParser The HTML parser to extract URLs from web pages 17 * @return List of all URLs under the same hostname 18 */ 19 public List<String> crawl(String startUrl, HtmlParser htmlParser) { 20 visitedUrls = new HashSet<>(); 21 dfs(startUrl, htmlParser); 22 return new ArrayList<>(visitedUrls); 23 } 24 25 /** 26 * Depth-first search to recursively crawl URLs 27 * @param currentUrl The current URL being processed 28 * @param htmlParser The HTML parser to extract URLs from web pages 29 */ 30 private void dfs(String currentUrl, HtmlParser htmlParser) { 31 // Skip if URL has already been visited 32 if (visitedUrls.contains(currentUrl)) { 33 return; 34 } 35 36 // Mark current URL as visited 37 visitedUrls.add(currentUrl); 38 39 // Get all URLs from the current page and process them 40 for (String nextUrl : htmlParser.getUrls(currentUrl)) { 41 // Only crawl URLs with the same hostname 42 if (extractHostname(nextUrl).equals(extractHostname(currentUrl))) { 43 dfs(nextUrl, htmlParser); 44 } 45 } 46 } 47 48 /** 49 * Extracts the hostname from a URL 50 * @param url The URL string (format: http://hostname/path) 51 * @return The hostname portion of the URL 52 */ 53 private String extractHostname(String url) { 54 // Remove "http://" prefix (7 characters) 55 String urlWithoutProtocol = url.substring(7); 56 // Split by "/" and return the hostname part 57 return urlWithoutProtocol.split("/")[0]; 58 } 59} 60
1/** 2 * // This is the HtmlParser's API interface. 3 * // You should not implement it, or speculate about its implementation 4 * class HtmlParser { 5 * public: 6 * vector<string> getUrls(string url); 7 * }; 8 */ 9 10class Solution { 11public: 12 vector<string> crawl(string startUrl, HtmlParser htmlParser) { 13 // Initialize result vector and visited set 14 vector<string> result; 15 unordered_set<string> visited; 16 17 // Start DFS traversal from the starting URL 18 dfs(startUrl, htmlParser, visited, result); 19 20 return result; 21 } 22 23private: 24 /** 25 * Performs depth-first search to crawl all URLs under the same hostname 26 * @param url: Current URL to process 27 * @param htmlParser: Parser to get URLs from current page 28 * @param visited: Set of already visited URLs to avoid duplicates 29 * @param result: Vector to store all crawled URLs 30 */ 31 void dfs(const string& url, HtmlParser& htmlParser, 32 unordered_set<string>& visited, vector<string>& result) { 33 // Skip if URL has already been visited 34 if (visited.count(url)) { 35 return; 36 } 37 38 // Mark current URL as visited and add to result 39 visited.insert(url); 40 result.push_back(url); 41 42 // Get all URLs from current page and recursively visit those with same hostname 43 vector<string> nextUrls = htmlParser.getUrls(url); 44 for (const string& nextUrl : nextUrls) { 45 // Only crawl URLs with the same hostname 46 if (extractHostname(url) == extractHostname(nextUrl)) { 47 dfs(nextUrl, htmlParser, visited, result); 48 } 49 } 50 } 51 52 /** 53 * Extracts hostname from a URL 54 * Assumes URL starts with "http://" (7 characters) 55 * @param url: URL string to extract hostname from 56 * @return: Hostname portion of the URL 57 */ 58 string extractHostname(const string& url) { 59 // Skip "http://" prefix (7 characters) 60 const int protocolLength = 7; 61 string hostname; 62 63 // Extract characters until '/' or end of string 64 for (int i = protocolLength; i < url.size(); ++i) { 65 if (url[i] == '/') { 66 break; 67 } 68 hostname += url[i]; 69 } 70 71 return hostname; 72 } 73}; 74
1/** 2 * // This is the HtmlParser's API interface. 3 * // You should not implement it, or speculate about its implementation 4 * interface HtmlParser { 5 * getUrls(url: string): string[]; 6 * } 7 */ 8 9/** 10 * Main function to crawl all URLs under the same hostname starting from a given URL 11 * @param startUrl - The starting URL to begin crawling from 12 * @param htmlParser - Parser instance to extract URLs from HTML pages 13 * @returns Array of all crawled URLs under the same hostname 14 */ 15function crawl(startUrl: string, htmlParser: HtmlParser): string[] { 16 // Initialize result array and visited set 17 const result: string[] = []; 18 const visited: Set<string> = new Set(); 19 20 // Start DFS traversal from the starting URL 21 dfs(startUrl, htmlParser, visited, result); 22 23 return result; 24} 25 26/** 27 * Performs depth-first search to crawl all URLs under the same hostname 28 * @param url - Current URL to process 29 * @param htmlParser - Parser to get URLs from current page 30 * @param visited - Set of already visited URLs to avoid duplicates 31 * @param result - Array to store all crawled URLs 32 */ 33function dfs( 34 url: string, 35 htmlParser: HtmlParser, 36 visited: Set<string>, 37 result: string[] 38): void { 39 // Skip if URL has already been visited 40 if (visited.has(url)) { 41 return; 42 } 43 44 // Mark current URL as visited and add to result 45 visited.add(url); 46 result.push(url); 47 48 // Get all URLs from current page and recursively visit those with same hostname 49 const nextUrls: string[] = htmlParser.getUrls(url); 50 for (const nextUrl of nextUrls) { 51 // Only crawl URLs with the same hostname 52 if (extractHostname(url) === extractHostname(nextUrl)) { 53 dfs(nextUrl, htmlParser, visited, result); 54 } 55 } 56} 57 58/** 59 * Extracts hostname from a URL 60 * Assumes URL starts with "http://" (7 characters) 61 * @param url - URL string to extract hostname from 62 * @returns Hostname portion of the URL 63 */ 64function extractHostname(url: string): string { 65 // Skip "http://" prefix (7 characters) 66 const protocolLength: number = 7; 67 let hostname: string = ""; 68 69 // Extract characters until '/' or end of string 70 for (let i = protocolLength; i < url.length; i++) { 71 if (url[i] === '/') { 72 break; 73 } 74 hostname += url[i]; 75 } 76 77 return hostname; 78} 79

Solution Approach

The implementation uses a recursive DFS approach with a helper function to extract hostnames and a set to track visited URLs.

Helper Function for Hostname Extraction:

def host(url):  url = url[7:] # Remove "http://" (7 characters)  return url.split('/')[0] # Get everything before first '/'

This function extracts the hostname from a URL. For example, http://example.com/page becomes example.com.

Main DFS Function: The core exploration logic is implemented as a nested recursive function:

def dfs(url):  if url in ans:  return # Base case: already visited  ans.add(url) # Mark as visited  for next in htmlParser.getUrls(url):  if host(url) == host(next): # Same hostname check  dfs(next) # Recursive exploration

Algorithm Steps:

  1. Initialize a set ans to store all visited URLs. Using a set ensures O(1) lookup time for checking if a URL has been visited and automatically handles duplicates.

  2. Start DFS from the starting URL by calling dfs(startUrl).

  3. In each DFS call:

    • Check if the current URL has been visited (base case)
    • If visited, return immediately to avoid cycles
    • If not visited, add it to the ans set
    • Get all linked URLs from the current page using htmlParser.getUrls(url)
    • For each linked URL, check if it has the same hostname
    • If the hostname matches, recursively explore that URL
  4. Return the result by converting the set to a list: return list(ans)

Key Design Decisions:

  • Set vs List for tracking visited URLs: A set provides O(1) membership checking compared to O(n) for a list, making the algorithm more efficient when dealing with many URLs.

  • Recursive DFS vs Iterative: The recursive approach is cleaner and more intuitive for this problem. The call stack naturally handles the backtracking.

  • Hostname comparison: By comparing hostnames before making the recursive call, we ensure we never explore URLs outside our domain, maintaining the problem constraints.

The time complexity is O(N + E) where N is the number of unique URLs within the hostname and E is the total number of edges (links) between them. The space complexity is O(N) for storing the visited URLs plus O(H) for the recursion stack depth, where H is the maximum depth of the URL graph.

Ready to land your dream job?

Unlock your dream job with a 5-minute evaluator for a personalized learning plan!

Start Evaluator

Example Walkthrough

Let's walk through a small example to illustrate the solution approach.

Given:

  • startUrl = "http://news.com/world"
  • The HtmlParser returns the following links for each URL:
    • "http://news.com/world"["http://news.com/sports", "http://news.com", "http://blog.com/travel"]
    • "http://news.com/sports"["http://news.com/world", "http://news.com/tech"]
    • "http://news.com"["http://news.com/world"]
    • "http://news.com/tech"[]

Step-by-step execution:

  1. Initialize: Create empty set ans = {}

  2. Start DFS with "http://news.com/world":

    • Not in ans, so add it: ans = {"http://news.com/world"}
    • Get linked URLs: ["http://news.com/sports", "http://news.com", "http://blog.com/travel"]
    • Check each link:
      • "http://news.com/sports": hostname is "news.com" (same as start) → recurse
      • "http://news.com": hostname is "news.com" (same) → recurse later
      • "http://blog.com/travel": hostname is "blog.com" (different) → skip
  3. DFS on "http://news.com/sports":

    • Not in ans, so add it: ans = {"http://news.com/world", "http://news.com/sports"}
    • Get linked URLs: ["http://news.com/world", "http://news.com/tech"]
    • Check each link:
      • "http://news.com/world": already in ans → skip
      • "http://news.com/tech": hostname is "news.com" (same) → recurse
  4. DFS on "http://news.com/tech":

    • Not in ans, so add it: ans = {"http://news.com/world", "http://news.com/sports", "http://news.com/tech"}
    • Get linked URLs: [] (no links)
    • Nothing to explore, return
  5. Back to step 2, continue with "http://news.com":

    • Not in ans, so add it: ans = {"http://news.com/world", "http://news.com/sports", "http://news.com/tech", "http://news.com"}
    • Get linked URLs: ["http://news.com/world"]
    • "http://news.com/world": already in ans → skip
  6. Final result: Convert set to list and return ["http://news.com/world", "http://news.com/sports", "http://news.com/tech", "http://news.com"]

The crawler successfully found all URLs within the news.com domain while avoiding the external blog.com link and preventing infinite loops from circular references.

Time and Space Complexity

Time Complexity: O(N + E)

Where N is the total number of unique URLs within the same hostname that can be reached from the startUrl, and E is the total number of edges (links between pages).

  • The DFS traversal visits each unique URL exactly once due to the check if url in ans: return, which takes O(1) time for set lookup
  • For each visited URL, we call htmlParser.getUrls(url) once, which returns a list of URLs
  • We iterate through all returned URLs to check if they belong to the same hostname using the host() function
  • The host() function performs string operations that take O(L) time where L is the length of the URL, but this can be considered O(1) for practical URL lengths
  • In total, we process each URL once and examine each edge once, resulting in O(N + E) time complexity

Space Complexity: O(N + H)

Where N is the number of unique URLs visited and H is the maximum depth of the recursion stack.

  • The ans set stores all unique URLs visited within the same hostname, requiring O(N) space
  • The recursive DFS call stack can go as deep as the longest path in the crawl graph, which in the worst case (a linear chain of URLs) could be O(N), but typically would be much smaller
  • The space used by htmlParser.getUrls() to return lists of URLs is temporary and gets garbage collected after each call
  • Converting the set to a list at the end requires O(N) additional space temporarily
  • Overall space complexity is O(N) for storing the URLs plus O(H) for the recursion depth, giving us O(N + H), which simplifies to O(N) in the worst case

Learn more about how to find time and space complexity quickly.

Common Pitfalls

1. Incorrect Hostname Extraction for Edge Cases

The Pitfall: The current hostname extraction function assumes all URLs start with http:// and uses a fixed offset of 7 characters. This breaks if:

  • URLs use https:// (8 characters)
  • URLs have different protocols or formats
  • The URL format is unexpected

Example failure case:

# Current implementation fails for: url = "https://example.com/page" # Would incorrectly parse as "s://exa..." url = "http://example.com" # Works, but what if no trailing path?

Solution: Use a more robust hostname extraction method:

def get_hostname(url: str) -> str:  # Handle both http:// and https://  if url.startswith('http://'):  url = url[7:]  elif url.startswith('https://'):  url = url[8:]   # Split by '/' and take hostname part  # Handle case where there's no path after hostname  parts = url.split('/', 1) # Split only at first '/'  return parts[0]

2. Stack Overflow with Deep or Cyclic URL Structures

The Pitfall: The recursive DFS approach can cause stack overflow if the URL graph is very deep or has many levels. Python's default recursion limit is around 1000, which could be exceeded in large websites.

Solution: Implement an iterative version using an explicit stack:

def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:  visited_urls = set()  stack = [startUrl]  start_hostname = get_hostname(startUrl)   while stack:  current_url = stack.pop()  if current_url in visited_urls:  continue   visited_urls.add(current_url)   for next_url in htmlParser.getUrls(current_url):  if get_hostname(next_url) == start_hostname and next_url not in visited_urls:  stack.append(next_url)   return list(visited_urls)

3. Inefficient Hostname Comparison

The Pitfall: The current code calls get_hostname(current_url) inside the loop for every linked URL, even though the current URL's hostname never changes. This results in redundant parsing.

Solution: Cache the hostname calculation:

def depth_first_search(current_url: str) -> None:  if current_url in visited_urls:  return   visited_urls.add(current_url)  current_hostname = get_hostname(current_url) # Calculate once   for next_url in htmlParser.getUrls(current_url):  if current_hostname == get_hostname(next_url): # Use cached value  depth_first_search(next_url)

4. Not Handling the Start URL's Hostname Consistently

The Pitfall: The code compares get_hostname(current_url) with get_hostname(next_url) in each recursive call, but this could lead to issues if there are redirects or if the comparison logic changes deeper in the recursion.

Solution: Store the target hostname once at the beginning:

def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:  target_hostname = get_hostname(startUrl) # Fixed target  visited_urls = set()   def dfs(current_url: str) -> None:  if current_url in visited_urls:  return  visited_urls.add(current_url)   for next_url in htmlParser.getUrls(current_url):  # Always compare against the original hostname  if get_hostname(next_url) == target_hostname:  dfs(next_url)   dfs(startUrl)  return list(visited_urls)

This ensures that all URLs are compared against the same reference hostname throughout the crawling process, preventing drift or inconsistencies.

Loading...
Discover Your Strengths and Weaknesses: Take Our 5-Minute Quiz to Tailor Your Study Plan:

What are the two properties the problem needs to have for dynamic programming to be applicable? (Select 2)


Recommended Readings

Want a Structured Path to Master System Design Too? Don’t Miss This!

Load More