I want to create a personal web crawler. As I download each page, I collect a new list of URLs. To increase the likelihood that I am downloading the most important pages first, I want to keep the list of URLs to visit sorted by length (shorter URLs are more likely to be closer to the front page). That presents a problem though, because I'm also adding URLs.
A naïve list (Vec<T>
in Rust), won't be very efficient. After every batch of inserts, I would need to re-sort the list. Wouldn't it be good if there were a data structure that could maintain the sort order itself? Yes! It's called a binary heap.
Rust's standard library provides a binary heap implementation, as std::collections::BinaryHeap
. In my case, I want to sort a list of url::Url
. By default, they are sorted in alphabetic order.
use url::Url; use std::collections::BinaryHeap; fn main() { let raw_urls = vec![ "https://www.youtube.com/watch?v=ywskA8CoulM", "https://tim.mcnamara.nz/", "https://github.com/rust-lang/rust/issues?labels=E-easy&state=open", ]; let mut urls = BinaryHeap::new(); for url in raw_urls { urls.push(Url::parse(url).unwrap()); } while let Some(url) = urls.pop() { println!("{:?}", url.as_str()); } }
This prints our URLs, but in alphabetic order:
"https://www.youtube.com/watch?v=ywskA8CoulM" "https://tim.mcnamara.nz/" "https://github.com/rust-lang/rust/issues?labels=E-easy&state=open"
Unfortunately, it's a little bit complicated to ask BinaryHeap
to use a custom sort order. To change the default, I create a new type that wraps one as ShortestFirst(Url)
. Then, I implement Ord
and PartialOrd
traits.
use url::Url; use std::collections::BinaryHeap; #[derive(PartialEq, Eq, , Debug)] struct ShortestFirst(Url); impl PartialOrd for ShortestFirst { fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> { Some(self.cmp(other)) } } impl Ord for ShortestFirst { fn cmp(&self, other: &Self) -> std::cmp::Ordering { let left = (other.0).as_str().len(); let right = (self.0).as_str().len(); left.cmp(&right) } } fn main() { let raw_urls = vec![ "https://www.youtube.com/watch?v=ywskA8CoulM", "https://tim.mcnamara.nz/", "https://github.com/rust-lang/rust/issues?labels=E-easy&state=open", ]; let mut urls = BinaryHeap::new(); for url in raw_urls { urls.push(ShortestFirst(Url::parse(url).unwrap())); } while let Some(ShortestFirst(url)) = urls.pop() { println!("{:?}", url.as_str()); } }
This prints the URLs, in the order we wanted:
"https://tim.mcnamara.nz/" "https://www.youtube.com/watch?v=ywskA8CoulM" "https://github.com/rust-lang/rust/issues?labels=E-easy&state=open"
Top comments (0)