| Safe Haskell | Safe-Inferred |
|---|---|
| Language | Haskell2010 |
Network.HTTP.Conduit.Downloader
Description
HTTP downloader tailored for web-crawler needs.
- Handles all possible http-client exceptions and returns human readable error messages.
- Handles some web server bugs (returning
deflatedata instead ofgzip, invalidgzipencoding). - Uses OpenSSL instead of
tlspackage (sincetlsdoesn't handle all sites and works slower than OpenSSL). - Ignores invalid SSL sertificates.
- Receives data in 32k chunks internally to reduce memory fragmentation on many parallel downloads.
- Download timeout.
- Total download size limit.
- Returns HTTP headers for subsequent redownloads and handles
Not modifiedresults. - Can be used with external DNS resolver (hsdns-cache for example).
- Keep-alive connections pool (thanks to http-client).
Typical workflow in crawler:
withDnsCache $ c -> withDownloader $ d -> do ... -- got URL from queue ra <- resolveA c $ hostNameFromUrl url case ra of Left err -> ... -- uh oh, bad host Right ha -> do ... -- crawler politeness stuff (rate limits, queues) dr <- download d url (Just ha) opts case dr of DROK dat redownloadOptions -> ... -- analyze data, save redownloadOpts for next download DRRedirect .. -> ... DRNotModified -> ... DRError e -> ...
It's highly recommended to use http://hackage.haskell.org/package/concurrent-dns-cache (preferably with single resolver pointing to locally running BIND) for DNS resolution since getAddrInfo used in http-client can be buggy and ineffective when it needs to resolve many hosts per second for a long time.
Synopsis
- urlGetContents :: String -> IO ByteString
- urlGetContentsPost :: String -> ByteString -> IO ByteString
- download :: Downloader -> String -> Maybe HostAddress -> DownloadOptions -> IO DownloadResult
- post :: Downloader -> String -> Maybe HostAddress -> ByteString -> IO DownloadResult
- downloadG :: (Request -> IO Request) -> Downloader -> String -> Maybe HostAddress -> DownloadOptions -> IO DownloadResult
- rawDownload :: (Request -> IO Request) -> Downloader -> String -> Maybe HostAddress -> DownloadOptions -> IO (DownloadResult, Maybe RawDownloadResult)
- data DownloadResult
- data RawDownloadResult = RawDownloadResult {}
- type DownloadOptions = [String]
- data DownloaderSettings = DownloaderSettings {}
- data Downloader
- withDownloader :: (Downloader -> IO a) -> IO a
- withDownloaderSettings :: DownloaderSettings -> (Downloader -> IO a) -> IO a
- newDownloader :: DownloaderSettings -> IO Downloader
- postRequest :: ByteString -> Request -> Request
Download operations
urlGetContents :: String -> IO ByteString Source #
Download single URL with default DownloaderSettings. Fails if result is not DROK.
urlGetContentsPost :: String -> ByteString -> IO ByteString Source #
Post data and download single URL with default DownloaderSettings. Fails if result is not DROK.
Arguments
| :: Downloader | |
| -> String | URL |
| -> Maybe HostAddress | Optional resolved |
| -> DownloadOptions | |
| -> IO DownloadResult |
Perform download
post :: Downloader -> String -> Maybe HostAddress -> ByteString -> IO DownloadResult Source #
Perform HTTP POST.
Arguments
| :: (Request -> IO Request) | Function to modify |
| -> Downloader | |
| -> String | URL |
| -> Maybe HostAddress | Optional resolved |
| -> DownloadOptions | |
| -> IO DownloadResult |
Generic version of download with ability to modify http-client Request.
Arguments
| :: (Request -> IO Request) | Function to modify |
| -> Downloader | |
| -> String | URL |
| -> Maybe HostAddress | Optional resolved |
| -> DownloadOptions | |
| -> IO (DownloadResult, Maybe RawDownloadResult) |
Even more generic version of download, which returns RawDownloadResult. RawDownloadResult is optional since it can not be determined on timeouts and connection errors.
data DownloadResult Source #
Result of download operation.
Constructors
| DROK ByteString DownloadOptions | Successful download with data and options for next download. |
| DRRedirect String | Redirect URL |
| DRError String | Error |
| DRNotModified | HTTP 304 Not Modified |
Instances
| Read DownloadResult Source # | |
Defined in Network.HTTP.Conduit.Downloader Methods readsPrec :: Int -> ReadS DownloadResult # readList :: ReadS [DownloadResult] # | |
| Show DownloadResult Source # | |
Defined in Network.HTTP.Conduit.Downloader Methods showsPrec :: Int -> DownloadResult -> ShowS # show :: DownloadResult -> String # showList :: [DownloadResult] -> ShowS # | |
| Eq DownloadResult Source # | |
Defined in Network.HTTP.Conduit.Downloader Methods (==) :: DownloadResult -> DownloadResult -> Bool # (/=) :: DownloadResult -> DownloadResult -> Bool # | |
data RawDownloadResult Source #
Result of rawDownload operation.
Constructors
| RawDownloadResult | |
Fields | |
Instances
| Show RawDownloadResult Source # | |
Defined in Network.HTTP.Conduit.Downloader Methods showsPrec :: Int -> RawDownloadResult -> ShowS # show :: RawDownloadResult -> String # showList :: [RawDownloadResult] -> ShowS # | |
type DownloadOptions = [String] Source #
If-None-Match and/or If-Modified-Since headers.
Downloader
data DownloaderSettings Source #
Settings used in downloader.
Constructors
| DownloaderSettings | |
Fields
| |
Instances
| Default DownloaderSettings Source # | |
Defined in Network.HTTP.Conduit.Downloader Methods | |
data Downloader Source #
Keeps http-client Manager and DownloaderSettings.
withDownloader :: (Downloader -> IO a) -> IO a Source #
Create a new Downloader, use it in the provided function, and then release it.
withDownloaderSettings :: DownloaderSettings -> (Downloader -> IO a) -> IO a Source #
Create a new Downloader with provided settings, use it in the provided function, and then release it.
newDownloader :: DownloaderSettings -> IO Downloader Source #
Create a Downloader with settings.
Utils
postRequest :: ByteString -> Request -> Request Source #
Make HTTP POST request.