-
Couldn't load subscription status.
- Fork 2
Home
Goscrapy requires Go version 1.22 or higher to run.
Goscrapy provides the goscrapy cli tool to help you scaffold a goscrapy project.
Usage
- Install
go install github.com/tech-engine/goscrapy@latest- Verify installation
goscrapy -v- Create a project
goscrapy startproject scrapejsp- Create a custom pipeline
goscrapy pipeline export_2_DBGoScrapy operates around the below three concepts.
- Job: Describes an input to your spider.
- Record: Represents an output produced by your spider.
- Spider: Contains the main logic of your scraper.
Job represents an input to goscrapy spider and must implement core.IJob interface.
type IJob interface { Id() string }type Job struct { id string // add your own fields here } func (j *Job) Id() string { return j.id }A Record represents an output produced by a spider(via yield) and must implement core.IOutput.
type IOutput interface { Record() *Record RecordKeys() []string RecordFlat() []any Job() IJob }type Record struct { J *Job `json:"-" csv:"-"` } func (r *Record) Record() *Record { return r } func (r *Record) RecordKeys() []string { .... keys := make([]string, numFields) .... return keys } func (r *Record) RecordFlat() []any { .... return slice } func (r *Record) Job() core.IJob { return r.J }Encapsulates the main logic of a goscrapy spider. We embed gos.ICoreSpider to make our spider work.
type Spider struct { gos.ICoreSpider[*Record] } func New(ctx context.Context) (*Spider, <-chan error) { // use proxies // proxies := core.WithProxies("proxy_url1", "proxy_url2", ...) // core := gos.New[*Record]().WithClient( // gos.DefaultClient(proxies), // ) core := gos.New[*Record]() // Add middlewares core.MiddlewareManager.Add(MIDDLEWARES...) // Add pipelines core.PipelineManager.Add(PIPELINES...) errCh := make(chan error) go func() { errCh <- core.Start(ctx) }() return &Spider{ core, }, errCh } // This is the entrypoint to the spider func (s *Spider) StartRequest(ctx context.Context, job *Job) { // for each request we must call NewRequest() and never reuse it req := s.NewRequest() var headers http.Header /* GET is the request method, method chaining possible req.Url("<URL_HERE>"). Meta("MY_KEY1", "MY_VALUE"). Meta("MY_KEY2", true). Header(headers) */ /* POST req.Url(<URL_HERE>) req.Method("POST") req.Body(<BODY_HERE>) */ // call the next parse method s.Request(req, s.parse) } // can be called when spider exits func (s *Spider) Close(ctx context.Context) { } func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) { // response.Body() // response.StatusCode() // response.Header() // response.Bytes() // response.Meta("MY_KEY1") // yielding output pushes output to be processed by pipelines, also check output.go for the fields var data Record err := json.Unmarshal(resp.Bytes(), &data) if err != nil { log.Panicln(err) } // s.Yield(&data) }In addition to all the files discussed, we also have settings.go where we can import all middlewares and pipelines we want to use in our project.
// HTTP Transport settings // Default: 10000 const MIDDLEWARE_HTTP_TIMEOUT_MS = "" // Default: 100 const MIDDLEWARE_HTTP_MAX_IDLE_CONN = "" // Default: 100 const MIDDLEWARE_HTTP_MAX_CONN_PER_HOST = "" // Default: 100 const MIDDLEWARE_HTTP_MAX_IDLE_CONN_PER_HOST = "" // Inbuilt Retry middleware settings // Default: 3 const MIDDLEWARE_HTTP_RETRY_MAX_RETRIES = "" // Default: 500, 502, 503, 504, 522, 524, 408, 429 const MIDDLEWARE_HTTP_RETRY_CODES = "" // Default: 1s const MIDDLEWARE_HTTP_RETRY_BASE_DELAY = "" // Default: 1000000 const SCHEDULER_REQ_RES_POOL_SIZE = "" // Default: num. of CPU * 3 const SCHEDULER_CONCURRENCY = "" // Default: 1000000 const SCHEDULER_WORK_QUEUE_SIZE = "" // Pipeline Manager settings // Default: 10000 const PIPELINEMANAGER_ITEMPOOL_SIZE = "" // Default: 24 const PIPELINEMANAGER_ITEM_SIZE = "" // Default: 0 const PIPELINEMANAGER_OUTPUT_QUEUE_BUF_SIZE = "" // Default: 1000 const PIPELINEMANAGER_MAX_PROCESS_ITEM_CONCURRENCY = "" // Middlewares here var MIDDLEWARES = []middlewaremanager.Middleware{ middlewares.Retry(), middlewares.MultiCookieJar, middlewares.DupeFilter, } var export2CSV = pipelines.Export2CSV[*Record](pipelines.Export2CSVOpts{ Filename: "itstimeitsnowornever.csv", }) // Pipelines here var PIPELINES = []pm.IPipeline[*Record]{ export2CSV, // export2Json, } ...- scrapejsp - api scraping
- scrapejsp_method2 [This new method is recommended] - api scraping
- books.toscrape.com - html scraping
More examples coming...
func main() { ctx, cancel := context.WithCancel(context.Background()) var wg sync.WaitGroup wg.Add(1) spider, errCh := test1.New(ctx) go func() { defer wg.Done() err := <-errCh if err != nil && errors.Is(err, context.Canceled) { return } fmt.Printf("failed: %q", err) }() // start the scraper with a job, currently nil is passed but you can pass your job here spider.StartRequest(ctx, nil) OnTerminate(func() { fmt.Println("exit signal received: shutting down gracefully") cancel() wg.Wait() }) }Customize the Default client.
| Option | Description | Default |
|---|---|---|
| WithProxies | Accepts multiple proxy url strings. | by default client uses proxy from enviroment |
| WithTimeout | Http client timeout. | 10 seconds |
| WithMaxIdleConns | Controls the max no. of idle(keep-alive) conns. across all hosts. 0 means unlimited. | 100 |
| WithMaxIdleConnsPerHost | Same as WithMaxIdleConns but per host. | 100 |
| WithMaxConnsPerHost | Limits the total no. of conns. per host. 0 mean unlimited. | 100 |
| WithProxyFn | Accepts a custom proxy function for transport. | Round robin |
[spider.go]
func New(ctx context.Context) (*Spider, <-chan error) { // default client options // proxies := gos.WithProxies("proxy_url1", "proxy_url2", ...) // core := gos.New[*Record]().WithClient( // gos.DefaultClient(proxies), // ) // we can also provide in our custom client // core := gos.New[*Record]().WithClient(myCustomHTTPClient) }Pipelines help in managing, transforming, and fine-tuning the scraped data.
We can add pipelines using coreSpider.PipelineManager.Add().
[settings.go]
// use export 2 csv pipeline export2Csv := pipelines.Export2CSV[*scrapejsp.Record](pipelines.Export2CSVOpts{ Filename: "itstimeitsnowornever.csv", }) // use export 2 json pipeline export2Json := pipelines.Export2JSON[*scrapejsp.Record](pipelines.Export2JSONOpts{ Filename: "itstimeitsnowornever.json", Immediate: true, })A Group allows us to execute multiple pipelines concurrently. All pipelines in a group behave as one single pipeline. This can be useful in scenarios we may want to export our data both to multiple destinations. Instead of exporting sequentially, we can bundle them together in a group.
Pipelines in a group shouldn't be used for data transformation but for independent tasks like data exporting to a database etc.
[settings.go]
func myCustomPipelineGroup() *pm.Group[*Record] { // create a group pipelineGroup := pm.NewGroup[*Record]() pipelineGroup.Add(export2CSV) // pipelineGroup.Add(export2Json) return pipelineGroup } // Pipelines here // Executed in the order they appear. var PIPELINES = []pm.IPipeline[*Record]{ export2CSV, // export2Json, // myCustomPipelineGroup(), // use group as if it were a single pipeline }GoScrapy also support inbuilt and custom middlewares for manipulation outgoing request.
- MultiCookieJar - used for maintaining different cookie sessions while scraping.
- DupeFilter - filters duplicate requests
- Retry - retry request with exponential back-off upon failure or with http status codes 500, 502, 503, 504, 522, 524, 408, 429
| Option | Description | Default |
|---|---|---|
| MaxRetries | Additional retries after failure. | 3 |
| Codes | Http code to trigger retry. | 500, 502, 503, 504, 522, 524, 408, 429 |
| BaseDelay | Exponential Backoff multiplier. | 1 second |
| Cb | Callback executed after every retry. If callback returns false, further retry is skipped. | nil |
We can add middlewares using gos.MiddlewareManager.Add().
[settings.go]
var MIDDLEWARES = []middlewaremanager.Middleware{ middlewares.Retry(), middlewares.MultiCookieJar, middlewares.DupeFilter, }GoScrapy supports custom pipelines. To create one, you can use goscrapy cli.
abc\go\go-test-scrapy>scrapejsp> goscrapy pipeline export_2_DB ✔️ pipelines\export_2_DB.go ✨ Congrates, export_2_DB created successfully.To create one, you can use goscrapy cli. Custom middlewares must have the below function signature.
func MultiCookieJar(next http.RoundTripper) http.RoundTripper { return core.MiddlewareFunc(func(req *http.Request) (*http.Response, error) { // you middleware custom code here }) }GoScrapy supports CSS and XPATH selectors.
[spider.go]
func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) { // CSS selector - select all products a tags and extract the href attribute value var productUrls []string productUrls = resp.Css("article.product_pod h3 a").Attr("href") // select all the text node values var productNames []string productNames = resp.Css("article.product_pod h3 a").Text() // Selector chaining is possible too productUrls = resp.Css("article.product_pod").Css("h3 a").Attr("href") // Xpath selector productUrls = resp.Xpath("//article[contains(@class, 'product_pod')]//h3//a").Attr("href") // chaining xpath and css also possible productUrls = resp.Xpath("//article[contains(@class, 'product_pod')]").Css("h3 a").Attr("href") // Get all matching nodes var productUrlNodes []*html.Node productUrlNodes = resp.Css("article.product_pod h3 a").GetAll() // Get the first matching node var firstProductUrlNode *html.Node firstProductUrlNode = resp.Css("article.product_pod h3 a").Get() }