Dmitriy Zub ☀️ for SerpApi

Posted on Jun 1, 2022 • Originally published at serpapi.com

Web Scraping all ResearchGate Publications in Python

#python #programming #tutorial #serpapi

What will be scraped

Prerequisites

Basic knowledge scraping with CSS selectors

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective and show the most common approaches of using CSS selectors when web scraping.

Separate virtual environment

If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

Reduce the chance of being blocked

There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.

Install libraries:

pip install parsel playwright

Full Code

from parsel import Selector from playwright.sync_api import sync_playwright import json def scrape_researchgate_publications(query: str): with sync_playwright() as p: browser = p.chromium.launch(headless=True, slow_mo=50) page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36") publications = [] page_num = 1 while True: page.goto(f"https://www.researchgate.net/search/publication?q={query}&page={page_num}") selector = Selector(text=page.content()) for publication in selector.css(".nova-legacy-c-card__body--spacing-inherit"): title = publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::text").get().title() title_link = f'https://www.researchgate.net{publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::attr(href)").get()}' publication_type = publication.css(".nova-legacy-v-publication-item__badge::text").get() publication_date = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(1) span::text").get() publication_doi = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(2) span").xpath("normalize-space()").get() publication_isbn = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(3) span").xpath("normalize-space()").get() authors = publication.css(".nova-legacy-v-person-inline-item__fullname::text").getall() source_link = f'https://www.researchgate.net{publication.css(".nova-legacy-v-publication-item__preview-source .nova-legacy-e-link--theme-bare::attr(href)").get()}' publications.append({ "title": title, "link": title_link, "source_link": source_link, "publication_type": publication_type, "publication_date": publication_date, "publication_doi": publication_doi, "publication_isbn": publication_isbn, "authors": authors }) print(f"page number: {page_num}") # checks if next page arrow key is greyed out `attr(rel)` (inactive) and breaks out of the loop  if selector.css(".nova-legacy-c-button-group__item:nth-child(9) a::attr(rel)").get(): break else: page_num += 1 print(json.dumps(publications, indent=2, ensure_ascii=False)) browser.close() scrape_researchgate_publications(query="coffee")

Code explanation

Import libraries:

from parsel import Selector from playwright.sync_api import sync_playwright import json

Code	Explanation
`parsel`	to parse HTML/XML documents. Supports XPath.
`playwright`	to render the page with a browser instance.
`json`	to convert Python dictionary to JSON string.

Define a function and open a playwright with a context manager::

def scrape_researchgate_publications(query: str): with sync_playwright() as p: # ...

Code	Explanation
`query: str`	to tell Python that `query` should be an `str`.

Lunch a browser instance, open new_page with passed user-agent:

browser = p.chromium.launch(headless=True, slow_mo=50) page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")

Code	Explanation
`p.chromium.launch()`	to launch Chromium browser instance.
`headless`	to explicitly tell `playwright` to run in headless mode even though it's a defaut value.
`slow_mo`	to tell `playwright` to slow down execution.
`browser.new_page()`	to open new page. `user_agent` is used to act a real user makes a request from the browser. If not used, it will default to `playwright` value which is `None`. Check what's your user-agent.

Add a temporary list, set up a while loop, and open a new URL:

authors = [] while True: page.goto(f"https://www.researchgate.net/search/publication?q={query}&page={page_num}") selector = Selector(text=page.content()) # ...

Code	Explanation
`goto()`	to make a request to specific URL with passed query and page parameters.
`Selector()`	to pass returned HTML data with `page.content()` and process it.

Iterate over author results on each page, extract the data and append to a temporary list:

for publication in selector.css(".nova-legacy-c-card__body--spacing-inherit"): title = publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::text").get().title() title_link = f'https://www.researchgate.net{publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::attr(href)").get()}' publication_type = publication.css(".nova-legacy-v-publication-item__badge::text").get() publication_date = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(1) span::text").get() publication_doi = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(2) span").xpath("normalize-space()").get() publication_isbn = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(3) span").xpath("normalize-space()").get() authors = publication.css(".nova-legacy-v-person-inline-item__fullname::text").getall() source_link = f'https://www.researchgate.net{publication.css(".nova-legacy-v-publication-item__preview-source .nova-legacy-e-link--theme-bare::attr(href)").get()}' publications.append({ "title": title, "link": title_link, "source_link": source_link, "publication_type": publication_type, "publication_date": publication_date, "publication_doi": publication_doi, "publication_isbn": publication_isbn, "authors": authors })

Code	Explanation
`css()`	to parse data from the passed CSS selector(s). Every CSS query traslates to XPath using `csselect` package under the hood.
`::text`/`::attr(attribute)`	to extract textual or attribute data from the node.
`get()`/`getall()`	to get actual data from a matched node, or to get a `list` of matched data from nodes.
`xpath("normalize-space()")`	to parse blank text node as well. By default, blank text node is be skipped by XPath.

Check if the next page is present and paginate:

# checks if the next page arrow key is greyed out `attr(rel)` (inactive) -> breaks out of the loop if selector.css(".nova-legacy-c-button-group__item:nth-child(9) a::attr(rel)").get(): break else: page_num += 1

Print extracted data, and close browser instance:

print(json.dumps(publications, indent=2, ensure_ascii=False)) browser.close() # call the function scrape_researchgate_publications(query="coffee")

Part of the JSON output:

[ { "title":"The Social Life Of Coffee Turkey’S Local Coffees", "link":"https://www.researchgate.netpublication/360540595_The_Social_Life_of_Coffee_Turkey%27s_Local_Coffees?_sg=kzuAi6HlFbSbnLEwtGr3BA_eiFtDIe1VEA4uvJlkBHOcbSjh5XlSQe6GpYvrbi12M0Z2MQ6grwnq9fI", "source_link":"https://www.researchgate.netpublication/360540595_The_Social_Life_of_Coffee_Turkey%27s_Local_Coffees?_sg=kzuAi6HlFbSbnLEwtGr3BA_eiFtDIe1VEA4uvJlkBHOcbSjh5XlSQe6GpYvrbi12M0Z2MQ6grwnq9fI", "publication_type":"Conference Paper", "publication_date":"Apr 2022", "publication_doi":null, "publication_isbn":null, "authors":[ "Gülşen Berat Torusdağ", "Merve Uçkan Çakır", "Cinucen Okat" ] }, { "title":"Coffee With The Algorithm", "link":"https://www.researchgate.netpublication/359599064_Coffee_with_the_Algorithm?_sg=3KHP4SXHm_BSCowhgsa4a2B0xmiOUMyuHX2nfqVwRilnvd1grx55EWuJqO0VzbtuG-16TpsDTUywp0o", "source_link":"https://www.researchgate.netNone", "publication_type":"Chapter", "publication_date":"Mar 2022", "publication_doi":"DOI: 10.4324/9781003170884-10", "publication_isbn":"ISBN: 9781003170884", "authors":[ "Jakob Svensson" ] }, ... other publications { "title":"Coffee In Chhattisgarh", # last publication "link":"https://www.researchgate.netpublication/353118247_COFFEE_IN_CHHATTISGARH?_sg=CsJ66DoWjFfkMNdujuE-R9aVTZA4kVb_9lGiy1IrYXls1Nur4XFMdh2s5E9zkF5Skb5ZZzh663USfBA", "source_link":"https://www.researchgate.netNone", "publication_type":"Technical Report", "publication_date":"Jul 2021", "publication_doi":null, "publication_isbn":null, "authors":[ "Krishan Pal Singh", "Beena Nair Singh", "Dushyant Singh Thakur", "Anurag Kerketta", "Shailendra Kumar Sahu" ] } ]

DEV Community

Web Scraping all ResearchGate Publications in Python

What will be scraped

Prerequisites

Full Code

Code explanation

Links

Top comments (0)