Skip to content

Commit 545374c

Browse files
committed
docs(faq): added faq section and refined installation
1 parent 15b7682 commit 545374c

File tree

8 files changed

+94
-84
lines changed

8 files changed

+94
-84
lines changed

.python-version

Lines changed: 0 additions & 2 deletions
This file was deleted.

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ Feel free to contribute and join our Discord server to discuss with us improveme
168168

169169
Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md).
170170

171-
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/gkxQDAjfeX)
171+
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/uJN7TYcpNa)
172172
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
173173
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)
174174

@@ -179,13 +179,14 @@ Wanna visualize the roadmap in a more interactive way? Check out the [markmap](h
179179

180180
## ❤️ Contributors
181181
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
182+
182183
## Sponsors
183184
<div style="text-align: center;">
184185
<a href="https://serpapi.com?utm_source=scrapegraphai">
185186
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
186187
</a>
187188
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
188-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 10%;">
189+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
189190
</a>
190191
</div>
191192

docs/source/getting_started/installation.rst

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,18 @@ The library is available on PyPI, so it can be installed using the following com
2525

2626
It is higly recommended to install the library in a virtual environment (conda, venv, etc.)
2727

28-
If your clone the repository, you can install the library using `poetry <https://python-poetry.org/docs/>`_:
28+
If your clone the repository, it is recommended to use a package manager like `rye <https://rye.astral.sh/>`_.
29+
To install the library using rye, you can run the following command:
2930

3031
.. code-block:: bash
3132
32-
poetry install
33+
rye pin 3.10
34+
rye sync
35+
rye build
36+
37+
.. caution::
38+
39+
**Rye** must be installed first by following the instructions on the `official website <https://rye.astral.sh/>`_.
3340

3441
Additionally on Windows when using WSL
3542
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

docs/source/index.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,15 @@
3232

3333
modules/modules
3434

35+
.. toctree::
36+
:hidden:
37+
:caption: EXTERNAL RESOURCES
38+
39+
GitHub <https://github.com/VinciGit00/Scrapegraph-ai>
40+
Discord <https://discord.gg/uJN7TYcpNa>
41+
Linkedin <https://www.linkedin.com/company/scrapegraphai/>
42+
Twitter <https://twitter.com/scrapegraphai>
43+
3544
Indices and tables
3645
==================
3746

docs/source/introduction/overview.rst

Lines changed: 67 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,11 @@
66
Overview
77
========
88

9-
ScrapeGraphAI is a open-source web scraping python library designed to usher in a new era of scraping tools.
10-
In today's rapidly evolving and data-intensive digital landscape, this library stands out by integrating LLM and
11-
direct graph logic to automate the creation of scraping pipelines for websites and various local documents, including XML,
12-
HTML, JSON, and more.
9+
ScrapeGraphAI is an **open-source** Python library designed to revolutionize **scraping** tools.
10+
In today's data-intensive digital landscape, this library stands out by integrating **Large Language Models** (LLMs)
11+
and modular **graph-based** pipelines to automate the scraping of data from various sources (e.g., websites, local files etc.).
1312

14-
Simply specify the information you need to extract, and ScrapeGraphAI handles the rest,
15-
providing a more flexible and low-maintenance solution compared to traditional scraping tools.
13+
Simply specify the information you need to extract, and ScrapeGraphAI handles the rest, providing a more **flexible** and **low-maintenance** solution compared to traditional scraping tools.
1614

1715
Why ScrapegraphAI?
1816
==================
@@ -21,17 +19,75 @@ Traditional web scraping tools often rely on fixed patterns or manual configurat
2119
ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention.
2220
This flexibility ensures that scrapers remain functional even when website layouts change.
2321

24-
We support many Large Language Models (LLMs) including GPT, Gemini, Groq, Azure, Hugging Face etc.
25-
as well as local models which can run on your machine using Ollama.
22+
We support many LLMs including **GPT, Gemini, Groq, Azure, Hugging Face** etc.
23+
as well as local models which can run on your machine using **Ollama**.
2624

2725
Library Diagram
2826
===============
2927

30-
With ScrapegraphAI you first construct a pipeline of steps you want to execute by combining nodes into a graph.
31-
Executing the graph takes care of all the steps that are often part of scraping: fetching, parsing etc...
32-
Finally the scraped and processed data gets fed to an LLM which generates a response.
28+
With ScrapegraphAI you can use many already implemented scraping pipelines or create your own.
29+
30+
The diagram below illustrates the high-level architecture of ScrapeGraphAI:
3331

3432
.. image:: ../../assets/project_overview_diagram.png
3533
:align: center
3634
:width: 70%
3735
:alt: ScrapegraphAI Overview
36+
37+
FAQ
38+
===
39+
40+
1. **What is ScrapeGraphAI?**
41+
42+
ScrapeGraphAI is an open-source python library that uses large language models (LLMs) and graph logic to automate the creation of scraping pipelines for websites and various document types.
43+
44+
2. **How does ScrapeGraphAI differ from traditional scraping tools?**
45+
46+
Traditional scraping tools rely on fixed patterns and manual configurations, whereas ScrapeGraphAI adapts to website structure changes using LLMs, reducing the need for constant developer intervention.
47+
48+
3. **Which LLMs are supported by ScrapeGraphAI?**
49+
50+
ScrapeGraphAI supports several LLMs, including GPT, Gemini, Groq, Azure, Hugging Face, and local models that can run on your machine using Ollama.
51+
52+
4. **Can ScrapeGraphAI handle different document formats?**
53+
54+
Yes, ScrapeGraphAI can scrape information from various document formats such as XML, HTML, JSON, and more.
55+
56+
5. **I get an empty or incorrect output when scraping a website. What should I do?**
57+
58+
There are several reasons behind this issue, but for most cases, you can try the following:
59+
60+
- Set the `headless` parameter to `False` in the graph_config. Some javascript-heavy websites might require it.
61+
62+
- Check your internet connection. Low speed or unstable connection can cause the HTML to not load properly.
63+
64+
- Try using a proxy server to mask your IP address. Check out the :ref:`Proxy` section for more information on how to configure proxy settings.
65+
66+
- Use a different LLM model. Some models might perform better on certain websites than others.
67+
68+
- Set the `verbose` parameter to `True` in the graph_config to see more detailed logs.
69+
70+
- Visualize the pipeline graphically using :ref:`Burr`.
71+
72+
If the issue persists, please report it on the GitHub repository.
73+
74+
6. **How does ScrapeGraphAI handle the context window limit of LLMs?**
75+
76+
By splitting big websites/documents into chunks with overlaps and applying compression techniques to reduce the number of tokens. If multiple chunks are present, we will have multiple answers to the user prompt, and therefore, we merge them together in the last step of the scraping pipeline.
77+
78+
7. **How can I contribute to ScrapeGraphAI?**
79+
80+
You can contribute to ScrapeGraphAI by submitting bug reports, feature requests, or pull requests on the GitHub repository. Join our `Discord <https://discord.gg/uJN7TYcpNa>`_ community and follow us on social media!
81+
82+
Sponsors
83+
========
84+
85+
.. image:: ../../assets/serp_api_logo.png
86+
:width: 10%
87+
:alt: Serp API
88+
:target: https://serpapi.com?utm_source=scrapegraphai
89+
90+
.. image:: ../../assets/transparent_stat.png
91+
:width: 15%
92+
:alt: Stat Proxies
93+
:target: https://dashboard.statproxies.com/?refferal=scrapegraph

docs/source/scrapers/graph_config.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ Some interesting ones are:
1414
- `burr_kwargs`: A dictionary with additional parameters to enable `Burr` graphical user interface.
1515
- `max_images`: The maximum number of images to be analyzed. Useful in `OmniScraperGraph` and `OmniSearchGraph`.
1616

17+
.. _Burr:
18+
1719
Burr Integration
1820
^^^^^^^^^^^^^^^^
1921

@@ -43,6 +45,8 @@ To log your graph execution in the platform, you need to set the `burr_kwargs` p
4345
}
4446
}
4547
48+
.. _Proxy:
49+
4650
Proxy Rotation
4751
^^^^^^^^^^^^^^
4852

requirements-dev.lock

Lines changed: 2 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -81,8 +81,6 @@ cycler==0.12.1
8181
dataclasses-json==0.6.6
8282
# via langchain
8383
# via langchain-community
84-
decorator==5.1.1
85-
# via ipython
8684
defusedxml==0.7.1
8785
# via langchain-anthropic
8886
distro==1.9.0
@@ -97,10 +95,7 @@ email-validator==2.1.1
9795
# via fastapi
9896
exceptiongroup==1.2.1
9997
# via anyio
100-
# via ipython
10198
# via pytest
102-
executing==2.0.1
103-
# via stack-data
10499
faiss-cpu==1.8.0
105100
# via scrapegraphai
106101
fastapi==0.111.0
@@ -119,7 +114,6 @@ free-proxy==1.1.1
119114
frozenlist==1.4.1
120115
# via aiohttp
121116
# via aiosignal
122-
fsspec==2024.5.0
123117
fsspec==2024.5.0
124118
# via huggingface-hub
125119
furo==2024.5.6
@@ -208,8 +202,6 @@ jmespath==1.0.1
208202
jsonpatch==1.33
209203
# via langchain
210204
# via langchain-core
211-
jsonpickle==3.0.4
212-
# via pyvis
213205
jsonpointer==2.4
214206
# via jsonpatch
215207
jsonschema==4.22.0
@@ -268,9 +260,6 @@ multidict==6.0.5
268260
# via yarl
269261
mypy-extensions==1.0.0
270262
# via typing-inspect
271-
networkx==3.3
272-
# via pyvis
273-
# via scrapegraphai
274263
numpy==1.26.4
275264
# via altair
276265
# via contourpy
@@ -312,8 +301,6 @@ playwright==1.43.0
312301
# via undetected-playwright
313302
pluggy==1.5.0
314303
# via pytest
315-
prompt-toolkit==3.0.43
316-
# via ipython
317304
proto-plus==1.23.0
318305
# via google-ai-generativelanguage
319306
# via google-api-core
@@ -354,8 +341,6 @@ pygments==2.18.0
354341
# via furo
355342
# via rich
356343
# via sphinx
357-
pygments==2.18.0
358-
# via ipython
359344
pyparsing==3.1.2
360345
# via httplib2
361346
# via matplotlib
@@ -373,8 +358,6 @@ python-multipart==0.0.9
373358
# via fastapi
374359
pytz==2024.1
375360
# via pandas
376-
pyvis==0.3.2
377-
# via scrapegraphai
378361
pyyaml==6.0.1
379362
# via huggingface-hub
380363
# via langchain
@@ -414,7 +397,6 @@ sf-hamilton==1.63.0
414397
shellingham==1.5.4
415398
# via typer
416399
six==1.16.0
417-
# via asttokens
418400
# via python-dateutil
419401
smmap==5.0.1
420402
# via gitdb
@@ -453,8 +435,6 @@ starlette==0.37.2
453435
# via fastapi
454436
streamlit==1.34.0
455437
# via burr
456-
stack-data==0.6.3
457-
# via ipython
458438
tenacity==8.3.0
459439
# via langchain
460440
# via langchain-community
@@ -480,9 +460,6 @@ tqdm==4.66.4
480460
# via scrapegraphai
481461
typer==0.12.3
482462
# via fastapi-cli
483-
traitlets==5.14.3
484-
# via ipython
485-
# via matplotlib-inline
486463
typing-extensions==4.11.0
487464
# via altair
488465
# via anthropic
@@ -492,7 +469,6 @@ typing-extensions==4.11.0
492469
# via google-generativeai
493470
# via groq
494471
# via huggingface-hub
495-
# via ipython
496472
# via openai
497473
# via pydantic
498474
# via pydantic-core
@@ -508,10 +484,10 @@ typing-inspect==0.9.0
508484
# via sf-hamilton
509485
tzdata==2024.1
510486
# via pandas
511-
undetected-playwright==0.3.0
512-
# via scrapegraphai
513487
ujson==5.10.0
514488
# via fastapi
489+
undetected-playwright==0.3.0
490+
# via scrapegraphai
515491
uritemplate==4.1.1
516492
# via google-api-python-client
517493
urllib3==2.2.1

0 commit comments

Comments
 (0)