Skip to content

langchain/document_loaders/web_base.py #8505

@tabee

Description

@tabee

System Info

langchain/document_loaders/web_base.py > works for me only when i change:

return await response.text() 

with:

body = await response.read() return body.decode('utf-8', errors='ignore') 

otherwise:

der Code produziert leider einen Fehler:

/home/codespace/.py
thon/current/bin/python3 /workspaces/b3rn_zero_ai/notebooks/ignite_vectorstore.py
Fetching pages: 13%|###8 | 33/256 [00:03<00:19, 11.18it/s]Traceback (most recent call last):
File "/workspaces/b3rn_zero_ai/notebooks/ignite_vectorstore.py", line 68, in
documents = loader.load()
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/sitemap.py", line 142, in load
results = self.scrape_all([el["loc"].strip() for el in els if "loc" in el])
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/web_base.py", line 168, in scrape_all
results = asyncio.run(self.fetch_all(urls))
File "/home/codespace/.local/lib/python3.10/site-packages/nest_asyncio.py", line 35, in run
return loop.run_until_complete(task)
File "/home/codespace/.local/lib/python3.10/site-packages/nest_asyncio.py", line 90, in run_until_complete
return f.result()
File "/home/codespace/.python/current/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 232, in __step
result = coro.send(None)
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/web_base.py", line 148, in fetch_all
return await tqdm_asyncio.gather(
File "/home/codespace/.python/current/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in gather
res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
File "/home/codespace/.python/current/lib/python3.10/site-packages/tqdm/asyncio.py", line 79, in
res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
return f.result() # May raise f.exception().
File "/home/codespace/.python/current/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 234, in __step
result = coro.throw(exc)
File "/home/codespace/.python/current/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
File "/home/codespace/.python/current/lib/python3.10/asyncio/futures.py", line 285, in await
yield self # This tells Task to wait for completion.
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
future.result()
File "/home/codespace/.python/current/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/home/codespace/.python/current/lib/python3.10/asyncio/tasks.py", line 232, in __step
result = coro.send(None)
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/web_base.py", line 136, in _fetch_with_rate_limit
return await self._fetch(url)
File "/home/codespace/.python/current/lib/python3.10/site-packages/langchain/document_loaders/web_base.py", line 120, in _fetch
return await response.text()
File "/home/codespace/.python/current/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1086, in text
return self._body.decode( # type: ignore[no-any-return,union-attr]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte
Fetching pages: 15%|####4 | 38/256 [00:04<00:23, 9.25it/s]

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

I tried to make embedding from a website in "french" language.

Expected behavior

we need a solution when : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 11: invalid start byte

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugRelated to a bug, vulnerability, unexpected error with an existing feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions