-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Description
Describe the bug
Hello everyone,
i am trying to load mozilla-foundation--common_voice_11_0
and it fails. Reproducer
import datasets
datasets.load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True, trust_remote_code=True)
and it fails with
File ~/opt/envs/.../lib/python3.10/site-packages/datasets/utils/file_utils.py:827, in _add_retries_to_file_obj_read_method.<locals>.read_with_retries(*args, **kwargs)
825 for retry in range(1, max_retries + 1):
826 try:
--> 827 out = read(*args, **kwargs)
828 break
829 except (
830 _AiohttpClientError,
831 asyncio.TimeoutError,
832 requests.exceptions.ConnectionError,
833 requests.exceptions.Timeout,
834 ) as err:
File /usr/lib/python3.10/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
319 def decode(self, input, final=False):
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
When i remove streaming then everything is good but i need streaming=True
Steps to reproduce the bug
import datasets
datasets.load_dataset("mozilla-foundation/common_voice_11_0", "en", split="test", streaming=True, trust_remote_code=True)
Expected behavior
Expected that it will download dataset
Environment info
datasets==3.6.0
python3.10
on all platforms linux/win/mac
Metadata
Metadata
Assignees
Labels
No labels