Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 11 additions & 9 deletions Lib/tarfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -525,7 +525,7 @@ def read(self, size=None):
if not buf:
break
t.append(buf)
buf = "".join(t)
buf = b"".join(t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It never caused a problem, since this line is never called; size is never None in the function call. But still, should be fixed, I guess.

else:
buf = self._read(size)
self.pos += len(buf)
Expand All @@ -538,6 +538,7 @@ def _read(self, size):
return self.__read(size)

c = len(self.dbuf)
t = [self.dbuf]
while c < size:
buf = self.__read(self.bufsize)
if not buf:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about bypassing self.__read()?

 while c < size: if self.buf: buf = self.buf self.buf = b"" else: buf = self.fileobj.read(self.bufsize) if not buf: break
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compressed streams your suggestion works. However, there is one case where self.__read() is called with large size: When you open the stream uncompressed with "r|". For this special case, you still need the optimization in self.__read().

cpython/Lib/tarfile.py

Lines 537 to 538 in 3c45240

if self.comptype == "tar":
return self.__read(size)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless optimization self.__read(), double (and unaligned) buffering is totally useless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. It is a bit twisted to handle the different cases of decompressing versus direct read. So far, this is just a minimal fix of the performance regression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.
To backport this, fix only repeated bytes += bytes and keep other behavior as-is.

Expand All @@ -546,26 +547,27 @@ def _read(self, size):
buf = self.cmp.decompress(buf)
except self.exception:
raise ReadError("invalid compressed data")
self.dbuf += buf
t.append(buf)
c += len(buf)
buf = self.dbuf[:size]
self.dbuf = self.dbuf[size:]
return buf
t = b"".join(t)
self.dbuf = t[size:]
return t[:size]

def __read(self, size):
"""Return size bytes from stream. If internal buffer is empty,
read another block from the stream.
"""
c = len(self.buf)
t = [self.buf]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this optimization is needed.
In all caller of __read(), size is small or ==bufsize.
while c < size: doesn't loop twice actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For large size, chunked read is not needed anyway.

rem = size - len(self.buf) if rem > 0: self.buf += self.fileobj.read(max(self.bufsize, rem)) t = self.buf[:size] self.buf = self.buf[size:] return t
while c < size:
buf = self.fileobj.read(self.bufsize)
if not buf:
break
self.buf += buf
t.append(buf)
c += len(buf)
buf = self.buf[:size]
self.buf = self.buf[size:]
return buf
t = b"".join(t)
self.buf = t[size:]
return t[:size]
# class _Stream

class _StreamProxy(object):
Expand Down