The code for determining if all chunks have been read has an OBOE that causes not all bytes to be read if the number of bytes in the chunk is 1 byte more than the amount of bytes read after the copy
if g.cIdx >= g.rChunk.size-1 { // chunk complete
g.sp.give <- g.rChunk.b
g.chunkID++
g.rChunk = nil
}
This will occur if the chunk size is a multiple of the default byte buffer size (32KiB, 32 * 1024) that io.Copy
uses + 1 byte; When the copy exits, g.cIdx
is updated with the amount bytes read. This makes the variable a count which is 1-based (opposed to 0-based if it was an index into the buffer); The chunk size is a length which is 1-based, and there is no need to offset by 1.
When this is encountered, the goroutines will end up in an infinite select situation as the next iteration of the loop will check if g.bytesRead == g.contentLen
which it does not because there is 1 byte remaining in the chunk, and it will proceed to call g.nextChunk()
since the g.rChunk
was cleared out earlier. Now the goroutine will block the select in nextChunk()
and there are no more chunks.
goroutine trace from pprof
goroutine 43 [select, 83 minutes]:
github.com/rlmcpherson/s3gof3r.(*getter).nextChunk(0xc2083705a0, 0x44c2d4, 0x0, 0x0)
/go/src/github.com/rlmcpherson/s3gof3r/getter.go:244 +0x2d9
github.com/rlmcpherson/s3gof3r.(*getter).Read(0xc2083705a0, 0xc21b1e0000, 0x8000, 0x8000, 0x8000, 0x0, 0x0)
/go/src/github.com/rlmcpherson/s3gof3r/getter.go:207 +0x182
io.Copy(0x7feef9a5e5a8, 0xc208064800, 0x7feef9a6e320, 0xc2083705a0, 0x180000, 0x0, 0x0)
/usr/local/go/src/pkg/io/io.go:353 +0x1f3
project/dbimport.(*bucketUtils).concat(0xc21b2071c0, 0xc208242000, 0x5a, 0x80, 0xc21b64ab40, 0x40, 0x0, 0x0)
/go/src/project/main.go:217 +0x7ef
project/dbimport.processS3Files(0xc2080ec1e0, 0xc2081ec660, 0xc2083ce2c0, 0x7fffc6ee6e1f, 0xf, 0xc20821ff50, 0x2a, 0xc208269200, 0x5a, 0x5a, ...)
/go/src/project/main.go:792 +0x17a3
project/dbimport.func·007(0xc2083cc2a0)
/go/src/project/main.go:622 +0x13b6
goroutine 93481 [select]:
github.com/rlmcpherson/s3gof3r.func·002()
/go/src/github.com/rlmcpherson/s3gof3r/pool.go:42 +0x6cd
created by github.com/rlmcpherson/s3gof3r.bufferPool
/go/src/github.com/rlmcpherson/s3gof3r/pool.go:68 +0x15a
Can be reproduced with any file that is 1 byte larger than a multiple of 32KiB, such as in my case 1081345 (32 * 1024 * 33 + 1); Default PartSize of 20MiB
The last couple iterations of the loop with some debugs
...
s3gof3r2015/03/06 23:37:16 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:37:16 getter.go:204: g.chunkTotal: 1
s3gof3r2015/03/06 23:37:16 getter.go:205: g.chunkID: 0
s3gof3r2015/03/06 23:37:16 getter.go:207: g.rChunk.id: 0
s3gof3r2015/03/06 23:37:16 getter.go:211: nw: 0
s3gof3r2015/03/06 23:37:16 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:213: g.cIdx: 1015808
s3gof3r2015/03/06 23:37:16 getter.go:214: g.bytesRead: 1015808
s3gof3r2015/03/06 23:37:16 getter.go:215: g.contentLen: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:228: g.cIdx: 1015808
s3gof3r2015/03/06 23:37:16 getter.go:229: g.rChunk.size: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:230: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:232: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:233: n: 32768
s3gof3r2015/03/06 23:37:16 getter.go:234: bytesRead: 1015808
s3gof3r2015/03/06 23:37:16 getter.go:238: bytesRead: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:244: g.chunkID: 0
s3gof3r2015/03/06 23:37:16 getter.go:246: g.rChunk.id: 0
s3gof3r2015/03/06 23:37:16 getter.go:250: -----loop end------
s3gof3r2015/03/06 23:37:16 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:37:16 getter.go:204: g.chunkTotal: 1
s3gof3r2015/03/06 23:37:16 getter.go:205: g.chunkID: 0
s3gof3r2015/03/06 23:37:16 getter.go:207: g.rChunk.id: 0
s3gof3r2015/03/06 23:37:16 getter.go:211: nw: 0
s3gof3r2015/03/06 23:37:16 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:213: g.cIdx: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:214: g.bytesRead: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:215: g.contentLen: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:228: g.cIdx: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:229: g.rChunk.size: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:230: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:232: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:233: n: 32768
s3gof3r2015/03/06 23:37:16 getter.go:234: bytesRead: 1048576
s3gof3r2015/03/06 23:37:16 getter.go:238: bytesRead: 1081344
s3gof3r2015/03/06 23:37:16 getter.go:244: g.chunkID: 1
s3gof3r2015/03/06 23:37:16 getter.go:248: g.rChunk.id: nil
s3gof3r2015/03/06 23:37:16 getter.go:250: -----loop end------
s3gof3r2015/03/06 23:37:16 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:37:16 getter.go:204: g.chunkTotal: 1
s3gof3r2015/03/06 23:37:16 getter.go:205: g.chunkID: 1
s3gof3r2015/03/06 23:37:16 getter.go:209: g.rChunk.id: nil
s3gof3r2015/03/06 23:37:16 getter.go:211: nw: 0
s3gof3r2015/03/06 23:37:16 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:37:16 getter.go:213: g.cIdx: 1081344
s3gof3r2015/03/06 23:37:16 getter.go:214: g.bytesRead: 1081344
s3gof3r2015/03/06 23:37:16 getter.go:215: g.contentLen: 1081345
s3gof3r2015/03/06 23:37:16 getter.go:271: ------nextChunk select------
Results in infinite select loop. File on disk is missing 1 byte (the last byte of file).
Can also be reproduced by using any PartSize that is a multiple of 32KiB + 1; Example of a file that succeeded with default config of 20MiB PartSize, but fails with a PartSize of 131073 (32 * 1024 * 4 + 1); Also shows a multipart download which above example did not.
...
s3gof3r2015/03/06 23:55:15 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:55:15 getter.go:204: g.chunkTotal: 10
s3gof3r2015/03/06 23:55:15 getter.go:205: g.chunkID: 9
s3gof3r2015/03/06 23:55:15 getter.go:207: g.rChunk.id: 9
s3gof3r2015/03/06 23:55:15 getter.go:211: nw: 0
s3gof3r2015/03/06 23:55:15 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:213: g.cIdx: 32768
s3gof3r2015/03/06 23:55:15 getter.go:214: g.bytesRead: 1212416
s3gof3r2015/03/06 23:55:15 getter.go:215: g.contentLen: 1275093
s3gof3r2015/03/06 23:55:15 getter.go:228: g.cIdx: 32768
s3gof3r2015/03/06 23:55:15 getter.go:229: g.rChunk.size: 95436
s3gof3r2015/03/06 23:55:15 getter.go:230: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:232: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:233: n: 32768
s3gof3r2015/03/06 23:55:15 getter.go:234: bytesRead: 1212416
s3gof3r2015/03/06 23:55:15 getter.go:238: bytesRead: 1245184
s3gof3r2015/03/06 23:55:15 getter.go:244: g.chunkID: 9
s3gof3r2015/03/06 23:55:15 getter.go:246: g.rChunk.id: 9
s3gof3r2015/03/06 23:55:15 getter.go:250: -----loop end------
s3gof3r2015/03/06 23:55:15 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:55:15 getter.go:204: g.chunkTotal: 10
s3gof3r2015/03/06 23:55:15 getter.go:205: g.chunkID: 9
s3gof3r2015/03/06 23:55:15 getter.go:207: g.rChunk.id: 9
s3gof3r2015/03/06 23:55:15 getter.go:211: nw: 0
s3gof3r2015/03/06 23:55:15 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:213: g.cIdx: 65536
s3gof3r2015/03/06 23:55:15 getter.go:214: g.bytesRead: 1245184
s3gof3r2015/03/06 23:55:15 getter.go:215: g.contentLen: 1275093
s3gof3r2015/03/06 23:55:15 getter.go:228: g.cIdx: 65536
s3gof3r2015/03/06 23:55:15 getter.go:229: g.rChunk.size: 95436
s3gof3r2015/03/06 23:55:15 getter.go:230: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:232: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:233: n: 29900
s3gof3r2015/03/06 23:55:15 getter.go:234: bytesRead: 1245184
s3gof3r2015/03/06 23:55:15 getter.go:238: bytesRead: 1275084
s3gof3r2015/03/06 23:55:15 getter.go:244: g.chunkID: 10
s3gof3r2015/03/06 23:55:15 getter.go:248: g.rChunk.id: nil
s3gof3r2015/03/06 23:55:15 getter.go:250: -----loop end------
s3gof3r2015/03/06 23:55:15 getter.go:203: -----loop start------
s3gof3r2015/03/06 23:55:15 getter.go:204: g.chunkTotal: 10
s3gof3r2015/03/06 23:55:15 getter.go:205: g.chunkID: 10
s3gof3r2015/03/06 23:55:15 getter.go:209: g.rChunk.id: nil
s3gof3r2015/03/06 23:55:15 getter.go:211: nw: 29900
s3gof3r2015/03/06 23:55:15 getter.go:212: len(p): 32768
s3gof3r2015/03/06 23:55:15 getter.go:213: g.cIdx: 95436
s3gof3r2015/03/06 23:55:15 getter.go:214: g.bytesRead: 1275084
s3gof3r2015/03/06 23:55:15 getter.go:215: g.contentLen: 1275093
s3gof3r2015/03/06 23:55:15 getter.go:271: ------nextChunk select------
Results in infinite select loop. File on disk will be missing 1 byte for every chunk.
Pull request incoming which addresses the issue and adds a couple guards
- If bytes read does not equal content length and all chunks have been processed, error
- If bytes read is greater than content length, error
- This should not occur as golang uses a LimitedReader up to the content length, but for completeness should be here
For more robustness, there could be timeout on the select in the getNextChunk
, If for any reason the routine gets into here and there are no more chunks, it will block forever. Cannot rely on the underlying HTTP connection being gone and triggering a close because the data has been read into memory. I did think about that the worker()
function should close the g.readCh
but that would not have it break out of the select (it would if it was a range on the select); Have not thought this fully out, but feel something can be done to signal that no more chunks will be arriving on the channel because the workers are gone.