I'm trying to do the same.
As far as I understand we have to use Video Toolbox APIs (VTDecompressionSessionSetMultiImageCallback) to get the second frame.
So far I haven't figured out what pixel format to use when creating a AVAssetReaderTrackOutput instance and/or the matching CMFormatDescription for the VTDecompressionSession.
I've only done simple AVFoundation transcoding in the past so I'm not even sure I'm on the right track :)
Cheers!