AVSampleBufferDisplayLayer renders half of each frame when using high preset

I am using AVSampleBufferDisplayLayer to display video that is being streamed over the network. On the sending side an AVCaptureSession is used to capture CMSampleBuffers which are serialized into NAL units and streamed to the receiver, which then turns them back into CMSampelBuffers and feeds them to AVSampleBufferDisplayLayer (as is described for instance here). It works quite well - I can see the video and it streams more or less smoothly.


If I set the capture session's sessionPreset to AVCaptureSessionPresetHigh the video shown on the receiving side is cut in half - the top half displays the video from the sender while the bottom half is a solid dark green. If I use any other preset (e.g. AVCaptureSessionPresetMedium or AVCaptureSessionPreset1280x720) the video displays in its entirety.


Has anyone encountered such an issue, or has any idea what might cause it?


I tried examining the data at the source as well as the data at the destination, to see if I can determine where the image is being chopped off, but I have not been successful. It occurred to me that perhaps the high quality frame is being split into more than one NALUs and I am not putting it together correctly - is that possible? How does such splitting look like on the elementary-stream level (if possible at all)?


Thank you

Amos

Accepted Reply

Try to record the source (simply save to file on your iOS) and open the elementry stream with an H264 analyzer or even with ffplay (i think it can open elemtry streams directly).

it is possible that you are not copying the entire frame but only half (sometimes iOS HW encoder is using 2 slices per frame).

If the recorded file is fine, then you may have network drops.

Replies

Try to record the source (simply save to file on your iOS) and open the elementry stream with an H264 analyzer or even with ffplay (i think it can open elemtry streams directly).

it is possible that you are not copying the entire frame but only half (sometimes iOS HW encoder is using 2 slices per frame).

If the recorded file is fine, then you may have network drops.

Thank you idrori, that put me on the right track - it was indeed that 2 slices were being used per frame and I was dropping the second one. If I assume the stream contains two slices per frame it displays without distortion.


However, how can I tell by looking at a NALU if it's just half of a frame or if it represents the whole frame? From reading the specification it looks like this information would be contained in the slice header, maybe in first_mb_in_slice field, but I'm having trouble reading the slice header.


Here's what I'm seeing at a medium preset, frames coming in single slices:


#1 - type 5 NALU: 25b8201f c3c6b18f eb87ffe1 462fbb63 ...

#2 - type 1 NALU: 21e1084d ff0132a8 fac022df bc0d5ae0 ...

#3 - type 1 NALU: 21e2107f 23eb2a9e 45bcd5b4 f8102ae0 ...

#4 - type 1 NALU: 21e3184f ff213bc7 3bce4ac6 f6a7ad24 ...

#5 - type 1 NALU: 21e42046 ffb78f7a 10e03fc1 03fcbca6 ...


And at the high preset I see frames coming in two slices:


#1 - type 5 NALU (first half): 25b82017 ff3d756d 94ac5be1 c7bbe9d0 ...

#2 - type 5 NALU (second half): 25001fe2 e0805f18 ec4767ac d0274ce6

#3 - type 1 NALU (first half): 21e1084c ffdc52ee 56a75e6e 5bc7cc2e

#4 - type 1 NALU (second half): 21001fe3 842133ff 097c7015 0e16c346

#5 - type 1 NALU (first half): 21e21046 3fe41fd8 3a9f5443 29c54d7d

#6 - type 1 NALU (second half): 21001fe3 884118ff 00000bea f9778008


The first byte determines the nalu type, but is the second byte already the slice header? How do I read first_mb_in_slice? The spec mentioned something about exp golomb encoding, I couldn't quite figure it out. I see from the spec that I need to read it in exp-golomb encoding, ok, that seems reasonable - but just from looking at the data it seems that every second half of a slice has a second byte of 00 - but that's backwards, I'd like to see a marker on the first slice indicating if I should expect a second slice or not. Does any such marker exist?


Thank you

All right, after some experimentation and reading the spec: ub order to recognize if two successive NAL units belong to the same frame it is necessary to parse the slice header of the picture NAL units. This goes more or less like this: the first field of the slice header is first_mb_in_slice which is encoded in Golomb encoding. Next come slice_type and the pic_aprameter_set_id, also in Golomb encoding, and finally the frame_number, as an unsigned integer of length (log2_max_frame_num_minus_4 + 4) bits (to get the value of log2_max_frame_num_minus_4 it is necessary to parse the PPS corresponding to this frame). If two consecutive NAL units have the same frame_num they are part of the same frame and should be put into the same CMSampleBuffer.

Hey amosasp,


I'm not sure if detecting new access units is just a matter of checking the frame number. The spec has a good paragraph that details this: "7.4.1.2.4 Detection of the first VCL NAL unit of a primary coded picture." I've implemented this and I'm correctly getting the access units from my h264 stream. Though the decoder is complaining and giving me the errorcode: -8969. If I'm correct that's the "bad data" error.


I'm wondering if someone knows in general what the HW-decoder expects when I call VTDecompressionSessionDecodeFrame(). For example, lets say the encoder generates 4 nals which make up the access-unit/picture. In my case the first nal has a 4-byte annex-b start code, the 2nd,3rd and 4th nals all have 3 bytes start codes. Do I only rewrite the startcode of the first nal unit and replace it with the size (in big endian) or do I need to remove the start codes of all nals and replace them with a 4-byte length prefix. Then combine this data into one buffer and feed that?


Thanks

roxlu

Hi roxlu,

How did you combine the four nalus? My encoder generates two nalus once time and I also needs to combine them. And my encoded data looks as same as amosasp's. Thanks.

Hi idrori,

Is there a way to control the nalus numbers for one frame after HW encoding. Here in my situation, HW encoder generates sometime one nalu and other times generate two nalus. And when there are two nalus for one frame, the screen shows blurred video while one nalu for one frame shows video correctly after transfered by rtmp.

Hi,

I know this is a bit old, but I've been fighting with this lately and finally I made it to work. So I want to share some knowledge. Hope it will save somebody's time, and not like in my case 😟.

My scenario is :

A computer in the network is streaming sliced H264 video with ffmpeg realtime preset, each frame is sliced into more pieces (5 in my case) and transmitted through the network as raw NAL units. We are not using B frames, so the frames come in the decode order..

What I needed to do, to make the decoder actually decode the video,

1. Gather all NAL units for one frame into one CMBlockBuffer, each unit prepended with its 4-byte length, so the data look like [4byte slice1 size][slice1 data][4byte slice2 size][slice2 data]...[4byte slice5 size][slice5 data]

2. Wrap the block buffer into CMSampleBuffer. Now this was the deal breaker for me - set the number of samples to 1, the sample size array doesn't matter at all, timing array must contain one CMSampleTimingInfo with correct duration and presentation timestamp (this is a different story 🙂).

3. And feed it to the decompression session (or AVSampleBufferDisplayLayer in my case). That was it!!

What I was doing wrong:

Among other things..:-) the part with the slices of the video frame was, that I wrongly assumed the number of samples is the number of slices in the CMBlockBuffer, which it is not.

Hope this helps someone.

Best

Thank you very much! This helped me too.


Especially old devices (iPhone 5s in my case) used to generate partial 1/5 NALUs where newer devices generates 1 per frame