基本概念

音频属性

采样频率(Sample Rate)
每秒对声音的采集次数，它用赫兹(Hz)来表示，也可以看作类似动态映像的帧数。(采样率越高越靠近原声音的波形)
量化精度(Bit Depth)
指记录声音的动态范围，它以位(Bit)为单位。(对Y轴进行切割,以最大振幅切成2的n次方计算，n就是bit数)
声音通道(Channel)
声道数。比如左声道右声道。
码率
取样频率×量化精度×声道数。即单位时间内传输的数据量

packets/frames/chunks

受设备或codec所限而分隔开的一小段数据

codecs

为了更好的描述数据所采用的算法.通常分为encoder和decoder两部分.

Encoding

The process of converting a raw media signal to a binary file of a codec. For example encoding a series of raw images to the video codec H.264. Encoding can also refer to the process of converting a very high quality raw video file into a mezzanine format for simpler sharing & transmission – Ex: taking an uncompressed RGB 16-bit frame , with a size of 12.4MB, for 60 seconds (measured at 24 frames/sec) totalling 17.9GB – and compressing it into 8-bit frames with a size of 3.11MB per frame, which for the same video of 60 seconds at 24fps is 2.9GB in total. Effectively compressing the size of the video file down by 15GB!

Decoding

The opposite of encoding; decoding is the process of converting binary files back into raw media signals. Ex: H.264 codec streams into viewable images.
Transcoding: The process of converting one codec to another (or the same) codec. Both decoding & encoding are necessary steps to achieving a successful transcode. Best described as: decoding the source codec stream and then encoding it again to a new target codec stream. Although encoding is typically lossy, additional techniques like frame interpolation and upscaling increase the quality of the conversion of a compressed video format.
Muxing: The process of adding one or more codec streams into a container format.

Demuxing

Extracting a codec stream from a container format.

Transmuxing

Extracting streams from one container format and putting them in a different (or the same) container format.

Multiplexing:

The process of interweaving audio and video into one data stream. Ex: An elementary stream (audio & video) from the encoder are turned into Packetized Elementary Streams (PES) and then converted into Transport Streams (TS).

Demultiplexing

The reverse operation of multiplexing. This means extracting an elementary stream from a media container. E.g.: Extracting the mp3 audio data from an mp4 music video.

In-Band Events

This refers to metadata events that are associated with a specific timestamp. This usually means that these events are synchronized with video and audio streams. E.g.: These events can be used to trigger dynamic content replacement (ad-insertion) or the presentation of supplemental content.

stream/Track

表示一路音频或视频,为了效率,通常是通过codec编码过后的数据

container 容器

就是我们平常所说的音视频的文件格式,比如avi/wav/mp4等.一个容器中一般包含一个或多个stream

format

容器在磁盘上的格式

MUX和DEMUX

Mux 是 Multiplex 的缩写，意为“多路传输”，其实就是“混流”、“封装”的意思，与“合成”的意思相似就是指把视频素材和音频素材封装到一个单独的文件中。

意义：
通过 muxing（混流），可以将视频流、音频流甚至是字幕流捆绑到一个单独的文件中，作为一个信号进行传输，等传输完毕，就可以通过 demuxing（分离）将里面的视频、音频或字幕分解出来各自进行解码和播放。

要点：
在 muxing 与 demuxing 的整个过程，都不对原来的视频、音频或字幕重新编码。混流（封装、打包）后的文件，可以通过分离（分解、解包）操作，获得与原始素材一模一样的独立的视频、音频和字幕文件。

通过mux可以将多个stream打包到一个container中,通过demux可以将一个容器中的多个stream分离出来

Filter（滤镜）

滤镜是一种约定的数据转换原则.比如水平翻转滤镜就是将图像水平翻转下.

RTP Timestamp 计算

RTP timestamp is an important attribute in RTP header and is used plug the packet in right order for playback. Also it is used to synchronize audio video packets. Lets see how these RTP timestamps are calculated.

RTP timestamp calculation involves two parameters explained below.

Packetization time - Packetization time represents one RTP packet duration in milliseconds. For example, In G711 case, one RTP packet may represent 20 millisecond. Please note that other than 20ms packetization time is a valid case as well.
Sampling rate - Sampling rate is number of analog samples taken per second to convert to digital form. In a typical G711 case, sampling rate is 8kHz. So 8000 analog samples are taken per second to convert to digital form. Higher the sampling rate, better is the quality.

Audio RTP Timestamps

One can choose a random value for audio RTP timestamp. And for successive Audio RTP packets, timestamp should be incremented by sampling rate / packets per second. Lets consider a case where sampling rate is 8kHz and packetization time is 20ms.

One frame corresponds to 20ms
For 1 second, there will be 1000ms / 20ms = 50 frames

Audio RTP packet timestamp incremental value = 8kHz / 50 = 8000Hz / 50 = 160.

Video RTP Timestamps

Typically in video case there are 30 frames per second or 24 frames per second video. Lets consider a typical case, where sampling rate is 90kHz and fps is 30.

Then video RTP packet timestamp incremental value = 90kHz / 30 = 90,000Hz / 30 = 3000.
Hence each video RTP frame timestamp should be incremented by 3000.

In practice, one video frame may be sent as more than one RTP packet because of bigger size. Say one video frame you are sending as 3 RTP packets. For all these 3 RTP packets, you need to keep timestamp same. For next video frame you can increase RTP timestamp by 3000.

In certain cases, if you do not know fps, probably you need to go for system clock time and derive timestamp.

其它说明

container/format/mux/demux是一个东西的多个方面.container是一个逻辑上的概念,format是这个概念在磁盘上/传输过程中的具体格式,将多个stream合并成container就是mux,从container中解析出独立的steam就是demux

参考资料

关于音视频的一些知识（demux、filter等）
Fun with Container Formats – Part 1

技术积累

音视频基本概念