Ogg Vorbis audio compression method. How audio compression works

MP3 audio compression format

MPEG-1 Audio Layer 3 File extension: .mp3 MIME type: audio/mpeg Format type: Audio

MP3 (more precisely, MPEG-1/2/2.5 Layer 3 (but not MPEG-3) is the third encoding format audio track MPEG) is a licensed file format for storing audio information.

At the moment, MP3 is the most famous and popular of the common lossy digital audio encoding formats. It is widely used in file sharing networks for evaluation transfer musical works. The format can be played on almost any popular operating system, on almost any portable audio player, and is also supported by all modern models music centers and DVD players.

The MP3 format uses a lossy compression algorithm designed to significantly reduce the size of data required to play a recording and provide a playback quality very close to the original (according to most listeners), although audiophiles report a noticeable difference. When creating an MP3 at an average bitrate of 128 kbps, the resulting file is approximately 1/10 the size of the original audio CD file. MP3 files can be created with high or low bitrate, which affects the quality of the resulting file. The principle of compression is to reduce the precision of certain parts of the audio stream, making it virtually inaudible to most people's ears. This method called perceptual encoding. In this case, at the first stage, a sound diagram is constructed in the form of a sequence of short periods of time, then information that is not discernible to the human ear is removed from it, and the remaining information is stored in a compact form. This approach is similar to the compression method used when compressing images into JPEG format.

MP3 was developed by a working group of the Fraunhofer Institute (German: Fraunhofer-Institut f?r Integrierte Schaltungen) led by Karlheinz Brandenburg and the University of Erlangen-Nuremberg in collaboration with AT&T Bell Labs and Thomson (Johnson, Stoll, Deery, etc.).

The basis for the development of MP3 was the experimental codec ASPEC (Adaptive Spectral Perceptual Entropy Coding). The first MP3 encoder was the L3Enc program, released in the summer of 1994. One year later, the first software MP3 player appeared - Winplay3.

When developing the algorithm, tests were carried out on very specific popular compositions. The main song was Suzanne Vega's “Tom's Diner.” Hence the joke that “MP3 was created solely for the sake of comfortable listening to Brandenburg’s favorite song,” and Vega began to be called “MP3 mom.”


Format Description

In this format, sounds are encoded in a frequency manner (without discrete parts); There is support for stereo, and in two formats (details below). MP3 is a lossy compression format, that is, part of the audio information that (according to the psychoacoustic model) the human ear cannot perceive or is not perceived by all people is permanently removed from the recording. The compression level can be varied, including within a single file. The range of possible bitrate values ​​is 8 - 320 kbit/s. For comparison, the data stream from a regular Audio-CD format CD is 1411.2 kbps at a sampling frequency of 44100 Hz.

MP3 and "Audio-CD quality"

In the past, it was widely believed that 128 kbps recording was suitable for music intended to be listened to by most people, providing Audio-CD quality sound. In reality, everything is much more complicated. Firstly, the quality of the resulting MP3 depends not only on the bitrate, but also on the encoding program (codec) (the standard does not establish an encoding algorithm, it only describes the presentation method). Secondly, in addition to the prevailing CBR (Constant Bitrate) mode (in which, simply put, every second of audio is encoded with the same number of bits), there are ABR (Average Bitrate) and VBR (Variable Bitrate) modes. Thirdly, the 128 kbit/s limit is arbitrary, since it was “invented” in the era of the format’s formation, when playback quality sound cards And computer speakers was generally lower than at present.

At the moment, the most common MP3 files are with a bitrate of 192 kbit/s, which may indirectly indicate that the majority consider this bitrate to be sufficient. The actual perceived "quality" depends on the source audio file, the listener, and their audio system. Some music lovers prefer to compress music with “maximum quality” - 320 kbps, or even switch to other formats, such as FLAC, where the average bitrate is ~1000 kbps. There is also an opinion among music lovers that some samples (fragments of audio recordings) cannot be properly compressed with losses: at all possible bit rates it is not difficult to distinguish compressed audio from the original.

Encoding modes and options

There are three versions of the MP3 format for different needs: MPEG-1, MPEG-2 and MPEG-2.5. They differ in the possible ranges of bitrate and sampling frequency:

* 32-320 kbps at sampling rates of 32000 Hz, 44100 Hz and 48000 Hz for MPEG-1 Layer 3;

* 16-160 kbps at sampling rates of 16000 Hz, 22050 Hz and 24000 Hz for MPEG-2 Layer 3;

* 8-160 kbps at sampling rates 8000 Hz and 11025 Hz for MPEG-2.5 Layer 3.

Audio channel coding control modes

Since the MP3 format supports two-channel encoding (stereo), there are 4 modes:

* Stereo is a two-channel encoding in which the channels of the original stereo signal are encoded independently of each other, but the distribution of bits between channels in the total bitrate can vary depending on the complexity of the signal in each channel.

* Mono - single-channel encoding. If you encode two-channel material in this way, the differences between the channels will be completely erased, since the two channels are mixed into one, it is encoded and it is played back on both channels of the stereo system. The only advantage of this mode can only be the output quality compared to the Stereo mode at the same bitrate, since one channel has twice the large quantity bits than in Stereo mode.

* Two-channel - two independent channels, for example, audio on different languages. The bitrate is divided into two channels. For example, if the specified bitrate is 192 kbit/s, then for each channel it will be only 96 kbit/s.

* Joint Stereo - the best way two-channel coding. For example, in one of the Stereo Combined modes, the left and right channels are converted into their sum (L+R) and difference (L-R). For most sound files the saturation of the channel with the difference (L-R) is much less than the channel with the sum (L+R). Also here, the perception of sound by a person plays a role, for whom differences in the direction of sound are much less noticeable. Therefore, Unified Stereo allows you to either save on the channel bitrate (L-R) or improve the quality at the same bitrate, since the sum channel (L+R) allocates most of the bitrate. There is an opinion that this mode not suitable for stereo audio material, in which subjectively absolutely various material, as it blurs the differences between channels. But modern codecs use different schemes in different frames (including pure stereo) depending on the source signal.

CBR stands for Constant Bit Rate, that is, Constant bitrate, which is set by the user and does not change when encoding the work. Thus, each second of the work corresponds to the same number of encoded bits of data (even when encoding silence). CBR can be useful for limited-channel media streams; in this case, encoding uses all the capabilities of the data channel. For storage, this encoding mode is not optimal, since it cannot allocate enough space for complex sections of the original work, while wasting space on simple sections. Higher bitrates (above 256 kbps) can solve this problem by allocating more space for data, but also proportionally increasing the file size.

VBR stands for Variable Bit Rate, that is, Variable Bitrate or Variable Bitrate, which is dynamically changed by the encoder program during encoding, depending on the saturation of the encoded audio material and the encoding quality set by the user (for example, silence is encoded with a minimum bitrate). This MP3 encoding method is the most progressive and is still being developed and improved, since audio material of different saturation can be encoded with certain quality, which is usually higher than when setting the average value in the CBR method. Plus, the file size is reduced due to fragments that do not require a high bitrate. The disadvantage of this encoding method is the difficulty of predicting the size of the output file. But this disadvantage of VBR encoding is insignificant compared to its advantages. Another disadvantage is that VBR considers quieter fragments to be “insignificant” audio information, so it turns out that if you listen very loudly, these fragments will be of poor quality, while CBR renders both quiet and loud fragments with the same bitrate. The VBR format is constantly improving, thanks to the constant improvement of the mathematical model of codecs, in particular after the release updated version free mp3 codec lame (version 3.98), variable bitrate encoding, according to the developers themselves, is qualitatively better than CBR and especially ABR.

ABR stands for Average Bit Rate, that is, Average Bitrate, which is a hybrid of VBR and CBR: the bitrate in kbit/s is set by the user, and the program varies it, constantly adjusting it to the given bitrate. Thus, the encoder will be wary of using the maximum and minimum possible bitrate values, as there is a risk of not fitting into the user-specified bitrate. This is a clear disadvantage of this method, as it affects the quality of the output file, which will be slightly better than when using CBR, but much worse than when using VBR. On the other hand, this method allows the most flexibility in setting the bitrate (can be any number between 8 and 320, versus exclusively multiples of 16 in the CBR method) and calculating the size of the output file.

Marks within the boundaries of the mp3 file (at the beginning and/or at the end). They can contain information about the author, album, year of release and other information about the track. Later versions of tags can store album covers and song lyrics. There are different versions of tags.

Flaws

Technical disadvantages. MP3 is the leader in popularity, but it is not the best in terms of technical parameters. There are formats that allow you to achieve higher quality with the same file size, such as Vorbis, AAC. MP3 also lacks a lossless encoding mode, which is desirable for professional use. At the same time, MP3 is quite suitable (from a professional point of view) for distributing demo compositions or other ways of “distributing” your music due to the ubiquity of players.

Legal restrictions. For free use format there are patent restrictions. Alcatel-Lucent owns the rights to MP3 and receives royalties from those who use the format - player manufacturers and mobile phones. Because of this, the licensing purity of the format is questionable. In particular, Alcatel-Lucent made claims Microsoft for having MP3 support built into Windows. However, patents on the technology expire in 2010, after which any company will be able to use it freely.

Formats - Audio compression formats

FLAC (English Free Lossless Audio Codec (Free Lossless Audio Codec) is a popular free codec for audio compression. Unlike lossy codecs Ogg Vorbis, MP3, FLAC does not remove any information from the audio stream and is suitable both for listening to music on high-quality sound reproduction equipment and for archiving an audio collection. Today, the FLAC format is supported by many audio applications.

Audio stream

The main parts of the flow are:

* String of four bytes "fLaC"

* STREAMINFO metadata block

* Other optional metadata blocks

* Audio frames

The first four bytes identify the FLAC stream. The following metadata contains information about the stream, followed by compressed audio data.

Metadata

FLAC defines several types of metadata blocks (all of which are listed on the format page). Metadata blocks can be of any size, and new blocks can be easily added. The decoder has the ability to skip metadata blocks unknown to it. Only the STREAMINFO block is required. It contains the sampling rate, number of channels, etc., as well as data that allows the decoder to configure buffers. The MD5 signature of the uncompressed audio data is also recorded here. This is useful for checking the entire stream after it has been transmitted.

Other blocks are designed to reserve space, store lookup tables, tags, audio disc markup lists, and application-specific data. Options for adding PADDING blocks or search points are given below. FLAC does not require search points, but they can significantly increase access speed and can also be used to place marks in audio editors.

Audio data

The metadata is followed by compressed audio data. Metadata and audio data are not interleaved. Like most codecs, FLAC divides the input stream into blocks and encodes them independently of each other. The block is packed into a frame and added to the stream. The basic encoder uses constant-size blocks for the entire stream, but the format allows for blocks of varying lengths throughout the stream.

Blocking

The block size is very important parameter for coding. If it is very small, there will be too many frame headers in the stream, which will reduce the compression level. If the size is large, then the encoder will not be able to select an effective compression model. Understanding the modeling process will help you increase the level of compression for certain types of input data. Usually when used linear forecasting on audio data with a sampling rate of 44.1 kHz optimal size block lies in the range of 2-6 thousand samples.

Interchannel decorrelation

If stereo audio data is input, it may go through an inter-channel decorrelation stage. The right and left channels are converted to the average and difference according to the formulas: average = (left + right)/2, difference = left - right. Unlike joint stereo, this process does not lead to losses. For audio CD data, this usually results in a significantly higher compression level.

Modeling

At the next stage, the encoder tries to approximate the signal with such a function that the result obtained after subtracting it from the original (called difference, remainder, error) can be encoded with a minimum number of bits. Function parameters must also be written, so they should not take up much space. FLAC uses two methods for generating approximations:

* fitting a simple polynomial to the signal

* general coding with linear predictors (LPC).

First, constant polynomial prediction (-l 0) is significantly faster but less accurate than LPC. The higher the LPC order, the slower but better the model will be. However, as the order increases, the gain will become less and less significant. At some point (usually around 9), the encoder routine that determines the best order starts to fail and the size of the resulting frames increases. To overcome this, you can use brute force, which will lead to a significant increase in encoding time.

Second, the parameters for constant predictors can be described by three bits, while the parameters for the LPC model depend on the number of bits per sample and the order of the LPC. This means that the size of the frame header depends on the method and order chosen and may affect the optimal block size.

Residual coding

Once the model is fitted, the encoder subtracts the approximation from the original to produce a residual (erroneous) signal, which is then losslessly encoded. This takes advantage of the fact that the difference signal usually has a Laplace distribution and there is a set of special Huffman codes called Rice codes that allow these signals to be encoded efficiently and quickly without using a dictionary.

Rice coding consists of finding a single parameter that matches the signal distribution and then using it to construct codes. When the distribution changes, the optimal parameter also changes, so there is a method that allows you to recalculate it as necessary. The remainder can be divided into contexts or sections, each of which will have a different Rice parameter. FLAC allows you to specify how the partitioning should be done. The remainder can be divided into 2n sections.

Framing

The audio frame is preceded by a header, which begins with a synchronization code and contains the minimum information necessary for the decoder to play the stream. The block or sample number and the eight-bit checksum of the header itself are also recorded here. The sync code, frame header CRC, and block/sample number allow resynchronization and searching even in the absence of search points. At the end of the frame, its sixteen-bit checksum is written. If the underlying decoder detects an error, a block of silence will be generated.

To support basic metadata types, the basic decoder can skip ID3v1 and ID3v2 tags so they can be freely added. ID3v2 tags must appear before the "fLaC" marker, and ID3v1 tags must appear at the end of the file.

There are modifications of the FLAC encoder: Improved FLAC encoder and Flake.

On January 29, 2003, Xiphophorus (now called the Xiph.Org Foundation) announced the inclusion of the FLAC format in their line of products such as Ogg Vorbis

MP3 audio compression format

Audio compression methods

Audio Compression

Audio compression is the process of reducing the bit rate by reducing the statistical and psychoacoustic redundancy of the digital audio signal.

Audio Compression(audio compression) - a type of data compression, encoding, used to reduce the size of audio files or to reduce bandwidth for streaming audio. Audio file compression algorithms are implemented in computer programs ah, called audio codecs. The invention of special algorithms for compressing audio data is motivated by the fact that general algorithms compression is ineffective for working with audio and makes it impossible to work in real time.

As in the general case, there are lossless audio compressions, which makes possible restoration original data without distortion, and lossy compression, in which such restoration is impossible. Lossy compression algorithms provide a high degree of compression, for example, an audio CD can hold no more than an hour of “uncompressed” music; with lossless compression, a CD can hold almost 2 hours of music, and with lossy compression at an average bitrate - 7-10 hours.

Lossless compression

The difficulty with lossless audio compression is that audio recordings are extremely complex in their structure. One compression method is to find patterns and repeat them, but this method is not effective for more chaotic data, such as digitized audio or photographs. Interestingly, while computer-generated graphics are much easier to compress without loss, synthesized audio has no advantage in this regard. This is because even computer-generated sound is usually very complex shape, which presents a difficult challenge for inventing an algorithm.

Another difficulty is that the sound usually changes very quickly and this is also the reason why ordered byte sequences appear very rarely.

The most common lossless compression formats are:
Free Lossless Audio Codec (FLAC), Apple Lossless, MPEG-4 ALS, Monkey's Audio, and TTA.

Lossy compression

Lossy compression has extremely wide applications. In addition to computer programs, lossy compression is used in streaming DVD audio, digital television and radio and internet streaming media.

The innovation of this compression method was the use of psychoacoustics to detect sound components that are not perceived by the human ear. An example would be or high frequencies, which are perceived only when their power is sufficient, or quiet sounds that occur simultaneously or immediately after loud sounds and are therefore masked by them - such sound components may be transmitted less accurately, or not transmitted at all.

To implement masking, the signal from a time sequence of amplitude samples is converted into a sequence of sound spectra, in which each spectrum component is encoded separately. To carry out such a transformation, the methods of fast Fourier transform, MDCT, quadrature-mirror filters or others are used. The total amount of information during such recoding remains unchanged. Compression in a certain frequency domain may involve masked or null components not being stored at all, or being encoded at a lower resolution. For example, frequency components up to 200 Hz and above 14 kHz can be encoded at 4-bits, while components in the mid-range are encoded at 16-bits. The result of such an operation will be encoding with an average bit depth of 8-bits, but the result will be significantly better than when encoding the entire frequency range with 8-bit bits.

However, it is obvious that fragments of the spectrum recoded with low resolution can no longer be restored exactly, and are thus lost forever.
The main parameter of lossy compression is the bitrate, which determines the degree of file compression and, accordingly, the quality. There are compressions with a constant bitrate (CBR), variable bitrate (VBR) and average bitrate (ABR).

The most common lossy compression formats are: AAC, ADPCM, ATRAC, Dolby AC-3, MP2, MP3, Musepack Ogg Vorbis, WMA and others.

MP3 audio compression format

MPEG-1 Audio Layer 3 File extension: .mp3 MIME type: audio/mpeg Format type: Audio

MP3 (more precisely, English MPEG-1/2/2.5 Layer 3 (but not MPEG-3) - the third MPEG audio track encoding format) is a licensed file format for storing audio information.

At the moment, MP3 is the most famous and popular of the common lossy digital audio encoding formats. It is widely used in file-sharing networks for the evaluation of music. The format can be played in almost any popular operating system, on almost any portable audio player, and is also supported by all modern models of stereo systems and DVD players.

The MP3 format uses a lossy compression algorithm designed to significantly reduce the size of data required to play a recording and provide a playback quality very close to the original (according to most listeners), although audiophiles report a noticeable difference. When creating an MP3 at an average bitrate of 128 kbps, the resulting file is approximately 1/10 the size of the original audio CD file. MP3 files can be created with high or low bitrate, which affects the quality of the resulting file.

The principle of compression is to reduce the precision of certain parts of the audio stream, making it virtually inaudible to most people's ears. This method is called perceptual coding. In this case, at the first stage, a sound diagram is constructed in the form of a sequence of short periods of time, then information that is not discernible to the human ear is removed from it, and the remaining information is stored in a compact form. This approach is similar to the compression method used when compressing images into JPEG format.

MP3 was developed by a working group of the Fraunhofer Institute (German: Fraunhofer-Institut f?r Integrierte Schaltungen) led by Karlheinz Brandenburg and the University of Erlangen-Nuremberg in collaboration with AT&T Bell Labs and Thomson (Johnson, Stoll, Deery, etc.).



The basis for the development of MP3 was the experimental codec ASPEC (Adaptive Spectral Perceptual Entropy Coding). The first MP3 encoder was the L3Enc program, released in the summer of 1994. One year later, the first software MP3 player appeared - Winplay3.

When developing the algorithm, tests were carried out on very specific popular compositions. The main song was Suzanne Vega's “Tom's Diner.” Hence the joke that “MP3 was created solely for the sake of comfortable listening to Brandenburg’s favorite song,” and Vega began to be called “MP3 mom.”

Format Description

In this format, sounds are encoded in a frequency manner (without discrete parts); There is support for stereo, and in two formats (details below). MP3 is a lossy compression format, that is, part of the audio information that (according to the psychoacoustic model) the human ear cannot perceive or is not perceived by all people is permanently removed from the recording. The compression level can be varied, including within a single file. The range of possible bitrate values ​​is 8 - 320 kbit/s. For comparison, the data stream from a regular Audio-CD format CD is 1411.2 kbps at a sampling frequency of 44100 Hz.

MP3 and "Audio-CD quality"

In the past, it was widely believed that 128 kbps recording was suitable for music intended to be listened to by most people, providing Audio-CD quality sound. In reality, everything is much more complicated. Firstly, the quality of the resulting MP3 depends not only on the bitrate, but also on the encoding program (codec) (the standard does not establish an encoding algorithm, it only describes the presentation method). Secondly, in addition to the prevailing CBR (Constant Bitrate) mode (in which, simply put, every second of audio is encoded with the same number of bits), there are ABR (Average Bitrate) and VBR (Variable Bitrate) modes. Thirdly, the 128 kbit/s limit is arbitrary, since it was “invented” in the era of the format’s formation, when the playback quality of sound cards and computer speakers was usually lower than it is now.

Some methods of compressing audio data (addition to Lecture 2)

    Lossless coding is an audio encoding method that allows for 100% recovery of data from a compressed stream. This method of data compression is used in cases where maintaining the original data quality is critical. For example, after mixing sound in a recording studio, the data must be archived in original quality for possible future use. Lossless encoding algorithms existing today (for example, Monkeys Audio) can reduce the volume occupied by data by 20-50%, but at the same time ensure 100% restoration of the original data from the data obtained after compression. Such encoders are a kind of data archivers (like ZIP, RAR and others), only designed for audio compression.

    Lossy coding. The purpose of such encoding is to use any means to achieve the similarity of the sound of the restored signal with the original with as little amount of packed data as possible. This is achieved by using various algorithms that “simplify” the original signal (throwing out “unnecessary” hard-of-audible details from it), which leads to the fact that the decoded signal actually ceases to be identical to the original, but only sounds similar.

There are many compression methods, as well as programs that implement these methods. The most famous are MPEG-1 Layer I,II,III (the last one is the well-known MP3), MPEG-2 AAC (advanced audio coding), Ogg Vorbis, Windows Media Audio (WMA), TwinVQ (VQF), MPEGPlus, TAC, and others.

On average, the compression ratio provided by such encoders is in the range of 10-14 (times).

Some audio file formats :

AU format . This is a simple and common format on Sun and NeXT systems (in the latter case, however, the file will have the SND extension). The file consists of a short service header (minimum 28 bytes), which is immediately followed by audio data. Widely used in Unix-like systems and serves as the base for the Java machine.

WAVE format (WAV). Standard file format for storing audio in Windows. It is a special type of another, more general RIFF (Resource Interchange File Format) format; Another type of RIFF is AVI video files. A RIFF file is composed of blocks, some of which may in turn contain other nested blocks; Each data block is preceded by a four-character identifier and length. WAV audio files are generally simpler and have only one format block and one data block. The first contains general information about the digitized sound (number of channels, sampling frequency, nature of the volume dependence, etc.), and in the second - the numerical data themselves. Each sample occupies an integer number of bytes (for example, 2 bytes in the case of 12-bit numbers, the most significant bits contain zeros). In stereo recording, numbers are grouped in pairs for the left and right channels, respectively, with each pair forming a complete block - for our example, its length will be 4 bytes. This seemingly excessive structuring allows the software to optimize the process of data transfer during playback, but, as in such cases always happens, the gain in time leads to a significant increase in file size.

MP3 format (MPEG Layer3) . It is one of the audio storage formats later adopted as part of the compressed video standards. The nature of obtaining this format is in many ways similar to the compression of graphic data using JPEG technology that we have already discussed. Since arbitrary sound data is not compressed well enough by reversible methods, it is necessary to move on to irreversible methods: in other words, based on knowledge about the properties of human hearing, sound information is “corrected” so that the resulting distortions in the ear are unnoticeable, but the resulting data is better compressed using traditional methods. This is called adaptive encoding and allows you to save on the least significant sound details from the point of view of human perception. The techniques used in MP3 are not easy to understand and rely on fairly complex mathematics, but they provide a very significant compression effect on audio information. The successes of MP3 technology have led to the fact that it is now used in many household audio devices, for example, players and cell phones.

MIDI format. The name MIDI is an abbreviation for Musical Instrument Digital Interface, i.e. digital interface for musical instruments. This is a fairly old (1983) standard that combines a variety of musical equipment (synthesizers, drums, lighting). MIDI is based on data packets, each of which corresponds to an event, such as pressing a key or setting a sound mode. Any event can simultaneously control several channels, each of which relates to a specific piece of equipment. Despite its original purpose, the file format has become a standard for music data that can, if desired, be played using a computer's sound card without any external MIDI equipment. The main advantage of MIDI files is their very small size, since they are not detailed recordings of sound, but are actually some kind of advanced electronic equivalent of traditional musical notation. But this same property is also a disadvantage: since the sound is not detailed, different equipment will reproduce it differently, which, in principle, can even noticeably distort the author’s musical intention.

MOD format. It represents a further development of the ideology of MIDI files. Known as “playback modules,” they store not only “electronic sheet music,” but also digitized audio samples that are used as templates for individual notes. In this way, unambiguous sound reproduction is achieved. The disadvantages of the format include the large amount of time it takes to superimpose patterns of simultaneously sounding notes on top of each other.

Today, the amount of information we consume online has increased thousands of times compared to the early 2000s. And it’s not surprising, because before, in addition to much less widespread Internet coverage, the sites and services we were used to looked completely different.

Every day we read articles and news about what this or that company has developed new standard connection, surpassing current analogues in data transfer speed. For almost two decades, providers and manufacturers of many gadgets have made a huge step towards high-speed Internet access. But speeds alone are not the reason why our instant access to sites is the same.


The development of compression algorithms for images, audio and video files has played a huge role in saving our time. Walking around the Internet, we often don’t even think about how and what works, how much effort was put into the development of this or that technology. IN new series In this article, we will look at compression methods for popular formats such as MP3 and JPEG, and also take a basic look at the video encoding process.

Algorithm operation

The first in a new series of articles will be the most popular audio file compression format *.mp3. It appeared in 1993, thanks to the working group of the Fraunhofer Institute, and was standardized by the MPEG association. According to Wikipedia, the association was formed by the international organization ISO to develop standards for compressing audio and video files. They also established the following standards:

  • MPEG-1: Intended for compressing video and audio files, later becoming an established standard for VCD (Video CD).
  • MPEG-2: Already focused on broadcast television signal transmission of the ATSC, ISDB and DVB families and other satellite TV broadcasts. Such as Dish Network.
  • MPEG-3: A standard developed for HDTV broadcasting, but was not adopted due to the fact that MPEG-2, with minor modifications, was quite sufficient for such purposes. And no, this is not the same mp3 that you might be thinking about now. In fact, mp3 is a branch of the MPEG-1, layer 3 standard.
  • MPEG-4: Is a major improvement on MPEG-1, supporting 3D content decoding and low-bitrate compression. It also integrated a software copyright protection system - DRM. Among the new video formats introduced into the standard, ASP and H.264 can be noted.
Anyway, let's go back to mp3. The main goal of the format was and is to reduce the size of files by removing certain parts of the sound spectrum that are not perceptible on non-professional audio equipment, in accordance with the psychoacoustic model of human sound perception.

At this stage, using the Fourier transform algorithm, the sound wave is decomposed into spectra different frequencies. All those frequencies that are inaudible to our ears are simply removed. Basically this is the entire spectrum of sound above 16,000 Hz. By the way, music identification services such as SoundHound and Shazam also work on this principle. The algorithm built into their work divides the audible sound wave into several, selects the rhythm, the main notes and compares them with its database.

But nevertheless, the overall sound picture of, for example, an mp3 file with a bitrate of 320 kbps is not much different from an uncompressed file, while the size can be 1/10 of the original.

Already at this stage, the file size can be significantly reduced, but the largest percentage of compression occurs in the next stages of masking. The work of the first of them is to remove multiple audio frequencies at loud moments in the song, that is, if a loud drum sounds, then all other signals emanating from the instruments included in the arrangement can simply be removed, and no one will notice it.

And in some cases, in accordance with the same psychoacoustic model, it is possible to remove lobes before and after the sound of loud sounds, since during this period all people experience short-term (literally for a few hundredths of a second) deafness.

Then comes the distribution of sounds across channels. This happens not without loss of detail, with the help of special formulas, which you can see in the picture (simplified). The difference in the sound of each channel is reduced to almost zero in order to save another hundred or two bytes.

At the end, each of the compressed frames of the audio recording, encoded with the same characters (for example, zeros), is reduced to the minimum size using the Huffman code method. During its operation, no additional information is lost; it’s just that some code is assigned to each of the frame values, depending on how many times a particular number occurs in it. Next, all the remaining pieces of our audio recording are glued together and at the output they form the audio file we are familiar with.

Thank you for reading to the end, now we understand how one of the most common audio formats works. In the next article we will look at the video compression process.

Today, most of us deal primarily with digital audio reproduction systems. In these systems, sound is stored digitally - that is, as sequences of zeros and ones, which, after decoding them using special software and hardware, turn into sound. In the world of digital music, there is a struggle, on the one hand, for the quality of playback, and on the other, for the amount of stored data. These are two opposing concepts - the higher the sound quality, the more space is usually required to store it. In order to preserve digital audio at the highest possible quality in as little information as possible, audio compression algorithms have been developed.

There are two different approaches to compressing audio information. The first is called lossless compression - during such compression, the sound recorded digitally is preserved in its entirety, without loss. Another approach to compressing audio data is called lossy compression - the sound is processed in a special way, everything that is unnecessary, according to the conclusion of the compression algorithm, is removed from it, and what remains is compressed. This compression, in comparison with lossless compression, allows you to achieve much higher levels of compression, that is, reduce the size of audio files, while the sound quality, if you do not try to compress the file too much, does not suffer particularly noticeably.

Music recordings can also be compressed with conventional archivers, but they cannot work in real time, and besides, the compression level of uncompressed musical recordings rarely exceeds 50%. Another method of compressing audio information, used in practice, is to use special programs - so-called codecs, with the help of which you can compress and decode and play compressed compositions “on the fly”.

When talking about codecs for compressing audio information, we should distinguish between the concepts of codec and media data container. A container is, simply put, a kind of standard shell in which audio data compressed by one or another codec is stored. For example, an MP4 container can store data compressed by various codecs - in particular, lossy compression codec AAC, lossless compression codec ALAC and others. Usually for various types data stored in an MP4 container uses different file extensions. In the same way, a WAV file can store various data - for example, compressed in the popular MP3 format or uncompressed information in the PCM format - in the case of WAV files file name extension remains unchanged (. wav), and these files differ only in their internal structure.

List of programs

In table 3.1. The programs described in this topic are given. These are basically universal programs; you can choose any of them to encode certain files. The default input file format is WAV, but almost all programs can encode music between formats and “decompress” source files to standard WAV.

Table 3.1. Programs and file formats
Programs and formats MP3 OGG WMA A.A.C. VQF FLAC WAV PACK A.P.E. ALAC
Lame +
Winlame + + +
RazorLame +
Windows Media Encoder +
aoTuV +
iTunes +
ImToo WMA MP3 Converter* + +
MP4 Converter**
ImToo Audio Encoder + + + + + + +
Flac Frontend +
Cue Splitter ***
WavPack Frontend +
Monkey's Audio +
dBpoweramp + + + + + + + +

* ImToo WMA MP3 Converter supports a large number of input file formats, but the output can only be MP3 and WMA.

** MP4 Converter program converts video files various formats in a format understandable to Apple iPod players.

*** A program for splitting large audio files according to index maps.

Lossy compression

Existing lossy audio compression formats include the "big four" - MP3, WMA, Ogg Vorbis and AAC. Your MP3 player is almost 100% likely to support one of these formats, and most likely several. Knowledge about some features of formats will be especially useful when working with audio information in practice. For example, in the following lectures we will look at software for working with audio, in particular, we will dwell in detail on the conversion of audio from one format to another, and if you know a little more about the data compression format than its name, this can help you a lot. So, let's start with the most popular format.

MP3

The full name of MP3 is MPEG 1 Audio Layer 3. MP3 is a lossy audio compression format that has achieved incredible popularity around the world. Currently, there are variants of the standard - MPEG-2 Layer 3 and MPEG-2.5 Layer 3.

The history of MP3 begins in the late 1980s, when a working group of engineers from the Fraunhofer Society began working on the DAB (Digital Audio Broadcast) project. The project was part of the EUREKA research program and within its framework was known as EU-147. MP3 was the result of reworking the Musicam and ASPEC audio compression standards, adding new original concepts to the ideas used in these standards. Thomson is also directly related to the standard.

The standard developed in the early 1990s, the final version of the standard was published in 1995, but back in 1994 the first software MP3 encoder was created, which was called l3enc. Then the extension was chosen. mp3 for files encoded in this format, and in 1995 the first software MP3 player, Winplay3, appeared and was available to the general public. Thanks to high quality music with small file sizes, as well as due to the emergence of simple and high-quality software for playing and creating MP3 files (for example, the widely known and still alive WinAmp, which appeared in the mid-1990s), the standard has gained enormous popularity and is still enjoying it.

MP3 Features

Speaking about the capabilities of the MP3 format, perhaps we should start with the format in which music is stored on ordinary music CDs, on the so-called Audio CD. The sound recorded on such discs has very specific characteristics, namely, 44.1kHz 16Bit Stereo (44.1 kHz, 16-bit stereo sound). Translated into normal human language, this means that every second of sound consists of 44,100 samples (this parameter is called the sampling frequency), each of which has a size of 16 bits (that is, two bytes), and information is recorded for two channels - for the right and for the left. As a result, it turns out that to store one second of music in Audio CD format, you will need 44100*16*2=1411200 bits, or 176400 bytes, or 172.2 KB. Thus, a five-minute composition will take 176400*5*60=52920000 bytes, that is, almost 50 megabytes disk space. Even today, taking into account tens, and more often hundreds of gigabytes hard drives, which are at the disposal of ordinary users, it is quite difficult to imagine a music collection consisting solely of sound recorded in such a wasteful format. What can we say about hard drives with a capacity of a couple of gigabytes, which were the ultimate dream of many ten years ago.

Files compressed into MP3 with virtually no loss of original quality take up 6-10 times less space than the original. That is, from a huge 50-megabyte file, a quite decent 5-megabyte file is obtained. Moreover, if you compress such a file using conventional algorithms compression (RAR or ZIP, for example), which are used for simple files, we will get, at best, a 50% gain (that is, a file of about 25 MB). What's the matter? Why is MP3 able to compress files so much without degrading their quality? The answer to the question here lies in the word “practically”. After all, ordinary compression does not change the quality of the compositions, it completely preserves it, but MP3 carries out some manipulations with the file that can affect its quality.

How MP3 works

MP3 is based on many compression mechanisms, in particular, the so-called adaptive encoding, based on psychoacoustic models that take into account the peculiarities of human perception of sound and remove from it everything “unnecessary” - everything that is impossible for the average person to hear when listening to compositions. As we have already said, if you do not try to compress the composition too much, using the highest quality version of MP3 encoding, then its size will be approximately 6-10 times smaller than the original with CD quality, and the quality of these two recordings will be identical - hardly even a professional will distinguish them. With more high levels Compression losses (also called compression artifacts) are much more audible, but those who use highly compressed MP3 music consciously take this step. For example, highly compressed MP3s are extremely popular among cell phones– often the device’s built-in memory is not enough to upload a sufficient number of high-quality MP3s into it, as a result the owner sacrifices recording quality for quantity. But let's return to the description of the principles of operation of MP3, in particular, to psychoacoustic models.

Adaptive coding based on psychoacoustic models applies various knowledge about the peculiarities of human perception of sounds. So, if two sound signals are played simultaneously, one of which is weaker, then more weak signal is muffled (or, as they say, masked) more strong signal. The result is that a person hears a stronger sound, but not a weaker one. In this case, information about the weaker sound is simply discarded. The same thing happens if immediately after a loud the sound is coming quiet - loud noise causes a temporary decrease in hearing sensitivity, as a result - a quiet sound is inaudible - information about it can also be removed. Also, when processing musical compositions, it is taken into account that most people are not able to distinguish between signals whose power is below a certain level for different frequency ranges.

Bitrate

When MP3 encoding, the so-called bitrate (bitrate or stream width), which is set during encoding, is of particular importance. For example, the Audio CD already described can be encoded with maximum bitrate 320 Kbps (kilobits per second - this figure is also referred to as kbps, kbs, kb/s) to 128 and below. In practice, at a bitrate below 128 Kbps, the sound quality drops so much that it makes sense to encode with a similar bitrate only when there is simply no other alternative.

Different source materials can be encoded with the same bitrate, for example, the sound may not be stereophonic, but monophonic, another may be the sampling frequency or sample size, but bitrate is a very important integral indicator of the quality of an MP3 file. In general, the larger it is, the better it is. Very often, when encoding Audio CD-quality MP3 recordings, you can find a bitrate of 192 Kbps - it is quite suitable for these purposes, however, when listening to such recordings on high-quality audio equipment (especially if you compare them with the original Audio CD), compression artifacts are noticeable .

However, it cannot be said unequivocally that any musical composition, say, recorded at a bitrate of 192 Kbps is better than a composition recorded at 128 Kbps. Much depends on the music itself, on the encoder, on the original recording quality, as well as on what type of bitrate was used when recording the composition.

So, the simplest type of bitrate is constant bitrate - or CBR (Constant Bit Rate). This bitrate does not change during encoding of the entire composition, that is, every second of sound, regardless of its content, is encoded with the same number of bits.

Bit Rate) – it can be called a combination of VBR and CBR. So, before starting encoding, the user sets the average bitrate, and during encoding, the program, using a variable bitrate, makes sure that in the end the bitrate fits into user installed limitation. The quality of the output file is thus worse than using VBR (but slightly better than using the similar CBR), but the file size can be flexibly and precisely adjusted.

During encoding, the original audio signal is divided into sections called frames. Each frame is encoded separately, and during decoding, the audio signal is reconstructed from the decoded frames. Of particular interest when encoding MP3 is the method of processing the stereo signal - let's look at this issue in more detail.