Chapter 3

Production

Section Editors: Jay Veloso Batista

Production of media is the beginning of our workflow chain. Most often, compelling content is produced through a collaboration of talented individuals and technological tools. To provide you a basis in understanding, we need to start with the origination and capture of the media itself, along with its metadata. This section covers both motion picture and video production together because, while for many years they were separate processes with film dominating the cinema productions, modern tools supply quality media and have relegated film to specialty projects.

Cameras

Capturing images began when scientists discovered the light sensitive properties of certain chemicals and began to experiment with substrates and supports, leading to the forerunner of the modern camera and the initial blossoming of tintypes during the mid-nineteenth century. By the twentieth century, innovations had led to motion capture – ​a video camera is a camera used for electronic motion picture acquisition initially developed for the television industry but now common in all applications.

The earliest video cameras based on the mechanical Nipkow disk were designed by John L. Baird and used in experimental broadcasts from 1918 to the 1930s. All-electronic designs based on the video camera tube, such as Vladimir Zworykin’s Iconoscope and Philo Farnsworth’s image dissector, replaced the Baird system by the 1930s. These remained in wide use until the 1980s, when technological break-throughs introduced solid-state image sensors such as CCDs and CMOS active pixel sensors into digital camera systems, completely eliminating common tube technologies problems such as “image burn-in” where an overly bright light or a stationary picture would imprint on the tube. For the first time, these developments made digital video workflows practical. Around the world, digital television gave a boost to the manufacture of digital video cameras and by the 2010s, most video cameras were digital for professional and consumer applications.

With digital video capture an affordable technology, the distinction between professional video cameras and movie cameras disappeared. Today the mid-range cameras exclusively used for television and other work are termed professional video cameras.

Creating content with video cameras is dedicated to two core industrial applications. The first, a reflection of the early days of broadcasting, is live event production, where the camera provides the source of real-time images directly to a screen for immediate viewing. While a few production systems still serve live television, especially sport event production, most live camera connections are dedicated to security, police, military, and industrial situations where monitoring is required. In the second application, the images are recorded to a storage device for archiving or further processing. For many years, videotape was the primary format used for this recording, although gradually it was replaced by optical disc, hard disk, and then flash memory. Today, recorded video is the basis of television and movie production, and more often surveillance tasks where unattended records of a situation are required for post event analysis.

Jeff Stansfield of Advantage Video in Los Angeles provides this history lesson on the evolution of camera and associated technology.

For us to really understand the Audio and Video acquisition formats, we should spend a little time first looking into where we came from and understand the history of media acquisition. As we look at the past, we can see the battles and choices that led us to where we are today. Choices really started back in 1982 when Sony developed a professional videocassette product called the Betacam then the Betacam SP and high-end digital recording systems like the Digital Betacam. This was an important step on the path to where we are now for three reasons.

Firstly, the videocassettes all use the same shape with only very slight variations, meaning vaults and other storage facilities do not have to be changed when using a new format. This saved the studios and production companies tons of money in storage fees. It also let a production go right from the set to post without going through any developing or film processing.

Secondly, this started us to bring production and post-production together, which had been separated up to this time. Today most production companies also do their own post.

The third advancement this brought us was the Non-linear editing (NLE or NLVE if you add the word video) or non-destructive editing. This also applied to digital audio workstations (DAW) for audio post. There had already been NLEs, but systems like Lucasfilm’s EditDroid, the most popular in 1980, was very cumbersome and needed a lot of Laserdiscs to support editing.

There had even been NLEs as far back as the early 1970s, with system like the CMX, but the game changer came from Herb Dow, A.C.E who used the Ediflex that could use a bank of multiple Sony, JVC, and Panasonic Video Cassette Recorders (VCRs) as sources. After Herb Dow, A.C.E edited “Still the Beaver” in 1985, the system was adopted in about 75% of network programs and started to penetrate the film community, mostly for shorts and documentaries. Now you could go from filming to post and to air all on one medium.

In January 1984, Eastman Kodak announced the Video8 camera technology and in 1985, Sony introduced the HandyCam. This was an important step as it gave us the “Pro”-sumer camera, bringing a professional level of production to almost anyone. This allowed a lot more people to become filmmakers who just wanted to produce their own small productions and brought more interest in growing the independent community.

By late 1989 the first computer-based system, the EMC2, was released. One year later, Avid® showed off their system based on the Apple Macintosh. Even though it was only 15 Frames Per Second (FPS) and the Lucasfilm’s EditDroid was much higher quality, the die was cast and the market embraced these technological advances.

This also gave us the codec (Coder/Decoder file) that let us deal with video formats in new ways, and almost all video and files use some kind of codec today. This is not to be confused with AVI that is sometimes mistakenly described as a codec: AVI is actually a container format. There are many container formats, like AVI, ASF, QuickTime, RealMedia, and Matroska. Within those containers there are many actual codecs like Divx, H.264, Sorenson and dozens more. These containers are packages that can contain multiple codecs, one each for audio, video and other playout sources.

The next advancement came after Disney engineers developed a long form solution that allowed the Macintosh to go beyond the 50GB storage limitation of that time. In April that year, Avid® and others introduced their new systems that could take advantage of the expansion of memory, and within two years the Avid® Media Composer had displaced most 35 mm film editing systems in most of the motion picture studios and TV stations worldwide. This made Avid® the undisputed authority in “off-line” non-linear editing systems. Even today the Avid® products are what people first think of when they talk about professional NLEs.

Apple set another bar by acquiring the software developed by macromedia engineers for the Media 100 hardware after Macromedia decided to get out of the video editing business. The Macromedia product evolved to be named “Final Cut” and Apple eventually released it as Final Cut Pro. This full featured, professional grade software was priced at only $1,499.00. This software was mostly revolutionary for its price, as it, like the Video8 and HandyCam, allowed small businesses or high schools to create their own projects and finish them. Until then, the Avid® tools started at $28,000.00 for just the software and basic breakout box. Final Cut Pro was also revolutionary as it was based on the QuickTime codec for media handling and it featured the ability to ingest video and audio via a computer firewire connection, so you did not need expensive cards or breakout boxes.

Companies like Adobe jumped on this workflow within a few years. Avid® decided to stay with their proprietary expensive hardware system and suffered as Apple and Adobe took more of Avid®’s market share as well as all of the millions, and now billions of entry level filmmakers. The advantages of a firewire workflow caused many camera manufacturers to produce new cameras and decks with firewire, grinding the barrier to entry even lower.

The next big change is when the Red One camera started shipping in August of 2007. RED lowered the cost of a high-end, digital cinema production camera, from that of the Panavision rig, at about $120,000 fully loaded, down to under $50,000. This also was the writing on the wall for film as the dominant production workflow. Thanks to RED, the professional independent films community started to really grow, now that the cost of equipment was somewhat affordable.

All the elements had now come together: we had professional NLEs, cameras that can shoot in high resolution and almost anyone could afford it. This triggered the next major innovation: large screen productions, which are still evolving and the only remaining use for 35mm.

In 2003, Sony introduced the tapeless video format media, the XDCAM, and the Professional Format Disc (PFD) to the world. Panasonic followed in 2004 with its P2 format, which used solid-state memory cards as a recording medium for DVCPRO-HD video. Then in 2006 Panasonic and Sony collaborated on the AVCHD format as an inexpensive, tapeless, high-definition video format. Today AVCHD camcorders are produced by Sony, Panasonic, JVC, Canon and others.

The Types of Cameras and Their Uses

The earliest video cameras were mechanical flying-spot scanners which were in use in the 1920s and 1930s during the period of mechanical television. Improvements in tube technology led to the development of video camera tubes in the 1930s and the underlying technology changed to support television broadcasting. Early designed cameras were extremely large devices, typically constructed in two sections, and often weighing over 250 pounds, too large for an operator to hand carry. The camera section held the lens and tube pre-amplifiers and other necessary electronics and was connected by a large diameter multicore cable to the remainder of the camera electronics, mounted in a separate studio room or a remote broadcast truck. Standalone, the camera head was unable to generate a video picture signal. Once created, video signals were output to the studio for transmission or recording. By the 1950s, solid-state electronics miniaturization had progressed to the point where some monochrome cameras could operate standalone and handheld. However, as a result of decades of applications, studio configuration remained static, with cameras connected via a large cable bundle to move the signals back to the camera control unit (CCU) in the studio or truck. The CCU was used to align and operate the camera’s functions, including exposure, system timing, video and black levels.

By the 1950s in the U.S. and the 1960s in Europe, the first color cameras were introduced, and the system complexity increased as there were three or four pickup tubes, and their size and weight also increased. Hand-operated color cameras did not come into general use until the early 1970s with the first generation of cameras still split into a camera head unit, the body of the camera, containing the lens and pickup tubes, and held on the shoulder or a body brace in front of the operator, connected via a cable bundle to a backpack CCU. For field work, a separate Video Tape Recorder (VTR) was still required to record the camera’s video output. “Camcorders” combine a camera and a VTR. Designed with extreme mobility in mind, these are widely used for television production, home movies, electronic news gathering (ENG), and similar applications. Since the transition to digital video cameras, most cameras have in-built recording media and essentially are camcorders. When these products were first introduced, the typical recorder was either a portable 1" reel to reel VTR, or a portable 3/4" U-matic VCR (Video Cassette Recorder). To operate the original Camcorder systems, the two camera units would be carried by the camera operator, while a tape operator would carry the portable recorder. By 1976, new product designs allowed camera operators to carry on their shoulders a one-piece camera containing all the electronics to output a broadcast quality composite video signal; however, a separate videotape recording unit was still required.

Due to the cost of producing and editing film, Electronic news-gathering (ENG) cameras replaced the 16mm film cameras for TV news production in the mid-1970s. Portable video tape production also enabled quicker response time for the completion of timely stories when compared to the need to chemically process film before review or edit.

Advances in solid-state technology lead to Charge-Coupled Device (CCD) imagers which were introduced in the mid-1980s. The first CCD cameras could not compete with the quality of color or resolution found in tube cameras of the same period, but the benefits of CCD technology, including smaller, lightweight cameras, a more stable image not prone to image burn-in or lag, and easy set-up calibrations meant development on CCD imagers quickly expanded its use in the industry. Once the quality of the images reached the level of the tube sensor, CCD cameras began displacing tube-based systems, which were almost completely retired by the advent of the 1990s. During the 1990s, cameras with the recorder permanently mated to the camera head became the norm for ENG. At the same time, in studio camera design, the camera electronics shrank as CCD imagers replaced the pickup tubes. The thick multicore cables connecting the camera head to the CCU were replaced by TRIAX connections, a slender video cable that carried multiple video signals, intercom audio, and control circuits, and could be run for a long distance, up to a mile if necessary. While the camera size reduced, and electronics no longer required over-sized housings, the typical “box” camera shape remained, as it must continue to hold large studio lenses, teleprompters, an electronic viewfinder (EVF), and other gear needed for studio and sports production. Electronic Field Production cameras are sometimes mounted in studio configurations inside a cage to support additional studio accessories.

In the late 1990s as High Definition Television broadcasting commenced, HDTV cameras suitable for news and general-purpose work were introduced. Delivering higher image quality, their operation was identical to their standard definition predecessors. At the turn of the century, new methods of recording for cameras were announced, including interchangeable hard-drives and systems based on flash memory which ultimately supplanted other forms of recording media.

Professional grade video cameras are designed for different purposes. Modern video cameras, like those used for television production, may be studio-based or designed for electronic field production (EFP). These systems generally offer detailed manual control of all parameters for the camera operator, sometimes to the exclusion of automated operation and usually use three sensors to separately record the red, green, and blue color images.

“Camcorders” combine a camera and a Video Cassette Recorder (VCR), a hard drive, or other recording device in a single package. Some action cameras have 360° recording capabilities and there are specialty systems for super speed capture and systems optimized for live event playback.

Closed-circuit television (CCTV) employs pan-tilt-zoom cameras (PTZ) for security, surveillance, and/or monitor requirements. Designed to be small, circumspect, and to support unattended operations when used in industrial or scientific settings, they are devised to support environmental factors typically inaccessible or uncomfortable for humans and are hardened for hostile environments (e.g., high radiation or heat, toxic chemical exposure, etc.). Webcams are video cameras which stream a live video feed to a computer. Camera phones have video cameras incorporated into mobile devises and recent quality enhancements of these platforms have led to some independent film productions relying on the mobile phones for creative effects. “Lipstick cameras” are named because the lens and sensor block combined are similar in size and appearance to a lipstick container: These miniature cameras are either hard mounted in a small location, like a race car, or on the end of a boom pole. The sensor block and lens are separated from the rest of the camera electronics by a long thin multi-conductor cable. The camera settings are manipulated from this box, while the lens settings are normally set when the camera is mounted in place.

The modern professional video camera, still referred to as a television camera even though the use has spread beyond broadcasting, is a feature full device for creating electronic moving images. In 2000s, major manufacturers such as Sony and Philips introduced completely digital professional video cameras. These cameras used CCD sensors and recorded video digitally on flash memory storage. These were followed by digital High Definition cameras (HDTV). As digital technology improved, supporting the transition to digital television transmissions, digital professional video cameras became dominant in television studios, Electronic News Gathering and EFP. With the advent of digital video capture in the 2000s, the distinction between professional video cameras and movie cameras disappeared as the internal technology and mechanism for capture became the same for both applications. For our purposes, mid-range cameras dedicated to television and professional media collection are termed professional video cameras.

Most professional cameras utilize an optical prism block directly behind the lens. This prism block (a trichroic assembly comprising two dichroic prisms) separates the image into the three primary colors, red, green, and blue, directing each color into a separate charge-coupled device (CCD) or Active pixel sensor (CMOS image sensor) mounted to the face of each prism. Some high-end consumer cameras also do this, producing a higher-resolution image, with better color fidelity than is normally possible with just a single video pickup. In both single sensor and triple sensor designs the weak signal created by the sensors is amplified before being encoded into analog signals for use by the viewfinder and also encoded into digital signals for transmission and recording. The analog outputs were normally in the form of either a composite video signal, which combined the color and luminance information to a single output, or an R-Y B-Y Y component video output through three separate connectors.

For our purposes, it is helpful to understand the differences between camera applications in their typical settings. Most television studio cameras stand on the floor, usually with pneumatic or hydraulic mechanisms called pedestals to adjust the height and are mounted on wheels. The CCU is connected via a TRIAX, fiber optic or the multicore cable, although these are very rarely used in today’s studios. The CCU along with genlock and other equipment is installed in the production control room (PCR) often known as the “Gallery” of the television studio. When used outside a formal television studio in outside broadcasting (OB), they are often on tripods that may or may not have wheels. Studio cameras are light and small enough to be taken off the pedestal and the lens changed to a smaller size to be used on an operator’s shoulder, or mounted on a dolly or a crane, making the cameras much more versatile than previous generations of studio cameras. These cameras are outfitted with a “tally light,” a small signal-lamp that indicates, for the benefit of those being filmed as well as the camera operator, that the camera is “live” and its signal is being used for the “main program” when the light is lit.

Electronic News Gathering (ENG) video cameras were originally designed based on input and direction from news and field production camera operators. ENG cameras are larger and heavier than consumer hand-held models to dampen small movements and are typically supported by a shoulder support on the camera operator’s shoulder, freeing a hand to operate the zoom lens control. These cameras can be mounted on tripods with fluid heads with a quick release plate and have interchangeable lenses. The lens is focused manually without intermediate servo controls. There are usually options to use a behind-the-lens filter wheel for selecting neutral density light filters. Accessible controls are implemented with hard, physical switches, usually in the same camera location, controlling Gain Select, White/Black balance, color bar select, and record start controls – ​these are not selected via software or menu selection to better support in field, manual operation. All settings, like white balance, focus, and iris can be manually adjusted, and automatic controls can be completely disabled to provide better creative management of the image. Professional grade BNC style connectors are supplied for video output and the “genlock” synchronization feed input, and a minimum of two professional XLR style input connectors are supplied for audio. Often there is a direct input for a portable wireless microphone, and audio is fully adjustable via easily accessed knobs. In addition to an electronic view finder, the video feed can be output to an external CRT viewfinder and as a professional tool it is equipped with a time code control section, allowing time presets. Multiple-camera setups can be time code-synchronized or “jam-synced” or forced to synchronize to a master clock. Usually these professional tools have “bars and tone” available in-camera, allowing the operator to insert the Society of Motion Picture and Television Engineers (SMPTE) color bars, a reference signal that simplifies calibration of monitors and levels setting when duplicating and transmitting the picture.

Electronic field production (EFP) cameras are similar to studio systems as they are used in multiple-camera switched configurations, but their main application is a deployment outside a controlled, studio environment, for concerts, sports, and live news coverage of special events. Versatility is the main advantage of these cameras, as they are designed to be flexible, carried on the shoulder or mounted on camera pedestals and cranes, and can support the large, very long focal length zoom lenses made for studio camera mounting. As opposed to camcorder ENG cameras, these cameras have no recording ability, and transmit their signals back to the broadcast truck through a fiber optic, TRIAX, or via a Radio Frequency link. Remote cameras are a separate type, typically very small camera heads designed to be operated by remote control. Despite a diminutive size, they are capable of performance comparable to the larger ENG and EFP cameras. “Block” cameras are named because the camera head is a small block, usually smaller than the lens. Sometimes completely self-contained, other block cameras only contain the sensor block and pre-amps requiring connection to a separate camera control unit in order to operate. All the functions of the camera can be controlled from a distance, including controlling the lens focus and zoom. Typically, these cameras are pan and tilt mounted, and can be placed in a stationary position, such as atop a pole or tower, in a corner of a broadcast booth, or behind a basketball hoop. Block cameras can also be placed on robotic dollies, at the end of camera booms and cranes, or “flown” in a cable supported harness as seen in sports productions where a camera has a bird’s eye view of the field of play.

The Basics of Digital Cameras

The initial digital camera concept began with Eugene F. Lally of the Jet Propulsion Laboratory, who was applying mosaic photosensor technology to capture digital images. In 1961, his idea was to take pictures of the planets and stars while travelling through space to provide information to locate astronauts’ position. Later in 1972, a Texas Instruments employee Willis Adcock had an idea for a filmless camera (U.S. patent 4,057,830). In both of these cases, the concept was good, but our technology was not ready to support the innovation. By 1975 a commercial all-digital camera called the Cromemco Cyclops was introduced. Originally a Popular Electronics hobbyist construction project, the design was published in the February 1975 issue of the magazine, and it used a 32×32 Metal Oxide Semiconductor sensor to capture images. Later that same year, 1975, an engineer at Eastman Kodak named Steven Sasson invented and built the first self-contained electronic camera that used a charge-coupled device (CCD) image sensor. Sasson’s design was employed for military and scientific applications and by 1976 medical and news applications followed. As the professional cameras began to adopt digital technology, these ideas and technical innovations were applied to professional designs.

The two major types of digital image sensor are CCD and CMOS. A CCD sensor has one amplifier for all the pixels, while each pixel in a CMOS active-pixel sensor has its own amplifier. Because of this design, compared to CCDs, CMOS sensors use less power. While there are many arguments about the viability of the two methods, and which provides the best images, overall final image quality is more dependent on the image processing capability of the camera than on sensor type.

Resolution is important, whether the system is Standard Definition (SD), High Definition (HD), or Ultra High Definition (UHD). The resolution of a digital camera is often limited by the image sensor that turns light into discrete signals. The brighter the image at a given point on the sensor, the larger the value that is read for that pixel. Depending on the physical structure of the sensor, a color filter array may be used, which requires “de-mosaic-ing” to recreate a full-color image. The number of pixels in the sensor determines the camera’s “pixel count.” In a typical sensor, the pixel count is the product of the number of rows and the number of columns. For example, a 1,000 by 1,000-pixel sensor would have 1,000,000 pixels, or 1 megapixel.

Image sharpness is another measurement of camera performance. As you would expect, the final quality of any captured image depends on all optical transformations in the chain of producing the image. The weakest link in an optical chain determines the final image quality. In case of a digital camera, a simplistic way of expressing it is that the lens determines the maximum sharpness of the image while the image sensor determines the maximum resolution.

Let’s take a deeper look into the heart of modern cameras, to better understand the methods employed to capture images and recognizing that each of these methods is generating more data. When digital innovations were first introduced, three different image capture methods were developed, and the hardware sensors and color filters adapted to support each method. The three types are single-shot, multi-shot, and scanning. Single-shot capture cameras use either one sensor chip with a “Bayer” filter, or three separate image sensors for the primary additive colors red, green, and blue that are exposed to the same image via a beam splitter. These were called three-CCD cameras. A “Bayer” light filter is an arrangement of color filters in a mosaic in the capture pixel array. The Bayer pattern is a repeating 2×2 mosaic pattern of filters, with green at opposite corners and red and blue in the other two positions. This means that green takes twice the proportion of filters and this was consciously designed to mimic the properties of our human eyes, which determine brightness from green and are more sensitive to brightness than to hue or saturation. Some cameras have used a 4-color filter pattern, typically adding two different hues of green to provide a potentially more accurate color but requiring complications in the interpolation process.

Multi-shot cameras were designed to expose the digital sensor to the image in a sequence of openings of the lens aperture. As there are a number of manufacturers that developed hardware using this method, the multi-shot technique has been applied in a number of different ways. Originally, the systems adapted a single image sensor to capture the same image as three filters were sequentially passed in front of the sensor to obtain additive color information. Later, a multiple shot innovation called micro-scanning was developed and this method employs one sensor chip with a color filter, typically a Bayer filter, and the hardware system physically manipulates the sensor in the focus plane of the lens to build a better resolved image than the native resolution of the chip supports. Latter developments combined both methods without a Bayer filter on the chip.

Scanning, as its name suggests, captures images by moving the sensor chip across the focal plane, in an operation like a document scanner. These cameras can offer very high-resolution images. Camera systems employ “linear” or “tri-linear” sensors in a single line of photosensors, or three lines for the three colors and the scanning operation is accomplished by physically moving the sensor or by rotating the whole camera. These digital rotating line cameras offer images of very high total resolution and the best color fidelity.

Single-shot systems with Bayer filters are typically consumer models and require an optical anti-aliasing filter to reduce aliasing due to the reduced sample rate of the different primary color images. Often the solution is a de-mosaic software algorithm to interpolate the color information and supply a full range of RGB color data. Professional cameras that use a beam-splitter single-shot 3 chip approach, three-filter multi-shot approach, color co-site sampling, or specialized sensors do not use anti-aliasing filters, nor de-mosaic software. Software in a raw converter program, such as Adobe Camera Raw, interprets the color data from the sensor to obtain a full color image, because the RGB color model requires three intensity values for each pixel: one measurement for red, green, and blue. Other color models also require three or more values per pixel. A single sensor element cannot simultaneously record these three intensities, and so a color filter array (CFA) must be used to selectively filter a particular color for each pixel. Using software, any color intensity values not captured for a particular pixel can be interpolated from the values of adjacent pixels which represent the color being calculated.

Since 2008, manufacturers have offered consumer DSLR cameras with a “movie” mode capable of recording high definition motion video. A DSLR with this feature is often known as an HDSLR or DSLR video shooter. Early HDSLRs capture video using a nonstandard video resolution or frame rate. The first DSLR introduced with an HD movie mode, the Nikon D90, captures video at 720p24 (1280×720 resolution at 24 frame/s). HDSLRs use the full camera imager area to capture HD video, though not all pixels which can introduce video artifacts. Compared with the higher resolution image sensors found in typical camcorders, HDSLR’s much larger sensors yield distinctly different image characteristics. HDSLRs can achieve much shallower depth of field and superior low-light performance. Still, because of the low ratio of active pixels to total pixels, these consumer models are more susceptible to aliasing artifacts in scenes with particular textures, and the internal rolling shutter tends to introduce more artifacts. Because of the DSLR’s optical construction, HDSLRs usually lack important video functions found on standard dedicated camcorders, such as autofocus while shooting, powered zoom, and an electronic viewfinder/preview screen. Additional limitations caused by their handling have prevented the HDSLR cameras from taking their place in the industry as simple point-and-shoot camcorders, requiring special planning and skills to gather professional quality images.

Over the past ten years, video functionality in these consumer models has continued to improve including higher video resolution and video bitrate, improved automatic control (autofocus) and manual exposure control, and support for formats compatible with high-definition television broadcast, Blu-ray disc mastering or Digital Cinema Initiatives (DCI). Models now offer broadcast compliant 1080p24 video. These developments have sparked a digital filmmaking revolution, and the “Shot On DSLR” badge of honor is becoming a mainstay for documentary and independent producers. An increased number of films, documentaries, television shows, and other productions are utilizing the quickly improving features. Affordability and convenient size compared with professional movie cameras is driving rapid adoption.

The Implications of Camera Imaging Sensor Choice on System Requirements

Let’s quickly review what we know about image sensors: An imaging sensor is a detection device that measures light and conveys the information that constitutes a total image. The sensors work by converting the variable attenuation of light waves as they pass through or reflect off objects into electrical signals, small bursts of current that convey the data. Early analog sensors for visible light were video camera tubes. Today the sensors are digital, either semiconductor “charge-coupled devices” (CCD), or active pixel sensors in “complementary metal–​oxide–​semiconductor” (CMOS) or “N-type metal-oxide-semiconductor” (NMOS, also called Live MOS) technologies. While CCD sensors are used for high-end broadcast quality video cameras, consumer products with internal cameras use lower cost and physically smaller CMOS sensors which offer lower power consumption in battery powered devices.

The fundamental operation of the CCD sensor is analog as each cell of a CCD pixel image sensor is an analog capture device. When light strikes the chip, it is retained as a small electrical charge in each sensor. Small output amplifiers are connected to a line of pixel sensors. The charges in the line of pixels nearest to the output amplifiers are amplified and output, then each line of pixels shifts its charges one line closer to the amplifier, filling the empty line closest to the amplifiers. This process is then repeated until all the lines of pixels have had their charge amplified and output. Alternatively, the CMOS image sensor has an amplifier for each pixel compared to the few amplifiers of a CCD. This results in less area for the capture of photons than a CCD, but this issue is typically resolved by using micro-lenses in front of each photodiode, which redirect light into the photodiode that would have otherwise bounced off the amplifier and be undetected. Some CMOS imaging sensors also use a method of back-side illumination to increase the number of photons that hit the photodiode. CMOS sensors can potentially be implemented with fewer components, use less power, and/or provide faster readout than CCD sensors, and most important for consumer products, they are also less vulnerable to static electricity discharges.

Many parameters can be used to evaluate the performance of an image sensor, including dynamic range, signal-to-noise ratio, and low-light sensitivity. For sensors of comparable types, the signal-to-noise ratio and dynamic range improve as the size increases.

Optical resolution describes the ability of an imaging system to resolve detail in the object that is being imaged. An imaging system may have many individual components including the lens the recording and the display components. Each of these contributes to the optical resolution of the system, as will the environment in which the imaging is done.

The optical transfer function (OTF) of an optical system such as a camera, microscope, human eye, or projector specifies how different spatial frequencies are handled by the system. The OTF is used by engineers to describe how the optical system projects light from the object or scene to the next item in the optical transmission chain, whether it is onto a photographic film, a detector array, a retina or screen. Some optical and electrical engineers prefer to use the modulation transfer function (MTF), an electrical measurement that neglects phase effects, but in most situations is the measured equivalent of the OTF. Either function specifies the response to a periodic sine-wave pattern passing through the lens system, as a mathematical function of its spatial frequency (period), and its orientation (phase). To get technical, the OTF is formally defined as the Fourier transform of the point spread function (PSF), which is the impulse response of the optics, the image of a point source. As a Fourier transform, the OTF is complex-valued; but it will be real-valued in the common case of a PSF that is symmetric about its center. The MTF is formally defined as the magnitude (absolute value) of the complex OTF. These spatial measurements allow us to quantify the sharpness of a diffraction-limited focused imaging system, or the blurriness of an out-of-focus imaging system, and they take into account important factors such as resolution.

Color separation is a function of the color image sensors employed in the device and is described by the three types of color-separation mechanism:

  • Bayer filter sensor, using a color filter array that passes red, green, and blue light to selected pixel sensors. Each individual sensor element is made sensitive to green, red, and blue by means of a colored chemical dye gel placed over each individual element. Inexpensive to manufacture, this technique lacks the color purity of dichroic filters. Because the color gel segment must be separated from the others by a “freme,” a separation like the caning in stained glass windows, less of the areal density of a Bayer filter sensor is available to capture light, making the Bayer filter sensor less sensitive than similar sized color sensors. In today’s designs, most common Bayer filters use two green pixels, and one each for red and blue, which results in less resolution for red and blue colors, but this is acceptable because it correlates to the human optical system’s reduced sensitivity at the edges of our visual spectrum. Missing color samples are interpolated using a de-mosaic algorithm or ignored altogether by a compression scheme. To improve color information, techniques like color co-site sampling use a hardware mechanism to shift the color sensor in pixel steps.

Foveon X3 sensor, using an array of layered pixel sensors, separates light through the wavelength-dependent absorption properties of silicon. This system allows every pixel location to sense all three colors and is similar to color photography film operations.

  • 3-CCD, using three discrete digital image sensors, with color separation done by a dichroic optical prism. The dichroic prism elements provide sharper separation, improving overall color quality at full resolution. 3-CCD sensors produce better low-light performance and produce a full 4:4:4 signal, which is preferred in broadcasting, video editing, and chroma key visual effects. Every pixel can be measured with three color values, a Y value, a Cb value, and a Cr value. The maximum measurement for each of these is “4,” so broadcasters and videographers use the shorthand of “4:4:4” to describe the maximum measurement for each parameter and this refers to the best possible color for an image.

Lens resolution is the ability of a lens to resolve detail and is determined by the quality of the lens limited by diffraction. Light coming from a point in the object diffracts through the lens aperture in such a way that it forms a pattern of diffraction in the image. This pattern will have a center spot and surrounding bright rings, separated by dark nulls, and this pattern is known as an Airy pattern, and the central bright spot as an Airy disk. Two adjacent points in the object give rise to two diffraction patterns. If the angular separation of the two points is significantly less than the Airy disk angular radius, then the two points cannot be resolved in the image, but if their angular separation is much greater than this, distinct images of the two points are formed and they can therefore be resolved. Only the very highest quality lenses have diffraction limited resolution, and normally the quality of the lens limits its ability to resolve detail. This ability is expressed in the aforementioned Optical Transfer Function which describes the spatial (angular) variation of the light signal as a function of spatial (angular) frequency. When the image is projected onto a flat plane, such as photographic film or a solid-state detector, spatial frequency is the preferred domain, but when the image is referred to the lens alone, angular frequency is preferred. OTF may be broken down into the magnitude and phase components in its mathematical measurement formulas and can be explained as spatial frequency in the x- and y-plane, respectively. OTF accounts for aberrations. The magnitude is known as the Modulation Transfer Function (MTF) and the phase portion is known as the Phase Transfer Function (PTF). In imaging systems, the phase component is typically not captured by the sensor making the important measure of the MTF.

Film, solid-state devices like CCD and CMOS detectors, and tube detectors (vidicon, plumbicon, etc.), as optical sensors detect spatial differences in electromagnetic energy. The ability of such a detector to resolve those differences depends mostly on the size of the detecting elements. Spatial resolution is typically expressed in line pairs per millimeter (lppmm), lines of resolution, mostly for analog video, contrast vs. cycles/mm, or MTF (the modulus of OTF). The MTF may be mathematically found by taking the two-dimensional Fourier transform of the spatial sampling function. Smaller pixels result in wider MTF curves and better detection of higher frequency energy. This is analogous to taking the Fourier transform of a signal sampling function: in that case, the dominant factor is the sampling period, which is analogous to the size of the picture element (pixel). Other factors include pixel noise, pixel crosstalk, substrate penetration, and fill factor.

A common problem among non-technicians is the use of the “number of pixels” on the detector to describe the resolution. If all sensors were the same size, this would be acceptable. But they are not, so the use of the number of pixels is misleading. For example, all else being equal, a 2-megapixel camera of 20-micrometer-square pixels will have worse resolution than a 1-megapixel camera with 8-micrometer pixels.

For a reliable resolution measurement, film manufacturers typically publish a plot of Response (%) vs. Spatial Frequency (cycles per millimeter). This measurement is derived experimentally. Solid-state sensor and camera manufacturers normally publish specifications from which the user may derive a theoretical MTF according to a recommended procedure. A few may also publish MTF curves, while others will publish the response (%) at the Nyquist frequency, or publish the frequency at which the response is 50%.

To find a theoretical MTF curve for a sensor, it is necessary to know three characteristics of the sensor: the active sensing area, the area encompassing the sensing area, the interconnection and support structures (“real estate”), and the total pixel count. The total pixel count is usually supplied, but as we have discussed it can be misleading. Sometimes the overall sensor dimensions are given, from which the real estate area can be calculated. Whether the real estate area is given or derived, if the active pixel area is not given, it may be derived from the real estate area and the fill factor, where fill factor is the ratio of the active area to the dedicated real estate area.

Time also impacts our sensor resolution. An imaging system running at 24 frames per second is essentially a discrete sampling system that samples a two-dimensional area. These systems also suffer from sampling limitations. For example, all sensors have a characteristic response time. Film is limited at both the short and the long resolution extremes, typically understood to be anything longer than 1 second and shorter than 1/10,000 second. Additionally, film requires a mechanism to advance it through exposure, or a moving optical system to expose it. These limit the speed at which successive frames may be captured.

Digital systems have a different limitation. CCD is speed-limited by the rate at which the charge can be moved from a single pixel site to the next. CMOS has the advantage of having individually addressable pixel cells, and this has led to its advantage in the high-speed photography industry. Tube systems like vidicons, plumbicons, and image intensifiers have specific applications. The speed at which they can be sampled depends upon the decay rate of the phosphor used in the tube. For example, the P46 phosphor has a decay time of less than 2 microseconds, while the P43 decay time is on the order of 2–​3 milliseconds. A tube image sensor with P43 phosphor is unusable at frame rates above 1,000 frames per second. Physical temperature impacts detectors as well. If objects within a scene are in motion relative to the imaging system, the resulting motion blur will result in lower spatial resolution. Short integration times will minimize the blur, but integration times are limited by sensor sensitivity. Important in our world of digital video distribution, motion between frames in media production has an impact on digital movie compression schemes (e.g., MPEG-1, MPEG-2). There are sampling schemes that require real or apparent motion inside the camera (scanning mirrors, rolling shutters) that may result in incorrect rendering of image motion, so sensor sensitivity and other time-related factors have a direct impact on spatial resolution.

In HDTV and VGA systems, the spatial resolution is fixed independently of the analog bandwidth because each pixel is digitized, transmitted, and stored as a discrete value. Cameras, recorders, and displays are selected so that the resolution is identical from digital camera to display. Analog systems are different, where the resolution of the camera, recorder, cabling, amplifiers, transmitters, receivers, and display may be completely independent and overall system resolution is controlled by the bandwidth of the lowest performing component in the chain. There are two methods by which to determine system resolution. The first is to perform a series of two-dimensional calculations, measuring first the image and the lens, then the result of that procedure with the image sensor, and continue through all components of the system. This is a complicated and time-consuming computation and must be performed anew for each object to be imaged. The other method is to transform each of the components of the system into the spatial frequency domain, and then to multiply the two-dimensional results. In this way a system response may be determined without reference to an image. The mathematical calculation will employ the Fourier transform. Although this method is considerably more difficult to comprehend conceptually, it becomes easier to use, especially when different design requirements or imaged objects are to be tested.

A variety of optical resolution measurement systems are available. Typical test charts for Contrast Transfer Function (CTF) consist of repeated bar patterns. The limiting resolution is measured by determining the smallest group of bars, both vertically and horizontally, for which the correct number of bars can be seen. By calculating the contrast between the black and white areas at several different frequencies, however, points of the CTF can be determined with a mathematical contrast equation. In broadcast plants, some modulation may be seen above the limiting resolution; these may be aliased and phase-reversed to correct. When using other methods, including the interferogram, sinusoid, and the edge in the ISO 12233 target, it is possible to compute an entire MTF curve. The response to the edge is similar to a step response, and the Fourier Transform of the first difference of the step response yields the MTF. Other measurement systems include an interferogram created between two coherent light sources that can be used for at least two resolution-related purposes, the first to determine the quality of a lens system, and the second to project a pattern onto a sensor to measure resolution. The EIA 1956 resolution target, shown in Figure 3.1, and the similar IEEE 208–​1995 resolution target, were specifically designed to be used with television systems.

Figure 3.1 EIA 1956 Video Resolution Target

The gradually expanding lines near the center are marked with periodic indications of the corresponding spatial frequency. The limiting resolution is found directly through inspection. The most important measure is the limiting horizontal resolution since the vertical resolution is typically determined by the applicable video standard. The ISO 12233 target was developed for digital camera applications, since modern digital camera spatial resolution may exceed the limitations of the older targets. It includes several knife-edge targets for the purpose of computing MTF by Fourier transform. They are offset from the vertical by 5° so that the edges will be sampled in many different phases, which allow estimation of the spatial frequency response beyond the Nyquist frequency of the sampling.

Finally, a Multi-burst signal is an electronic waveform used to test transmission, recording, and display systems. The test pattern consists of several short periods of specific frequencies. The contrast of each may be measured by inspection and recorded, giving a plot of attenuation vs. frequency. The NTSC3.58 multi-burst pattern consists of 500 kHz, 1 MHz, 2 MHz, 3 MHz, 3.58 MHz, and 4.2 MHz blocks. 3.58 MHz is important because it is the chrominance frequency for NTSC video. PAL Multiburst packets are 0.5 MHz, 1.0 MHz, 2.0 MHz, 4.0 MHz, 4.8 MHz, 5.8 MHz as per CCIR 569.

The Various Forms of Color Coding & Color Spaces

Color space is a specific organization of colors. When used in combination with the physical parameters of a specific device, it supports in both analog and digital representations identical representations of color. Color space may be defined by individual interpretation, for example the Pantone collection where particular colors are assigned to a set of physical color swatches and assigned names or numbers. Alternatively, color space can be structured mathematically, as defined by the NCS System, Adobe RGB or sRGB. Color models are abstract mathematical models describing the way colors can be represented as tuples of numbers (e.g., triples in RGB or quadruples in CMYK), yet a color model with no associated mapping function to an absolute color space becomes an arbitrary color system with no connection to any internationally recognized standard of interpretation. By employing a specific mapping function between a color model and a reference color space, media producers establish within the reference color space a fixed, absolute gamut, or “footprint,” and for a given color model this defines a color space. For example, Adobe RGB and sRGB are two different absolute color spaces, both based on the RGB color model. When defining a color space, the most used international reference standards are the CIELAB or CIEXYZ color spaces, which were specifically designed to encompass all colors an average human can see.

How did we arrive at the idea of a color space? Starting in 1802, Thomas Young posed the existence of three types of photoreceptors in the eye, which we now call “cone cells,” each of which was sensitive to a particular range of visible light. By 1850, Hermann von Helmholtz developed the Young–​Helmholtz theory that the three types of cone photoreceptors could be classified as short-preferring (blue), middle-preferring (green), and long-preferring (red), according to their response to the wavelengths of light striking the retina. The relative strengths of the signals detected by the three types of cones are interpreted by the brain as a visible color. The color-space concept followed and was introduced by Hermann Grassmann, who developed it in two stages. First, he developed the idea of vector space, which allowed the algebraic representation of geometric concepts in dimensional space. With this conceptual background, in 1853 Grassmann published a theory of how colors mix. This theory and its three-color laws are still taught as Grassmann’s laws. As noted by Grassmann

… the light set has the structure of a cone in the infinite-dimensional linear space. As a result, a quotient set of the light cone inherits the conical structure, which allows color to be represented as a convex cone in the 3- D linear space, which is referred to as the color cone.

Colors can be created with color spaces based on the CMYK color model, using the subtractive primary colors of pigment (cyan (C), magenta (M), yellow (Y), and black (K)). To create a three-dimensional representation of a given color space, we can assign the amount of magenta color to the representation’s X axis, the amount of cyan to its Y axis, and the amount of yellow to its Z axis. The resulting 3-D space provides a unique position for every possible color that can be created by combining those three pigments. Colors can be created on monitors with color spaces based on the RGB color model, using the additive primary colors (red, green, and blue). A three-dimensional representation would assign each of the three colors to the X, Y, and Z axes. Note that colors generated on given monitor are limited by the reproduction medium, such as the phosphor (in a CRT monitor) or filters and backlight (LCD monitor). Another way of creating colors on a monitor is with an HSL or HSV color space, based on hue, saturation, brightness (value/brightness). With such a space, the variables are assigned to cylindrical coordinates. Many color spaces can be represented as three-dimensional values in this manner, as show in Figure 3.2, but some have more or fewer dimensions, and some arbitrarily defined color spaces, such as Pantone cannot be represented in this way at all.

Additive color mixing: Three overlapping lightbulbs in a vacuum, adding together to create white, as shown in Figure 3.3.

Subtractive color mixing: Three splotches of paint on white paper, subtracting together to turn the paper black.

RGB uses additive color mixing, because it describes what kind of light needs to be emitted to produce a given color. RGB stores individual values for red, green, and blue. RGBA is RGB with an additional channel, alpha, to indicate transparency. Common color spaces based on the RGB model include sRGB, Adobe RGB, ProPhoto RGB, scRGB, and CIE RGB.

Figure 3.2 Additive Color Mixing

Figure 3.3 Subtractive Color Mixing

YIQ was formerly used in analog NTSC television broadcasts (National Television Standards Committee, used in North America, Japan, and elsewhere) for historical reasons. This system stores a “luma” value roughly analogous to, and sometimes incorrectly identified as, luminance, along with two chroma values as approximate representations of the relative amounts of blue and red in the color. It is similar to the YUV scheme used in most video capture systems and in PAL (Australia, Europe except France and Russia) television, except that the YIQ color space is rotated 33° with respect to the YUV color space and the color axes are swapped. The YDbDr scheme used by SECAM analog television in France and Russia is rotated in another way. YPbPr is a scaled version of YUV and it is most commonly seen in its digital form, YCbCr, used widely in video and image compression schemes such as MPEG and JPEG.

xvYCC is a new international digital video color space standard published by the IEC (IEC 61966-2-4). Based on the ITU BT.601 and BT.709 standards, it extends the gamut beyond the R/G/B primaries specified in those standards. HSV (hue, saturation, value), also known as HSB (hue, saturation, brightness) is often used by artists because it is often more natural to think about a color in terms of hue and saturation than in terms of additive or subtractive color components. HSV is a transformation of an RGB color space, and its components and colorimetry are relative to the RGB color space from which it was derived. HSL (hue, saturation, lightness/luminance), also known as HLS or HSI (hue, saturation, intensity), is quite similar to HSV, with “lightness” replacing “brightness.” The difference is that the brightness of a pure color is equal to the brightness of white, while the lightness of a pure color is equal to the lightness of a medium gray.

What Does Gamma and Log Processing Mean?

Gamma, the shortened and common reference for gamma correction, is the non-linear operation used to encode and decode luminance or tristimulus values in video or still image systems. In the simplest cases, gamma correction is defined by a mathematical power-law expression. A gamma value is sometimes called an encoding gamma, and the process of encoding with this compressive logarithmic expression is called gamma compression. In the reverse function, the gamma value is called a decoding gamma and the application of the expansive power-law logarithmic expression is called gamma expansion.

Gamma encoding of images is used to optimize the usage of data bits when encoding an image, or the bandwidth required to transport an image, by taking advantage of the non-linear way humans perceive color and light. With no actual relationship to the gamma function, the human perception of brightness in everyday situations, in other words not in darkness or blinding sunlight, follows an approximate logarithmic function with greater sensitivity to relative differences between darker tones than between lighter ones. We apply gamma correction because if images are not gamma-encoded, they allocate too much data or too much bandwidth to highlights that humans cannot differentiate, and too few bits or too little bandwidth to shadow values that humans typically do see and would require more bits/bandwidth to maintain an optimum visual experience.

Gamma encoding was originally developed to compensate for the input–​output characteristic of cathode ray tube (CRT) displays. Light intensity varies nonlinearly with the electron-gun voltage in CRT displays. Altering the input signal by gamma compression can cancel this nonlinearity, such that the output picture has the intended luminance. The similarity of CRT physics to the inverse of gamma encoding needed for video transmission was a combination of coincidence and engineering, and simplified the early television set electronics. The advantage to modern systems is different. The gamma characteristics of the display device are less of a concern in the gamma encoding of images and video; we employ gamma encoding to maximize the visual quality of the signal, regardless of the gamma characteristics of the display device.

Until the recent advent of High Dynamic Range (HDR) televisions and monitors, most video screens were not capable of displaying the dynamic range of brightness that can be captured by typical digital cameras. Considerable artistic effort has been invested in choosing the reduced form for the original image to be presented. Contrast selection via the gamma correction is part of the artistic adjustments used to fine-tune the image for reproduction. Also, it is important to note that digital cameras record light using electronic sensors that respond linearly, not logarithmically like our eyes. The process of rendering linear raw data to conventional RGB data (e.g., for storage into JPEG image format), color space transformations and rendering transformations will be performed. In particular, almost all standard RGB color spaces and file formats use a gamma compression non-linear encoding of the image intensities of the primary colors and the intended reproduction is almost always nonlinearly related to actual measured scene intensities.

Binary data in still image files, such as JPEG photographic images, are explicitly encoded which means that they carry gamma-encoded values, not linear intensities, as are motion picture files compressed with the MPEG standard. The gamma encoding system can be tuned to manage both cases through color management, if a better match to the output device gamma is required. The sRGB color space standard used with most cameras does not use a simple power-law nonlinearity as described but has a decoding gamma value near 2.2 over much of its range. Below a compressed value of 0.04045 or a linear intensity of 0.00313, the curve is linear, in other words, the encoded value is directly proportional to intensity. Output to CRT-based television receivers and monitors does not usually require further gamma correction, since the standard video signals that are transmitted or stored in image files incorporate gamma compression that provides a pleasant image after the gamma expansion of the CRT. For television signals, the actual gamma values are defined by the video standards (ATSC or DVB-T) and are always fixed and published values.

What RAW Means and How to Digitally Process It

A camera RAW image file contains minimally processed data from the image sensor of either a digital camera or motion picture film scanner. RAW files are named because they are not yet processed and not ready to be printed or edited. Typically, raw images are processed by a RAW converter in a wide-gamut internal color space where precise adjustments can be made before conversion to a house preferred or “mezzanine” file format for storage or further manipulation. This usually encodes the image in a device-dependent color space. There are hundreds of RAW formats in use by different models of digital equipment and many are not compatible.

The purpose of RAW image formats is to save, with minimum loss, all data obtained from the sensor, as well as the metadata describing the conditions of the capture of the image. RAW image formats are intended to capture as closely as possible the complete characteristics of the scene, all pertinent physical information about the light intensity and color. Like photographic negatives, RAW digital images may have a wider dynamic range or color gamut than the eventual final image format as it maintains the most complete record of the captured image. Most RAW image file formats store information sensed according to the geometry of the sensor’s individual photo-receptive elements, the pixels, rather than points in the expected final image: for example, camera sensors with hexagonal element displacement record information for each of their hexagonally displaced cells, which a decoding software will eventually transform into rectangular geometry during “digital developing.” The process of converting a RAW image file into a viewable format is sometimes called “developing” a RAW image, harkening back to the film development process used to convert photographic film into viewable prints. Setting white balance, color grading and gamma all participate in the rendering process.

RAW files contain the information required to produce a viewable image from the camera’s sensor data. The structure of RAW files often follows a common pattern:

  • A short file header which typically contains an indicator of the byte-ordering of the file, a file identifier and an offset into the main file data.

  • Camera sensor metadata, which is required to interpret the sensor image data, including the size of the sensor, the attributes of the CFA and its color profile.

  • Image metadata which is required for inclusion in any CMS environment or database. These include the exposure settings, camera/scanner/lens model, date (and, optionally, place) of shoot/scan, authoring information and other. Some RAW files contain a standardized metadata section with data in Exif format.

  • In the case of motion picture film scans, either the timecode, keycode or frame number in the file sequence which represents the frame sequence in a scanned reel. This item allows the file to be ordered in a frame sequence (without relying on its filename).

  • The sensor image data: RAW files contain the full resolution data as read out from each of the camera’s image sensor pixels.

If RAW format data is available, it can be used in high-dynamic-range imaging conversion, as a simpler alternative to the multi-exposure HDI approach of capturing three separate images, one underexposed, one correct and one overexposed, and “overlaying” one on top of the other.

To be viewed or edited, the output from a camera’s image sensor has to be converted to a photographic rendering of the scene and stored in a standard format. This processing, whether done in-camera or later in a RAW-file converter, involves a number of operations, including but not limited to:

  • decoding – ​image data of RAW files are typically encoded for compression purpose, but sometimes for security obfuscation purpose

  • defective pixel removal – ​replacing data in known bad locations with interpolations from nearby locations

  • white balancing – ​accounting for color temperature of the light that was used to take the photograph

  • noise reduction – ​trading off detail for smoothness by removing small fluctuations

  • color translation – ​converting from the camera native color space defined by the spectral sensitivities of the image sensor to an output color space (typically sRGB for JPEG)

  • tone reproduction – ​the scene luminance captured by the camera sensors and stored in the RAW file (with a dynamic range of typically ten or more bits) needs to be rendered for pleasing effect and correct viewing on low-dynamic-range monitors or prints; the tone-reproduction rendering often includes separate tone mapping and gamma compression steps.

  • compression – ​for example JPEG compression

Cameras and image processing software may also perform additional processing to improve image quality, for example:

  • removal of systematic noise – ​bias frame subtraction and flat-field correction

  • dark frame subtraction

  • optical correction – ​lens distortion, chromatic aberration, and color fringing correction

  • contrast manipulation

  • dynamic range compression – ​lighten shadow regions without blowing out highlight regions

Compressed File Systems

Video, once it is digitized, must be saved in a representative manner so that the content can be stored in an archive or transmitted as a file or a data stream. A video compression format is the coding that is applied to the content data to make files and there are thousands of types, from consumer to “pro”-sumer to professional methods. Video encoding to compressed formats can be “lossless,” meaning that all the data is saved resulting in a very large file, or “lossy” where the data is compressed for smaller file size or lower transmission bandwidth, resulting in degradation of the original content data. Consumer media is usually compressed using “lossy” video codecs, resulting in significantly smaller files than lossless compression. While most video coding formats are designed either for lossy or lossless compression, some formats support both. Uncompressed video formats are a type of lossless video processing used in some circumstances such as when sending video to a display over an HDMI connection. Some high-end cameras can capture video directly in an uncompressed, lossless format. A few examples of video coding formats include MPEG-2 Part 2, MPEG-4 Part 2, JPEG2000, H.264 (MPEG-4 Part 10), DV-DVCpro, AVC-Intra, DNXHD, DPX, HEVC, Theora, RealVideo RV40, VP9, and AV1. A specific software or hardware implementation capable of video data compression and/or decompression to/from a specific video coding format is called a video “codec,” and the operation of changing one format to another is called “transcoding.”

It is important to know that there are hundreds of formats and variations in use in the media industry. While many codecs have been developed for specific software and hardware platforms, such as PRORES for the Apple Final Cut Pro Software or DNXHD for the AVID® Editing systems, most professional coding formats are documented in a detailed technical specification. Some specifications have been approved by standardization organizations and are considered a video coding standard. Because of the proliferation of codec formats, the use of the term ‘standard’ applies to both formal standards and de facto product promoted standards. Conceptually, there is a difference between a format “specification” and its codec implementations. Video coding formats should be detailed in specifications, and the software or hardware to encode and decode data to or from uncompressed video data are codec implementations of those specifications. This is one of the reasons why the industry struggles with hundreds of versions of the same format: for each specification, there can be many codecs implementing that specification in slightly different manners, often to promote a specific product feature or benefit.

In many cases, video content data encoded using a coding format is bundled with audio channels each encoded with an audio coding format, as well as subtitle or captions data, and “wrapped” inside a multimedia container format such as MXF (SMPTE Specification 386M, 383M, 381M & 377 are frequently used), Quicktime, AVI, MPEG4, FLV, or RealMedia. These file packages are a container holding the pieces of the total media asset and can be easily transmitted or stored. Wrapper multimedia container formats can contain any one of a number of different video coding formats, which complicates this situation for companies that share media files on a regular basis like broadcasters, production and post-production houses. When a user analyzes a new file, an MP4 wrapper container can contain compressed video made with the MPEG-2 Part 2 codec, or video compressed with the H.264 video coding format, or any of a number of other format options.

A video coding format specification does not dictate all algorithms used by a codec implementing the format. For example, a large part of how video compression typically works is by finding similarities between video frames and achieving compression by copying previously coded similar sub-images or “macroblocks” and adding small differences as they occur. In the real world it is almost impossible to find an optimal compression solution, but many manufacturers have developed hardware and software tools to manage the predictions of changes in the macroblocks as well as capture the differences. Since the video coding format standards do not dictate the algorithms that manage the encoding steps, systems can innovate in their support for compression across the video frames in the data stream. The application of the video often requires codec decisions that trade storage space versus time, in other words a live feed for broadcast quality may need a codec that operates fast but is very inefficient in file size and requires higher storage space, while a DVD codec needs to manage space on the storage medium and trades the speed of operation for higher compression and lower storage requirements.

The equipment in your plant will dictate the type of format your company requires. Choices are impacted by your editing software, your transmission chain, storage and Media library selections. One of the most widely used video coding formats is H.264. H.264 is a popular choice for encoding for Blu-ray Discs and it is widely used for “proxy” video that media asset management systems use to reference high definition files. It is also widely used by streaming Internet sources, like YouTube, Netflix, Vimeo, and the iTunes Store, web software such as the Adobe Flash Player and Microsoft Silverlight, and also various HDTV broadcasts over terrestrial (ATSC standards, ISDB-T, DVB-T, or DVB-T2), cable (DVB-C), and satellite (DVB-S2). Standards vying to be the next generation video coding format appear to be JPEG2000, the heavily patented HEVC (H.265) and AV1.

A subclass of video coding formats are the intra-frame formats, which apply compression to each picture in the video-stream in isolation, with little or no attempt to take advantage of correlations between successive pictures over time for a higher level of compression. An example is Motion JPEG, which is simply a sequence of individually JPEG-compressed images. These codecs tend to operate faster but build much larger files than a video coding format supporting interframe coding. Because interframe compression copies data from one frame to the next, should an originating frame be lost or damaged, the subsequently following frames cannot be properly reconstructed. Editing a video file compressed with interframe formats is difficult due to the fact that an original frame may not be at the selected edit point. Making edits in intraframe-compressed video is almost the same operation as editing uncompressed video. Another difference between intraframe and interframe compression is with intraframe systems, each frame uses a predictable and similar amount of data; however, in most interframe systems, particular frames – ​like the “I frames” in an MPEG-2 file – ​cannot copy data from previous frames, requiring larger amount of data than nearby frames. This is why you will hear some producers ask to set their MPEG codec for “all I-Frames,” in fact emulating an intraframe codec to make smooth editing easer.

Video coding formats can define additional restrictions to be applied to encoded video, called profiles and levels. A “profile” restricts which encoding techniques are allowed and a “level” is a restriction on parameters such as maximum resolution and data rates. It is possible to have to supply a decoder which only supports decoding a subset of profiles and levels for a given video format. This is most often set to make the decoder program or hardware easier to use, faster in operation, or control the size of the encoded output file.

Uncompressed File Systems

“Uncompressed” video is digital video that has never been compressed or was generated by decompressing previously compressed media. It is most often found in video cameras, video monitors, some video recording devices and in video processors that manage operations like image resizing, deinterlacing, and text and graphics overlay. Uncompressed video can be transmitted over various types of baseband digital video interfaces like SDI. Some High Definition cameras feature the output of high-quality uncompressed video, while others compress the video using a lossy compression method, usually to achieve a lower price point. In a lossy compression process video information is removed, which leads to compression artifacts and reduces the quality of the resulting decompressed video. If your business is based on high-quality editing of digital video, it is best to work with video that has never been compressed or used lossless compression as this maintains the best possible quality. Compression can always be applied after editing chores are finished.

Common Camera Media Types

As we learned from our history of the development of digital cameras, professional digital cameras began using back-mounted hard drives and we still find these capture hard drives still in use today, with some vendors adopting the SD card for firmware updates and small file transfers. Most professional camera manufacturers such as Sony, ARRI, Cannon and RED provide their own capture media as a part of their extended product line. These media types, despite standards are not interchangeable.

Common File Formats for Cameras

Common acquisition file formats in digital cameras include Uncompressed RAW, Apple ProRes, AVID® DNX, and MPEG. For camera work that is captured for immediate delivery or destined for delivery applications, there are a number of supported formats including for 4K and 2K images DPX and TIFF, and for High Definition and Standard Definition there is support for Apple Quicktime, JPEG, AVID® AAF, MXF, H.264, and MP4. The format is usually selected to support the editing and processing tools in the studio or facility. For example, if your production chain uses the AVID® editing systems or the Apple Final Cut Pro editing tools then the camera output format is chosen to match the in-house system.

All cameras provide metadata about the picture. Data may include aperture, exposure time, focal length, date and time taken, and location. Camera cards from different manufacturers have different file conventions, including folder structure and metadata storage, which often makes the media non-interoperable.

Common Production Formats

There are primary production formats used in the industry today and they are better known by their manufacturer rather than the myriad of formats or product names: Avid®, Apple, and Adobe, the three “A’s” of media and broadcast production.

AVID®—​The DNXHD video coding format, and its many permutations over the years, is the AVID® specified format for editing and processing. The AVID® Media Composer has been a mainstay in Hollywood for over 20 years, and it includes a complete eco-system of network products (ISIS, etc.), storage products, production asset management (PAM) tools (Interplay, Media Central UX), audio tools like PRO TOOLS, newsroom automation systems (iNEWS) and graphics editing systems. AVID® offers dialogue search tools, a full featured RESTful API and support from a global distribution network.

Apple—​ProRes is the video coding format introduced by Apple Computers to support high quality editing on their Final Cut Pro platform. In the early 1990s, the relatively inexpensive Final Cut Pro software and similarly priced Apple hardware platform to support it provided hundreds of operations a low cost, high quality video editing tool, lowering the financial barrier to entry for many small producers, broadcasters and post-production houses.

Adobe Systems—​Touted to work with any camera uncompressed RAW files, the Premiere Pro video editing suite, coupled with the other tools in the Adobe Creative Cloud like After Effects for graphics, Audition for audio, and its internal Adobe Media Encoder. The media encoder is a versatile tool that transcodes from / to many different formats and variances in formats, and over the past 5 years has become a popular production software system, replacing many Final Cut Pro applications in the industry.

What Ingest Means and How to Organize It

Ingest is the process of capturing incoming media streams or files, analyzing them and bringing them into a system for management or further processing. Whether the media arrives from a camera card, an FTP site, or a content distribution accelerated network like IBM Aspera, Signiant or File Catalyst, or delivered over an ethernet network, the digitized assets must be recognized, analyzed, and published in the internal systems to enable further use.

Ingest processes are often described as a “workflow” as each company has a set of steps it uses to evaluate, catalog and integrate new digital media as it arrives. The ingest process typically employs tools such as digital video codecs to normalize the media to company specifications. The following list defines one example of an ingest process:

  • Create metadata placeholder(s) – ​this is an optional step that many systems employ to prepare a pre-arrival metadata description of expected incoming digital media, typically with an identifier like a title, a house number or code, or a Universal Unique Identifier (UUID)

  • Search/locate file(s) – ​files that have arrived can automatically trigger a “watch folder” to start the process, or the system can be commanded to seek the digital media files and start the process

  • Pre-analysis of the High-Resolution file – ​what are its technical parameters? Does it meet house requirements? Is there metadata accompanying the media file(s) in a sidecar file or embedded within the “wrapper?” Is this an expected secured delivery or is there a virus scan to run on the data package?

  • Map Media format – ​the ingest process must use the analysis to separate the components inside the wrapper and prepare its tools, such as codecs, to organize the work to be performed on the incoming file(s)

  • Convert Media to house format – ​some companies and some systems require their media to be in a single format for use in their operations. This format is referred to as a “mezzanine” format and all incoming media must match it or be transcoded to the mezzanine format

  • Media Proxy generation – ​if the media is to be used in a library management system, these systems typically create a low-resolution copy of the media to use as a reference to the high-resolution system for media annotations, edit points, and as an aid in quality control operations. These “proxies” substitute for the actual media for many internal functions and provide fast access for manual chores

  • Create XML representation – ​define the incoming asset in a standardized way in a language that can be both human and machine readable, and can be used by asset, content, or production management software

  • Catalog asset – ​place the XML representation in the system of record for the organization, whether it is asset, content, or a production management operation

  • Miscellaneous Map Translator – ​any additional files, media “maps,” or ancillary files found in the wrapper need to be translated into the organization’s system for historical reference or operations if the system uses metadata-driven workflows

  • Store High-Resolution instance – ​the new media asset and any associated components like audio files and caption files, must be stored in the most appropriate storage repository

  • Create Index of High-Resolution asset – ​collect any technical information on the asset as well as any indicated points in the digital file, such as “Start of Media (SOM),” “End of Media (EOM),” etc.

  • Generate storyboard – ​some systems need a series of still images to be generated for each asset to aid in user searchability and downstream processing of the media

  • Set locations on Proxy – ​if the indexed Hi-res asset has points, set those indications on the proxy version of the original file

  • Insert proxy audio – ​attach the audio channels to the video proxy

  • Insert proxy subtitles or closed captions – ​attach the subtitle or caption files to the proxy

  • Generate “Title” – ​in the system of record, there will be some internal name for the new asset, and this is the step that registers the new name, whether it is a title, a house number or some other form of tracking mechanism

  • Report instance UUID – ​if the internal system of record is employing UUIDs (Universal Unique Identifiers), this is the step to register that tracking sequence

  • Auto-tag with metadata and distribute incoming content – ​some sequences recognize media for particular purposes and can “tag” the new asset with a metadata marker to trigger a downstream workflow to deliver the new asset to a particular location, department, or user group

  • Auto assign “new asset” tasks to internal staff – ​some systems move newly ingested assets into a production or quality review workflow and the ingest process can assign work based on the arrival of new media

  • Escalation procedures – ​should the incoming digital file fail the process, the ingest workflow can trigger messages or alarms to notify the supplier or management to the issues of the failure

  • Automated Quality Control (AQC) – ​many organizations use software tools to analyze the digital video and perform a quality review with a posted analysis report; some systems even annotate the proxies with the information from the report to aid in quick human review of the failure points

  • Manual QC – ​typically this step is triggered by failure in an auto quality control review, but some companies insist on a human review of incoming media to check its viability

  • Supplemental files/linking – ​for companies that are distribution Video on Demand (VOD) and Over the Top (OTT) service (HULU, Netflix, Verizon, etc.) versions, especially for companies that are distributing international language versions of their programming, the need to match and link supplemental files like audio and subtitle translations or an edited version to the original asset is key to the success of the business

  • Unknown Media workflows – ​what to do with an asset that is unrecognizable? The ingest system needs an “escape valve” to pass the media to a human review process to correctly catalog and link the asset in the system of record

  • Manager review steps – ​some organizations require management to review incoming media of particular types and these steps can be automatically organized by the ingest process

  • Supplier analysis reports – ​if your organization is accepting media from syndicators or ad agencies, the ingest process may be required to save audit data on media sources for reporting or dashboard monitoring

  • “Refused” media reports and supplier notifications – ​when media fails, especially if the failures are regular and caused by the same source, the ingest process is often required to notify the original supplier of the failures and provide management reports of the media failures.

This list is not an exhaustive list of steps for an ingest process. If the in-house system is supporting SMPTE Interoperable Master Format (IMF) for distribution, there may be additional steps including the reading of the IMF packing lists and applying it to the media map, noting any IMF mark-up points and indicating those on the proxy annotations, etc. Ingest workflows are often customized to the internal requirements of the organization as well as the needs of the software and hardware tools in use.

Asset Management

Media Asset Managers or MAMs are our modern libraries for digital media files. They come in many different forms, many focused on particular roles in an organization and they are not to be confused with DAMs, Digital Asset Managers, the systems focused on document management which usually boast a completely different set of features and applications than their media counterparts. MAMs are designed specifically for media management and because modern metadata can drive workflows and automate operations, today’s products are often deeply integrated to workflow orchestration systems and business process managers.

Asset management systems have become specialized, with systems managing specific workflows for archive and preservation operations, production support, media preparation for play-to-air for broadcast networks and television stations, syndication product distribution, news operations support, sports and live event production and “versioning” for international distribution, or to support Video on Demand (VOD) and Over-the-Top (OTT) services like Netflix, HULU, YouTube, etc. It is key to recognize that specialized MAMs do not necessarily address every operational requirement. Cost-effective systems support small operations while multi-site, geographically spanning enterprise systems can organize and manage global operations.

Core to any MAM is its organization of the media and components in the library and the systems searchability and ease of use. The library catalog for each asset is based on a set of metadata usually built upon a data model adapted to the needs of a particular application or a particular company’s unique data collection requirements. MAMs feature media manipulation tools to playback and annotate proxies. Some orchestration systems use the metadata collected to drive workflows without human interaction. As an orchestration workflow engine, MAM systems must integrate into various third-party tools to complete the steps of the operational path. Recently there has been a move to add value by enhancing workflow operations with Machine Learning software tools, typically called “Artificial Intelligence” or AI. Clever AI applications are being introduced to lower the manual labor, especially in regard to reducing human effort and increasing efficiencies such as automatically applying metadata annotations to media files in the library.

The digital media ingest process is usually designed to be managed by the asset management system, and the automatic harvesting of metadata and technical data is key to the tool’s success. When considering MAM systems, pre-planning and analysis of needs will go a long way in selecting the correct tool for a particular application as not all MAMs work the same way or provide the same results, nor do component systems all interconnect and function well together in a workflow orchestration system.

On-Set Grading

On-set grading is a technique used by cinematographers in which a certain “look” or visual style is applied to video or film material by the means of set lighting. In modern film and television production, since the images are captured in a digital format the method of applying a visual style is by applying color correction or color grading. These are artistic concerns that have implications throughout the production phases of a program. For example, incoming RAW camera footage can be color graded to a specific quality of light early in the capture process, but later in the production process when other processing is added to the image, it may be found to be too warm or too cool, in other words too red-yellow or too blue for the director’s preference or artistic vision. This is especially apparent in modern film making when computer-generated imagery (CGI) character generation is added to productions. The cinematographers can go back to the original RAW footage and re-grade the digital file to adjust the color grading, so that the later processing maintains the image style that the director seeks.

Dailies

What happened yesterday? This is the real question “dailies” answer. Dailies are the unedited camera footage collected during the making of a production, whether it be a television program or a cinematic production. These clips are named “dailies” because at the end of every day the day’s footage is collected and developed, synchronized to the audio channels, and prepared for viewing. Dailies are viewed by members of the production crew, typically early in the morning before the day’s filming starts, but sometimes during a lunch break or at the end of the day. It is common for several members of the production team including the director, cinematographer, editor, and others to view and discuss the dailies. Dailies are sometimes separately reviewed by producers or executives who are not directly involved in day-to-day production to ensure their investment is on track as expected. Sometimes multiple copies of the dailies are distributed to for individual viewing, via a secure Internet connection or via physical media such as a DVD.

Dailies indicate how the overall filming and actors’ performances are advancing. At the same time, in industry jargon the term can refer to any raw footage, regardless of the actual date of capture. In the UK and Canada, dailies are called “rushes” or daily rushes, referring to the speed required to quickly turn-around the print for viewing. In animation projects, dailies are called rushes and you may hear the dailies review called “sweat box” sessions.

Active monitoring of dailies allows the film crew to review images and audio that were captured the previous day, and it provides technical as well as artistic analysis of the captured media. Technical problems can be caught and resolved quickly. Directors can evaluate and modify the actors’ performances as well as adjust camera angles and scene positioning. If a scene must be reshot, it is best to address the need immediately rather than later in the process when sets may have been torn down and actors have left the production schedule.

Realistically the process of reviewing a dailies sequence is monotonous, as dailies often include multiple recordings of the same scene with minor changes or adjustments. High Definition digital video dailies can be as big as 2K resolution (2048×858, 2.39:1 aspect). Many productions have a main production unit which does all primary cinematography and one or more smaller teams shooting additional “pickup” shots, stunts, locations, or special effects shots. This additional video is included with the main unit footage on the dailies reels. Because of the way the dailies are processed, when a unit shoots with more than one camera typically all the shots from one “A” camera will be followed by all “B” camera shots, then by the “C” camera and so on. Wary of the time required to review every shot, ordinarily only a small amount of the previous day’s footage is viewed; and with digital media, footage can be fast-forwarded as desired. Sound that was recorded without simultaneous picture recording is called “wild sound” and is sometime included in the dailies. Visual effects shots are collected daily for viewing by a visual effects supervisor. These dailies contain the previous day’s work by animators and effects artists in various state of completion. Once animation or character generation requires additional feedback from the director, the supervisor will collect specific dailies and screen for the director either as part of the normal dailies process or in a separate visual effects dailies screening.

As most modern editing is accomplished on computer-based non-linear editing systems, keycode numbers are logged on the media which assign a number to each frame of film and are later used to assemble the original film to conform to the edit. Dailies delivered to the editing department are essentially “proxies” and already have timecode and keycode numbers overlaid on the image. These reference numbers assist the later assembly of the original high-quality film and audio to conform to the edits.

Modern video cameras can record the image and sound simultaneously to video tape or hard disk in a format that can be immediately viewed on a monitor, eliminating the need to undergo a conversion to create dailies. Audio synchronization can be very important, and the film methods of using clapperboards and manual adjustments are still found on production sets. Sound collection and synchronization need to be done for every take.

Rushes and dailies can be used to create trailers and “sizzles,” short promotional clips, even if they may contain footage that is not in the final production.

Sources and Types of Metadata

Metadata, or data about data, is sourced at many different stages in the production process and supplied from many different foundations. Broadly we can separate the sources into machine- versus human-generated metadata. Types of machine-generated metadata include but are not limited to:

  • Camera technical specifications like field of view, sample rates, etc. and captures of the specific settings such as recording format, output format, etc.

  • Technical lighting settings including any color grading information and equipment settings

  • Technical audio settings and equipment settings

  • Ancillary equipment settings such as time code generation

  • Metadata collected in the dailies creation process or in the dailies transmission process

  • Metadata captured by asset management systems during ingest including technical file data such as wrapper and media formats, Sidecar XML descriptive data, captioning data, etc.

  • Metadata created in the editing process, including edit mark-in/mark-out points, multiple media file selections, and specific edit tool project files

  • Metadata generated in the effect creation or animation processes including insert mark-in/mark-out points and specific tool project files

  • Metadata created in a compression or transcode process

Types of human-generated metadata include but again are not limited to:

  • Title, date, and location

  • Light, weather, and time of day references

  • Production schedule references

  • Tracking numbers

  • DIT notes

  • Description of scenes

  • Credits including director, actors, cinematographer

  • Notes from dailies review

  • Scripts and script notes

  • File annotations to note points of interest in the digital files

There are many sources for external metadata which can be used to augment a particular asset. These include the Entertainment Identifier Registry numbers (EIDR), Ad-ID identifiers for commercial advertisement and rights tracking purposes, company-specific media identifiers like house numbers, Universal Unique IDentifiers (UUID), metadata sourced from external descriptive libraries, and work orders or requests for media access, delivery, or edits.

Machine Learning (AI) applications have become a major source for metadata creation. Using Asset management systems as a repository, machine learning software can be used to perform metadata augmentation and annotation to digital video and audio media. The output of these systems is typically an XML (eXtensible Markup Language) or JSON (JavaScript Object Notation) file that frame accurately documents the media or its proxy. Cloud-based, consumer-driven systems like those offered by Amazon Web Services and Google can provide well-trained software systems for generic applications such as speech to text, scene change, or celebrity recognition. More broadcast and media focus platforms such as Veritone, Graymeta, and Zorroa provide focused engines with particular characteristics to manage specialized jobs like content recognition for compliance (nudity, prohibited language, prohibited actions like smoking, etc.), location of product and logo recognition, object recognition, etc.

The Role of the DIT

As digitization becomes more important in our production chain, more tasks concerning data management have evolved and increased, and the position of the Digital Imaging Technician (DIT) has been created to address this ever-growing need. The DIT position was originally created to manage the transition from the long-established film movie camera medium into the current digital cinema age. As a camera department crew member, the DIT works in collaboration with the cinematographer on operational workflows, system integration, the production’s camera settings, overall signal integrity, and any image manipulation and effects to achieve the creative vision of the director and the cinematographer in digitized media.

Involved in the entire end-to-end production, DITs are responsible for preparation, on-set tasks and post-production and, in fact the DIT connects the on-set work with the post-production operations. The DIT role has become a blend of several other positions, such as Video Controller, Video Shader, or Video Engineer. DITs support camera teams with technical and creative digital camera management to guarantee the highest technical quality as well as the safety and security of the digital files. They are the established responsible party for managing all data on set, including system quality checks and reliable file backups. Once a project enters post-production, the DIT manages the delivery of recordings to the post-production team, including quality control and generating working copies if they are required. Typically, the DIT must ensure that original camera data and all associated metadata is backed up at least twice daily, ensuring data integrity with checksum verification. Important productions require all backups to be made on LTO tapes, a more sturdy and reliable method of storage compared to camera hard drives and spinning disk storage.

The DIT’s role on-set has become important in assisting cinematographers, who are conditioned to work with film stock, in achieving their artistic vision through digital tools. Through monitoring exposure, building Color Decision List (CDL) and “look up tables” (LUTs) on a daily basis for the post-production team, the DIT actively assists the team. Across all equipment on the set, the DIT manages settings in digital camera’s menu system for recording format and output specifications. The DIT is responsible for collecting and securing any digital audio recorded by an external recorder operated by the Production Sound Mixer.

Often working beside the DIT, the data wrangler position was created as a support role for managing, transferring, and securing all the digital data acquired on-set via the digital cameras and depending on the scale of the project, the DIT can cover the data wrangler position but usually not the reverse.

The Role of the Data Wrangler

Data wrangling is the process of mapping and / or transforming data from one “raw” form into a different format to add value for a variety of purposes such as interoperability, analytics, and to meet archive standards. The process is sometimes called data “munging.” A data wrangler is the person who manages and performs these transformation operations. This may include data manipulation, aggregation, visualization, training a statistical model or machine learning (AI) software, and many other processes. As a process, wrangling typically follows a set of general steps which start with data extraction in a raw form from the data source, “munging,” or sorting the raw data using algorithms or parsing the data into predefined data structures and depositing the resulting content into a database for storage and future use. This munged data comprises one important form of metadata for our media purposes.

Data wranglers can be found in most industries today, not just media and entertainment but the position can hold considerable importance as the coordinator for acquisition of data across many different sources and devices – ​in our Internet of Things (IOT) connected world, this has become increasingly important. Specific duties mirror those of a storage administrator working with large amounts of data, involving both data transfer from instruments to storage grid or facility as well as data manipulation for re-analysis via high performance computing instruments or access via Internet infrastructure-based and AI tools.

Our data wrangler’s focus is on the massive amount of data generated by cameras, digital audio recorders, and productions and editing tools. When data transformations are required, for example between different camera systems, these include actions such as extractions, parsing, joining, standardizing, augmenting, cleansing, consolidating, and filtering to create desired wrangling outputs that can be leveraged in the downstream post-production processes. Depending on the amount and format of the incoming data, data wrangling has traditionally been performed on spreadsheets or via hand-written scripts in languages such as Python or SQL. Sometimes data wrangling can be managed in asset management systems prepared with the proper data model and conversion tools.

eXtensible Markup Language (XML)

Extensible Markup Language (XML) is a computer code language that defines a set of format rules for encoding documents that are both human and machine readable. The free and open standards defined in the XML 1.0 Specification that was adopted by the World Wide Web Consortium (W3C) as well as several related specifications define XML. The primary application goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format designed to support Unicode for adoption by different human languages. Although designed to manage documents, XML is widely used for the representation of arbitrary data structures such as those used in digital media metadata structures. Several schema systems aid in the definition of XML-based languages, and programmers have developed many application programming interfaces (APIs) to aid the processing of XML data between tools and systems.

XML was developed as a specification to address the proliferation of non-interoperable document encoding structures. Today, hundreds of document formats using the XML syntax have been developed, including SOAP, SVG, and XHTML. XML-based formats have become the default for many office-productivity tools and the rich features of the XML schema specification have provided the base language for communication protocols. XML has come into common use for the interchange of data over the Internet. It is common for a specification to comprise several thousand pages as many of these standards are complex. XML is widely used in a Services Oriented Architecture (SOA) where disparate systems communicate with each other by exchanging XML messages. The message exchange format is standardized as an XML schema (XSD). This is a typical method used by media software systems for the exchange of relevant metadata or command sets.

Key XML Terminology

When looking at an XML file, understanding some terms and usage will help you decipher the meaning of the document. The material in this section is based on the XML Specification. This is not an exhaustive list of all the constructs that appear in XML in any way – ​it is only provided as an introduction to the key concepts encountered in media system and metadata usage.

Character: An XML document is a string of characters. Almost every legal Unicode character may appear in an XML document.

Processor and application: The processor analyzes the markup and passes structured information to an application. The specification places requirements on what an XML processor must do and not do. The actual application is outside its scope. The specification calls it a “processor,” but coders typically to it as an “XML parser.”

Markup and content: The characters making up an XML document are divided into markup and content, which may be distinguished by the application of fairly straight-forward syntactic rules. Strings that constitute markup either begin with the character < and end with a >, or they begin with the character & and end with a;. Strings of characters that are not markup are content. It is important to note that whitespace before and after the outermost element is classified as markup.

Tag: A tag is a markup construct that begins with < and ends with >. Tags come in three flavors:

  • start-tag, such as <section>;

  • end-tag, such as </section>;

  • empty-element tag, such as <line-break />.

Element: An element is a logical document component that either begins with a start-tag and ends with a matching end-tag or simply consists of an empty-element tag. The characters between the start-tag and end-tag, if any, are the element’s content, and may contain markup, including other elements, which are called child elements. An example is <greeting>Hello, world!</greeting>. Another is <line-break/>.

Attribute: An attribute is a markup construct consisting of a name–​value pair that exists within a start-tag or empty-element tag. An example is <img src="madonna.jpg” alt="Madonna” />, where the names of the attributes are “src” and “alt,” and their values are “madonna.jpg” and “Madonna” respectively. Another example is <step number="3">Connect A to B.</step>, where the name of the attribute is “number” and its value is “3.” An XML attribute can only have a single value and each attribute can appear at most once on each element. In the common situation where a list of multiple values is desired, this must be done by encoding the list into a well-formed XML attribute with some format beyond what XML defines itself. Usually this is either a comma or semi-colon delimited list or, if the individual values are known not to contain spaces, a space-delimited list can be used. <div class="inner greeting-box">Welcome!</div>, where the attribute “class” has both the value “inner greeting-box” and also indicates the two CSS class names “inner” and “greeting-box.”

XML declaration: XML documents may begin with an XML declaration that describes some information about themselves. An example is <?xml version="1.0” encoding="UTF-8”?>.

With a little basic knowledge and by carefully reviewing an XML document, one can usually discern the meaning of the file without a coding background.

Cloud Impacts

The ubiquity of cloud infrastructure has simplified the access of production teams to dailies, and enabled the DIT and Data Wranglers a faster, more comprehensive gateway to tools, storage, and automated backup processes. As more metadata is generated and more munging is required, cloud tools provide a trusted platform with fast ingress and egress. Competition in the cloud services market has begun the inevitable process of driving cloud infrastructure to more affordable levels, and software manufactures have begun to leverage cloud native designs with microservices so that a DIT can customize their platform to a production’s unique requirements. Storage options and automatic disaster recovery options for data are a welcome support for the DIT’s protection measures, and the advances in security have made these platforms safe and reliable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.163.197