10
Information-Centric Software
10.1 Standard Phones, Smart Phones and Super Phones
In December 2009 Microsoft and NAVTEQ, a digital map and location company, signed a technology agreement on the development of 3D map data and navigation software including the street-level visuals that we now take for granted when we look at Google maps. 18 months earlier Nokia had acquired Navteq for $8.1 billion dollars and started introducing phones such as the 6210 optimised for navigation applications.
Over the next two years smart phones with GPS positioning and associated navigation capabilities became increasingly common but the unique differentiation that Nokia presumably aimed to achieve through the acquisition failed to materialise and it is hard to detect any particular market advantage achieved despite the substantial investment. This is arguably a question of timing with a general rule that is not a great idea to invest in a specialist technology provider servicing a market that is being aggressively commoditised.
In 2006 we undertook an analysis of the potential of the mobile metadata market. This gives us a benchmark against which market expectation and market reality five years on can be judged.
We premised the analysis on the assertion that three types of mobile phone hardware and software form factors would evolve over time, standard phones, smart phones and super phones. Standard phones are voice dominant and/or voice/text dominant. Standard phones change the way we relate to one another. These phones have relatively basic imaging and audio capture capabilities. Smart phones change the way we organise our work and social lives. They have more advanced imaging and audio capture capabilities than standard phones. Super phones, also known as ‘future phones’ in that they did not exist in 2005, would change the way in which we relate to the physical world around us.
Super phones would have advanced audio and image capture capabilities equal to or better than human audio and video processing systems, have equivalent memory resources but significantly better memory retrieval than human memory systems, a better sense of direction than most humans and an enhanced ability to sense temperature, gravitational forces and electromagnetic fields.
10.1.1 The Semisuper Phone
In a sense the ingredients for a semisuper phone were already available. We could capture 20 kHz or more of audio bandwidth using solid-state microphones, capture low-g gravity gradients using 3-axis accelerometers and have low-cost phones that included a temperature sensor, digital compass and positioning capabilities and build software added value on all of these hardware attributes. However, a super phone has to be as good as or better than humans at more or less everything. This means having imaging bandwidth that is at least perceptually equivalent or preferably better than our own human visual system, able to see and hear the world in three dimensions.
This is why humans have two ears and two eyes in order to gauge distance and depth and in order to respond and react to danger or opportunity. Similarly, humans have a well-evolved optical memory. We remember events visually albeit sometimes rather selectively.
There are, however, still significant hardware constraints. Imaging resolution even with high-end CCD’ sensors, the 40-megapixel CCD referred to in the last chapter for example, is still short of present human resolution (about 200 megapixels). Imaging processing systems are also still significantly dumber than humans when it comes to taking spatial decisions based on complex multichannel inputs.
For example, a racing driver driving a Formula One car uses audio, visual, vibrational and wind speed information and a sense of smell in subtle and significant ways that are presently hard to replicate either in terms of capture bandwidth or information processing in artificial intelligence systems.
However, leaving aside how audio and visual information is actually used, there are some areas where devices are better at ‘seeing’ the real world than we are. CCD and CMOS sensors for example are capable of capturing and processing photons well into the nonvisible infrared spectrum (up to 900 nanometres for CMOS and over 1000 nanometres for CCD) – the basis for present night-vision systems.
There are similarities between the image processing chain and the RF processing chain. In RF signal processing, the three major system performance requirements are sensitivity (the ability to detect wanted signal energy in the presence of a noise floor), selectivity (the ability to detect wanted signal energy in the presence of interfering signals) and stability (the ability to achieve sensitivity and selectivity over time, temperature and a wide range of operating conditions).
Noise is a composite of Gaussian white noise introduced by the devices used to process (filter and amplify) the signal and distortion with the added complication of nonlinear effects that create harmonic distortion. Image processing is very similar. The lens is the equivalent of the radio antenna but collects photons (radiant energy in the visible optical bands), rather than RF energy.
10.2 Radio Waves, Light Waves and the Mechanics of Information Transfer
Radio waves are useable in terms of our ability to collect wanted energy from them from a few kHz (let's say 3 kHz) to 300 GHz, above which the atmosphere becomes to all intents and purposes completely opaque.
In wavelength terms assuming radio waves travel at 300 000 000 metres per second, this means our wavelengths of interest scale from 100 kilometres (very long wave) to 0.0001 metre (300 GHz).
Optical wavelengths of (human) visible interest scale from 300 to 700 nanometres but let's say we are also interested in ultraviolet (below 300 nanometres) and infrared (above 700 nanometres. Let's be imaginative and say we are interested in the 100 to 1000 nanometre band. Assuming light waves also travel at 300 000 000 metres per second (of which more later), this means a frequency of interest from 0.000 001 metres to 0.000 000 1 metres or 3000 to 30 000 THz (a nanometre is one billionth of a metre).
10.2.1 Image Noise and Linear and Nonlinear Distortions
Having collected our radiant energy in a lens we subject it to a number of distortion processes partly through the lens itself (optical aberrations) and then via the sensor and colour-filter array.
The sensor array converts the photons into electrons (an obliging property of silicon) but in the process introduces noise. The combination of lens imperfections and sensor noise introduce ambiguity into the image pipe that needs to be dealt with by image-processing algorithms.
10.2.2 The Human Visual System
Interestingly this is not dissimilar to the human visual system where we sample the real optical analogue world with two parallel sampling grids consisting of cones and rods. Cones are centred around the middle of the retina and work in medium to bright light and are colour sensitive. Rods are concentrated in an area outside the cones. There are fewer of them. They are optimised for low-light conditions but are not colour sensitive. This is why we see in black and white at night. If you look at a distant dim object (a star at night) it will be easier to see if you shift your gaze off centre.
A sensor array emulates the functionality of the cones in the human visual system. It has to deliver similar or better dynamic range, similar or better colour accuity and perceptually similar resolution. Note that although the number of cones equates to 200 megapixels, no one seems quite sure why we need that much resolution.
So image processing is just like RF signal processing, except that it has to take into account the idiosyncrasies of the human visual system. Note also that although humans are superficially alike (two arms, two legs, two eyes) we are also intrinsically different in terms of optical performance. Some of us are colour blind, some of us have better resolution. Scientists have solved this by creating an ‘ideal observer’, a sort of ‘model man’ against which we can judge artificial optical and image processing systems.
The human visual system was first described in detail (in the western world at least) by Johannes Kepler (1571–1630) who codified the fundamental laws of physiological optics.
Rene Descartes (1596–1650) verified much of Kepler's work and added some things (like the coordinate system) that have proved indispensable in present-day imaging systems.
Isaac Newton (1642–1726) worked on the laws of colour mixture (‘the rays to speak properly are not coloured’) and did dangerous experiments on himself pushing bodkins (long pins) into his eye. He anticipated the modern understanding of visual colour perception based on the activity of peripheral receptors sensitive to a physical dimension (wavelength). All that stuff with the apple was a bit of a sideshow.
In 1801 Thomas Young proposed that the eye contained three classes of receptor each of which responded over a broad spectral range. These ideas were developed by Helmholtz and became the basis of Young–Helmholtz trichromatic theory.
This optical wavelength business was all a bit curious and still is. Anders Jonas Angstrom (1814–1874) decided quite sensibly that light should be measured in wavelengths and that given that the radiant energy was coming from the sun, then it would be a good idea to relate the measurements to a hydrogen atom.
Thus, an Angstrom was defined as being equivalent to the diameter of a hydrogen atom and equivalent to one tenth of a nanometre. This means that a nanometre is equivalent to 10 hydrogen atoms sitting shoulder to shoulder (assuming this is what hydrogen atoms like doing). This is remarkable given that in 1850, the metre was an ill-defined measure and dangerously French.
In the 1790s when Fourier was thinking about his transforms during the French revolution, his compatriots were deciding that the metre should be made equivalent to one ten millionth of the length of the meridian running through Paris from the pole to the equator on the basis that Paris was and still is the centre of the world if not the universe in general.
Unfortunately, this distance was miscalculated (a failure to account for the flattening of the world by rotation) and so the metre was set to an arbitrary length referenced to a bar of rather dull platinum–iridium alloy.
This was redefined in 1889 using an equally arbitrary and equally dull length of platinum–iridium, then redefined again in 1960 with a definition based on a wavelength of krypton-86 radiation (the stuff that Superman uses so at least slightly less dull).
In 1983 the metre was redefined as the length of the path travelled by light in a vacuum during a time interval of /1/299 792 458 of a second, which effectively determined the speed of light as 299 792 458 metres per second.
This is not the same basis upon which radio waves are measured and not the same way that Angstroms are measured. Anders Jonas Angstrom's work (1814–1874) was carried on by his son Knut (1857–1910) and formed a body of work that remains directly relevant today to the science of optical measurement.
10.3 The Optical Pipe and Pixels
10.3.1 Pixels and People
We have said that the sensor array emulates the functionality of the human visual system in that it takes point measurements of light intensity at three wavelengths (red, green and blue). A 4-megapixel sensor will have four million pixels most of which will be capturing photons and turning them into an electron count that can be used to describe luminance (brightness) and colour (chrominance).
Pixels behave very similarly to people. There are black pixels and white pixels and 254 shades of grey pixel in between. Pixels of similar colour tend to congregate together. A white pixel in a predominantly black pixel area stands out as being a bit unusual (an outlier). There are good pixels and bad pixels. Some bad pixels start life as good pixels but become bad later in life. Noisy pixels can be a big problem not only to themselves but also to their neighbours. Pixels see life through red, green, blue or clear spectacles but lack the natural colour-balancing properties of the human visual system. Their behaviour is, however, fairly predictable both spatially in a still image and temporally in a moving image.
10.3.2 The Big Decision
In RF signal processing, we have to take a final decision as to whether a 0 is a 0 or a 1 is a 1 or whether a bit represents a +1 or −1 (See Chapter 3 section on matrix maths).
In the same way, a decision has to be taken in the image pipe – is this a black pixel or a white pixel? (or one of 254 shades of pixel in between) or a green, red or blue pixel of a certain intensity and should that pixel be there or actually moved along a bit. Given that pixels are arranged in 4 by 4 or 8 by 8 or 16 by 16 macroblocks this is like a game of suduko, or if preferred a slightly complicated form of noughts and crosses (diagonals are important in the pixel world).
Pixels that do not fit a given set of rules can be changed to meet the rules. These decisions can be based on comparing a point pixel value with that of its neighbours. Simplistically, it is this principle that is applied to image enhancement in the image pipe (noise reduction, blur reduction, edge sharpening) and to image manipulation either in or after the image pipe (colour balancing, etc.).
10.3.3 Clerical Mathematicians who Influenced the Science of Pixel Probability
This gives us an excuse to introduce three mathematicians, the Reverend Thomas Bayes (the clerics of middle England and their role in cellular phone design), George Boole and Augustus De Morgan.
The Reverend Thomas Bayes (1702–1761)1 is arguably the father of ‘conditional probability theory’ in which ‘a hypothesis is confirmed by any data that its truth renders probable’. He is most well known for his 1764 paper ‘An Essay Toward Solving a Problem in the Doctrine of Chances’ and his principles are widely used in gambling (of which Thomas Bayes would have disapproved) and image-processing and image-enhancement algorithms.
For instance, conditional probability theory is the basis for context modelling. For example, an area of blue sky with occasional white clouds will obey a certain rule set prescribed by the context (a sunny day), a dull day sets a different context in which greys will dominate and contrast will be less pronounced. Adaptive algorithms can be developed that are spatially and/or context driven, conditioned by the image statistics and/or any other available context information.
George Boole (1815–1864) developed Bayes work by introducing more precise conditional language, the eponymous Boolean operators of OR (any one of the terms are present, more than one term may be present), AND (all terms are present), NOT (the first term but not the second term is present) and XOR (exclusive OR, one or other term is present but not both).
Every time we use a search engine we are using a combination of Boolean operations and Bayesian conditional probabilities but Boole is also present in almost every decision taken in the image-processing pipe.
Augustus De Morgan (1806–1871) was working in parallel to Boole on limits and boundaries and the convergence of mathematical series (or at least the description of convergence in precise mathematical terms). He is probably most famous for De Morgan's Law that applied to people and pixels goes something like this:
In a particular group of people
Most people have shirts.
Most people have shoes.
Therefore, some people have both shirts and shoes.
In a particular group of pixels
Most pixels have chrominance value x.
Most pixels have chrominance value y.
Therefore, some pixels have both values x and y.
More specifically, the work of Augustus De Morgan is directly applied to the truncation of iterative algorithms so that an optimum trade off can be achieved between accuracy and time – the longer the calculation the more accurate it becomes but the more clock cycles it consumes. The modern image processing pipe is full of these rate/distortion, complexity/distortion, delay/distortion ‘trade off’ calculations.
So the Reverend Bayes, George Boole and Augustus De Morgan have had and are still having a fundamental impact on the way in which we manage and manipulate image signals that have particular statistical properties and exhibit specific behavioural characteristics. They allow us to pose the thesis that ‘this cannot be a black pixel because …
10.3.4 The Role of Middle England Clerics in Super-Phone Design
Their role in super-phone design is, however, far more fundamental.
Early super phones will be capable but dumb. They may be able to see and hear as well as we can but they cannot think like us. This requires the power of reason and logical analysis to be applied.
Smart phones declared role in life is to make us more efficient. Some of us, many of us, do not want to be made more efficient. We may own a smart phone because buying one seemed like a good idea at the time but many of us will only use a small subset of the phone's potential capabilities.
Most of us, however, would like some practical help in the way that we relate to the real world around us – knowing where we are, where we need to go, seeing in the dark, hearing sounds that are below our acoustic hearing threshold and having a device that can help us make intuitive and expanded sense of the surrounding physical world of light, sound, gravity, vibration, temperature, time and space.
This is the promise of the smart super phone – the future phone that will deliver future user and telecom value. The value, however, is enhanced if placed in a context of position and location, which brings us to the topic of mobile metadata management.
10.4 Metadata Defined
Metadata is information about information, data that is used to describe other data. It comes from the Greek, meaning ‘among’, ‘with’,’ beside’ or ‘after’. Aristotle's ‘Metaphysics’ was a follow onto his work on physics – it was consequential to the original information or, in Aristotle's case, body of existing work.
Metadata can be spatial (where), temporal (when) and social (who). It can be manually created or automatically created. Information about information can be more valuable than the information itself.
Semantic metadata is metadata that not only describes the information but explains its significance or meaning. This implies an ability to interpret and infer (Aristotle).
We are going to stretch a point by also talking about ‘extended metadata’. ‘Extended metadata’ is information about an object a place or a situation or a body of work (Aristotle again). The object, the place, the situation, the body of work is ‘the data’. An object can be part of a larger object – part of an image for example.
The combination of metadata and extended metadata, particularly in a mobile wireless context, potentially delivers significant value but is in turn dependent on the successful realisation of ‘inference algorithms’, 'similarity processing algorithms’ and ‘sharing algorithms’ that in themselves realise value (algorithmic value).
This might seem to be ambitiously academic. In practice, metadata and, specifically, mobile metadata is a simple mechanism for increasing ‘user engagement’ that by implication realises user value.
10.4.1 Examples of Metadata
Higher-level examples of metadata include electronic programming guides used in audio and video streaming. These can be temporally based (now showing on channel number) or genre based (sport, classical, jazz). The genre-based indexing of iPod music files is an example.
Lower-level examples include audio, image and video metadata descriptors that allow users to search for particular characteristics, an extension of present word and image search algorithms now ubiquitous on the World Wide Web. These descriptors also allow automated matching of images or video or voice or audio clips.
Manual metadata depends on our ability to name or describe content. We may of course have forgotten what something is called – a tune, perhaps. If we can hum the tune, it is possible to match the temporal and harmonic patterns of the tune to a data base. This is an automated or at least memory-jogging naming process that results in the automated creation of metadata, putting a name to the tune. This may not realise direct application value (the application will probably be free) but provides an opportunity to realise indirect value (sheet music or audio download value).
The example in Figure 10.1 will be visually recognisable to some because of the pattern combination of the time signature (3/4 waltz time) and the opening interval (a perfect fifth). If you just used the perfect fifth it could be a quite large number of tunes, Twinkle Twinkle Little Star, Theme from 2001, Whisper Not, and many more. However, only one of these examples is a waltz (clue to the tune, think, River and Moon). Congratulations. You have just performed an inference algorithm based on temporal and harmonic pattern recognition.
10.4.2 Examples of Mobile Metadata
Any of the metadata examples above are potentially accessible in a wide-area wireless environment. However, they are not unique to wireless. Mobile metadata combines metadata value with mobility value. Mobile metadata is based on our ability to acquire and use spatial knowledge of where a user is at any given time combined with other inputs available to us.
Spatial knowledge may be based on cell ID or observed time difference or GPS (macropositioning) possibly combined with heading information (compass heading), possibly combined with micropositioning (low-G accelerometers), probably combined with time, possibly combined with temperature or other environmental data (light level or wind speed).
Even simple cell ID can add substantial metadata value, for example images can be tagged and/or archived and/or searched and/or shared by cell ID location.
This is an area of substantial present research, for example by the Helsinki University of Technology2 under the generic description of ‘automatic metadata creation’. The research highlights the problems of manual metadata. For example, when people take a picture it is boring and usually difficult to manually annotate the image. There will generally be a lack of consistency in manual notation that will make standardised archiving and retrieval more problematic.
In comparison, automated spatial and temporal metadata can be standardised and unambiguous. We know where the user is and what the time is. Areas of potential additional value are identified. These are based, respectively, on ‘inference algorithms’ also known as ‘disambiguation algorithms’, ‘similarity processing algorithms’ also known as ‘guessing algorithms’ and ‘sharing algorithms’ also known simply as ‘group distribution lists’.
10.4.3 Mobile Metadata and Geospatial Value
For example, if it is known that a user has just emerged from Westminster Tube Station (macroposition), has turned right and then left into Parliament Square and has then turned to face south (compass heading) and up (micropositioning) and then takes a picture, you know that the user is taking a picture of Big Ben. Theoretically this could be double checked against the typical image statistics of Big Ben.
This is the basis of geotagging and smart tagging familiar to us in present mapping systems. Smart tagging was standardised in Windows XP and has become progressively more ubiquitous over time.
‘Big Ben’ can then be offered as a metadata annotation for the user to accept or reject.
However, having automatically identified that it is Big Ben that the user is looking at, there is an opportunity to add additional extended metadata information to the image, for example, see below.
10.4.4 Optional Extended Image Metadata for Big Ben
Big Ben is one of London's best-known landmarks and looks most spectacular at night when the clock faces are illuminated. You even know when parliament is in session; because a light shines above the clock face.
The four dials of the clock are 23 feet square, the minute hand is 14 feet long and the figures are 2 feet high. Minutely regulated with a stack of coins placed on the huge pendulum, Big Ben is an excellent timekeeper, which has rarely stopped.
The name Big Ben actually refers not to the clock-tower itself, but to the thirteen ton bell hung within. The bell was named after the first commissioner of works, Sir Benjamin Hall. This bell came originally from the old Palace of Westminster; it was given to the Dean of St. Paul's by William III.
The BBC first broadcast the chimes on the 31st December 1923 – there is a microphone in the turret connected to Broadcasting House.
10.4.5 Mobile Metadata's Contribution to Localised Offer Opportunities, Progressive Mapping and Point and Listen Applications
There are of course various related localised ‘offer opportunities’ associated with this ‘engagement moment’, a Macdonald's 24-hour ‘Big Ben Burger’ or the latest geospecific downloadable Big Ben ring tone. Of course, if the super phone has decent solid-state microphones, then the user can record his own ring tones and use them and/or upload them with a suitable spatial, temporal and personal metadata tag.
Geospatial value is of course closely linked, or could be closely linked, to progressive mapping applications (where only relevant parts of the map are downloaded as you need them) and wide-area ‘point and listen’ applications. Point straight ahead in the Big Ben example and you could be listening to an audio download of Parliamentary question time.
If you wanted a 3D overview of the area for local navigation or local information reasons or a ‘virtual 3D tour’ of the inside of Big Ben then this could be downloaded from a server.
10.5 Mobile Metadata and Super-Phone Capabilities
So there are a number of extended opportunities that are specific to the extended audio and image capture and play back/play out capabilities now being included in new-generation phones and becoming increasingly familiar to us today.
This is important because many of our mobile metadata applications do not intuitively require a connection to a network. A Nikon camera with GPS and a compass capability will know where it is and what it is looking at and could have embedded reference data available. A top of the range Nikon SLR camera, for example, has 30 000 reference images stored that are used for autoexposure, autowhite balance and autofocus so it's a small step to add another few gigabytes of embedded reference data. The purpose of scene recognition in this example is limited to improving picture quality but the same principles can be applied to face recognition and other applications such as number-plate recognition.
It is therefore important to identify functionality that can be made unique to a phone, particularly a camera phone, connected to a network. The ability to connect to the web to access smart tag or geotag information is an important differentiator but not the only one.
We suggested that phones could be divided into three functional categories, standard phones that help us communicate with one another, smart phones that help us to organise our social and business lives and super phones that help us to relate to the physical world around us. Smart super phones combine all three functional domains. The mobile metadata that you get from network-connected smart super phones builds on these functional domains.
We have said that smart super phones by definition will have advanced micro- and macropositioning and direction sensing capabilities combined with extended audio (voice and music) and image-capture capabilities combined with extended audio and display and text output capabilities plus access to server-side storage (the web) plus access to network resident rendering, filtering and processing power.
Returning to our Big Ben example, Big Ben is an object with a unique audio signature. Other objects and places have similar but different signatures, the sea side for example, a concert hall, an airport. We mentioned earlier that audio signatures make potentially good ring tones – Big Ben or the Westminster Peal.
If you fancy church bells as ring tones you will be interested in the relevant object metadata tag.
Consider the effect of bells rung together in a peal rather than singly, in English changes – with each bell striking a fixed interval, usually 200 to 250 milliseconds after the previous one. Peals of bells can sound very different in character and the effect is influenced by many factors; the frequency and amplitude of the bells’ partials, considered individually and as a group; how they are clappered; and rather critically, the building in which they are being rung’ (from a guide to campanology – the harmonic science and mechanics of bell ringing).
A paper on combinatorics, a branch of discrete mathematics with bell ringing as an example explores this topic further.3
Objects and places and spaces similarly have unique visual signatures. These can be object specific, The Statue of Liberty for example, or general, the seaside. A seascape will generally consist of some sky (usually grey in the UK), some sea (usually grey in the UK) and sand (usually grey in the UK). The image statistics will have certain recognisable ratios and certain statistical regional boundaries and there will also be a unique audio signature.
Object and place and visual signatures come ‘for free’ from the autoexposure, autocolour balancing and autofocusing functions in higher-end camera phones. Autoexposure gives you a clue though not definite proof that it is day or night, autocolour balancing and autoexposure together give you a clue though not definite proof that the image is being taken in daylight outdoors on a sunny or cloudy day or indoors under natural, tungsten or fluorescent lighting. Autofocus gives you a clue though not definite proof of distance and spatial relationships within the image. Two cameras mounted a couple of inches apart and two microphones six inches apart would of course greatly increase the spatial relationship capture capability, a function beginning to be included now in 3G smart phones.
People have social signatures, these are specific (voice and visual) and physical (my lap top has fingerprint recognition which I am impressed by but have never used) and general, for example behaviour patterns.
10.6 The Role of Audio, Visual and Social Signatures in Developing ‘Inference Value’
Now these are interesting capabilities and necessary for higher-end camera phone functionality but do they have a wider application value when combined with network connectivity?
Well yes in that they can add value to ‘inference algorithms’.
‘Inference algorithms’ are algorithms that combine spatial, temporal and social information. Their job is to ‘disambiguate’ the context in which, for example, a picture has been taken or an observed event is taking place.
So the fact that it is 4.00 a.m. (temporal) on the 21 June (temporal) and that you (personal signature) are somewhere on Salisbury Plain (cell ID spatial) looking at the sun (autoexposure and autowhite balance) rise over some strangely shaped rocks (image statistics and object recognition) at a certain distance away (autofocus) listening to some chanting (audio signature) tells the network with a very high level of confidence that you are a Druid Priest attending the mid summer solstice at Stonehenge. Does this particular example illustrate realisable value? Probably not, but more general applications of inference algorithms will – this is mechanised automated, mobile metadata inference value.
10.7 Revenues from Image and Audio and Memory and Knowledge Sharing – The Role of Mobile Metadata and Similarity Processing Algorithms
Inference algorithms potentially create new application value opportunities and add value to existing applications. Downloadable Druid ring tones spring to mind from the above example – a rather specialist market.
‘Similarity processing algorithms’ are related to but different from inference algorithms. The job of similarity processing algorithms is to detect and exploit spatial, temporal and social commonalities and similarities or occasionally, useful and relevant dissimilarities (the attraction of opposites). The relationship with inference algorithms is clear from the example above.
If the network knows I am a Druid (I am not but would not mind if I was) then the presumption could be made that I have common and similar social interests to other Druids. I am part of the Druid community that in turn has affiliations with other specialist communities. We may live in similar places. Before long, an apparently insignificant addressable market has become more significant. Not only does the network know that I am a Druid, it also knows that I attended the 2011 summer solstice and may want to share memories of that moment with other members of my social community. This is the cue moment for the suggested recipients list for any image or video and/or audio recording I may have made of the event or for the sharing of particular cosmic thoughts that occurred to me at sunrise. This is sometimes described in the technical literature as ‘context to community’ sharing. The suggested recipients sharing list might include other potentially interested and useful parties. A parallel example is the solar eclipse that happened in April 2006 in Africa and Central Asia. This spawned a swathe of specialist web sites to service the people viewing and experiencing the event either directly or indirectly including a NASA site.4
This is an example of a contemporary web cast of the event
Tokyo, Mar 25 2006: The national astronomical observatory of Japan and other organisations held an event to webcast a total solar eclipse from Africa and Central Asia when it occurred on Wednesday evening Japan time.
The event titled “Eclipse Cafe 2006” was the first of its kind organized by the observatory.
Participants viewed live images of the solar eclipse webcast at eight astronomical observatories and science museums across Japan, including Rikubetsu astronomical observatory in Hokkaido and Kazuaki Iwasaki space art gallery in Shizuoka prefecture.
Of the eight locations, participants in Hiroshima and Wakayama prefectures had opportunities to ask questions to astronomical researchers in Turkey using a teleconferencing system.
The webcasting service was offered with webcast images of solar eclipses in the past. On Wednesday, the images were transmitted from Egypt, Turkey and Libya.
The solar eclipse was first observed in Libya at around 7:10 pm local time and lasted four minutes, which is relatively long for a solar eclipse.’
Similarity processing algorithms would automatically capture these spatial (where), temporal (when and for how long) and social (who) relationships and create future potential engagement and revenue opportunities from a discrete but significant special interest group.
In the case of Big Ben, it is self-evident that there are potentially millions of spatial, temporal and socially linked images, memories and ‘advice tags’ available to share that can be automatically identified.
10.8 Sharing Algorithms
So automated sharing algorithms can be built on automated inference algorithms and automated similarity algorithms. Sharing algorithms are based on an assumed common interest that may include a spatial common interest (same place), temporal common interest (previous, present or possible future common interest) and a social common interest. Most of us are offered the opportunity to ‘benefit’ from sharing algorithms at various times through the working day. Plaxo and Linkedin, Friends Reunited and Facebook are examples. This is algorithmic value in the corporate and personal data domain. All are examples of extended social metadata tagging where the tagged object is … us. This is generally only acceptable if we have elected to accept the process. It also suggests a need to establish that we are who we say we are. There is an implicit need for social disambiguation.
10.9 Disambiguating Social Mobile Metadata
10.9.1 Voice Recognition and the Contribution of Wideband Codecs and Improved Background Noise Cancellation
We have said that spatial and temporal metadata are implicitly unambiguous. Social metadata is potentially more ambiguous. How can we be sure the user of the phone/camera phone is the owner of the camera phone? Voice recognition helps in that it can uniquely identify the user. Wideband codecs and improved background noise cancellation will significantly improve present voice-recognition performance. Note that these capabilities inherently depend on a user's willingness to be identified that in turn is dependent on the user's perception of value from the function. Higher-quality codecs and better background noise control will also improve general audio capture quality, making voice tags and audio tag additions to images (manual metadata input) and user-captured ring tones (a form of mobile metadata) more functionally attractive.
10.9.2 Text-Tagging Functionality
The same principal applies to speech recognition in that accuracy should steadily improve over time making speech-driven text tagging more functionally effective (manual metadata input).
10.10 The Requirement for Standardised Metadata Descriptors
A combination of factors are therefore improving the functionality of automatic metadata capture and manual metadata input over time. Automatic spatial metadata capture improves as positioning accuracy and object recognition/image recognition/image resolution improves; automatic social metadata capture improves as voice recognition improves. Manual metadata input (text tagging) improves as speech recognition improves.
However, the benefits of these improvements can only be realised provided that standardised descriptors have been agreed to provide the basis for interoperability. Interoperability in this context means the ability to capture, process and share metadata from multiple sources.
10.10.1 The Role of MPEG 75 in Mobile Metadata Standardisation
Even with improved speech recognition, text-based metadata input is problematic in that it is nonstandardised. Each user will tend to use a different vocabulary and syntax. This makes it difficult to implement inference and similarity algorithms other than slightly haphazard word and phrase matching.
It is easier to standardise automatically generated metadata. For example, to mandate prescriptive methods for describing the harmonic and temporal structures of a voice file or audio file or the colour, texture, shape and gradient of an image file or the structure and semantics (who is doing what to whom) of a video file.
Images, video and audio files can usually use Fourier descriptors, given that the majority of image, video and audio content will have been compressed using Fourier-based JPEG or MPEG compression algorithms. The purpose of MPEG7 is to standardise these audio and visual descriptors to allow the development of standardised search, match and retrieval techniques, including metadata-based inference and similarity algorithms.
Taking imaging as an example, descriptors can be region specific or macroblock specific. For example, it is possible to search for a man in a red woolly hat in a particular area of an image in multiple images by looking for the frequency descriptor for the colour red and the ‘closest match’ texture descriptor.
Generically, all images are converted into a common unified format in which image features are identified based on the wavelength of the colours making up the scene, expressed as a standardised 63-bit descriptor. Some of the more complex algorithms, for example automated face-recognition algorithms, require more accuracy and resolution and use a 253-bit descriptor.
Video descriptions are based on a differentiation of simple scenes and complex scenes together with motion estimation using vectors (direction and magnitude). In a perfect world, this all works with present and proposed MPEG-4-based object coding, though in practice it is still problematic to extract meaningful video objects out of a high frame rate high-resolution video. These are nontrivial processing tasks not to be attempted seriously in embedded software and are presently better performed as a server-side function (potentially good news for network operator and server added value).
Cameras and camera phones equipped with depth sensors and/ or twin camera stereoptic capture makes it easier to capture 3D video and segmented objects that in turn make object-based coding more useful. These at present represent a specialised but emerging application sector. Thus, MPEG 7 marks a useful start in terms of a standards process but there is much practical work still to be done.
10.11 Mobile Metadata and the Five Domains of User Value
The fact that we have not arrived at an end destination does not mean we cannot enjoy the journey and mobile metadata already has much to offer in terms of ‘interconnected user domain value’.
To summarise:
The Radio system domain, mobile metadata can extract value from cell ID, direction and speed of movement and all of the other (many) user-specific behaviour metrics that are potentially available from the radio air interface. This value can be realised by integration with parallel mobile metadata inputs including:
The Audio domain mobile metadata can be associated with voice capture and voice recognition, speech capture and speech recognition and wideband audio capture and audio listening.
The Positioning domain -- positioning and location value can be associated with (low-G accelerometer) and macropositioning (GPS or equivalent) and location systems.
The Imaging domain – mobile metadata as an integral part of the image and video sharing value proposition.
The Data domain – the integration of mobile metadata into the personal and corporate information management proposition.
10.12 Mathematical (Algorithmic Value) as an Integral Part of the Mobile Metadata Proposition
Realisation of mobile metadata value in all five user value domains is, however, dependent on an effective implementation of standardised ‘descriptive maths’ (MPEG7 descriptors or other standardised equivalents) and comparative algorithms – inference and similarity algorithms.
As stated earlier, some though not all of the algorithmic decisional techniques used in this space can be traced back to the Reverend Thomas Bayes, George Boole and Augustus De Morgan. Similarly, some though not all of the descriptor techniques can be traced back to Joseph Fourier (1768–1830) and Carl Friedrich Gauss (1777–1855).
But life moves on and it is important to realise that mathematical techniques and algorithmic techniques continue to evolve. In the telecoms industry, we have tended historically to focus on engineering capability (infrastructure deployment) and technology capability (product R and D) to provide competitive differentiation. This may be less relevant over time.
Modern mobile phones and modern mobility networks including mobility networks based on mobile metadata constructs are strategically dependent on mathematical and algorithmic value. This includes the ‘descriptive maths’ used to describe signals and systems and the ‘decisional maths’ used to respond to changing needs and conditions. An example of descriptive maths used in the signal space is the Fourier transform and present work on wavelet transforms. Descriptive maths helps us to capture and process the analogue world around us. One example of decisional maths applied at the system level would be an admission control algorithm.
Some of the mathematical value both in pure and applied maths is ‘heritage value’, the legacy of several thousand years of mathematical study and inspiration. Some of the mathematical value is ‘contemporary value’, the contributions presently being made by practicing mathematicians. So it is relevant to consider the areas in contemporary mathematics that are most likely to prove useful in terms of differentiating the future ‘user experience proposition’.
10.12.1 Modern Mathematicians and Their Relevance to the User Experience Proposition
One way to do this is to find out what work, or more accurately, whose work is winning awards. For years there was no mathematical equivalent to the Nobel Prize. Rumour has it that the Swedish mathematician Gosta Mittag Leffler had an affair with Alfred Nobel's wife. This effectively shut out all future mathematicians particularly Swedish mathematicians from the Nobel award and recognition process.
There is a Field award for mathematicians but this is only awarded every 4 years and is restricted to mathematicians under 40. In 2002, the Norwegian government decided to fund a yearly award known as the Abel Prize6 to mark the double centenary of the birth of Niels Henrik Abel. Niels Henrik Abel died in 1829 at the age of 27 after contracting TB following an ill-advised sleigh ride. In his short life he had, however, developed the foundations for group theory, also worked on in parallel by Galois.
Group theory is essentially an integration of geometry, number theory and algebra. Abel worked specifically on the commutative properties of group operations, arithmetical processes like addition where it does not matter in which order sums are performed. These came to be known as abelian groups.
Strangely, but perhaps not surprisingly, group theory is increasingly relevant in many areas of telecommunications including mobile telecommunications and mobile metadata management.
Geometry is important in vector maths (mathematical calculations that have direction and magnitude), number theory is important in statistical processing (for example, the processing of image signal statistics) and algebra crops up all over the place.
A popular description of group theory is that it helps discover and describe what happens when one does something to something and then compares the results with the result of doing the same thing to something else, or something else to the same thing. Group theory has been and is used in a wide cross section of physical problem solving including the modelling of turbulence and weather forecasting. Our interest in a wireless telecoms context is to study the role of group theory in the management and manipulation of complex and interrelated data sets that change over time and space at varying rates.
That's why it is useful to keep track of who is winning the Abel Prize each year. They are all mature contemporary working mathematicians and as such provide an insight into evolving areas of mathematics that are potentially strategically intellectually economically and socially important. This of course assumes that the Norwegian Academy of Science and Letters knows more than we do about the work of contemporary mathematicians, which for the majority of us is probably a valid assumption.
In 2003, the Abel Prize was awarded to Jean-Pierre Serre for his work on topology (place and space) and group theory.
In 2004, the Abel Prize was awarded jointly to Sir Michael Francis Atiyah and Isadore M Singer for their work on the eponymous Atiyah–Singer index theorem, a construct for measuring and modelling how quantities and forces vary over time and space taking into account their rate of change.
In 2005, the Abel Prize was awarded to Peter Lax for his work on nonlinear differential equations and singularities, the modelling of odd things that happen at odd moments, breaking the sound barrier for example. Dedicated readers of RTT Hot Topics might remember our venture onto this territory in our March 2003 Hot Topic ‘Turbulent Networks’. One reader hurtfully assumed the Hot Topic was an early April fool. This was of course untrue. System stability is going to be a major focus in multimedia multiservice network provision including multimedia multiservice networks with integrated mobile metadata functionality. Peter Lax wittingly or unwittingly is building on work done by Benoit Mandelbrot in the 1980s that in turn was based on the work of Lewis Fry Richardson (uncle of the late Sir Ralph Richardson) on aeronautical turbulence prior to the First World War.
In 2006, the Abel Prize was awarded to Lennart Carleson for his work on harmonic analysis and his theory of smooth dynamical systems.
In 2007 Srinivasa S.R. Varadhan was awarded for his work on the maths describing rare chance and probability events and a unified theory of large deviation.
In 2008 Professor John Griggs Thompson, of Cambridge and Florida Universities and Professor Jacques Tits, of the Collège de France were awarded a joint prize for their work on algebra and in modern group theory.
In 2009 Mikhail Gromov received the prize for his revolutionary contributions to geometry.
In 2010 John Tate received the prize for his work on number theory.
In 2011 John Stony received the prize for work on topology, geometry and algebra.
Surprisingly, there are no Chinese mathematicians yet on this list, but presumably this will change in the future and it is likely that the specialisms may be similar.
From the Abel list, Lennart Carleson conveniently brings together two story lines that we have explored.
His work on harmonic analysis developed Fourier's work, but specifically applied to acoustic waveform synthesis. The popular press describes Lennart as the ‘mathematician who paved the way for the iPod’. This is perhaps a rather journalistic interpretation. In practice, his work in this area in the 1960s marked an important advance in musical set theory, the categorising of musical objects and their harmonic relationships. These have subsequently proved useful in audio system simulation and design and automated audio metadata tagging. Robert Moog was developing his Moog synthesiser at the same time.
Lennart's genius is that he has done this work together with his work on dynamic systems, the branch of modern mathematics now dedicated to the modelling of large systems like financial markets and the weather – systems that change over time.
Multimedia multiservice network provision including multimedia multiservice networks with integrated mobile metadata functionality increasingly exhibit large-system behaviour. This is not surprising given that they are large systems with multiple inputs, many of which are hard to predict.
As with the weather, it sometimes seems hard to predict human behaviour, particularly long-term behaviour. This is, however, a scaling issue. We do not have enough visibility to the spatial, temporal and social data sets or the vector behaviour (direction and magnitude) of the data sets over a sufficiently wide range of time scales.
The mechanics of mobile metadata potentially allow us to capture and manage this information and ultimately, provided user elective issues can be addressed, to realise value from the process. At this point, data on data (metadata) becomes more valuable than the data itself.
1 http://www.york.ac.uk/depts/maths/histstat/bayesbiog.pdf.
2 http://www.cs.hut.fi/∼rsarvas/publications/Sarvas_MetadataCreationSystem.pdf.
3 http://www.studyblue.com/#note/preview/632416.
4 http://eclipse.gsfc.nasa.gov/SEmono/TSE2006/TSE2006.html.
5 http://mpeg.chiariglione.org/standards/mpeg-7/mpeg-7.htm#3.1_MPEG-7_Multimedia_Description_Schemes.