Computer vision marking programs. Statement of the computer vision problem

The scope of computer vision is very wide: from barcode readers in supermarkets to augmented reality. From this lecture you will learn where computer vision is used and how it works, what images look like in numbers, which problems in this area are solved relatively easily, which are difficult, and why.

The lecture is intended for high school students - students of the Small ShAD, but adults will also be able to learn a lot of useful things from it.

The ability to see and recognize objects is a natural and habitual ability for humans. However, this is still an extremely difficult task for a computer. Attempts are now being made to teach a computer at least a fraction of what a person uses every day without even noticing it.

Probably the most common place an ordinary person encounters computer vision is at the checkout counter in a supermarket. Of course, we are talking about reading barcodes. They were specially designed in such a way as to make the reading process as easy as possible for the computer. But there are also more complex tasks: reading car license plates, analyzing medical images, flaw detection in production, facial recognition, etc. The use of computer vision to create augmented reality systems is actively developing.

Difference between human and computer vision

The child learns to recognize objects gradually. He begins to realize how the shape of an object changes depending on its position and lighting. In the future, when recognizing objects, a person is guided by previous experience. Over the course of his life, a person accumulates a huge amount of information; the learning process of a neural network does not stop for a second. It is not particularly difficult for a person to restore perspective from a flat picture and imagine what it would all look like in three dimensions.

All this is much more difficult for a computer. And primarily because of the problem of gaining experience. It is necessary to collect a huge number of examples, which is not very successful so far.

In addition, a person always takes into account the environment when recognizing an object. If you take an object out of its usual environment, it becomes noticeably more difficult to recognize it. Here, too, the experience accumulated over a lifetime, which the computer does not have, plays a role.

Boy or girl?

Let's imagine that we need to learn to determine the gender of a person (clothed!) at a glance from a photograph. First you need to identify factors that may indicate belonging to a particular object. In addition, you need to collect a training set. It is desirable that it be representative. In our case, we will take everyone present in the audience as a training sample. And let’s try to find distinguishing factors based on them: for example, hair length, presence of a beard, makeup and clothing (skirt or trousers). Knowing what percentage of representatives of the same sex had certain factors, we can create fairly clear rules: the presence of themes or other combinations of factors will, with some probability, allow us to say what gender the person is in the photo.

Machine learning

Of course, this is a very simple and conditional example with a small number of top-level factors. In real-life tasks posed to computer vision systems, there are many more factors. Defining them manually and calculating dependencies is an impossible task for humans. Therefore, in such cases there is no way to do without machine learning. For example, you can define several dozen initial factors, as well as set positive and negative examples. And the dependencies between these factors are automatically selected, and a formula is compiled that allows decisions to be made. Quite often the factors themselves are identified automatically.

Image in numbers

The most commonly used color space for storing digital images is RGB. In it, each of the three axes (channels) is assigned a different color: red, green and blue. Each channel is allocated 8 bits of information, respectively, the color intensity on each axis can take values in the range from 0 to 255. All colors in the RGB digital space are obtained by mixing the three primary colors.

Unfortunately, RGB is not always good for information analysis. Experiments show that the geometric proximity of colors is quite far from how a person perceives the proximity of certain colors to each other.

But there are other color spaces. The HSV (Hue, Saturation, Value) space is very interesting in our context. It has a Value axis, indicating the amount of light. A separate channel is allocated to it, unlike RGB, where this value must be calculated every time. In fact, this is a black and white version of the image that you can already work with. Hue is represented as an angle and is responsible for the fundamental tone. The color saturation depends on the Saturation value (distance from the center to the edge).

HSV is much closer to how we imagine colors. If you show a person a red and a green object in the dark, he will not be able to distinguish the colors. The same thing happens at HSV. The further down the V axis we go, the smaller the difference between hues becomes as the range of saturation values decreases. In the diagram it looks like a cone with an extremely black dot at the top.

Color and light

Why is it so important to have data on the amount of light? In most cases, color has no meaning in computer vision because it does not convey any important information. Let's look at two pictures: color and black and white. It is not much more difficult to recognize all the objects in the black and white version than in the color version. In this case, color does not carry any additional burden for us, but creates a great many computational problems. When we work with a color version of an image, the volume of data is, roughly speaking, raised to the power of a cube.

Color is used only in rare cases when, on the contrary, it simplifies calculations. For example, when you need to detect a face: it’s easier to first find its possible location in the picture, focusing on the range of skin tones. This eliminates the need to analyze the entire image.

Local and global features

The features with which we analyze an image can be local or global. Looking at this picture, most will say that it shows a red car:

This answer implies that the person identified an object in the image, and therefore described a local color attribute. By and large, the picture shows a forest, a road and a little car. In terms of area, the car occupies a smaller part. But we understand that the car in this picture is the most important object. If a person is asked to find pictures similar to this one, he will first select images that contain a red car.

Detection and segmentation

In computer vision, this process is called detection and segmentation. Segmentation is the division of an image into many parts that are related to each other visually or semantically. And detection is the detection of objects in an image. Detection must be clearly distinguished from recognition. Let’s say that in the same picture with a car you can detect a road sign. But it is impossible to recognize it, since it is turned backwards towards us. Also, when recognizing faces, the detector can determine the location of the face, and the “recognizer” will already say whose face it is.

Descriptors and visual words

There are many different approaches to recognition.

For example, this: in the image, you first need to highlight points of interest or interesting places. Something different from the background: bright spots, transitions, etc. There are several algorithms that allow you to do this.

One of the most common methods is called Difference of Gaussians (DoG). By blurring the picture with different radii and comparing the results, you can find the most contrasting fragments. The areas around these fragments are the most interesting.

The picture below shows what it roughly looks like. The received data is written to descriptors.

To ensure that identical descriptors are recognized as such regardless of rotations in the plane, they are rotated so that the largest vectors are rotated in the same direction. This is not always done. But if you need to detect two identical objects located in different planes.

Descriptors can be written in numerical form. A descriptor can be represented as a point in a multidimensional array. In our illustration we have a two-dimensional array. Our descriptors were included in it. And we can cluster them - divide them into groups.

Next, for each cluster we describe a region in space. When a descriptor falls into this area, what becomes important to us is not what it was, but which of the areas it fell into. And then we can compare images by determining how many descriptors of one image ended up in the same clusters as descriptors of another image. Such clusters can be called visual words.

To find not just identical pictures, but images of similar objects, you need to take many images of this object and many pictures in which it is not present. Then select descriptors from them and cluster them. Next, we need to find out which clusters included descriptors from images in which the object we needed was present. Now we know that if the descriptors from the new image fall into the same clusters, then the desired object is present in it.

The coincidence of descriptors does not yet guarantee the identity of the objects containing them. One method of additional verification is geometric validation. In this case, the location of the descriptors relative to each other is compared.

Recognition and classification

For simplicity, let's imagine that we can divide all images into three classes: architecture, nature and portrait. In turn, we can divide nature into plants, animals and birds. And having already realized that it is a bird, we can say which one it is: an owl, a seagull or a crow.

The difference between recognition and classification is quite arbitrary. If we found an owl in the picture, then this is more likely a recognition. If it’s just a bird, then this is some kind of intermediate option. And if only nature - this is definitely a classification. Those. the difference between recognition and classification is how far down the tree we go. And the further computer vision advances, the lower the boundary between classification and recognition will slide.

Interest in computer vision arose early in the field of artificial intelligence, along with tasks such as automatic theorem proving and intellectual games. Even the architecture of the first artificial neural network - the perceptron - was proposed by Frank Rosenblatt, based on an analogy with the retina, and its study was carried out on the example of the problem of character image recognition.

The significance of the vision problem has never been in doubt, but at the same time its complexity has been significantly underestimated. For example, the case when in 1966, one of the founders of the field of artificial intelligence, Marvin Minsky, did not even intend to solve the problem of artificial vision himself, but instructed one student to do it over the coming summer, became legendary in its significance. At the same time, much more time was allocated to create a program that plays chess at the grandmaster level. However, it is now obvious that creating a program that beats a person at chess is easier than creating an adaptive control system with a computer vision subsystem that could simply rearrange chess pieces on an arbitrary real board.

Progress in the field of computer vision is determined by two factors: the development of theory, methods, and the development of hardware. For a long time, theory and academic research have outpaced the practical use of computer vision systems. Conventionally, a number of stages in the development of the theory can be distinguished.

By the 1970s, the basic conceptual apparatus in the field of image processing had been formed, which is the basis for the study of vision problems. The main tasks specific to computer vision were also identified, related to the assessment of the physical parameters of the scene (range, movement speeds, reflectivity of surfaces, etc.) from images, although a number of these problems were still considered in a very simplified formulation for the “world of toys” cubes."
By the 80s, a theory of levels of image representation in methods of their analysis had been formed. A kind of marker of the end of this stage is David Marr’s book “Vision. An information approach to the study of the representation and processing of visual images.”
By the 90s, a systematic understanding of approaches to solving basic, now classical, computer vision problems was formed.
Since the mid-90s, there has been a transition to the creation and research of large-scale computer vision systems designed to work in various natural conditions.
The current stage is most interesting in the development of methods for automatically constructing image representations in image recognition and computer vision systems based on the principles of machine learning.

At the same time, applications were limited by computing resources. After all, in order to perform even the simplest image processing, you need to look at all its pixels at least once (and usually more than once). To do this, you need to perform at least hundreds of thousands of operations per second, which was impossible for a long time and required simplifications.

For example, for automatic recognition of parts in industry, a black conveyor belt could be used, eliminating the need to separate an object from the background, or scanning a moving object with a line of photodiodes with special illumination, which already at the level of signal formation ensured the selection of invariant features for recognition without the use of any complex methods of information analysis. In optical-electronic target tracking and recognition systems, physical stencils were used, allowing coordinated filtering to be performed “in hardware.” Some of these solutions were ingenious from an engineering point of view, but were applicable only to problems with low a priori uncertainty, and therefore had, in particular, poor transferability to new problems.

It is not surprising that the 1970s saw the peak of interest in optical computing in image processing. They made it possible to implement a small set of methods (mostly correlational) with limited invariance properties, but in a very effective way.

Gradually, thanks to the increase in processor performance (as well as the development of digital video cameras), the situation changed. Crossing a certain threshold of performance required to perform useful image processing in a reasonable amount of time has paved the way for an avalanche of computer vision applications. It should, however, immediately be emphasized that this transition was not instantaneous and is still ongoing.

First of all, generally applicable image processing algorithms became available for special processors - digital signal processors (DSPs) and programmable logic integrated circuits (FPGAs), which were often used together and are still widely used in on-board and industrial systems.

However, computer vision methods really became widely used only less than ten years ago, with the achievement of an appropriate level of processor performance in personal and mobile computers. Thus, in terms of the practical application of a computer vision system, a number of stages have passed: the stage of individual solution (both in terms of hardware and algorithms) of specific problems; stage of application in professional fields (especially in industry and defense) using special processors, specialized imaging systems and algorithms designed to work in conditions of low a priori uncertainty, but these solutions were scalable; and the mass application stage.

As you can see, the machine vision system includes the following main components:

The most widespread applications are achieved by computer vision systems that use standard cameras and computers as the first two components (the term “computer vision” is more suitable for such systems, although there is no clear distinction between the concepts of machine and computer vision). However, naturally, other machine vision systems are no less important. It is the choice of “non-standard” methods of image formation (including the use of spectral ranges other than visible, coherent radiation, structured illumination, hyperspectral instruments, time-of-flight, omnidirectional and high-speed cameras, telescopes and microscopes, etc.) that significantly expands the capabilities of machine vision systems . While in terms of algorithmic support capabilities, machine vision systems are significantly inferior to human vision, in terms of the ability to obtain information about observed objects, they are significantly superior to it. However, the issues of image formation constitute an independent field, and methods for working with images obtained using different sensors are so diverse that their review is beyond the scope of this article. In this regard, we will limit ourselves to a review of computer vision systems using conventional cameras.

Application in robotics

Robotics is a traditional application area for machine vision. However, the main share of the robot fleet for a long time fell on industry, where the perception of robots was not superfluous, but thanks to well-controlled conditions (low non-determinism of the environment), highly specialized solutions were possible, including for machine vision problems. In addition, industrial applications allowed the use of expensive equipment, including optical and computing systems.

In this regard, it is significant (although not related only to computer vision systems) that the share of the robot fleet attributable to industrial robots only became less than 50% in the early 2000s. Robotics intended for the mass consumer began to develop. For household robots, unlike industrial ones, cost is critical, as well as battery life, which implies the use of mobile and embedded processor systems. Moreover, such robots must operate in non-deterministic environments. For example, in industry for a long time (and to this day) photogrammetric marks were used, glued to observation objects or calibration boards, to solve the problems of determining the internal parameters and external orientation of cameras. Naturally, the need for the user to stick such tags on interior items would significantly worsen the consumer qualities of household robots. It is not surprising that the market for household robots waited until it reached a certain level of technology to begin its rapid development, which happened in the late 90s.

The starting point for this event can be the release of the first version of the AIBO robot (Sony), which, despite the relatively high price ($2500), was in great demand. The first batch of these robots in the amount of 5,000 copies was sold out on the Internet in 20 minutes, the second batch (also in 1999) - in 17 seconds, and then the sales rate was about 20,000 copies per year.

Also in the late 90s, devices appeared in mass production that could be called household robots in the full sense of the word. The most typical autonomous household robots are robot vacuum cleaners. The first model released in 2002 by iRobot was the Roomba. Then robotic vacuum cleaners appeared, produced by LG Electronics, Samsung, and others. By 2008, the total sales of robotic vacuum cleaners in the world amounted to more than half a million units per year.

It is significant that the first robotic vacuum cleaners equipped with computer vision systems appeared only in 2006. By this time, the use of mobile processors such as the ARM family with a frequency of 200 MHz made it possible to achieve image comparison of three-dimensional indoor scenes based on invariant key point descriptors for sensory purposes. localization of the robot with a frequency of about 5 frames/s. Using vision to determine a robot's location has become economically feasible, although until recently manufacturers preferred to use sonars for these purposes.

Further increases in the performance of mobile processors make it possible to set new tasks for computer vision systems in household robots, the number of sales of which is already in the millions of copies per year around the world. In addition to navigation tasks, robots intended for personal use may be required to solve problems of recognition of people and their emotions by faces, recognition of gestures, furnishings, including cutlery and dishes, clothing, pets, etc., depending on the type problem solved by a robot. Many of these problems are far from being completely solved and are promising from an innovation point of view.

Thus, modern robotics requires solving a wide range of computer vision problems, including, in particular:

a set of tasks related to orientation in external space (for example, the task of simultaneous localization and mapping - Simultaneous Localization and Mapping, SLAM), determining distances to objects, etc.;
tasks of recognizing various objects and interpreting scenes in general;
tasks of detecting people, recognizing their faces and analyzing emotions.

Driver assistance systems

In addition to household robots, computer vision methods have found wide application in driver assistance systems. Work on detecting markings, obstacles on the road, recognizing signs, etc. was actively carried out in the 90s. However, they reached a sufficient level (both in the accuracy and reliability of the methods themselves, and in the performance of processors capable of executing the corresponding methods in real time) mainly in the last decade.

One notable example is stereo vision techniques used to detect obstacles on the road. These methods can be quite critical to reliability, accuracy and performance. In particular, pedestrian detection may require the construction of a dense range map at a near real-time scale. These methods can require hundreds of operations per pixel and accuracy achieved at image sizes of at least megapixels, that is, hundreds of millions of operations per frame (several billion or more operations per second).

It is worth noting that the overall progress in the field of computer vision is by no means associated only with the development of hardware. The latter only opens up opportunities for the use of computationally expensive image processing methods, but these methods themselves also need to be developed. Over the past 10–15 years, methods for comparing images of three-dimensional scenes, methods for reconstructing dense range maps based on stereo vision, methods for detecting and recognizing faces, etc. have been brought to effective practical use. The general principles for solving the corresponding problems with these methods have not changed, but they have been enriched a number of non-trivial technical details and mathematical techniques that made these methods successful.

Returning to driver assistance systems, we cannot fail to mention modern methods of detecting pedestrians, in particular, based on histograms of oriented gradients. Modern methods of machine learning, which will be discussed later, for the first time allowed a computer to solve such a fairly general visual task as recognizing road signs better than a human, but not through the use of special image generation tools, but thanks to recognition algorithms that received exactly the same input information that a person has.

One of the significant technical achievements was Google's self-driving car, which, however, uses a rich set of sensors in addition to a video camera, and also does not work on unfamiliar (not previously filmed) roads and in bad weather conditions.

Thus, driver assistance systems require solving various computer vision problems, including:

stereo vision;
detection of obstacles on roads;
recognition of road signs, markings, pedestrians and cars;
tasks that also require mention are related to monitoring the driver’s condition.

Mobile applications

Computer vision tasks for personal mobile devices such as smartphones, tablets, etc. are even more widespread compared to household robotics and driver assistance systems. In particular, the number of mobile phones is steadily growing and has already practically exceeded the population of the Earth. At the same time, the majority of phones are now produced with cameras. In 2009, the number of such phones exceeded a billion, creating a colossal market for image processing and computer vision systems that is far from saturated, despite numerous R&D projects carried out both by mobile device manufacturers themselves and by a large number of start-ups .

Some of the image processing tasks for mobile devices with cameras coincide with the tasks for digital cameras. The main difference is the quality of the lenses and shooting conditions. An example is the task of synthesizing high dynamic range images (HDRI) from multiple images taken with different exposures. In the case of mobile devices, there is more noise in the images, frames are formed with a larger time interval, and the camera displacement in space is also greater, which complicates the task of obtaining high-quality HDRI images, which must be solved on the mobile phone processor. In this regard, the solution to seemingly identical problems for different devices may differ, which makes these solutions still in demand on the market.

Of greater interest, however, are new applications that were previously absent from the market. A wide class of such applications for personal mobile devices is associated with augmented reality tasks, which can be very diverse. This includes gaming applications (requiring a consistent display of virtual objects on top of the image of the real scene when moving the camera), as well as various entertainment applications in general, tourism applications (recognition of landmarks with information about them), as well as many other applications related to information retrieval and object recognition: recognition of inscriptions in foreign languages with display of their translation, recognition of business cards with automatic entry of information into the phone book, as well as face recognition with extraction of information from the phone book, recognition of movie posters (with the replacement of the poster image with a movie trailer), etc. d.

Augmented reality systems can be created in the form of specialized devices such as Google Glass, which further increases the innovative potential of computer vision methods.

Thus, the class of computer vision problems whose solutions can be applied in mobile applications is extremely wide. Image matching methods (identifying conjugate points) have a wide range of applications, including assessing the three-dimensional structure of a scene and determining changes in camera orientation and object recognition methods, as well as analyzing people's faces. However, an unlimited number of mobile applications can be proposed, which will require the development of specialized computer vision methods. Let us give just two such examples: recording on a mobile phone with automatic decoding of a game in a certain board game and reconstructing the trajectory of a golf club when striking a shot.

Information search and training

Many augmented reality tasks are closely related to information retrieval (so some systems, such as Google Goggles, are difficult to attribute to any specific area), which is of significant independent interest.

The tasks of searching images by content are also varied. They include image matching when searching for images of unique objects, such as architectural structures, sculptures, paintings, etc., detection and recognition of objects of classes of varying degrees of generality in images (cars, animals, furniture, human faces, etc., as well as their subclasses), categorization of scenes (city, forest, mountains, coast, etc.). These tasks can be found in various applications - for sorting images in home digital photo albums, for searching for products by their images in online stores, for retrieving images in geographic information systems, for biometric identification systems, for specialized image searches in social networks (for example, search faces of people attractive to the user), etc., up to searching for images on the Internet.

Both the progress already achieved and the prospects for its continuation are visible in the example of the Large Scale Visual Recognition Challenge competition, in which the number of recognized classes increased from 20 in 2010 to 200 in 2013.

Recognition of objects of so many classes is now unthinkable without the use of machine learning methods in the field of computer vision. One of the extremely popular areas here is deep learning networks, designed to automatically build multi-level feature systems, which are used for further recognition. The demand for this area is evident from the acquisition of various startups by corporations such as Google and Facebook. Thus, Google acquired DNNresearch in 2013, and the DeepMind startup in early 2014. Moreover, Facebook also competed for the purchase of the last startup (which had previously hired such a specialist as Yann Le Cun to head the laboratory leading developments in the field of deep learning), and the purchase price was $400 million. It is worth noting that the mentioned method also won in competition for road sign recognition, also based on deep learning networks.

Deep learning methods require enormous computing resources, and even learning to recognize a limited class of objects can require several days of work on a computing cluster. At the same time, even more powerful methods may be developed in the future, but they require even more computing resources.

Conclusion

We have considered only the most common computer vision applications for the general public. However, there are many other, less typical applications. For example, computer vision methods can be used in microscopy, optical coherence tomography, and digital holography. There are numerous applications of image processing and analysis methods in various professional fields - biomedicine, space industry, forensics, etc.

Reconstruction of a 3D profile of a metal sheet observed using a microscope using the “depth from focusing” method

Currently, the number of relevant computer vision applications continues to grow. In particular, problems related to the analysis of video data become available for solution. The active development of three-dimensional television is expanding the order for computer vision systems, the creation of which has not yet developed effective algorithms and requires more significant computing power. Such a popular task is, in particular, the task of converting 2D video to 3D.

It's no surprise that dedicated computing tools continue to be active on the cutting edge of computer vision systems. In particular, general-purpose graphics processing units (GPGPUs) and cloud computing are popular now. However, the corresponding solutions are gradually moving into the personal computer segment with a significant expansion of possible applications.

So, computer vision is a set of techniques that allow you to train a machine to extract information from an image or video. In order for a computer to find certain objects in images, it must be trained. To do this, a huge training sample is compiled, for example, from photographs, some of which contain the desired object, while the other part, on the contrary, does not. Next, machine learning comes into play. The computer analyzes the images from the sample, determines which features and their combinations indicate the presence of the desired objects, and calculates their significance.

After completing the training, computer vision can be used in practice. For a computer, an image is a collection of pixels, each of which has its own brightness or color value. In order for the machine to get an idea of the contents of the picture, it is processed using special algorithms. First, potentially significant locations are identified. This can be done in several ways. For example, the original image is subjected to Gaussian blur several times using different blur radii. The results are then compared with each other. This allows you to identify the most contrasting fragments - bright spots and broken lines.

Once significant places are found, the computer describes them in numbers. Recording a fragment of a picture in numerical form is called a descriptor. Using descriptors, you can fairly accurately compare image fragments without using the fragments themselves. To speed up calculations, the computer clusters, or distributes descriptors into groups. Similar descriptors from different images fall into the same cluster. After clustering, only the number of the cluster with descriptors most similar to the given one becomes important. The transition from a descriptor to a cluster number is called quantization, and the cluster number itself is called a quantized descriptor. Quantization significantly reduces the amount of data that a computer needs to process.

Based on quantized descriptors, the computer can compare images and recognize objects in them. It compares sets of quantized descriptors from different images and infers how similar they or individual fragments are. This comparison is also used by search engines to search for an uploaded image.

Computer vision and image recognition are an integral part of (AI), which has gained immense popularity over the years. In January of this year, the CES 2017 exhibition took place, where you could look at the latest achievements in this area. Here are some interesting examples of the use of computer vision that could be seen at the exhibition.

8 examples of using computer vision

Veronica Elkina

1. Self-driving cars

The largest stands with computer vision belong to the automotive industry. After all, self-driving and semi-autonomous car technologies work largely because of computer vision.

Products from NVIDIA, which has already made big strides in the field of deep learning, are used in many self-driving cars. For example, the NVIDIA Drive PX 2 supercomputer already serves as the underlying platform for self-driving cars, Volvo, Audi, BMW and Mercedes-Benz.

NVIDIA's DriveNet artificial perception technology is self-learning computer vision powered by neural networks. With its help lidars, radars, cameras and ultrasonic sensors capable of recognizing environment, road markings, transport and much more.

3. Interfaces

Eye tracking technology using computer vision is used not only in gaming laptops, but also in desktop and enterprise computers so that they can be controlled by people who cannot use their hands. The Tobii Dynavox PCEye Mini is a ballpoint pen-sized device that makes the perfect discreet accessory for tablets and laptops. This eye tracking technology is also used in new gaming and regular Asus laptops and Huawei smartphones.

Meanwhile, gesture control (computer vision technology that can recognize specific hand movements) continues to develop. It will now be used in future BMW and Volkswagen vehicles.

The new HoloActive Touch interface allows users to control virtual 3D screens and press buttons in space. We can say that it is a simple version of the real Iron Man holographic interface (it even reacts in the same way with a slight vibration when pressing elements). Technologies like ManoMotion will make it easy to add gesture controls to almost any device. Moreover, to gain control over a virtual 3D object using gestures, ManoMotion uses a regular 2D camera, so you don’t need any additional equipment.

eyeSight's Singlecue Gen 2 device uses computer vision (gesture recognition, facial analysis, action detection) and allows you to control TVs, smart lighting systems and refrigerators using gestures.

Hayo

Crowdfunding project Hayo is perhaps the most interesting new interface. This technology allows you to create virtual controls throughout your home - by simply raising or lowering your hand, you can increase or decrease the volume of your music, or turn on the kitchen lights by waving your hand over the countertop. It all works thanks to a cylindrical device that uses computer vision, as well as a built-in camera and 3D, infrared and motion sensors.

4. Household appliances

Expensive cameras that show you what's inside your refrigerator don't seem so revolutionary anymore. But what about an app that analyzes images from your refrigerator's built-in camera and tells you when you're low on certain foods?

Smarter's sleek FridgeCam device attaches to the side of your refrigerator and can detect expiration dates, tell you what's in the fridge, and even recommend recipes for selected foods. The device is sold at an unexpectedly affordable price - just $100.

5. Digital signage

Computer vision could change the way banners and advertisements look in stores, museums, stadiums and amusement parks.

A demo version of the technology for projecting images onto flags was presented at the Panasonic stand. Using infrared markers invisible to the human eye and video stabilization, this technology can project advertising onto hanging banners and even flags fluttering in the wind. Moreover, the image will look as if it were actually printed on them.

6. Smartphones and augmented reality

Many have talked about the game as the first mainstream AR (AR) app. However, like other apps trying to jump on the AR train, this game relied more on GPS and triangulation to give users the feeling that the object was right in front of them. Typically, smartphones have little to no real computer vision technology.

However, in November, Lenovo released Phab2, the first smartphone to support Google Tango technology. The technology is a combination of sensors and computer vision software that can recognize images, videos and the world around them in real time through a camera lens.

At CES, Asus debuted the ZenPhone AR, a smartphone that supports Tango and Google's Daydream VR. Not only can the smartphone track motion, analyze the environment and accurately determine position, but it also uses the Qualcomm Snapdragon 821 processor, which allows you to distribute the loading of computer vision data. All this helps to use real augmented reality technologies that actually analyze the situation through the smartphone camera.

Later this year, the Changhong H2 will be released, the first smartphone with a built-in molecular scanner. It collects light that bounces off an object and splits it into a spectrum, and then analyzes its chemical composition. Thanks to software that uses computer vision, the information obtained can be used for a variety of purposes - from prescribing medications and counting calories to determining skin condition and calculating nutrition levels.

A conference on big data will be held in Moscow on September 15 Big Data Conference. The program includes business cases, technical solutions and scientific achievements of the best specialists in this field. We invite everyone who is interested in working with big data and wants to apply it in real business. Follow the Big Data Conference on

Machine vision. What is it and how to use it? Optical Source Image Processing

Machine vision is a scientific direction in the field of artificial intelligence, in particular robotics, and related technologies for obtaining images of real-world objects, processing them and using the obtained data to solve various kinds of applied problems without (full or partial) human participation.

Historical breakthroughs in machine vision

Vision System Components

One or more digital or analog cameras (black and white or color) with suitable optics to capture images

Software for producing images for processing. For analog cameras this is an image digitizer

Processor (modern PC with a multi-core processor or built-in processor, for example - DSP)

Computer vision software that provides tools for developing individual software applications.

Input/output equipment or communication channels for reporting findings

Smart camera: one device that includes all of the above points.

Very specialized light sources (LEDs, fluorescent and halogen lamps, etc.)

Specific software applications for image processing and detection of relevant properties.

A sensor for synchronizing detection parts (often an optical or magnetic sensor) for image capture and processing.

Drives of a certain shape used for sorting or discarding defective parts.

Machine vision focuses on primarily industrial applications, such as autonomous robots and visual inspection and measurement systems. This means that image sensor technology and control theory are related to video data processing to control the robot, and real-time processing of the resulting data is carried out in software or hardware.

Image processing and image analysis mainly focus on working with 2D images, i.e. how to convert one image to another. For example, pixel-by-pixel operations to increase contrast, operations to highlight edges, remove noise, or geometric transformations such as image rotation. These operations assume that image processing/analysis operates independently of the content of the images themselves.

Computer vision focuses on processing three-dimensional scenes projected onto one or more images. For example, by restoring the structure or other information about a 3D scene from one or more images. Computer vision often depends on more or less complex assumptions about what is represented in images.

There is also a field called visualization, which was originally associated with the process of creating images, but sometimes dealt with processing and analysis. For example, radiography works with the analysis of video data for medical applications.

Finally, pattern recognition is a field that uses various methods to extract information from video data, mainly based on a statistical approach. Much of this field is devoted to the practical application of these methods.

Thus, we can conclude that the concept of “computer vision” today includes: computer vision, visual pattern recognition, image analysis and processing, etc.

Computer vision tasks

Recognition

Identification

Detection

Text recognition

Restoring 3D shape from 2D images

Motion estimation

Scene restoration

Image recovery

Identification of structures of a certain type in images, image segmentation

Optical Flow Analysis

Recognition

A classic problem in computer vision, image processing, and machine vision is determining whether video data contains some characteristic object, feature, or activity.

This problem can be reliably and easily solved by humans, but has not yet been satisfactorily solved in computer vision in the general case: random objects in random situations.

One or more predefined or learned objects or classes of objects can be recognized (usually along with their two-dimensional position in the image or three-dimensional position in the scene).

Identification

An individual instance of an object belonging to a class is recognized.
Examples: identification of a specific human face or fingerprint or vehicle.

Detection

The video data is checked for a certain condition.

Detection based on relatively simple and fast calculations is sometimes used to find small areas in the analyzed image, which are then analyzed using more resource-intensive techniques to obtain the correct interpretation.

Text recognition

Searching images by content: Finding all images in a large set of images that have content defined in various ways.

Position estimation: Determining the position or orientation of a certain object relative to the camera.

Optical Character Recognition: Recognition of characters in images of printed or handwritten text (usually for translation into a text format most convenient for editing or indexing. For example, ASCII).

Restoring a 3D shape from 2D images is carried out using stereo reconstruction of a depth map, reconstruction of a normal field and a depth map from the shading of a halftone image, reconstruction of a depth map from a texture and determination of a shape from displacement

An example of restoring a 3D shape from a 2D image

Motion estimation

Several motion estimation tasks in which a sequence of images (video data) is processed to find an estimate of the speed of each point in the image or 3D scene. Examples of such tasks are: determining three-dimensional camera movement, tracking, that is, following the movements of an object (for example, cars or people)

Scene restoration

Two or more scene images, or video data, are given. Scene reconstruction has the task of recreating a three-dimensional model of the scene. In the simplest case, a model can be a set of points in three-dimensional space. More sophisticated methods reproduce the full 3D model.

Image recovery

The task of image restoration is to remove noise (sensor noise, blur of a moving object, etc.).

The simplest approach to solving this problem is through various types of filters, such as low-pass or mid-pass filters.

Higher levels of noise removal are achieved by first analyzing video data for various structures, such as lines or edges, and then controlling the filtering process based on that data.

Image recovery

Optical flow analysis (finding the movement of pixels between two images).
Several motion estimation problems in which a sequence of images (video data) is processed to find an estimate of the speed of each point in the image or 3D scene.

Examples of such tasks are: determining three-dimensional camera movement, tracking, i.e. following the movements of an object (for example, cars or people).

Image processing methods

Pixel counter

Counts the number of light or dark pixels.
Using a pixel counter, the user can select a rectangular area on the screen at a location of interest, such as where he expects to see the faces of people passing by. The camera will immediately respond with information about the number of pixels represented by the sides of the rectangle.

The pixel counter allows you to quickly check whether a mounted camera meets regulatory or customer pixel resolution requirements, for example for the faces of people entering camera-monitored doors or for license plate recognition purposes.

Binarization

Converts a grayscale image to binary (white and black pixels).
The values of each pixel are conventionally coded as “0” and “1”. The value “0” is conventionally called the background or background, and “1” is the foreground.

Often when storing digital binary images, a bitmap is used, where one bit of information is used to represent one pixel.

Also, especially in the early stages of technology, the two possible colors were black and white, which is not mandatory.

Segmentation

Used to search and/or count parts.

The purpose of segmentation is to simplify and/or change the representation of an image so that it is simpler and easier to analyze.

Image segmentation is commonly used to highlight objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning labels to each pixel in an image such that pixels with the same labels share common visual characteristics.

The result of image segmentation is a set of segments that together cover the entire image, or a set of contours extracted from the image. All pixels in a segment are similar in some characteristic or calculated property, such as color, brightness, or texture. Neighboring segments differ significantly in this characteristic.

Reading Barcodes

Barcode is graphic information applied to the surface, marking or packaging of products, making it readable by technical means - a sequence of black and white stripes or other geometric shapes.
In machine vision, barcodes are used to decode 1D and 2D codes designed to be read or scanned by machines.

Optical Character Recognition

Optical Character Recognition: Automated reading of text such as serial numbers.

OCR is used to convert books and documents into electronic form, to automate business accounting systems, or to publish text on a web page.

Optical text recognition allows you to edit text, search for words or phrases, store it in a more compact form, display or print material without losing quality, analyze information, and apply electronic translation, formatting, or conversion to speech to text.

My program written in LabView for working with images

Computer vision was used for non-destructive quality control of superconducting materials.

Introduction. Solving the problems of ensuring comprehensive security (both anti-terrorism and mechanical safety of objects, and technological safety of engineering systems), currently, requires a systematic organization of control over the current state of objects. One of the most promising ways to monitor the current state of objects are optical and optoelectronic methods based on technologies for processing video images of an optical source. These include: programs for working with images; the latest image processing methods; equipment for obtaining, analyzing and processing images, i.e. a set of tools and methods related to the field of computer and machine vision. Computer vision is a general set of techniques that allow computers to see and recognize three- or two-dimensional objects, whether engineering or non-engineering. Working with computer vision requires digital or analog input/output devices, as well as computer networks and IP location analyzers designed to control the production process and prepare information for making operational decisions in the shortest possible time.

Statement of the problem. Today, the main task for the designed computer vision systems remains the detection, recognition, identification and qualification of potential risk objects located in a random location in the area of operational responsibility of the complex. Currently existing software products aimed at solving the listed problems have a number of significant disadvantages, namely: significant complexity associated with the high detail of optical images; high power consumption and a fairly narrow range of capabilities. Expanding the tasks of detecting objects of potential risk to the area of searching for random objects in random situations located in a random location is not possible with existing software products, even with the use of a supercomputer.

Target. Development of a universal program for processing images of an optical source, with the ability to stream data analysis, that is, the program must be light and fast so that it can be written to a small-sized computer device.

Tasks:

development of a mathematical model of the program;

writing a program;

testing the program in a laboratory experiment, with full preparation and conduct of the experiment;

research into the possibility of using the program in related areas of activity.

The relevance of the program is determined by:

the high cost of professional visual information processing programs.

Analysis of the relevance of program development.

the lack of image processing programs on the software market that provide a detailed analysis of the engineering components of objects;

constantly growing requirements for the quality and speed of obtaining visual information, sharply increasing the demand for image processing programs;

the existing need for high-performance programs that are reliable and user-friendly;

There is a need for high performance programs and simple management, which is extremely difficult to achieve in our time. For example, I took Adobe Photoshop. This graphic editor has a harmonious combination of functionality and ease of use for the average user, but in this program it is impossible to work with complex image processing tools (for example, image analysis by constructing a mathematical relationship (function) or integrated image processing);

the high cost of professional visual information processing programs. If the software is of high quality, then the price for it is extremely high, even down to the individual functions of a particular set of programs. The graph below shows the price/quality relationship between simple analogues of the program.

To simplify the solution of problems of this type, I developed a mathematical model and wrote a program for a computer device for image analysis using simple transformations of source images.

The program works with transformations such as binarization, brightness, image contrast, etc. The operating principle of the program is demonstrated using the example of the analysis of superconducting materials.

When creating composite superconductors based on Nb3Sn, the volume ratio of bronze and niobium, the size and number of fibers in it, the uniformity of their distribution over the cross section of the bronze matrix, the presence of diffusion barriers and stabilizing materials are varied. For a given volume fraction of niobium in a conductor, an increase in the number of fibers leads, accordingly, to a decrease in their diameter. This leads to a noticeable increase in the Nb/Cu-Sn interaction surface, which significantly accelerates the process of growth of the superconducting phase. Such an increase in the amount of the superconducting phase with an increase in the number of fibers in the conductor ensures an increase in the critical characteristics of the superconductor. In this regard, it is necessary to have a tool to control the volume fraction of the superconducting phase in the final product (composite superconductor).

When creating the program, the importance of conducting research into the materials from which superconducting cables are created was taken into account, since if the ratio of niobium to bronze is incorrect, an explosion of wires is possible, and, consequently, human casualties, monetary costs and loss of time. This program allows you to determine the quality of wires based on a chemical and physical analysis of the object.

Program block diagram

Description of the research stages.

Stage 1. Sample preparation: cutting a composite superconductor on an electrical discharge machine; pressing the sample into a plastic matrix; polishing the sample to a mirror finish; etching the sample to highlight niobium fibers on a bronze matrix. Samples of pressed composite superconducting samples were obtained;

Stage 2. Imaging: obtaining metallographic images using a scanning electron microscope.

Stage 3. Image processing: creation of a tool for determining the volume fraction of the superconducting phase in a metallographic image; a set of statistically significant data on a specific type of sample. Mathematical models of various image processing tools have been created; a software development was created to estimate the volume fraction of the superconducting phase; the program was simplified by combining several mathematical functions into one; the average value of the volume fraction of niobium fibers in the bronze matrix was 24.7±0.1%. A low percentage of deviation indicates a high repeatability of the structure of the composite wire.

Electron microscopy images of composite superconductors

Image processing methods in the program.

Identification- an individual instance of an object belonging to a class is recognized.

Binarization– the process of converting a color (or grayscale) image into two-color black and white.

Segmentation is the process of dividing a digital image into multiple segments (many pixels, also called superpixels).

Erosion– a complex process in which a structural element passes through all the pixels of the image. If at some position each single pixel of the structural element coincides with a single pixel of the binary image, then a logical addition of the central pixel of the structural element is performed with the corresponding pixel of the output image.

Dilatation- convolution of an image or a selected area of an image with a certain kernel. The core can have any shape and size. In this case, a single leading position is allocated in the kernel, which is combined with the current pixel when calculating the convolution.

Program formulas

Binarization formula (Otsu method):

Erosion formula:

Dilatation formula:

Pattern of dilatation and erosion

Color threshold segmentation formulas:

Determining the brightness gradient module for each image pixel:

Threshold calculation:

Equipment used

Program interface