Coding the application

Let's jump into the code.

First we have the class, the VideoPlayer:

public class VideoPlayer {

    private static final String AUTONOMOUS_DRIVING_RAMOK_TECH = "Autonomous Driving(ramok.tech)";
    private String windowName;
    private volatile boolean stop = false;
    private Yolo yolo;
    private final OpenCVFrameConverter.ToMat converter = new OpenCVFrameConverter.ToMat();
    public final static AtomicInteger atomicInteger = new AtomicInteger();


    public void startRealTimeVideoDetection(String videoFileName, Speed selectedIndex, boolean yoloModel) throws java.lang.Exception {
        log.info("Start detecting video " + videoFileName);
        int id = atomicInteger.incrementAndGet();
        windowName = AUTONOMOUS_DRIVING_RAMOK_TECH + id;
        log.info(windowName);
        yolo = new Yolo(selectedIndex, yoloModel);
        startYoloThread();
        runVideoMainThread(videoFileName, converter);
    }

The YOLO thread which simply runs in the background and predicts these bounding boxes, so we have a pre-trained model YOLO here, which is loaded once the application starts, and then we produce the results and simply convert those to a list of predicted objects. The detected object contains the center of the bounding box, height, width, and information that helps in producing bounding boxes, and then we start the main thread, which is the video thread:

private void runVideoMainThread(String videoFileName, OpenCVFrameConverter.ToMat toMat) throws FrameGrabber.Exception {
        FFmpegFrameGrabber grabber = initFrameGrabber(videoFileName);
        while (!stop) {
            Frame frame = grabber.grab();
            if (frame == null) {
                log.info("Stopping");
                stop();
                break;
            }
            if (frame.image == null) {
                continue;
            }
            yolo.push(frame);
            opencv_core.Mat mat = toMat.convert(frame);
            yolo.drawBoundingBoxesRectangles(frame, mat);
            imshow(windowName, mat);
            char key = (char) waitKey(20);
            // Exit this loop on escape:
            if (key == 27) {
                stop();
                break;
            }
        }
    }

This thread grabs the frames from the video file and then it puts those frames into a stack so the YOLO thread can read them, and then it draws those rectangles, which contain the low-level details, if we have the information of the detected object, the grid size, we can basically draw those into the frame, so it mixes the frame with the bounding boxes. Finally, we show this modified frame to the user.

We can choose three modes through the graphical user interface, fast, medium, and slow:

public enum Speed {

    FAST("Real-Time but low accuracy", 224, 224, 7, 7),
    MEDIUM("Almost Real-time and medium accuracy", 416, 416, 13, 13),
    SLOW("Slowest but high accuracy", 608, 608, 19, 19);

    private final String name;

    public final int width;
    public final int height;
    public final int gridWidth;
    public final int gridHeight;

    public String getName() {
        return name;
    }

    Speed(String name, int width, int height, int gridWidth, int gridHeight) {

        this.name = name;
        this.width = width;
        this.height = height;
        this.gridWidth = gridWidth;
        this.gridHeight = gridHeight;
    }

    @Override
    public String toString() {
        return name;
    }
}

The fastest one has a low resolution and a 7 x 7 grid size, which is very small. This will be fast, so you have almost real-time bounding boxes, basically we may not detect everything in the image.

As we move to the medium and the slow, we increase the resolution, and we also double the grid size, from 7 to 7 to 13 by 13, and then, for slow, we had six more, 19 by 19. We will be able, for example, for the slowest one, to detect almost everything in the video, but actually the bounding box will be quite outdated. If we choose the slowest one, it will take more than two seconds to predict for one frame, so the bounding boxes we see are from two seconds in the past.

Let's see how the application will look:

Let's choose the first video. For this one, we'll use the fastest one, real time but lower accuracy. Let's see how that looks:

We're able to detect the cars as they come in, but we may miss some things, such as some of the cars and some of the people.

Next, let's look at a slow YOLO, with twice the resolution and a 13x13 grid. Let's see what happens:

As we can see, we have more bounding boxes here:

We detected the person a bit late but there are more cars, which means more bounding boxes.

The traffic light comes a bit late, but we were able to see it:

If we choose the third option, we'll see more bounding boxes, but the response will be rather slow.

Let's try to run another one having a medium amount of frames. As we can see, this bounding box is is quite delayed:

Let's choose the fastest option. So now, as we can see, we detect a bit faster, and the bounding boxes are a bit more frequent:

The algorithm does a good job here. That's it for the demo, and considering that we have just CPU processing power, actually the results are quite good. If were to use the sliding window solution, we wouldn't be able to obtain such a good output.

Table of Contents for Coding the application

Create new playlist

Sign In

Sign Up

Table of Contents for
Coding the application