Distributed Vision Processing

From OMAPpedia

Revision as of 20:15, 11 October 2012 by Emrainey (Talk | contribs)
Jump to: navigation, search

Contents

Glossary

Design

Rationale

The design rationale of DVP is to create a systematic way to process machine vision kernels across multiple cores in a heterogeneous computing system like the OMAP4430, leveraging specialized hardware which can greatly accelerate specific machine vision kernels. DVP is a generic framework of kernels, but is not a generic computation language like OpenCL. Each kernel is precompiled for it's desired core and is accessible as a Node in the DVP Graph.

Why not OpenMAX?

OpenMAX is not the right interface for VISION. DVP needs more capabilities and lower overhead (at least 1 ipc per graph, not per message) than OMX can provide. OMX is specific to media codecs, not vision kernels (which are conceptually like single function calls).

Noteworthy Design Features/Decisions

Manager Prioritization

DVP internally priorities some hardware blocks over others due to implicit performance advantages due to hardware designs. If multiple Managers support the same kernel, DVP will internally determine how to prioritize whose kernel is called. DVP will use the Load Balancing information as a second level of decision making.

Currently the prioritization for OMAP4 is:

  1. simcop
  2. dsp
  3. cpu

This means if there is a Kernel "A" implemented on all Managers, it will prefer to execute the kernel on the highest priority Manager first working it's way down the priority list only when the Core that the Manager works on is exhausted of resources.

Load Balancing

DVP understands that real-time contraints can be expected in production devices has an Estimated Load table, whose input helps control where the machine vision kernels execute. Load Balancing is predictive only and does not use any run-time checking to compute load. This is done because not all Cores are capable of run-time detection. For example, the SIMCOMP is really an accelerator and is more accurately thought of as serially accessed in bursts of 100% utilization since no two tasks can be present concurrently (or at least aren't in this design).

Multiple DVP Instances

DVP can execute multiple Graphs in parallel within the same process or across multiple processes. The Estimated Load Table is in a semaphore protected piece of shared memory so that multiple processes can utilize the DVP system at once. Each process gets it's own instance of the Managers, but there is only one Estimated Load Table.

Graphs

Machine Vision Kernels can be called in "bulk" by formating a Graph which indicates the exact set of kernels to call, in what order, and their associated parameters.

In this example, we have Kernels A,B,C,D,E,F,G,H which have Nodes a,b,c,d,e,f,g,h respectively. In an unoptimized example, Nodes a through h can be called in series (in a single Section). The Graph is then:

a -> b -> c -> d -> e -> f -> g -> h

In an optimized version the programmer may have discovered that some nodes are not dependent on previous nodes and can be reordered and made parallel to gain performance. In this example, 'b', 'd' and 'e' depend on 'a', 'c' on 'b', 'f' on 'e', 'g' on 'c' and 'f' and 'h' on 'g'. There are now 5 Sections, 'a', 'bc', 'd', 'ef', and 'gh'. and three Orders: 'a' has 0, 'bc','d', and 'ef' have order 1, and 'gh' has order 2.

     |-> b -> c -> |
    /|             |\
   a |-> d         | g -> h
    \|             |/
     |-> e -> f -> | 
Order:
   0     1           2

In this example b,h, and e depend on a but not on each other. This Graph will execute 'a' first and wait until completion, then will concurrently launch 'bc', 'd', 'ef' (potentially in parallel on an SMP system), then when those have completed, it will launch 'gh' and wait for completion.

Allocating Memory

DVP supports multiple memory types for Images and Buffers.

On systems where the TILER API is exposed to DVP directly it will use it to allocate 1D/2D memory. On Host systems, only plain virtual memory is supported.

Section Completion Callbacks

After each section completes execution a callback is issued to the client to notify them of completion. This callback has several features.

Remote Execution Considerations

Latency

One of the biggest challenges of using heterogeneous multi-cores is the latency involved in IPC. This is minimized through offloading as many tasks as can be sent at once in a single transmission. In the context of DVP, this means sending as many kernels to execute on a remote kernel at once as possible. In some cases like SIMCOP this may not be possible as the number of supported kernels is low. However the DSP can process many different types of kernels, and is a good candidate to offload a myriad of tasks until it is fully utilized. The effect of this optimization of work in the Graph is to coagulate as many core-centric operations into a single Section as possible. Sections are analyized to see how many nodes ahead of the current node can be sent together to the appropriate remote core. In this manner, entire sections can be offloaded to the remote cores, thus greatly improving local loading and minimizing per Node latency.

Local Optimization

Each Manager can locally optimize Graph performance, beyond what the Boss may understand. For example, the CPU Manager may have some specialized assembly routines to do an optimized version of a kernel if the right conditions are met (specific parameters, combination kernels will subsequent kernels, etc). Each Manager can and must make these determinations internally. These optimized kernels should only be used if the overhead of checking for and running the optimization is greatly outweighed by the Mhz saving. Programmers of customer Managers should carefully weight optimization checks.

Dependencies

DVP on all OMAP4 platforms (Android/QNX/etc.) depends on the Syslink driver (info at Syslink Project) and TILER memory allocator.

ICS

Android ICS release changes the IPC mechanism to the Ducati and Tesla cores to use an interface called RPMSG which itself is built upon the VirtIO framework for kernel level virtualization. Underneath all the layers is still the Mailbox HW driver.

ICS also adds the ION memory manager, which implements a unified method of allocating 1D/2D buffers in the system.

Extending DVP

It is relatively straight forward to extend DVP to provide private implementations of some kernels. The Kernel Enum list has a definition for a "user" enum base which can be used to create custom kernel enums. These must simply not conflict with existing enums in the system. THe Boss will scan all Managers for exported kernel enums and will execute the kernels on those Managers given the prioritization of the Managers. If no Manager supports a kernel except the new extended Manager, then prioritization is not an issue. Prioritization is only considered when two or more Managers contain a kernel enum.

Compiling

Each new DVP Manager simply needs to implement the existing DVP Manager API (or duplicate the DVP CPU Manager code and replace the switch statement with your own enums and code).

Loading

On HLOS platforms with scandir() and fnmatch() implemented, the DVP Boss will dynamically load any shared object with the appropriate name ("<system path>/dvp_kgm_XXXX.so"). While this might seem dangerous from a security point of view, the Boss will specifically load only from the system library paths, which must themselves be compromised in order to breach security.

Memory Allocation and Usage

DVP allows the programmer to allocate memory in many formats, depending on the local HLOS. When DVP is running on a system with a TILER, DVP can allocate 1D cached and 1D uncached and 2D uncached tiled memory. Normally most allocations are allocated via malloc, calloc or memalign. The RPC layer of DVP understands the Cache issues associated with remote core execution and works to keep buffers consistent after Section executions.

Trade-offs

No Data Dependencies

DVP does not assume that it knows better than the programmer. It will execute Nodes in the order that the programmer gave it. DVP allows the programmer to assemble and execute a Graph regardless of how the Data dependencies work out. This means that the programmer may be able to construct a bad graph (incorrect dependencies). The onus of correct behaviour is left to the programmer. The trade-off here is code complexity and run-time overhead versus development-time overhead. If the Managers determines that the kernels can be done in a more efficient manner, it may do so. An example of this is a combined kernel which may take 1 input and produce 3 outputs which would normally be done individually.

No reordering of Graph Sections/Nodes/Kernels

DVP has been designed thus far to assume that the programmer is the best optimizer, not a complex graph dependency system.

Camera Considerations

Machine Vision has a fundamentally different approach to camera control than does Human Vision. Typically sensor tuning and camera controls are designed with Human Vision consumption in mind and not anything else. Machine Vision does not care about aesthetically pleasing images. Machine Vision "care-abouts" can be more varied and are functionally driven to what the Machine Vision algorithms being used are. To that end, camera which need to enable Machine Vision need the following functonality:

Implementation

Languages

DVP is implemented in C with some C99 extensions. GCC and Microsoft's CL compiler can both compile the majority of DVP. DVP does contain some NEON ARM assembly (see Writing ARM Assembly) which is in the AT&T assembly style.

DVP contains other components which are implemented in C++ (VisionCam/VisionEngine). These are convenience classes used to simplify usage of DVP within a HLOS environment.

Supported HLOS

Android Specific Issues

Enabled Cores

OMAP4

OMAP5

PC (Ubuntu/Windows)


1: On platforms which enable OpenCL. 2: ON platforms with Network connectivity and when an EC2 RPC is implemented.

Supported Kernels

DVP has some algorithm kernels which will be released "openly" with DVP.

Other Components

SOSAL

SOSAL is a very simple operating system abstraction layer plus design pattern library which allow for rapid development. It contains:

Display

Display is a critical piece of development code which allows for programmers to see the images coming from the camera or the output from kernels. Supported Display Techs are:

VisionCam

VisionCam is a C++ Wrapper around the OMX-Camera interface which aims to simplify the OMX interface sytle down to the bare-minimum needed to enable Machine Vision applications.

Subclasses

VisionCam has several subclasses which allow for various stages of development. They include:

Using VisionCam in Socket Mode

On the Android device (Blaze/Tablet/etc)

# vcam_server

On the Host, build DVP using the instructions below for your platform. Connect your platform to the PC via microUSB. Then execute:

$ adb forward tcp:8501 tcp:8501
$ adb forward tcp:8502 tcp:8502

To get single (front) camera image:

$ vcam_simple -t 3 --name localhost -w 160 -h 120 -c NV12 -s 1

To get the stereo (front) camera image (Top-Bottom) on Blaze:

$ vcam_simple -t 3 --name localhost -w 160 -h 240 -c NV12 -s 2 -tb

VisionEngine

VisionEngine is a utility C++ class used to implement Machine Vision applications which contains a thread, a reference to VisionCam and DVP.

Base Class Features

Dual Port Support

The VisionEngine (on latest develop, post RLS_1.80) supports Multiple Camera Ports and Multiple Graphs. Each port may be associated with multiple graphs. When GraphUpdate receives a VisionCamFrame with a specific port, the subclass must update the appropriate graphs using the m_correlation variable.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox