Hi, Today I would like to announce that my GitHub fork at https://github.com/sowson/darknet has a new update, the fork is an advanced port of DarkNet CNN from CUDA to OpenCL and tested on macOS with eGPU from Sonnet named Breakaway RX 570 Puck and on my GreenPC it also supports Intel Iris GPU, OpenCV 3, and there are several use cases for it. Yolo3, Yolo2, Yolo1, CIFAR-10 solutions work fine, using demo from webcam also, from mp4 videos as well. The overall performance is quite nice. I achieved a 20 FPS level on Yolo2, so as far as I know, it is the fastest DarkNet in OpenCL on the planet. Most BLAS kernels I rewrote from scratches. For some, I used my own idea auto tuner. And there is one more thing. For training, I changed the pseudo-random solution to the permutation set solution. It means that from n pictures when you get n times picture, you can have the same only once, implementation is trivial, but it is a training game changer.

On the previous post I put information on how to install on macOS and/or CentOS GNU/Linux. this is still actual and up to date. Below I want to share with you all test cases commands to run this OpenCL port I made thanks to very smart people that share many versions on the GitHub.

Training to remind you where I was on May 2018.

And now you may use for all below test case commands.

Yolo1

# Yolo1 Test on built-in GPU

./darknet yolo test cfg/yolov1.cfg ../weights/yolov1.weights data/dog.jpg

# Yolo1 Demo from 1st WebCam on Computer on eGPU

./darknet yolo demo cfg/yolov1.cfg ../weights/yolov1.weights -i 1 -c 0

# Yolo1 Demo from MP4 Movie on eGPU

./darknet yolo demo cfg/yolov1.cfg ../weights/yolov1.weights ../movies/movie.mp4 -i 1

# Yolo1 Train from ../train folder on eGPU

../darknet/darknet yolo train yolov1.cfg voc.data extraction.conv.weights -i 1

Yolo2

# Yolo2 Test on built-in GPU

./darknet detect cfg/yolov2.cfg ../weights/yolov2.weights data/dog.jpg

# Yolo2 Demo from 1st WebCam on Computer on eGPU

./darknet detector demo cfg/coco.data cfg/yolov2.cfg ../weights/yolov2.weights -i 1 -c 0

# Yolo2 Demo from MP4 Movie on eGPU

./darknet detector demo cfg/coco.data cfg/yolov2.cfg ../weights/yolov2.weights ../movies/movie.mp4 -i 1

# Yolo2 Train from ../train folder on eGPU

../darknet/darknet detector train voc.data yolo-voc.2.0.cfg darknet19_448.conv.23 -i 1

Yolo3

# Yolo3 Test on built-in GPU

./darknet detect cfg/yolov3.cfg ../weights/yolov3.weights data/dog.jpg

# Yolo3 Demo from 1st WebCam on Computer on eGPU

./darknet detector demo cfg/coco.data cfg/yolov3.cfg ../weights/yolov3.weights -i 1 -c 0

# Yolo3 Demo from MP4 Movie on eGPU

./darknet detector demo cfg/coco.data cfg/yolov3.cfg ../weights/yolov3.weights ../movies/movie.mp4 -i 1

# Yolo3 Train from ../train folder on eGPU

../darknet/darknet detector train voc.data yolov3-voc.cfg darknet53.conv.74 -i 1

CIFAR-10

# CIFAR-10 training from ../cifar folder on CPU

../darknet/darknet classifier train cfg/cifar.data cfg/cifar_small.cfg -nogpu

# CIFAR-10 training from ../cifar folder on built-in GPU

../darknet/darknet classifier train cfg/cifar.data cfg/cifar_small.cfg

# CIFAR-10 training from ../cifar folder on eGPU

../darknet/darknet classifier train cfg/cifar.data cfg/cifar_small.cfg -i 1

# CIFAR-10 validation test on eGPU ../cifar folder on eGPU (_test cfg has batch=1)

../darknet/darknet classifier valid cfg/cifar.data cfg/cifar_small_test.cfg backup/cifar_small.backup -i 1

# CIFAR-10 test on built-in GPU (_test cfg has batch=1)

../darknet/darknet classifier predict cfg/cifar.data cfg/cifar_small_test.cfg backup/cifar_small.backup data/cifar/train/35728_automobile.png

# CIFAR-10 test on built-in GPU (_test cfg has batch=1)

../darknet/darknet classifier predict cfg/cifar.data cfg/cifar_small_test.cfg backup/cifar_small.backup data/cifar/test/6298_cat.png

# CIFAR-10 test on built-in GPU (_test cfg has batch=1)

../darknet/darknet classifier predict cfg/cifar.data cfg/cifar_small_test.cfg backup/cifar_small.backup data/cifar/test/4882_frog.png

# CIFAR-10 test on built-in GPU (_test cfg has batch=1)

../darknet/darknet classifier predict cfg/cifar.data cfg/cifar_small_test.cfg backup/cifar_small.backup data/cifar/test/2568_truck.png

# CIFAR-10 test on built-in GPU (_test cfg has batch=1)

../darknet/darknet classifier predict cfg/cifar.data cfg/cifar_small_test.cfg backup/cifar_small.backup data/cifar/test/5238_bird.png

Enjoy!

p ;).

@piotr.sowa,

I would like to know that can I use this method in Raspberry pi 3 with help of Movidius neural stick for real time object recognition

Thanks in advance

@Hashir, it is not my priority for some time… but I am happy to announce that… Yolo3-spp now is supported :D

@piotr.sowa

Thanks for your valuable comment , nd also how can I do that yolo3-spp on pi, ist same as that of yolo3 or tinyYolo. What is meant by spp

Thnq in advance

@Hashir, for you, I added RPI option to define in build options. Please clone my repo than edit Makefile on top and disable OPENCV=0 (optionally) and enable RPI=1 on your RPi. Then please try to install VC4CL, make the darknet and let me know how it goes. OK? I need your help because I do not have right now any free RPi for that tests…

@piotr.sowa , after successfully installed VC4CL and downloaded darknet repo from your GitHub repo and after running I got two errors

1) I did the same steps as u mentioned in the previous comment, that is I disabled GPU=0, GPU_FAST=0 nd OPENCV=0 nd RPI=1 after make the darknet I got following error even for GPU nd GPU fast nd opencv =1

ibdarknet.a -o darknet -lm -lpthread libdarknet.a

make: warning: Clock skew detected. Your build may be incomplete.

2) bedside of that I just run the command mentioned above in CMD from Ur darknet folder

./darknet yolo test cfg/yolov1.cfg ../weights/yolov1.weights data/dog.jpg

After running this I got again error

pi@raspberrypi:~/Downloads/darknet-master $ ./darknet yolo test cfg/yolov1.cfg ../weights/yolov1.weights data/dog.jpg

layer filters size input output

0 conv 64 7 x 7 / 2 448 x 448 x 3 -> 224 x 224 x 64 0.944 BFLOPs

1 max 2 x 2 / 2 224 x 224 x 64 -> 112 x 112 x 64

2 conv 192 3 x 3 / 1 112 x 112 x 64 -> 112 x 112 x 192 2.775 BFLOPs

3 max 2 x 2 / 2 112 x 112 x 192 -> 56 x 56 x 192

4 conv 128 1 x 1 / 1 56 x 56 x 192 -> 56 x 56 x 128 0.154 BFLOPs

5

conv 256 3 x 3 / 1 56 x 56 x 128 -> 56 x 56 x 256 1.850 BFLOPs

6 conv 256 1 x 1 / 1 56 x 56 x 256 -> 56 x 56 x 256 0.411 BFLOPs

7 conv 512 3 x 3 / 1 56 x 56 x 256 -> 56 x 56 x 512 7.399 BFLOPs

8 max 2 x 2 / 2 56 x 56 x 512 -> 28 x 28 x 512

9 conv 256 1 x 1 / 1 28 x 28 x 512 -> 28 x 28 x 256 0.206 BFLOPs

10 conv 512 3 x 3 / 1 28 x 28 x 256 -> 28 x 28 x 512 1.850 BFLOPs

11 conv 256 1 x 1 / 1 28 x 28 x 512 -> 28 x 28 x 256 0.206 BFLOPs

12 conv 512 3 x 3 / 1 28 x 28 x 256 -> 28 x 28 x 512 1.850 BFLOPs

13 conv 256 1 x 1 / 1 28 x 28 x 512 -> 28 x 28 x 256 0.206 BFLOPs

14 conv 512 3 x 3 / 1 28 x 28 x 256 -> 28 x 28 x 512 1.850 BFLOPs

15 conv 256 1 x 1 / 1 28 x 28 x 512 -> 28 x 28 x 256 0.206 BFLOPs

16 conv 512 3 x 3 / 1 28 x 28 x 256 -> 28 x 28 x 512 1.850 BFLOPs

17 conv 512 1 x 1 / 1 28 x 28 x 512 -> 28 x 28 x 512 0.411 BFLOPs

18 conv 1024 3 x 3 / 1 28 x 28 x 512 -> 28 x 28 x1024 7.399 BFLOPs

19 max 2 x 2 / 2 28 x 28 x1024 -> 14 x 14 x1024

20 conv 512 1 x 1 / 1 14 x 14 x1024 -> 14 x 14 x 512 0.206 BFLOPs

21 conv 1024 3 x 3 / 1 14 x 14 x 512 -> 14 x 14 x1024 1.850 BFLOPs

22 conv 512 1 x 1 / 1 14 x 14 x1024 -> 14 x 14 x 512 0.206 BFLOPs

23 conv 1024 3 x 3 / 1 14 x 14 x 512 -> 14 x 14 x1024 1.850 BFLOPs

24 conv 1024 3 x 3 / 1 14 x 14 x1024 -> 14 x 14 x1024 3.699 BFLOPs

25 conv 1024 3 x 3 / 2 14 x 14 x1024 -> 7 x 7 x1024 0.925 BFLOPs

26 conv 1024 3 x 3 / 1 7 x 7 x1024 -> 7 x 7 x1024 0.925 BFLOPs

27 conv 1024 3 x 3 / 1 7 x 7 x1024 -> 7 x 7 x1024 0.925 BFLOPs

28 Segmentation fault

Pls help me out to solve this

Thanks in advance

Regards

Hashir

@piotr.sowa, I did all the steps mentioned above in raspberry pi 3 nd Intel movidius neural stick

@Hashir, Pls try GPU=1 GPU_FAST=1 RPI=1 and put for all the rest 0s. Then please go to file “src/opencl.c” and find a line with CL_DEVICE_TYPE_GPU and pls try to change it to CL_DEVICE_TYPE_ACCELERATOR. Then use Yolo2-Tiny, not Yolo1 and send the output of the detection test, ok? Looks like I forget that there is no GPU but ACCELERATOR. Pls let me know how it goes I am very interested result of your work :). Thanks!

My test is as follows on CPU. But we need GPU on CPU is too slow I think. But I fail on VC4CL installation.

root@raspberrypi:~/darknet# ./darknet detect cfg/yolov2.cfg ../weights/yolov2.weights data/dog.jpg

layer filters size input output

0 conv 32 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 32 0.299 BFLOPs

1 max 2 x 2 / 2 416 x 416 x 32 -> 208 x 208 x 32

2 conv 64 3 x 3 / 1 208 x 208 x 32 -> 208 x 208 x 64 1.595 BFLOPs

3 max 2 x 2 / 2 208 x 208 x 64 -> 104 x 104 x 64

4 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BFLOPs

5 conv 64 1 x 1 / 1 104 x 104 x 128 -> 104 x 104 x 64 0.177 BFLOPs

6 conv 128 3 x 3 / 1 104 x 104 x 64 -> 104 x 104 x 128 1.595 BFLOPs

7 max 2 x 2 / 2 104 x 104 x 128 -> 52 x 52 x 128

8 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs

9 conv 128 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 128 0.177 BFLOPs

10 conv 256 3 x 3 / 1 52 x 52 x 128 -> 52 x 52 x 256 1.595 BFLOPs

11 max 2 x 2 / 2 52 x 52 x 256 -> 26 x 26 x 256

12 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs

13 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs

14 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs

15 conv 256 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 256 0.177 BFLOPs

16 conv 512 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 512 1.595 BFLOPs

17 max 2 x 2 / 2 26 x 26 x 512 -> 13 x 13 x 512

18 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs

19 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs

20 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs

21 conv 512 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 512 0.177 BFLOPs

22 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs

23 conv 1024 3 x 3 / 1 13 x 13 x1024 -> 13 x 13 x1024 3.190 BFLOPs

24 conv 1024 3 x 3 / 1 13 x 13 x1024 -> 13 x 13 x1024 3.190 BFLOPs

25 route 16

26 conv 64 1 x 1 / 1 26 x 26 x 512 -> 26 x 26 x 64 0.044 BFLOPs

27 reorg / 2 26 x 26 x 64 -> 13 x 13 x 256

28 route 27 24

29 conv 1024 3 x 3 / 1 13 x 13 x1280 -> 13 x 13 x1024 3.987 BFLOPs

30 conv 425 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 425 0.147 BFLOPs

31 detection

mask_scale: Using default ‘1.000000’

Loading weights from ../weights/yolov2.weights…Done!

data/dog.jpg: Predicted in 157.359320 seconds.

dog: 81%

truck: 74%

bicycle: 83%

p ;).

@piotr.sowa, after putting GPU=1 GPU_FAST=1 RPI=1 and put for all the rest 0s, also changed CL_DEVICE_TYPE_GPU to CL_DEVICE_TYPE_ACCELERATOR in opencl.c. after this i just re make again from darknet master directory. but unfortunately i got error

pi@raspberrypi:~/Downloads/darknet-master $ make

make: Warning: File ‘gemm.c’ has modification time 357508 s in the future

gcc -Iinclude/ -Isrc/ -DGPU -DOPENCL -DRPI -DGPU_FAST -Wall -Wno-unknown-pragmas -Wno-unused-variable -Wfatal-errors -fPIC -O2 -DGPU -DOPENCL -DRPI -I/usr/include/ -I/usr/local/include/ -DGPU_FAST -c ./src/gemm.c -o obj/gemm.o

./src/gemm.c:170:20: fatal error: clBLAS.h: No such file or directory

#include “clBLAS.h”

^

compilation terminated.

Makefile:113: recipe for target ‘obj/gemm.o’ failed

make: *** [obj/gemm.o] Error 1

thanks in advance

regards

hashir

@piotr.sowa, after running yolov2tiny version I got output , but prediction was enterly different from the original image nd also I would like to know that yolo versions other than tiny yolo can run in pi ?

My output is given below

pi@raspberrypi:~/Downloads/darknet-master $ ./darknet detect cfg/yolov2-tiny-voc.cfg yolov2-tiny-voc.weights data/horses.jpg

layer filters size input output

0 conv 16 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 16 0.150 BFLOPs

1 max 2 x 2 / 2 416 x 416 x 16 -> 208 x 208 x 16

2 conv 32 3 x 3 / 1 208 x 208 x 16 -> 208 x 208 x 32 0.399 BFLOPs

3 max 2 x 2 / 2 208 x 208 x 32 -> 104 x 104 x 32

4 conv 64 3 x 3 / 1 104 x 104 x 32 -> 104 x 104 x 64 0.399 BFLOPs

5 max 2 x 2 / 2 104 x 104 x 64 -> 52 x 52 x 64

6 conv 128 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 128 0.399 BFLOPs

7 max 2 x 2 / 2 52 x 52 x 128 -> 26 x 26 x 128

8 conv 256 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 256 0.399 BFLOPs

9 max 2 x 2 / 2 26 x 26 x 256 -> 13 x 13 x 256

10 conv 512 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 512 0.399 BFLOPs

11 max 2 x 2 / 1 13 x 13 x 512 -> 13 x 13 x 512

12 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BFLOPs

13 conv 1024 3 x 3 / 1 13 x 13 x1024 -> 13 x 13 x1024 3.190 BFLOPs

14 conv 125 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 125 0.043 BFLOPs

15 detection

mask_scale: Using default ‘1.000000’

Loading weights from yolov2-tiny-voc.weights…Done!

data/horses.jpg: Predicted in 38.437360 seconds.

traffic light: 75%

@Hashir, please see my output, but before you do the same please clone one more time my repo and then RPI=1 change in Makefile. It failed on my test, but it used OpenCL, I think even Yolo2-Tiny is too big for RPi, but you may try to train CIFAR-10 and test it… one more thing is that after detecting GPU it takes a few minutes to build all OpenCL kernels, so be patient. if you need test with no GPU use “-nogpu” parameter. Thanks!

root@raspberrypi:~/darknet# ./darknet detect cfg/yolov2-tiny.cfg ../weights/yolov2-tiny.weights data/dog.jpg -thersh .1

Device ID: 0

Device name: VideoCore IV GPU

Device vendor: Broadcom

Device opencl availability: OpenCL 1.2 VC4CL 0.4

Device opencl used: 0.4

Device double precision: NO

Device max group size: 12

Device address bits: 32

layer filters size input output

0 could not push array to device. error: CL_OUT_OF_RESOURCES

could not push array to device. error: CL_INVALID_MEM_OBJECT

could not push array to device. error: CL_OUT_OF_RESOURCES

could not push array to device. error: CL_INVALID_MEM_OBJECT

could not push array to device. error: CL_OUT_OF_RESOURCES

could not push array to device. error: CL_INVALID_MEM_OBJECT

could not push array to device. error: CL_OUT_OF_RESOURCES

could not push array to device. error: CL_INVALID_MEM_OBJECT

conv 16 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 16 0.150 BFLOPs

1 could not push array to device. error: CL_OUT_OF_RESOURCES

could not push array to device. error: CL_INVALID_MEM_OBJECT

could not push array to device. error: CL_OUT_OF_RESOURCES

could not push array to device. error: CL_INVALID_MEM_OBJECT

could not push array to device. error: CL_OUT_OF_RESOURCES

could not push array to device. error: CL_INVALID_MEM_OBJECT

p ;).

With the last commit, you should be able to calculate this. Unfortunately, it is slow, it calculates detection wrong now, but maybe in short time, VC4CL will be better. Fingers-crossed.

root@raspberrypi:~/cifar# ../darknet/darknet classifier predict cfg/cifar.data cfg/cifar_small_test.cfg ../weights/cifar_small.weights data/cifar/test/4882_frog.png

Device ID: 0

Device name: VideoCore IV GPU

Device vendor: Broadcom

Device opencl availability: OpenCL 1.2 VC4CL 0.4

Device opencl used: 0.4

Device double precision: NO

Device max group size: 12

Device address bits: 32

layer filters size input output

0 conv 32 3 x 3 / 1 28 x 28 x 3 -> 28 x 28 x 32 0.001 BFLOPs

1 max 2 x 2 / 2 28 x 28 x 32 -> 14 x 14 x 32

2 conv 16 1 x 1 / 1 14 x 14 x 32 -> 14 x 14 x 16 0.000 BFLOPs

3 conv 64 3 x 3 / 1 14 x 14 x 16 -> 14 x 14 x 64 0.004 BFLOPs

4 max 2 x 2 / 2 14 x 14 x 64 -> 7 x 7 x 64

5 conv 32 1 x 1 / 1 7 x 7 x 64 -> 7 x 7 x 32 0.000 BFLOPs

6 conv 128 3 x 3 / 1 7 x 7 x 32 -> 7 x 7 x 128 0.004 BFLOPs

7 conv 64 1 x 1 / 1 7 x 7 x 128 -> 7 x 7 x 64 0.001 BFLOPs

8 conv 10 1 x 1 / 1 7 x 7 x 64 -> 7 x 7 x 10 0.000 BFLOPs

9 avg 7 x 7 x 10 -> 10

10 softmax 10

Loading weights from ../weights/cifar_small.weights…Done!

data/cifar/test/4882_frog.png: Predicted in 1.647542 seconds.

16.96%: dog

16.44%: deer

p ;).

I would like to try it with a Raspberry Pi, but I’m confused as to how to do it. Are there instructions as to what to get and build?

@PeterQuinn, please look into Makefile, there is instruction on top, install VC4CL, set OPENCV=0 and RPI=1, save the file, make and have fun :-).

Thanks. Lots of steps (and recursive instructions) but reasonable straightforward to install VC4CL.

I have darknet compiled but it appears to hang.

pi@raspi3:~/darknet $ sudo ./darknet detect cfg/yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg

Device ID: 0

Device name: VideoCore IV GPU

Device vendor: Broadcom

Device opencl availability: OpenCL 1.2 VC4CL 0.4

Device opencl used: 0.4

Device double precision: NO

Device max group size: 12

Device address bits: 32

Any ideas?

oh. After a long wait (20 minutes?) it continues:

layer filters size input output

10 conv 512 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 512 0.399 BFLOPs

11 max 2 x 2 / 1 13 x 13 x 512 -> 13 x 13 x 512

12 could not push array to device. error: CL_OUT_OF_RESOURCES

could not push array to device. error: CL_INVALID_MEM_OBJECT

could not push array to device. error: CL_OUT_OF_RESOURCES

one more update – I increased the memory available to the GPU to 768 and now it finishes loading the network. It still fails later though.

terminate called after throwing an instance of ‘std::out_of_range’

what(): vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)

Aborted

Any ideas?

@PeterQuinn, so for now only “-nogpu” switch works to me. I checked with the VC4CL author and this solution is still under development and does not work fine. Even with extended memory, because for example log, sqrt, pow functions are not implemented yet and there are critical to work on this. I updated also the source code, so please pull the latest version it may help. The error you posted is happening from time to time and is not deterministic.

Here you go… ;-).

p ;).

Nice work. Two questions…

Do you think this would also work on other single board computers than the RPi, for example on the ASUS Tinker Board (which has a Mali GPU)?

I believe you’re using clBLAS. Have you considered using CLBlast (https://github.com/CNugteren/CLBlast) instead?

Thank you, I do not have ASUS Tinker Board, if it has OpenCL it should work, maybe with a little Makefile change. And yes I consider CLBlast, however, it failed with Intel Iris GPU and I do not want that.

@piotr.sowa Hey, would I be able to run your openCL implementation of YOLOv3 on FPGA ? Do i need to make any changes ?

Sorry, I even do not know what FPGA is…

@piotr.sowa Would I be able to to run the yolov3 opencl without GPU ?