Click here to Skip to main content
15,881,852 members
Articles / Artificial Intelligence / Neural Networks
Article

Building an Object Detection iOS App with YOLO Core ML Model

Rate me:
Please Sign up or sign in to vote.
5.00/5 (2 votes)
30 Nov 2020CPOL3 min read 12.2K   566   4   5
In this last article in this series, we’ll extend the application to use our YOLO v2 model for object detection.
Here we’ll combine the Core ML version of the YOLO v2 model with the video stream capturing capabilities of our iOS app, and add object detection to that app.

This series assumes that you are familiar with Python, Conda, and ONNX, as well as have some experience with developing iOS applications in Xcode. You are welcome to download the source code for this project. We’ll run the code using macOS 10.15+, Xcode 11.7+, and iOS 13+.

Adding Model to Our Application

The first thing we need to do is to copy yolov2-pipeline.mlmodel (which we have saved previously to the Models folder of our ObjectDetectionDemo iOS application) and add it to project files:

Image 1

Running Object Detection on Captured Video Frames

To use our model, we need to introduce a few changes to our code from the previous article.

The startCapture method of the VideoCapture class needs to accept and store a Vision framework request parameter, VNRequest:

Python
public func startCapture(_ visionRequest: VNRequest?) {
    if visionRequest != nil {
        self.visionRequests = [visionRequest!]
    } else {
        self.visionRequests = []
    }
        
    if !captureSession.isRunning {
        captureSession.startRunning()
    }
}

Now let’s add the ObjectDetection class with the following createObjectDetectionVisionRequest method:

Python
createObjectDetectionVisionRequest method:
public func createObjectDetectionVisionRequest() -> VNRequest? {
    do {
        let model = yolov2_pipeline().model
        let visionModel = try VNCoreMLModel(for: model)
        let objectRecognition = VNCoreMLRequest(model: visionModel, completionHandler: { (request, error) in
            DispatchQueue.main.async(execute: {
                if let results = request.results {
                    self.processVisionRequestResults(results)
                }
            })
        })

        objectRecognition.imageCropAndScaleOption = .scaleFill
        return objectRecognition
    } catch let error as NSError {
        print("Model loading error: \(error)")
        return nil
    }
}

Note that we use the .scaleFill value for imageCropAndScaleOption. This introduces a small distortion to the image when scaling from the captured 480 x 640 size to the 416 x 416 sizethe model requires. It won’t have a significant impact on the results. On the other hand, it will make further scaling operations simpler.

The introduced code needs to be used in the main ViewController class:

Python
self.videoCapture = VideoCapture(self.cameraView.layer)
self.objectDetection = ObjectDetection(self.cameraView.layer, videoFrameSize: self.videoCapture.getCaptureFrameSize())
        
let visionRequest = self.objectDetection.createObjectDetectionVisionRequest()
self.videoCapture.startCapture(visionRequest)

With such a framework in place, we can execute the logic defined in visionRequest each time a video frame is captured:

Python
public func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else {
        return
    }
        
    let frameOrientation: CGImagePropertyOrientation = .up
    let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: frameOrientation, options: [:])
    do {
        try imageRequestHandler.perform(self.visionRequests)
    } catch {
        print(error)
    }
}

With the above changes, the yolov2_pipeline model is used on each captured frame, then the detection results are passed to the ObjectDetection.processVisionRequestResults method. Thanks to the model_decoder and model_nms components of the pipeline model we have implemented previously, no decoding logic is required on the iOS side. We simply read the most likely observation (objectObservation) and draw the corresponding box on the captured frame (calling the createBoundingBoxLayer and addSublayer methods):

Python
private func processVisionRequestResults(_ results: [Any]) {
    CATransaction.begin()
    CATransaction.setValue(kCFBooleanTrue, forKey: kCATransactionDisableActions)
        
    self.objectDetectionLayer.sublayers = nil
    for observation in results where observation is VNRecognizedObjectObservation {
        guard let objectObservation = observation as? VNRecognizedObjectObservation else {
            continue
        }

        let topLabelObservation = objectObservation.labels[0]
        let objectBounds = VNImageRectForNormalizedRect(
            objectObservation.boundingBox,
            Int(self.objectDetectionLayer.bounds.width), Int(self.objectDetectionLayer.bounds.height))
            
        let bbLayer = self.createBoundingBoxLayer(objectBounds, identifier: topLabelObservation.identifier, confidence: topLabelObservation.confidence)
        self.objectDetectionLayer.addSublayer(bbLayer)
    }
    CATransaction.commit()
}

Drawing Bounding Boxes

Drawing boxes is relatively simple and not specific to the Machine Learning part of our application. The main difficulty here is to use a proper scale and coordinate system: "0,0" means the upper left corner for the model but it means lower-left for the iOS and the Vision framework.

Two methods of the ObjectDetection class will handle this: setupObjectDetectionLayer and createBoundingBoxLayer. The former prepares the layer for the boxes:

Python
private func setupObjectDetectionLayer(_ viewLayer: CALayer, _ videoFrameSize: CGSize) {
    self.objectDetectionLayer = CALayer()
    self.objectDetectionLayer.name = "ObjectDetectionLayer"
    self.objectDetectionLayer.bounds = CGRect(x: 0.0,
                                     y: 0.0,
                                     width: videoFrameSize.width,
                                     height: videoFrameSize.height)
    self.objectDetectionLayer.position = CGPoint(x: viewLayer.bounds.midX, y: viewLayer.bounds.midY)
        
    viewLayer.addSublayer(self.objectDetectionLayer)

    let bounds = viewLayer.bounds
       
    let scale = fmax(bounds.size.width  / videoFrameSize.width, bounds.size.height / videoFrameSize.height)

    CATransaction.begin()
    CATransaction.setValue(kCFBooleanTrue, forKey: kCATransactionDisableActions)
    
    self.objectDetectionLayer.setAffineTransform(CGAffineTransform(scaleX: scale, y: -scale))
    self.objectDetectionLayer.position = CGPoint(x: bounds.midX, y: bounds.midY)        
    CATransaction.commit()
}

The createBoundingBoxLayer method creates the shapes to draw:

Python
private func createBoundingBoxLayer(_ bounds: CGRect, identifier: String, confidence: VNConfidence) -> CALayer {
    let path = UIBezierPath(rect: bounds)
        
    let boxLayer = CAShapeLayer()
    boxLayer.path = path.cgPath
    boxLayer.strokeColor = UIColor.red.cgColor
    boxLayer.lineWidth = 2
    boxLayer.fillColor = CGColor(colorSpace: CGColorSpaceCreateDeviceRGB(), components: [0.0, 0.0, 0.0, 0.0])
        
    boxLayer.bounds = bounds
    boxLayer.position = CGPoint(x: bounds.midX, y: bounds.midY)
    boxLayer.name = "Detected Object Box"
    boxLayer.backgroundColor = CGColor(colorSpace: CGColorSpaceCreateDeviceRGB(), components: [0.5, 0.5, 0.2, 0.3])
    boxLayer.cornerRadius = 6

    let textLayer = CATextLayer()
    textLayer.name = "Detected Object Label"
        
    textLayer.string = String(format: "\(identifier)\n(%.2f)", confidence)
    textLayer.fontSize = CGFloat(16.0)
        
    textLayer.bounds = CGRect(x: 0, y: 0, width: bounds.size.width - 10, height: bounds.size.height - 10)
    textLayer.position = CGPoint(x: bounds.midX, y: bounds.midY)
    textLayer.alignmentMode = .center
    textLayer.foregroundColor =  UIColor.red.cgColor
    textLayer.contentsScale = 2.0
        
    textLayer.setAffineTransform(CGAffineTransform(scaleX: 1.0, y: -1.0))
        
    boxLayer.addSublayer(textLayer)       
    return boxLayer
}

Application in Action

Congratulations – we have a working object detection app, which we can test in real life or – in our case – using a free clip from the Pixels portal.

Image 2

Note that the YOLO v2 model is quite sensitive to image orientation, at least for some object classes (the Person class, for example). Its detection results deteriorate if you rotate the frame before processing.

This can be illustrated using any sample image from the Open Images dataset.

Image 3

Image 4

Exactly the same image – and two very different results. We need to keep this in mind to ensure proper orientation of the images fed to the model.

Next Steps?

And this is it! It was a long trip, but we have finally got to its end. We have a working iOS application for object detection in live video stream.

If you’re trying to decide what to do next, you’re limited only by your imagination. Consider these question:

  • How could you use what you’ve learned to build a hazard detector for your car? If your iPhone were mounted on the dashboard if your car, facing forward, could you use what you’ve learned to make an app that would detect hazards and warn the driver?
  • Could you build a bird detector that would alert a bird enthusiast when their favorite bird has landed at their bird feeder?
  • If you have deer and other pests that roam into your garden and eat all of the vegetables you are growing, could you use your newly-acquired iOS object detection skills to build an iOS-powered electronic scarecrow to frighten the pests away?

The possibilities are endless. Why not try one of the ideas above - or come up with your own - and then write about it? The CodeProject community would love to see what you come up with.

This article is part of the series 'Mobile Neural Networks on iOS with Core ML - Part 2 View All

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Architect
Poland Poland
Jarek has two decades of professional experience in software architecture and development, machine learning, business and system analysis, logistics, and business process optimization.
He is passionate about creating software solutions with complex logic, especially with the application of AI.

Comments and Discussions

 
QuestionUnable to download the sample code Pin
ankit sachan 20223-May-22 4:51
ankit sachan 20223-May-22 4:51 
GeneralMy vote of 5 Pin
Member 1548451630-Dec-21 8:25
Member 1548451630-Dec-21 8:25 
QuestionYOLO object detection in real time Pin
Member 154529442-Dec-21 4:13
Member 154529442-Dec-21 4:13 
AnswerRe: YOLO object detection in real time Pin
Jarek Szczegielniak2-Dec-21 9:15
Jarek Szczegielniak2-Dec-21 9:15 
To be honest I'm not sure (and at the moment I don't have a built app to actually check it).
In theory it should, though, even on iPhones older than iPhone 11 (https://machinethink.net/faster-neural-networks/).

Anyway, even if it doesn't, the question is if you absolutely need to do it in a real-time?
If not, you could always capture and buffer the movie at 60fps and process each subsequent frame at your own time.
GeneralRe: YOLO object detection in real time Pin
Member 154529442-Dec-21 10:09
Member 154529442-Dec-21 10:09 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.