Object Detection System | billharit.com

Important Note

View the github link for the code explanation. this pages is only to showcase the system.

Short Explanation

This project aims to provide an end-to-end object detection system, and it has been specifically fine-tuned for face detection. The model's architecture is built upon VGGNet16, a deep convolutional neural network known for its exceptional image recognition capabilities. By utilizing TensorFlow and OpenCV, I've created a system capable of accurate and real-time object detection, making it suitable for a wide range of applications.

Dataset

For this project, I collected a custom dataset consisting of 90 images used for training and testing. These images were captured using a smartphone camera, which was positioned on a table to simulate real-world scenarios. The integration of DroidCam and OpenCV facilitated the image acquisition process, ensuring diverse and realistic data. Additionally, I used the labelme Python library to label each image, providing ground truth annotations that are crucial for training and evaluating the model.

The dataset contains a variety of lighting conditions, backgrounds, and facial expressions to ensure the model's robustness in real-world scenarios. It serves as a crucial component in the development and evaluation of the face detection model.

Augmented Dataset

Each original image of myself was augmented using Albumentations. Specifically, each original image was augmented 60 times, resulting in a significantly larger dataset for training. The augmented dataset now consists of a total of 3180 images, each with a resolution of 450x450 pixels. This augmented dataset plays a vital role in improving the model's performance, enabling it to handle a wider range of variations in real-world data.

With this extensive and diverse dataset, the model is well-equipped to provide accurate and reliable face detection in various scenarios and conditions.

Model

The model used in this project is based on the VGGNet16 architecture. It is designed to perform two distinct tasks: classification and regression. The classification task is focused on identifying faces in images, while the regression task is used to predict bounding boxes around detected faces.

Classification Task

The classification branch of our model is responsible for determining whether a face is present in the input image. It utilizes a sigmoid activation function to produce a single output value. This value serves as a confidence score, with higher values indicating a higher likelihood of a detected face.

Regression Task

The regression branch of our model is dedicated to predicting bounding boxes around the detected faces. It also employs a sigmoid activation function, generating four output values. These values represent coordinates or attributes that define the position and size of the bounding box.

Realtime Detection

To perform real-time object detection, I utilize the capabilities of OpenCV and TensorFlow. The system captures a live video feed, and for each frame captured, the model is fed with data from OpenCV to make predictions. The model's output is then processed and mapped onto the video frames using rectangles to represent bounding boxes around detected faces.

A confidence threshold of 0.5 is applied to filter the predictions. Any detection with a confidence score above this threshold is considered a valid detection and marked with a bounding box.

This real-time detection approach enables the model to provide dynamic and instantaneous face detection, making it well-suited for various real-world applications, including video surveillance and interactive systems.

Technical Difficulties and Lesson Learned

Throughout the course of this project, I encountered various technical challenges and gained valuable insights into the tools and technologies I employed. Here are some key takeaways from this journey:

OpenCV Integration: The project involved the integration of OpenCV with TensorFlow for real-time object detection. This collaboration provided a deeper understanding of how to capture live video feeds and process them with machine learning models, opening up exciting possibilities for applications.
Localization Loss and tf.GradientTape(): This project introduced me to concepts like localization loss and the use of tf.GradientTape(). These aspects of deep learning are crucial for object detection tasks and allowed me to gain a deeper appreciation for the intricacies of model training.
Keras Functional API: As a departure from my previous experience with the sequential API, this project marked my first foray into using the Keras Functional API for building models. This provided greater flexibility and customization in model architecture, enabling me to design a system tailored to my specific needs.
GPU Challenges: Utilizing GPU acceleration can significantly boost training performance. However, I faced challenges when it came to TensorFlow versions on Windows. With versions beyond 3.10 no longer supported, and difficulties in setting up Windows Subsystem for Linux (WSL), I opted to use a virtual environment with TensorFlow 3.9 from my thesis repositories back then, allowing me to continue making progress.
Webcam Woes: Unfortunately, during the course of this project, my webcam decided to take an unscheduled break. To keep the project moving forward, I leveraged DroidCam as a viable alternative for capturing live video feeds. This experience highlighted the importance of adaptability in the face of unexpected hardware challenges.

In summary, these technical challenges and lessons learned have enriched my understanding of machine learning, deep learning, and their practical applications. They also underscore the importance of flexibility and problem-solving when working on real-world projects.