Ball position tracking in the cloud with the PGA TOUR

The PGA TOUR continues to enhance the golf experience with real-time data that brings fans closer to the game. To deliver even richer experiences, they are pursuing the development of a next-generation ball position tracking system that automatically tracks the position of the ball on the green.
The TOUR currently uses ShotLink powered by CDW, a premier scoring system that uses a complex camera system with on-site compute, to closely track the start and end position of every shot. The TOUR wanted to explore computer vision and machine learning (ML) techniques to develop a next-generation cloud-based pipeline to locate golf balls on the putting green.
The Amazon Generative AI Innovation Center (GAIIC) demonstrated the effectiveness of these techniques in an example dataset from a recent PGA TOUR event. The GAIIC designed a modular pipeline cascading a series of deep convolutional neural networks that successfully localizes players within a camera’s field of view, determines which player is putting, and tracks the ball as it moves toward the cup.
In this post, we describe the development of this pipeline, the raw data, the design of the convolutional neural networks comprising the pipeline, and an evaluation of its performance.
Data
The TOUR provided 3 days of continuous video from a recent tournament from three 4K cameras positioned around the green on one hole. The following figure shows a frame from one camera cropped and zoomed so that the player putting is easily visible. Note that despite the high resolution of the cameras, because of the distance from the green, the ball appears small (usually 3×3, 4×4 or 5×5 pixels), and targets of this size can be difficult to localize accurately.

In addition to the camera feeds, the TOUR provided the GAIIC with annotated scoring data on each shot, including world location of its resting position and the timestamp. This allowed for visualizations of every putt on the green, as well as the ability to pull all of the video clips of players putting, which could be manually labeled and used to train detection models that make up the pipeline. The following figure show the three camera views with approximate putt path overlays, counterclockwise from top left. The pin is moved each day, where day 1 corresponds to blue, day 2 to red, and day 3 to orange.

Pipeline overview
The overall system consists of both a training pipeline an inference pipeline. The following diagram illustrates the architecture of the training pipeline. The starting point is ingestion of video data, either from a streaming module like Amazon Kinesis for live video or placement directly into Amazon Simple Storage Service (Amazon S3) for historical video. The training pipeline requires video preprocessing and hand labeling of images with Amazon SageMaker Ground Truth. Models can be trained with Amazon SageMaker and their artifacts stored with Amazon S3.

The inference pipeline, shown in the following diagram, consists of a number of modules that successively extract information from the raw video and ultimately predict the world coordinates of the ball at rest. Initially, the green is cropped from the larger field of view from each camera, in order to cut down on the pixel area in which the models must search for players and balls. Next, a deep convolutional neural network (CNN) is used to find the locations of people in the field of view. Another CNN is used to predict which type of person has been found in order to determine whether anyone is about to putt. After a likely putter has been localized in the field of view, the same network is used to predict the location of the ball near the putter. A third CNN tracks the ball during its motion, and lastly, a transformation function from camera pixel position to GPS coordinates is applied.

Player detection
Although it would be possible to run a CNN for ball detection over an entire 4K frame at a set interval, given the angular size of the ball at these camera distances, any small white object triggers a detection, resulting in many false alarms. To avoid searching the entire image frame for the ball, it’s possible to take advantage of correlations between player pose and ball location. A ball that is about to be putted must be next to a player, so finding the players in the field of view will greatly restrict the pixel area in which the detector must search for the ball.
We were able to use a CNN that was pre-trained to predict bounding boxes around all the people in a scene, as shown in the following figure. Unfortunately, there is frequently more than one ball on the green, so further logic is required beyond simply finding all people and searching for a ball. This requires another CNN to find the player that was currently putting.

Player classification and ball detection
To further narrow down where the ball could be, we fine-tuned a pre-trained object-detection CNN (YOLO v7) to classify all the people on the green. An important component of this process was manually labeling a set of images using SageMaker Ground Truth. The labels allowed the CNN to classify the player putting with high accuracy. In the labeling process, the ball was also outlined along with the player putting, so this CNN was able to perform ball detection as well, drawing an initial bounding box around the ball before a putt and feeding the position information into the downstream ball tracking CNN.
We use four different labels to annotate the objects in the images:

player-putting – The player holding a club and in the putting position
player-not-putting – The player not in the putting position (may also be holding a club)
other-person – Any other person who is not a player
golf-ball – The golf ball

The following figure shows a CNN was fine-tuned using labels from SageMaker Ground Truth to classify each person in the field of view. This is difficult because of the wide range of visual appearances of players, caddies, and fans. After a player was classified as putting, a CNN fine-tuned for ball detection was applied to the small area immediately around that player.

Ball path tracking
A third CNN, a ResNet architecture pre-trained for motion tracking, was used for tracking the ball after it was putted. Motion tracking is a thoroughly researched problem, so this network performed well when integrated into the pipeline without further fine-tuning.
Pipeline output
The cascade of CNNs places bounding boxes around people, classifies people on the green, detects the initial ball position, and tracks the ball once it begins moving. The following figure shows the labeled video output of the pipeline. The pixel positions of the ball as it moves are tracked and recorded. Note that people on the green are being tracked and outlined by bounding boxes; the putter at the bottom is labeled correctly as “player putting,” and the moving ball is being tracked and outlined by a small blue bounding box.

Performance
To assess performance of components of the pipeline, it’s necessary to have labeled data. Although we were provided with the ground truth world position of the ball, we didn’t have intermediate points for ground truth, like the final pixel position of the ball or the pixel location of the player putting. With the labeling job that we carried out, we developed ground truth data for these intermediate outputs of the pipeline that allow us to measure performance.
Player classification and ball detection accuracy
For detection of the player putting and the initial ball location, we labeled a dataset and fine-tuned a YOLO v7 CNN model as described earlier. The model classified the output from the previous person detection module into four classes: a player putting, a player not putting, other people, and the golf ball, as shown in the following figure.

The performance of this module is assessed with a confusion matrix, shown in the following figure. The values in the diagonal boxes show how often the predicted class matched the actual class from the ground truth labels. The model has 89% recall or better for each person class, and 79% recall for golf balls (which is to be expected because the model is pre-trained on examples with people but not on examples with golf balls; this could be improved with more labeled golf balls in the training set).

The next step is to trigger the ball tracker. Because the ball detection output is a confidence probability, it’s also possible to set the threshold for “detected ball” and observe how that changes the results, summarized in the following figure. There is a trade-off in this method because a higher threshold will necessarily have fewer false alarms but also miss some of the less certain examples of balls. We tested thresholds of 20% and 50% confidence, and found ball detection at 78% and 61%, respectively. By this measure, the 20% threshold is better. The trade-off is apparent in that for the 20% confidence threshold, 80% of total detections were actually balls (20% false positive), whereas for the 50% confidence threshold, 90% were balls (10% false positive). For fewer false positives, the 50% confidence threshold is better. Both of these measures could be improved with more labeled data for a larger training set.

The detection pipeline throughput is on the order of 10 frames per second, so in its current form, a single instance is not fast enough to be run continuously on the input at 50 frames per second. Achieving the 7-second mark for output after the ball steps would require further optimization for latency, perhaps by running multiple versions of the pipeline in parallel and compressing the CNN models via quantization (for example).
Ball path tracking accuracy
The pre-trained CNN model from MMTracking works well, but there are interesting failure cases. The following figure shows a case where the tracker starts on the ball, expands its bounding box to include both the putter head and ball, and then unfortunately tracks the putter head and forgets the ball. In this case, the putter head appears white (possibly due to specular reflection), so the confusion is understandable; labeled data for tracking and fine-tuning of the tracking CNN could help improve this in the future.

Conclusion
In this post, we discussed the development of a modular pipeline that localizes players within a camera’s field of view, determines which player is putting, and tracks the ball as it moves toward the cup.
For more information about AWS collaboration with the PGA TOUR, refer to PGA TOUR tees up with AWS to reimagine the fan experience.

About the Authors
James Golden is an applied scientist at Amazon Bedrock with a background in machine learning and neuroscience.
Henry Wang is an applied scientist at Amazon Generative AI Innovation Center, where he researches and builds generative AI solutions for AWS customers. He focuses on sports and media & entertainment industries, and has worked with various sports leagues, teams and broadcasters in the past. During his spare time, he likes to play tennis and golf.
Tryambak Gangopadhyay is an Applied Scientist at the AWS Generative AI Innovation Center, where he collaborates with organizations across a diverse spectrum of industries. His role involves conducting research and developing Generative AI solutions to address crucial business challenges and accelerate AI adoption.

<