Deep learning in agriculture is hard because of a lack of annotated data. The obvious solution is to collect and annotate more data! But this is time-consuming and expensive. And, as I know from experience, really, really tedious. So a lot of my work focuses on methods to avoid data annotation entirely.

How does this work? There’s a family of techniques called “Self-Supervised Learning” (SSL) which use some clever methods to generate labels automatically while training the model. These labels are probably not very useful for the task you actually care about, but the idea is that they’re meaningful enough for the model to learn some useful features during a pre-training stage. Then you can go off and use those features for whatever task you want. It’s kind of like pre-training on ImageNet, except that without the need for labels, there’s really nothing stopping you from scaling up your dataset to the entire internet.

Learning with SimCLR

Let’s look at a concrete example with images. If we have an image, we can freely distort it (altering colors, cropping, rotating, etc.) without making it unrecognizable. Thanks to this effect, we can design a “proxy task” for SSL, where we randomly distort an image in multiple ways. Then, we train a model to recognize when the inputs are actually totally different images, as opposed to the same image with some distortion.

It turns out this is enough for the model to learn some useful features.

Diagram of the standard SimCLR process, with cotton field images as examples.
A standard approach to contrastive learning with images. We take normal images (bottom row), and apply random distortions to them. Then, we train the model to recognize when it’s seeing the same image with different distortions (columns) as opposed to when it’s seeing totally different images (rows).

What About Second Camera?

My research focuses on a particular situation where we have two cameras observing the same scene. So what? Well, two cameras at different locations produce two different views of the same scene. My goal was to try to understand whether these two real views of the same scene are more effective for SSL than the two “fake” views that we created by distorting a single image.

Diagram of the multi-view SimCLR process, with cotton field images as examples.
My approach to multi-camera contrastive learning. In this case, I have two cameras viewing each scene. We train the model to recognize when it’s seeing two different views of the same scene as opposed to two completely different scenes.

As it turns out, it is more effective. In fact, the best versions of our technique were about 14% better than the standard image distortion technique. Of course, there’s a long list of caveats here: we focus explicitly on tasks related to cotton phenotyping. Your mileage may vary.

Oops, All Cameras

“But Daaaniel!” I hear you say, in that annoyingly whiny voice of yours. “Why should I care about two cameras? I only have one camera, and I think that’s a perfectly adequate number of cameras to have.” This may be true for you, but they don’t pay me the (actually not very) big bucks to be merely adequate.

Long-time readers of this blog (who could probably fit around my dining room table) might remember that we happen to have a robot with a crap-ton of cameras. Did you ever wonder what we used all that data we collected for? While, now you know! And it only took us… oh god, a year and a half? Wow.

MARS robot with 6 cameras mounted.
Ultimate camera power!

I pushed my version of multi-camera SSL to the limit, using all six cameras on the robot. Mostly, what I found was that this was excessive. Pretty much all the improvement from this technique can be obtained with just 3 cameras. That’s good news for those of you with a limited camera budget!

Bar graph showing the model's performance when trained with different numbers of cameras.
Performance of our method increases with up to three views, and then mostly saturates. COCO is a baseline that’s pre-trained with a large, labeled dataset. The vanilla version uses standard single-image distortions. SimCLR and MoCo are two different (but related) contrastive learning techniques.

Please, Cite my Paper

I’m not telling you all this just because I feel like it. A few days ago, I finally published a paper on this technique. If you’re interested in the details, well, they’re all in there.

Overall, this has been a very interesting project, but somewhat limited in scope. We really only looked at a few SSL techniques. We tested them only on cotton. We also tried to systematically evaluate the effect of camera positioning on learned feature quality, but the results were too noisy to draw many conclusions. There’s a lot of future work that could be done.

But for now, I’m just happy to have this sucker published.

Diagram showing the overall pipeline of the proposed SSL approach.
Graphical overview of the proposed SSL method

Petti, Daniel / Li, Changying / Liu, Ninghao
Contrastive multi-view representation learning for multi-camera plant phenotyping: A cotton field study
2026

Plant Phenomics , Vol. 8, No. 2
p. 100193

Abstract: Attempts to deploy computer vision in agricultural tasks often suffer from a shortage of annotated data. One strategy to alleviate the impact of limited data is Self-Supervised Learning (SSL), which involves pre-training a model on a pretext task that utilizes automatically generated annotations. The primary objective of this study is to leverage a multi-camera view dataset of cotton boll images for contrastive learning in order to enable phenotyping tasks with minimal data annotation. This dataset was collected in the field using six camera views. The efficacy of two contrastive learning frameworks (SimCLR and MoCo) in producing representations when positive examples originate from different cameras was investigated, and a comprehensive study of how the camera positions affect performance was conducted. After self-supervised pre-training, linear evaluation and semi-supervised learning experiments were performed on boll detection and plot status downstream tasks. In general, using multiple camera views with SimCLR and MoCo improves cotton boll detection mean average precision by 14% compared to vanilla SimCLR and MoCo. Through careful investigation using synthetic data, it was determined that relative camera poses with an intermediate amount of overlap seem more likely to perform well. Neither MoCo nor SimCLR was consistently superior to the other in this context. The representations embed meaningful features about the cotton plants, such as overall boll density, but also less meaningful ones, such as lighting variations. This technique could potentially accelerate the development of phenotyping algorithms based on data collected from field robots.

Categories:

No responses yet

Leave a Reply

Your email address will not be published. Required fields are marked *