Deep learning in agriculture is hard because of a lack of annotated data. The obvious solution is to collect and annotate more data! But this is time-consuming and expensive. And, as I know from experience, really, really tedious. So a lot of my work focuses on methods to avoid data annotation entirely.
How does this work? There’s a family of techniques called “Self-Supervised Learning” (SSL) which use some clever methods to generate labels automatically while training the model. These labels are probably not very useful for the task you actually care about, but the idea is that they’re meaningful enough for the model to learn some useful features during a pre-training stage. Then you can go off and use those features for whatever task you want. It’s kind of like pre-training on ImageNet, except that without the need for labels, there’s really nothing stopping you from scaling up your dataset to the entire internet.
Learning with SimCLR
Let’s look at a concrete example with images. If we have an image, we can freely distort it (altering colors, cropping, rotating, etc.) without making it unrecognizable. Thanks to this effect, we can design a “proxy task” for SSL, where we randomly distort an image in multiple ways. Then, we train a model to recognize when the inputs are actually totally different images, as opposed to the same image with some distortion.
It turns out this is enough for the model to learn some useful features.

What About Second Camera?
My research focuses on a particular situation where we have two cameras observing the same scene. So what? Well, two cameras at different locations produce two different views of the same scene. My goal was to try to understand whether these two real views of the same scene are more effective for SSL than the two “fake” views that we created by distorting a single image.

As it turns out, it is more effective. In fact, the best versions of our technique were about 14% better than the standard image distortion technique. Of course, there’s a long list of caveats here: we focus explicitly on tasks related to cotton phenotyping. Your mileage may vary.
Oops, All Cameras
“But Daaaniel!” I hear you say, in that annoyingly whiny voice of yours. “Why should I care about two cameras? I only have one camera, and I think that’s a perfectly adequate number of cameras to have.” This may be true for you, but they don’t pay me the (actually not very) big bucks to be merely adequate.
Long-time readers of this blog (who could probably fit around my dining room table) might remember that we happen to have a robot with a crap-ton of cameras. Did you ever wonder what we used all that data we collected for? While, now you know! And it only took us… oh god, a year and a half? Wow.

I pushed my version of multi-camera SSL to the limit, using all six cameras on the robot. Mostly, what I found was that this was excessive. Pretty much all the improvement from this technique can be obtained with just 3 cameras. That’s good news for those of you with a limited camera budget!

Please, Cite my Paper
I’m not telling you all this just because I feel like it. A few days ago, I finally published a paper on this technique. If you’re interested in the details, well, they’re all in there.
Overall, this has been a very interesting project, but somewhat limited in scope. We really only looked at a few SSL techniques. We tested them only on cotton. We also tried to systematically evaluate the effect of camera positioning on learned feature quality, but the results were too noisy to draw many conclusions. There’s a lot of future work that could be done.
But for now, I’m just happy to have this sucker published.

Petti, Daniel / Li, Changying / Liu, Ninghao
Contrastive multi-view representation learning for multi-camera plant phenotyping: A cotton field study
2026
Plant Phenomics , Vol. 8, No. 2
p. 100193
Abstract: Attempts to deploy computer vision in agricultural tasks often suffer from a shortage of annotated data. One strategy to alleviate the impact of limited data is Self-Supervised Learning (SSL), which involves pre-training a model on a pretext task that utilizes automatically generated annotations. The primary objective of this study is to leverage a multi-camera view dataset of cotton boll images for contrastive learning in order to enable phenotyping tasks with minimal data annotation. This dataset was collected in the field using six camera views. The efficacy of two contrastive learning frameworks (SimCLR and MoCo) in producing representations when positive examples originate from different cameras was investigated, and a comprehensive study of how the camera positions affect performance was conducted. After self-supervised pre-training, linear evaluation and semi-supervised learning experiments were performed on boll detection and plot status downstream tasks. In general, using multiple camera views with SimCLR and MoCo improves cotton boll detection mean average precision by 14% compared to vanilla SimCLR and MoCo. Through careful investigation using synthetic data, it was determined that relative camera poses with an intermediate amount of overlap seem more likely to perform well. Neither MoCo nor SimCLR was consistently superior to the other in this context. The representations embed meaningful features about the cotton plants, such as overall boll density, but also less meaningful ones, such as lighting variations. This technique could potentially accelerate the development of phenotyping algorithms based on data collected from field robots.
No responses yet