Using LLMs to Track Injuries in the Agriculture Sector

on February 4, 2025

When we think of farming, most of us probably picture some sort of idyllic rural scene: green pastures, a quaint farmhouse, hedgerows, maybe a couple of cows and chickens. But this has not been the reality for decades (assuming it ever was). In the developed world, farming is highly industrialized and mechanized.

It’s also dangerous. The agricultural sector consistently ranks among the most perilous for workers in the U.S. with 20.3 injuries per 100,000 workers in 2023. It also employs a lot of immigrants, and the U.S. is limited in its ability to track workplace injuries, even to white people. Add all that up, and the “official” picture of injuries in the agriculture industry is probably incomplete.

Ag Injury News

A few years ago, some of our colleagues created a database called Ag Injury News, which tracks these injuries and organizes them into a form that’s easy to search and analyze. The idea is that if some hypothetical future government decides that worker safety is a priority, they’ll have all the data they need to design policy interventions in a usable format.

Ag Injury News sources data from news reports on the internet. There’s an automated system that monitors for reports that it thinks are relevant, and human volunteers manually sort through the output and enter the information into the database. This has severe throughput limitations, given that there’s only so much you can expect from a volunteer workforce. I already dread reading the news, and I doubt having to take a quiz afterwards would make it any better.

Automating Article Ingestion with LLMs

Around the time that ChatGPT was publicly released, we wondered whether it could at least partially automate the tedious process through which articles are added to the Ag Injury News database. On the one hand, Large Language Models (LLMs) are usually pretty good at these sorts of “reading comprehension” tasks. On the other, LLMs are prone to hallucinations, and this application requires them to output text in a very structured format. Are they up to the task?

To answer that question, we tested two different versions of ChatGPT (3.5 and 4o), as well as the open-source LLama2. We downloaded 300 random articles from the Ag Injury News database, and for each article, we had the models fill out a form. Then we compared the responses to what was recorded in the Ag Injury News database. You can check out a simplified example of this process here.

Diagram of the overall workflow — High-level diagram of our article processing pipeline (Muller et al, 2024)

Both versions of ChatGPT performed well out of the gate, but we found that the much smaller Llama2 model struggled. The latter, however, is open source, which makes it easy to re-train on our own dataset. For that purpose, we pulled the entire contents of the Ag Injury News database, which, at that time, encompassed 5,300 articles about incidents that happened from 2016 to 2023. We generated the expected prompt-response pairs for the articles and used them to fine-tune the Llama2 model. This version of Llama2 worked much better, and still has the advantage of not costing anything to run.

Results

Chart of the accuracy on the incident fields

All three models did pretty well at extracting the high-level information from the articles. We were mostly pleased with this performance, although they did struggle a bit with more detailed location information (like city). I think this is because a lot of our articles come from local news sources, which often provide only minimal location details (like the street), assuming that the reader is familiar with the area. We tried to mitigate this by collecting information about the local news sources in our dataset and injecting that into the prompts, but this process isn’t always super reliable.

Chart of the accuracy on the victim fields

We also tried to extract specific information about every victim. The LLMs are quite effective at this task, although LLama2 does particularly well. It has some trouble with more complex fields like names, which underscores the limitations of these models. Sometimes the model will output the name of a reporter or bystander mentioned in the article instead of the victim.

One major caveat of these results is that our dataset tends to be really unbalanced. For instance, there are way more adult victims than child victims in our dataset (as one would hope). Therefore, high accuracy on the “age” field can obscure the fact that the models sometimes miss the rare child victims. A similar thing is going on for many of the other victim fields.

Addendum: More Capabilities with Better Models

One of the perils of doing research with LLMs is that it becomes outdated so quickly. As a case-in-point, DeepSeek R1 was released right after we submitted the proofs for our manuscript. R1 is a “reasoning” model, which is trained to think through complex problems in a step-by-step fashion, which does tend to improve performance on certain tasks. The idea is not new, but because it is relatively small and open source, R1 is the first reasoning model that’s really practical for researchers to get hands-on with.

I thought that R1 might solve a particular problem that we had with the Ag Injury News project. When we’re adding new victims to the Ag Injury News database, one of the things our collaborators care about is the OIICS code for their injuries. OIICS is a somewhat byzantine classification system. Each injury type is represented by a 4-digit code, and the classification is hierarchical: All codes that start with 4, for instance represent “falls, slips, and trips”. The granularity increases with each additional digit, with code 4213, for instance, indicating a “fall on same level due to tripping over an object”.

Example of an injury category in the OIICS classification system.

It turns out that automatically producing these codes for each victim is a pretty hard problem. (The difficulty is compounded by the fact that humans often disagree!) We had trouble getting LLMs to do this well. The best we could manage was a fine-tuned version of Llama2, but even this model had trouble with the really specific levels of the categorization.

Recently, I tested Deepseek R1 on the same problem, and it did a better job:

User: Please read the following article:  

1 injured after tree crashes down on farm in Westport WESTPORT, MASS. (WHDH) - A man was injured after a tree crashed down on a farm in Westport on Thursday. The man was sitting on a tractor and working on the trees when one crashed down on him. He was taken to a local hospital with non-life-threatening injuries. Authorities say the incident was an extremely close call.  

Please categorize the injury sustained by the victim in this article. The categories are:  

1: VIOLENCE AND OTHER INJURIES BY PERSONS OR ANIMALS 
 2: TRANSPORTATION INCIDENTS 
 3: FIRES AND EXPLOSIONS 
 4: FALLS, SLIPS, TRIPS 
 5: EXPOSURE TO HARMFUL SUBSTANCES OR ENVIRONMENTS 
 6: CONTACT WITH OBJECTS AND EQUIPMENT 
 7: OVEREXERTION AND BODILY REACTION 
 9999: Nonclassifiable  

Note that the options are listed in the order of precedence, so if two or more apply, use the first one. 

R1: <think> 

Okay, I need to figure out how to categorize the injury from the article provided. Let me read through it carefully. 

The article says a man was injured when a tree crashed down on him while he was sitting on a tractor and working on the trees. He was taken to the hospital with non-life-threatening injuries. 

Looking at the categories given: 

1: Violence or persons/animals 2: Transportation incidents 3: Fires/explosions 4: Falls/slips/trips 5: Exposure to harmful substances 6: Contact with objects/equipment 7: Overexertion/bodily reaction 9999: Nonclassifiable 

The man was working on trees when a tree fell on him. He was on a tractor, so the tractor is equipment, but the injury was caused directly by the tree. Trees are objects, not equipment. So the injury resulted from contact with an object (the tree) rather than equipment like machinery. 

I should check if any other categories apply. It wasn't violence or animal-related, no transportation involved beyond the tractor which isn't causing the injury, no fire or explosion, no fall since he was sitting, and no overexertion. So category 6 seems correct. 

</think> 

 

The injury is best categorized under 6: CONTACT WITH OBJECTS AND EQUIPMENT as the man was struck by a falling tree while working on it.

It seems that the reasoning model is better equipped to solve this type of problem. The disadvantage, of course, is that it requires far more tokens than our Llama2-based approach, which translates to a lot of additional computation. Across hundreds of articles, this can add up.

Anyway, this is just one article. I’m not sure how it would do on the combined dataset. If we were still working on the paper, I would probably have gone back and tested more thoroughly. Unfortunately, though, this model was released a little too late for us.

Conclusion

It turns out that LLMs are pretty good at sifting through reams of boring text and extracting important information. This is not a new insight. Other people have been doing similar things with clinical notes, or sifting through court documents in order to find people who may have been wrongfully convicted.

People, especially the most vulnerable, are easily lost in the streets and back-alleys of the massive bureaucracies we’ve built up to administer our societies. In the past, more often than not, anyone who fell through the cracks simply disappeared, their files left in a forgotten drawer to collect dust. People with power have limited insight into the lives of everyone below them and often prefer to keep it that way. New technology has the potential to sift through all those paper trails and amalgamate that information into something actionable. The question is whether or not we’ll use that capability.

For now, it looks like our project has come to an end. We always struggled to get funding for it, and what little we could muster ran out months ago. The people who run the Ag Injury News website don’t seem to have any interest in deploying what we’ve created. They’re worried about copyright lawsuits from the news organizations that write the content they feed into the system.

If you’d like to know more about our work, you can read our ASABE conference paper. We also have a more complete version that’s currently in-press. I’ll post about it when it gets published.

Update 2/5/25: The full paper has now been published and is available here.

Categories:

tech

Tags:

large language model