Add this one to a controversial use case of Claude Code: the enterprise of referee reviews. Earlier than I clarify how I take advantage of Claude Code to undertake referee reviews, let me observe a couple of issues. First, I say “undertake” as a result of using AI brokers to do a report is greater than merely “writing” one. That’s as a result of as I’ll present you might be able to doing extra in your reviews than you ever have performed earlier than, and writing might solely be a small portion for those who even select to take action in any respect. And two, I needed to first remind the reader my philosophy of when you need to use AI Brokers for social science. I feel this checklist of standards extends to different duties, however I’ll stick to social science for now to make my level. You must think about using an AI agent for work when the next 4 situations maintain.
-
Excessive worth duties. You must use AI brokers to do work when the work and/or the output of the work is extraordinarily priceless. Referee reviews qualify as that. They’re the spine of the trendy manufacturing operate of science and doing a nasty job at it imposes prices not solely on the editor and the researchers who wrote the paper you’re reviewing; a nasty job impacts the social scientific report itself by slowing it down, in addition to injecting into it errors of varied varieties.
-
Excessive time use. Just because the work is efficacious doesn’t subsequently imply the most effective response to the work is to make use of an AI agent for help. One of many issues that AI brokers are helpful for, although, is that they full cognitive duties and produce cognitive output at decrease time use, by which I imply, decrease items of human time. And so if referee reviews take often an incredible period of time to finish, then AI brokers are nice as a result of they do this work in much less time.
-
Onerous if not not possible to do properly. Unbiased of excessive worth duties and excessive time use work, AI brokers additionally ought to be thought-about when the work is tough to do properly, if not not possible to do in any respect. That is one thing to all the time maintain in your thoughts. It’s solely potential that even with an infinite period of time, the duty at hand is sort of presumably arduous to do properly. By which case, the AI brokers may be very helpful as a result of it could actually do this work, attributable to its coaching, coaching information, and stubbornness. It would not quit, and since it really works quick, it can get to the not possible activity in far lower than infinite time, which is a significant understatement to be trustworthy.
-
Simple to do badly, mediocre, and even flat out flawed. This precept appears redundant with quantity 3 generally, however I feel it’s worthwhile to maintain it separate. Precept 3 is extra of a “can’t do” sort of precept, however precept 4 is extra of a “can do” sort of precept. You can’t do it properly (precept 3). You might be probably going to do it dangerous, possibly even flawed (precept 4).
Referee reviews match all 4 of those. They’re extraordinarily priceless, as I mentioned. They’re the spine of recent science as a result of they’re the way in which wherein peer evaluate occurs. That’s precept 1: excessive worth use.
Referee reviews are additionally time consuming, excessive time use even. The extra attentive you might be to the manuscript, the longer they take. The much less attentive you might be to the manuscript, the extra you mar the scientific report by within the restrict doing persistently dangerous jobs on it. Precept 2: time consuming.
Referee reviews are arduous to do properly. They require ability, attentive, open mindedness. They require forgetting the id of the creator — their fame, your relationship with them, your mistrust and your fondness for them and the literature they’re a participant in. Precept 3: arduous to do properly.
And at last, referee reviews are simple to mess up. You will be triggered by the examine for any type of idiosyncratic motive. You possibly can fail to offer the creator a good shake. The sustaining of an open thoughts to it may very well be borderline not possible for some subjective motive. It could require expertise you merely wouldn’t have and don’t wish to have. Precept 4: trivially simple to do the work badly, if not incorrectly.
Thus, I might to suggest some tips I take advantage of to assist me undertake referee reviews utilizing brokers. Your mileage might fluctuate on this.
Thanks once more for all of your assist, within the meantime. This substack is a labor of affection! Take into account changing into a paying subscriber right this moment!
You’ll probably get your project from the editor with a pdf to evaluate. It’s unlikely if by no means the case that you may be despatched a phrase doc or textual content file. It is a downside with using LLMs as a result of they battle with pdfs as pdfs are usually not textual content. You possibly can learn the textual content, for certain, as a result of the characters seem an identical to the shapes of letters and phrases. They observe grammar. They’re writing. However they aren’t “textual content” in the way in which that it issues to a LLM. They’re extra like hieroglyphics and even drawings. They require extraction strategies which might be expensive by way of tokens and LLM consideration, and their use can run multiples of extra tokens with far larger error than had it been textual content. And the bigger the pdf, the more severe it will get.
Due to this fact step one in your referee report project is discover a approach to convert the pdf right into a markdown doc that you could then have the LLM analyze. I take advantage of /split-pdf, a ability I created and saved at my MixtapeTools GitHub repository. /split-pdf splits a pdf into smaller chunks, often round 4 pages lengthy, after which write markdown summaries of every shortened pdf. I often amend the ability with prompts too. Right here is an instance of 1 such immediate.
Please use /split-pdf on pdf. Then once you’re performed, please write markdowns summarizing every cut up. Preserve them separate. Do your absolute best to extract desk contents right into a markdown kind that's machine readable. Do your finest with figures by going on to them in every cut up and optically extract all the data you may get, together with an outline of the type of it, in order that we will recreate it in a deck after. Then when you’re performed, completely evaluate your entire markdowns you made, and create a single new markdown that's stuffed with insights. I wish to know the analysis query, the motivation for it, the type of estimation, the goal parameter, the identiifcation assumptions, I would like it as freed from jargon as you possibly can, what's the core proof they current, and their conclusions. I would love an evaluation of your general view of how satisfied you assume I'll discover the paper.
Discover how I direct the agent to very slim issues. It is because the /split-pdf was not designed for referee reviews. So I attempt to give it extra context for what the duty truly is, and what I would like is the extraction of the related info that I can use later for the needs of the peer evaluate. This isn’t performed instead of studying the paper thoughts you. Reasonably that is performed as a complement to the studying. What we hope that that is doing is supplying you with a compass, or a map, or a psychological mannequin, of the manuscript and the examine such that once you do learn it, you perceive the place this examine goes, and might hopefully then dive into it extra shortly.
However with this immediate, the LLM shall be splitting the manuscript — which in economics, recall with appendices can run over 100 pages lengthy and on a subject you might be solely marginally conversant in, not to mention an knowledgeable in — after which summarizing every cut up in accordance with no matter it’s you need it to do. After which as soon as it has produced all of the splits, and all of the markdown summaries of every cut up, it then opinions these markdowns rigorously and writes a single evaluate of your entire factor. I all the time finish it with a request that it assess the standard of proof that I might discover compelling. That’s, I ask it to debate how the authors current the proof, the order wherein they do, the convincing nature of it, and most significantly, I ask it to attempt to bear in mind my very own thoughts because it does so.
The following factor I do after it has performed this abstract is I ask the LLM to make use of my /beautiful_deck ability to write down a deck in beamer that can assist me acquire a psychological map and overview of the examine. However once more, I don’t merely invoke the beautiful_deck ability. Reasonably, I amend it with extra particular requests, often which entail a reminder to learn once more my “rhetoric of decks” essay, and to create the deck in accordance with ideas that are technically within the ability markdown, however which I nonetheless wish to remind them of. This in all probability does improve tokens, however to be trustworthy, I’m not 100% trustful of expertise and so are inclined to do followups when invoking them simply to be secure. That is in all probability one thing that with time I’ll grow to be a bit much less more likely to do, however it’s what I do now, so I needed to share it. Right here is an instance of a immediate.
As soon as you might be performed, then do that. Please use /beautiful_deck to make a whole a radical deck that's according to and primarily based on the ideas of the rhetoric of decks as outlined in my essay, the rhetoric of decks. Bear in mind the strikes: utility, instinct, visualization utilizing /tikz and .png, story, low cognitive desnity per slide, stream, after which and solely then the introduction of the extra technical and rigorous parts of the examine. In our case, I might additionally such as you to think about using simulations that match what the authors are doing, together with using estimators, the panel size, outcomes and ocvariates. Report stunning tables and exquisite figures primarily based on these simulations as long as they match what's being performed of their paper and as long as it aids the rhetoric of decks and hte narrative you’re making an attempt to inform. I'm the viewers; I'm the one who has to write down the report.
It is vital when utilizing the /beautiful_deck ability to remind the LLM who the viewers is. That’s as a result of rhetoric as a self-discipline of examine is just not merely what you say, and never merely the way you say it. It is usually who you say it to. And on this case, you aren’t speaking the manuscript’s contents to different folks; fairly, you might be speaking the manuscript’s contents to your self, and since you haven’t but learn the manuscript, and you might be needing a psychological map of the manuscript, this turns into in regards to the LLM writing a deck that’s finest for you.
That’s the reason I often request a couple of issues. First, I request narratives. I feel by way of narratives. If that makes me completely different than others, so be it, however I do know myself, and narratives are one of many essential methods I work together with science. It’s detectable to anybody who has learn my writings, significantly my Mixtape ebook. I’ve a behavior of detouring into digressions associated to the folks, their relationships, their locations of employment, and the context wherein they’re writing. For causes that aren’t solely clear to me, narrative helps me see higher the technical materials too. Once I can see that Alberto Abadie is Josh Angrist’s scholar, and I can see that Alberto Abadie coauthored with Guido Imbens all through the early twenty first century, and after I can see him take a place at MIT, leaving Harvard Kennedy Faculty. And after I can see that he’s from the Basque Nation. Once I can see all these items, then for causes which might be mysterious to me, I discover that I can observe extra carefully the econometric notation and the calculations in a few of his papers. Why that’s the case, I don’t know. I simply know that it’s the case, and so my decks have a tendency to emphasise narrative that communicates to me.
Discover that I additionally encourage closely using “stunning figures and exquisite tables”, though the LLM can not but itself extract utilizing snipping instruments the authors’ personal figures. So what I do as an alternative is I ask it to create simulations of the authors personal examine in some language with pretend information that matches the outline of the paper after which use the identical estimator, with the identical specification, as described and current that to me. This serves a couple of functions for me. First, it will get me photos and I want photos. I wants photos to grasp others work as a lot as if no more than the narratives of the work.
Second, by producing simulated information that matches the info within the paper, I can begin to higher perceive exactly how the estimator will and won’t work together with it. As an illustration, take a prolonged panel dataset of minimal wages. Minimal wages improve due to the authorized independence of every native jurisdiction. Cities and states can elevate their very own minimal wages. They’ll goal sure industries however not others. They needn’t watch for the federal authorities to do it. However the federal authorities routinely does do it too. And when the federal authorities raises the minimal wage, then there isn’t a untreated management group at a time limit when you find yourself estimating the impact of the minimal wage on employment. And what this implies is that two-way fastened results (TWFE) shall be within the precise scenario the place it’s the most biased. It is a delicate element, and it’s possible you’ll catch it, however the simulation will for certain catch it, and for those who requested for sure different estimators to be run, like Callaway and Sant’Anna — which by the way can not and is not going to use the interval when the federal minimal wage was handed because it should have an untreated unit at each 2×2 calculation for that 2×2 to even be calculated — you will see that that it could not even try it, and that may be one thing you catch.
After the /beautiful_deck ability was invoked to create, properly, a ravishing deck, I then have /referee2, my essential pondering ability, to evaluate the deck towards the markdowns and the markdowns towards the pdf. I wish to catch as many low hanging fruit issues as I can. Right here is an instance.
Please use now /referee2 on the deck and on the summaries. Overview what your interpretation is and your rationalization of it's within the deck. Bear in mind the rhetoric of decks and the objectives.
When /referee2 is invoked, an agent — ideally one on the CLI exterior of no matter you have got been doing as you need this to be a clear critique of the work of the opposite AI agent — will undergo the markdowns in addition to the deck to evaluate the brokers’ work. Be aware, /referee2 is just not reviewing the manuscript. Reasonably, /referee2 is reviewing the agent’s interpretation of the manuscript, in addition to the manufacturing of a deck primarily based on it to speak its findings to you. It would additionally evaluate any simulations.
I do that as a result of I’m making an attempt to pump as a lot inaccuracy out of the deck and summaries as I can in order that when I’m myself able to learn the deck, I shall be able to take action with out as a lot concern that the deck is severely flawed. The bottleneck in science is not manufacturing. It’s verification. And thus I attempt to use the LLM to get as a lot performed on verification in order that after I get into it, the best issues have been fastened.
That is in all probability overkill, however I do it anyway. I’ve one other ability known as /blindspot whose sole job is to evaluate no matter I throw at it, however to attempt to look across the edges of the “essential outcomes” of no matter it’s reviewing for issues that may be missed however that are suspicious. This is likely to be pattern sizes in tables that don’t fairly add up, however which will be simply missed since they’re actually not the star of the paper. They are often issues like how it’s not possible to estimate the impact of the minimal wage utilizing Callaway and Sant’Anna on employment if the years span the elevating of the federal minimal wage since that may destroy all of the untreated items. /blindspot is designed to attempt to discover issues like this. Here’s a generic instance.
As soon as you might be performed with this, please use /blindspot on the markdown summaries, the deck, but additionally the referee2 markdown you created. I would like you to deal with the issues which might be simple to overlook which aren't the star of the paper. As an illustration, possibly the pattern sizes don’t add up as a result of if there are 50 states and 10 years, there ought to be 50x10=500 items. But there are 850 items or there are 100 items -- which isn't potential if the panel is balanced, and thus might counsel an issue. Or possibly the paper makes use of Callaway and Sant’Anna to estimate the minimal wage, and has such a protracted panel that it contains the federal authorities’s elevating of the minimal wage. This might make CS not possible to make use of as a result of CS can not survive a federal minimal wage improve because it wants untreated items to calculate every 2x2, and the elevating of the federal minimal wage would depart the authors with out one. I'm making an attempt to see if referee2 missed something.
Discover that /blindspot is reviewing, not solely the deck and the markdown summaries. It is usually reviewing the /referee2 report. Once more, my angle is skepticism — what has the LLM missed? To try this, I create a donut gap across the examine’s essential outcomes, ask the LLM to disregard the principle outcomes, and focus as an alternative on the opposite stuff.
And final however not least, I invoke my /tikz ability. My /tikz ability is my means of making an attempt to get the LLM to double test the labeling issues which might be borderline not possible to iron out of the method of making pictures and information visualizations in tikz itself, in addition to as .png constituted of statistical software program like R or python.
See, my beautiful_deck ability is instructed to have a zero tolerance for compile errors. Frequent compile errors are when the textual content spills over the margins of the slide. It will set off seen errors, regardless of successful compile, with phrases like “overfull” and “hbox”. They may have hooked up to them numbered traces the place the error occurred, in addition to the scale of the error. So beautiful_deck tackles that head on and works at them till they’re not there.
However this course of is not going to catch the sorts of errors which might be produced with labels and drawings in Tikz and R/python/Stata. If phrases are crossing traces, then that won’t set off a compile error. It would efficiently compile, in reality, as a result of technically incorrectly drawn figures will nonetheless compile.
So I’ve a ability that’s solely dedicated to checking the photographs and the drawings. It makes use of issues like Bézier curves to make sure that the situation of labels and objects are usually not crossing in illegible methods. Bézier curves are usually not utilizing sight; they’re utilizing mathematical formulation. And subsequently it could actually hint immediately the coordinates of the labels to make sure there’s sufficient white house between objects on the picture. Right here is an instance immediate.
Now please use /tikz to evaluate all pictures be it the Tikz pictures and/or any simulated pictures in .png from R, Stata or python. Every object within the picture ought to have sufficient white house between it on each left and proper sides and high and backside sides in order that nothing is interfering. Try to be utilizing Bezier curves to make sure this, not vibes or guesses. Don't tolerate any errors.
Is /tikz good? Sadly, it isn’t. Perhaps 50% of the time it will utterly get rid of all the issues in picture building. When it fails, it’s often in areas the place there are such a lot of objects, and the necessity to put labels is so surgically exact, that it’s coping with a kind of Mission Not possible type conditions the place Tom Cruise has cross a room with out touching about 50 laser beams. When it is only one laser beam, /tikz works nice. When it’s 10, it nonetheless works nice. However above some quantity, I’ve seen that it’s only a problem, and it’s even a problem if I had been to be doing it manually. My hope is that in the long term, I determine it out. However for the time being it isn’t fairly there. Nonetheless, possibly you’ll have extra luck.
As soon as all of that is full, I’ve the LLM do another cross by to make sure that the deck is full. I take advantage of /referee2 and /beautiful_deck to do that by invoking them each. I’ve /referee2 learn the sooner agent’s referee2 markdown report, after which repair the deck. After which I take advantage of /beautiful_deck to do its personal cross by.
After which I’m performed. Then I learn carefully the deck, flipping round, making an attempt to observe it shortly at a topical stage. I roughly memorize the examine from the deck up to a degree the place I really feel like I can learn the paper recent.
That is how I work now. Whether or not that is killing my capacity to learn papers, I stay to be seen, however that is how I work. The peer evaluate course of is extremely priceless. The time it takes is immense. My very own consideration issues trigger me to do it badly until I spend an inordinate period of time on it. And generally I simply know that I’m in all probability going to do a nasty job realistically.
And that’s it. Discover I didn’t say anyplace to make use of the LLM to write down the report. And I didn’t say to make use of the LLM to keep away from studying the report. I did this to get an government abstract of the manuscript in a kind that I can digest in order that after I do learn it, I can deal with it extra effectively.
Your mileage might fluctuate on all the pieces else, however I figured I’d come clear and simply share how I’m doing all this as I feel maybe not everybody is aware of that is potential, and possibly editors ought to find out about it in addition to possibly they wish to get one thing like this spun as much as assist them on the desk if in reality the desk is getting an increasing number of submissions.
