I’m starting a systematic literature review for a research project and keep seeing recommendations for different AI tools, but I’m not sure which ones are actually reliable, accurate, and acceptable in an academic setting. I’d really appreciate suggestions on which AI tools you’ve personally used for literature review (searching, screening, summarizing, organizing papers), what worked well, what didn’t, and any pitfalls or ethical issues I should watch out for so I don’t harm the quality of my review.
Short version. Use AI as a helper for drafting and organizing, not for searching, screening, or extracting final data.
Here is what tends to be acceptable in academic settings for a systematic review:
- Literature search
Use established databases and tools.
• Core: PubMed, Web of Science, Scopus, PsycINFO, Cochrane, etc.
• For search-building help, you can use:
- ChatGPT / Gemini / Claude to draft initial search strings, then you refine with a librarian or your own expertise.
- Tools like Yale MeSH Analyzer for MeSH checking.
AI output here must be checked line by line. Do not trust it to know database syntax or indexing.
-
Screening
Most ethics guidelines say you should not delegate inclusion/exclusion to opaque AI models.
Use:
• Rayyan, Covidence, EPPI-Reviewer, or SyRF.
These use simpler ML to prioritize articles, but you still make the yes/no decisions. Journals accept these widely.
Avoid using LLMs to auto-decide “include / exclude” without human review. -
Data extraction
Keep this manual or use structured tools.
• Excel, REDCap, Covidence, EPPI-Reviewer, DistillerSR.
You can use ChatGPT or similar to:
• Help design extraction tables.
• Rephrase complicated sentences into simpler language while you check accuracy.
Do not let AI extract numeric results or risk of bias judgments unsupervised. It hallucintes and misses details. -
Risk of bias and quality assessment
Use standard tools.
• Cochrane RoB 2, ROBINS-I, Newcastle Ottawa, GRADE, AMSTAR 2, etc.
AI can help:
• Explain what each domain means.
• Draft your justification text from notes you provide.
You still need to read the trials and score them yourself. -
Writing the protocol and manuscript
Safe and helpful uses:
• Draft sections based on your bullet points.
• Improve clarity and grammar.
• Suggest structure, headings, and flow.
You must:
• Provide all substantive content.
• Check every reference and every number.
• Ensure no made up citations. LLMs invent references often. Cross-check in PubMed or Google Scholar. -
Acceptability and transparency
Check:
• Journal policies. Some demand an AI disclosure statement.
• Your institution’s research integrity or AI policy.
Best practice:
• Document where you used AI, for what, and which tool, including version and date.
Example: “We used ChatGPT (OpenAI, version X, accessed Month Year) to help edit language and refine text. All content was reviewed and verified by the authors.” -
Tools people often use and how
• General LLMs (ChatGPT, Gemini, Claude, Copilot)
- Safe for: phrasing help, summarizing papers you already uploaded, brainstorming keywords.
- Risky for: automatic screening, citation lists, quantitative data extraction.
• domain tools with some AI, more accepted: - Rayyan for screening prioritization.
- Covidence or DistillerSR for workflow management.
- Zotero, Mendeley, EndNote for reference management.
These have clear methods, not black box “chat” features making decisions for you.
- What to avoid
• Letting an LLM generate your reference list or PRISMA counts.
• Letting an LLM paraphrase huge blocks without checking for meaning drift.
• Using AI to fabricate rationales for risk of bias or results.
Editors and reviewers catch inconsistencies fast.
If you want a simple workflow:
- Draft protocol in Word. Use an LLM to polish language.
- Build searches with librarian + PubMed etc. Store in Zotero.
- Use Rayyan or Covidence for screening. You and a second reviewer decide.
- Extract data in Excel or Covidence.
- Use GRADE or similar for certainty.
- Use an LLM only to clean writing, summarize sections you already wrote, and help reorganize text.
- Declare AI use in the methods or acknowledgments.
I mostly agree with @ombrasilente’s take, but I’m slightly less strict about “never use AI for X” and more in the “use it like a nosy RA you don’t really trust” camp.
A few practical angles that haven’t been stressed yet:
1. Where AI actually saves you time (and is defensible)
- Refining your question & scope: Before you even write the protocol, LLMs are great for stress‑testing your PICO, suggesting alternative comparators, or pointing out adjacent terms you might miss. Then you sanity‑check all of it against real literature.
- Conceptual mapping: Paste a few key papers and ask the model to list recurring concepts, measures, and outcomes. Use that to design your data extraction form. You still define the actual items, but AI can surface “Oh, everyone’s using scale X and Y” faster than you scanning 20 PDFs.
- Terminology harmonization: If your field uses messy or overlapping language, AI can help you map “digital therapeutics / mobile health interventions / smartphone‑based CBT” into a cleaner taxonomy that you then document and justify.
2. Where some people do use LLMs, but you have to be brutally careful
This is where I slightly disagree with the super-strict stance. Some teams are experimenting with LLMs for:
-
Priority screening: Using an LLM to rank abstracts from “most likely relevant” to “probably irrelevant,” then you still screen everything but in a smarter order. If you do this:
- You must keep a full audit trail.
- You must still double‑screen a subset manually to check the model isn’t systematically excluding a subgroup.
- You need to state explicitly in the methods exactly how you used it.
It’s not mainstream‑accepted yet, but in fast‑moving fields (e.g., AI in medicine, COVID‑era topics) some reviewers quietly do this internally, as long as final inclusion/exclusion is human and reproducible.
-
First‑pass extraction for narrative details: For long, messy results sections, I’ve seen people ask an LLM: “List all outcomes and timepoints mentioned here” or “Highlight the sample characteristics.” Then they manually transcribe and verify into their form. AI here is just a high‑speed highlighter, not the source of truth. Journals won’t complain if you clearly say all final data were verified manually.
3. Tool choices by “risk” level rather than by task
Think about tools in tiers:
-
Low risk, almost always acceptable
- General LLMs (ChatGPT, Claude, Gemini, Copilot) for:
- Language polishing
- Brainstorming keywords or synonyms
- Explaining stats or methodology in simpler language
- Reference managers: Zotero, EndNote, Mendeley
- Workflow platforms that are mostly “dumb” (Covidence, Rayyan, DistillerSR, EPPI‑Reviewer)
- General LLMs (ChatGPT, Claude, Gemini, Copilot) for:
-
Medium risk, acceptable with transparency
- Using LLMs to summarize specific uploaded PDFs, when you already know they are in your corpus.
- Using them to help reword your risk of bias justifications from bullet notes you wrote.
- Generating alternative phrasings for your inclusion/exclusion criteria.
-
High risk, currently sketchy in academic eyes
- Letting AI identify “all the studies on X” instead of doing formal database searches.
- Letting AI decide inclusion/exclusion without human sign‑off.
- Letting AI output final effect sizes or PRISMA numbers.
If a methodologist or statistician on a review panel would raise an eyebrow, avoid or at least treat it as purely exploratory.
4. How to make editors and supervisors less nervous about your AI use
Concrete moves that help a lot:
- Keep versioned notes: “On Feb 5, used ChatGPT-4 to help rephrase background section; all content double‑checked against original sources.” Sounds boring, but this is exactly the kind of line that satisfies institutional audits.
- Make AI use non‑essential: If you had to strip out every AI contribution tomorrow, your review should still be reproducible from the protocol, search strings, and extraction sheets. That’s the mental test.
- Involve a librarian early: If you mention, “Search strategy was developed with an experienced medical librarian,” reviewers immediately care less about whatever minor LLM help you used for phrasing.
5. Choosing which LLM / ecosystem
Since you asked about tools, not just principles:
- If your institution provides a managed, privacy‑aware instance (e.g., “enterprise” ChatGPT, Claude for Teams, institutional Copilot), use that for anything involving unpublished notes or sensitive data. It helps with compliance arguments.
- If not, keep identifiable or sensitive info out of the prompts. Use short excerpts or anonymized notes instead of dumping full patient‑level details or confidential protocols.
- Don’t obsess over tiny performance differences between big models for this use case. What matters more:
- Can you export chat logs for documentation?
- Does it handle long PDFs without cutting corners too much?
- Does your supervisor / IT department allow it?
6. Quick decision rule you can actually follow
Every time you think of using AI, ask:
- “If I put this step in my methods section, would a picky reviewer shrug or start writing a complaint?”
- “If the AI vanished, could I redo this step with human work and get essentially the same scientific result?”
If the answer is “shrug” and “yes,” you’re probably in the safe zone. If not, keep AI in the “sidekick, not coauthor” role.
TL;DR:
- Keep the heavy, decisive stuff human and transparent.
- Use AI aggressively for drafting, clarifying, organizing, and concept‑mapping.
- Treat any AI use around screening or extraction as experimental, heavily audited, and always secondary to human judgment.
If you think of AI in your review as “a nosy RA you don’t really trust,” I’d actually go one step further: treat it like a very bright but chronically sloppy co‑op student. Useful, but it should never be the only person in charge of anything that ends up in your PRISMA diagram.
Instead of repeating what @ombrasilente already covered, I’d frame it around tool ecosystems rather than isolated tasks.
1. Ecosystem 1: Classic SR stack + “bolt‑on” LLM
This is the conservative, highly defensible route.
Core stack
- Reference manager: Zotero / EndNote
- Screening platform: Rayyan, Covidence, DistillerSR, EPPI‑Reviewer
- Stats & meta‑analysis: R (metafor, meta), RevMan, Stata, etc.
Where an LLM fits here
- Turn ugly scoping notes into a clean protocol draft.
- Translate search strings between databases.
- Turn your extraction sheet headings into standardized variable definitions.
I actually disagree slightly with @ombrasilente on not using AI to help with effect‑size thinking. You should not let it compute effect sizes you just trust, but you can give it an example paper and ask:
“Given this outcome and design, what effect size measures are typically used, and what would be the pros/cons of each?”
That can be a big time saver when you are picking your statistical strategy, as long as you confirm everything in proper methods papers.
2. Ecosystem 2: AI‑centric screening helpers (experimental but useful)
You mentioned “reliable, accurate, acceptable.” This is where the line gets blurry.
You can:
- Use an LLM to generate rationales for why an abstract might be in or out.
- Pair that with human coders who decide, but sometimes use the AI explanation to spot subtleties they would miss.
I’m less enthusiastic than @ombrasilente about LLM‑driven priority screening. Ranking is fine in theory, but it is incredibly hard to prove it is not systematically down‑ranking entire subfields or marginalized populations. If you try it:
- Keep a random sample that you screen in original random order and directly compare inclusion rates.
- Treat this as a methodological side experiment, not part of your core protocol unless your team explicitly approves it.
If a journal asks “What would have changed without AI?” and your answer is “We would have found different studies,” that is a red flag. Your answer should be “Nothing substantive; we just did the same things slower.”
3. Ecosystem 3: AI for synthesis & writing, not for discovery
Where AI is safest and most academically palatable is post‑extraction.
Examples that usually pass even strict supervisors:
- Turning raw extraction tables into first‑draft narrative summaries, which you then edit heavily.
- Asking it to produce competing conceptual models of how the interventions might work, based on the included studies, which you then argue for or against.
- Checking logical consistency between your inclusion criteria, PICO, and the way you actually discuss the evidence.
Here I slightly push beyond @ombrasilente’s caution: it is fine to let the model “push back” on your interpretation. Asking “What is the strongest argument that my conclusion is overstated, given these results?” can reveal your own blind spots. Still, all final claims are yours.
4. Quick “sanity checklist” for any AI tool you pick
Before you lock in a tool, ask:
- Does it store or train on my data?
- If yes, you probably cannot feed it confidential protocols, peer review reports, or nonpublic datasets.
- Can I export dialogues/logs as evidence for methods transparency?
- This is critical if a supervisor, ethics board, or reviewer wants proof of how you used it.
- Can it handle long inputs without silently truncating?
- Many models claim big context windows but still skip detail. Test on one long paper you know well.
- Is there an institutionally approved version?
- If your university offers an enterprise LLM, use that first. It resolves 80% of policy headaches.
5. On the mysterious product title ’ (pros & cons)
Since you referenced the product title ', here is how I would position something like that in a systematic review workflow, assuming it is marketed as an AI‑assisted SR helper.
Potential pros
- Could centralize key SR tasks in one place: screening, extraction, summarization.
- Might integrate LLM‑based summarization directly with your extraction tables, so you don’t copy/paste between tools.
- If it logs every AI action, that is gold for methodological transparency.
- Can be a good teaching tool for junior team members to see how questions map to structured extractions.
Potential cons
- If it claims to “find all the relevant studies for you,” be extremely wary. That is not yet methodologically accepted as a standalone search.
- Any hidden or opaque ranking/screening algorithm is a reproducibility problem. If you cannot describe the algorithm in Methods, treat it as auxiliary only.
- Vendor lock‑in risk: if your project outlives the product, you must still be able to export your data to generic formats (CSV, RIS, JSON).
- Journal reviewers may be skeptical until there is a solid validation paper for that specific tool. You might end up spending time defending it.
In short, a product like ’ is fine as a convenience layer on top of a standard SR pipeline, but not as a replacement for core PRISMA‑compliant steps.
6. How to integrate AI without triggering reviewer alarms
Concrete moves that keep you safe:
- Define in your protocol which steps are strictly human decisions:
- Final inclusion / exclusion
- Risk of bias judgments
- Effect size calculations and meta‑analysis
- Document AI tasks at the level of “role,” not brand:
- “A large language model was used for drafting plain‑language summaries; all text was verified and edited by the authors.”
- Involve at least one librarian or experienced methodologist, and explicitly state this. Their stamp of approval matters more than which AI model you used.
If you boil all of this down:
- Use AI heavily for thinking, organizing, drafting, and rephrasing.
- Use it cautiously and auditable‑y for screening support or extraction support, keeping humans as the definitive gatekeepers.
- Do not let any tool, including ', become a black box that determines what goes into your evidence base.