Indexing in (Gen)AI: reliability, reproducibility, and why it can impact your organization

During my generative (Gen)AI workshops, I always spend at least 15 minutes on one crucial topic: indexing. In the custom workshops I give to organizations and companies about (Gen)AI, I explain why this theme is fundamental for reliable (Gen)AI applications. Because I notice there is still a lot of uncertainty about it, I have decided to share this knowledge more widely through this blog.
In this blog, I clearly explain what indexing is, why you need to take it into account, what this means for reproducibility and rights, and how indexing affects the reliability of information. Let’s start with the basics.
What is indexing?
Indexing is the process by which information is collected, stored, structured, and made searchable in a system. Without indexing, a (Gen)AI system cannot retrieve relevant information. But here it gets interesting: what gets indexed determines what becomes visible. In other words, (Gen)AI does not see “the whole internet.” It sees what falls within its index. And that has major consequences.
Indexing determines reliability
Indexing is not just about storage, but also about selection. (Gen)AI systems work with certain data sources that may be limited in selection, exclusively accessible, commercially agreed upon, or temporarily available. This means the quality of the output strongly depends on which sources are indexed, how current that index is, and what exclusive partnerships exist.
For example, in February 2024, Reddit made a content licensing deal with Google worth approximately $60 million per year. Google remained the only search engine with access to Reddit content for (Gen)AI training. Reddit then adjusted its robots.txt file, blocking search engines such as Bing, DuckDuckGo, and other competitors from crawling new Reddit posts. This explains why Reddit content is more prominently present in Google products and less visible elsewhere in (Gen)AI (Reuters, 2024; SiliconAngle, 2024).
On the other hand, we also see that Perplexity AI has access to certain scientific literature sources. This means the quality of answers can be higher in some cases when academic support is needed. This clearly shows that indexing determines perspective.
What happens with limited or poor indexing?
When indexing is missing or limited, problems arise. Information may not be current, URLs may no longer be available, or sources may have disappeared behind paywalls.
In such situations, a (Gen)AI system may start to hallucinate. Hallucinations occur not only when a model cannot find a correct source in its index, but also due to structural limitations in the training process itself. Think of shortcut learning, teacher forcing, and unbalanced representation of training data. These cause models to generate confidently incorrect information, even when relevant sources are available (Huang et al., 2025; Tonmoy et al., 2025).
This is not deliberate deception, but a structural limitation. That is why I always say in my workshops that (Gen)AI is only as reliable as its index.
An important point that is often underestimated is that Generative AI systems do not automatically search the entire internet for every question. They work with a pre-compiled index that is regularly updated but never in real-time. This means recent developments, new publications, trends, or changed legislation are often not available in the (Gen)AI output. Even large and advanced systems sometimes miss information from smaller or less popular sources. Additionally, some sources can never be indexed due to copyright restrictions, privacy rules, or exclusive licensing agreements. Therefore, the information you get through (Gen)AI is never complete and always depends on the selection and currency of the index used. This is one of the reasons why it is crucial to always remain critical and check whether the data is current, complete, and reliable.
Indexing, reproducibility, and rights
Besides reliability, there is also a legal and ethical dimension. When you input data into a (Gen)AI tool, that data—depending on the terms—can be stored, used to improve the model, shared with partners, or made reproducible within an ecosystem.
Perplexity AI mentions in its terms of use, for example, that entered content may be used for operational purposes such as display and distribution under license (Perplexity AI, 2026a; Perplexity AI, 2026b).
In February 2026, OpenAI also made an agreement with the US Department of Defense to implement its AI models within the Pentagon’s classified network. The contract included explicit ethical safeguards, such as a ban on domestic mass surveillance and the requirement of human accountability when using force, including autonomous weapons systems. Following much public and political criticism, the contract was further tightened on March 3, 2026 (Politico, 2026; The New York Times, 2026). At the same time, rival AI developer Anthropic publicly responded by refusing such cooperation altogether. CEO Dario Amodei stated that AI systems are not yet reliable enough for fully autonomous weapons and that participation in the Pentagon program entails risks for privacy and democratic values (Republic World, 2026; AI Haberleri, 2026). This situation emphasizes the importance of being aware of what you enter and share online, especially when working with (Gen)AI. Data can be used through contracts, indexing, and partnerships in ways that affect reliability and privacy.
The above does not mean that you cannot use (Gen)AI. It means you must be aware of where your data goes, who potentially has access to it, and under what conditions this happens.
The quality of training data: an underestimated risk
It is also good to go deeper into how the quality of training data takes shape within the AI ecosystem.
(Gen)AI models are trained on enormous amounts of data from the public internet, including websites, forums, social media, news articles, and scientific publications. Additionally, models learn from interactions with users: the questions people ask, the corrections they make, and the feedback they give all contribute to the further development of the model. This has a direct impact on the quality of the generated output. After all, the internet does not contain only reliable, expert-verified information. A considerable portion of what is published online comes from anonymous users, unqualified sources, outdated insights, or biased perspectives. When an AI model is trained on such data, it inevitably adopts the inaccuracies, biases, and knowledge gaps of that data. This also affects indexing because AI systems selectively index information based on data that appears available and reliable, allowing inaccurate or biased sources to be disproportionately included.
This phenomenon is referred to in scientific literature as “garbage in, garbage out”: the quality of the output is never better than the quality of the input data. It is therefore of great importance that users of AI systems are aware of this principle and never blindly trust AI-generated information. Critical thinking, additional research through primary and authoritative sources, and consulting subject experts remain indispensable. Not everyone who publishes online is an expert in the relevant field, and not everything online is factually correct or current (Bender et al., 2021).
Why all of this is essential for organizations
Many organizations today focus on efficiency and innovation. But without insight into indexing, you risk strategic blindness because you only see part of reality. This can lead to faulty decision-making, reputational damage, or legal risks. Indexing directly affects the quality of analyses, the objectivity of answers, the currency of information, and the reproducibility of input. Especially when data is used via (Gen)AI, it is even more important to be critical of data output. That is why I explicitly make time for this in every workshop and emphasize the importance of conscious and critical use of (Gen)AI systems.
Additionally, indexing plays a role in how complete and nuanced (Gen)AI information is. Because (Gen)AI models can only use the data in their index, important context may be missing. This not only limits the completeness of answers but can also remove nuance, causing complex topics to be represented too simply. Incomplete indexing can lead to a skewed representation of facts, for example by ignoring small, specialized, or newer sources that are not standardly indexed. It also means certain sectors, niche knowledge, or local developments are less well included.
From a European legal perspective, the EU AI Act from 2026 obliges all providers of general-purpose AI models (GPAI) to be transparent about their training data and to actively comply with copyright opt-outs. This has direct consequences for how organizations in Europe may and must assess AI suppliers when purchasing and deploying them (ScaleVise, 2026). At the same time, the same EU AI Act states that employees within many organizations and companies must be AI-literate. Awareness about indexing and its effect on their work is of essential importance.
That is why it is essential that organizations understand (Gen)AI output is never automatically “complete” and is always a reflection of the chosen index and its selection criteria.
Tips for increasing the reliability of (Gen)AI data
To increase the reliability of data obtained through (Gen)AI, there are a number of practical techniques and best practices you can apply. Below is an overview of 15 concrete tips you can integrate into your work with (Gen)AI systems:
- Formulate prompts explicitly to look from multiple perspectives instead of just one interpretation.
- Add an instruction in prompts that only recent information may be used, for example after January 2026.
- Ask in prompts that at least 10 different sources be mentioned in the answer, and then always verify these sources yourself through the original publications. A prompt instructs a model but does not guarantee actual consultation of ten unique sources.
- Ask the (Gen)AI explicitly to check or mention the reliability of each source.
- Have the (Gen)AI make a summary of conflicting sources and name the differences.
- Always use multiple (Gen)AI tools to check whether the output is consistent.
- Have (Gen)AI explicitly be critical about its own generated text and indicate possible errors.
- Check factual data manually through original or primary sources.
- State in prompts that only information from verified and authoritative sources may be used.
- Ask the (Gen)AI to provide references or hyperlinks with every fact.
- Use a checklist for logical inconsistencies and verify those with the (Gen)AI.
- Ask the (Gen)AI to create a timeline so it becomes clear which information is current.
- Have (Gen)AI distinguish between facts, opinions, and interpretations in its output.
- Implement a process whereby the generated information is approved by an expert within the field.
- Keep track of which sources (Gen)AI consistently consults when repeatedly using the same prompt, and check for possible bias or outdated information.
By applying these techniques, you significantly increase the reliability of (Gen)AI-generated data and reduce the risk of hallucinations, outdated information, or unintended bias.
Click here for an additional blog on how to stay vigilant when using (Gen)AI.
Summary
Below I have placed a visual summary. Click on the image to enlarge it. Sharing it online is allowed, provided my website is mentioned: www.maryayaqin.com
Do you want to go deeper into this?
Do you want to organize a custom workshop on (Gen)AI literacy or exchange thoughts about reliable implementation of (Gen)AI within your organization? Do you need support with (Gen)AI initiatives and creating internal support? Feel free to contact me. Together we ensure that (Gen)AI is not only smart and innovative, but also reliable, strategically thoughtful, and responsibly deployed.
Source list
- Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT). ACM. https://dl.acm.org/doi/10.1145/3442188.3445922
- Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2025). Large language models hallucination: A comprehensive survey (arXiv:2510.06265v2). arXiv. https://arxiv.org/html/2510.06265v2
- Perplexity AI. (2026a, February 25). Enterprise terms of service. https://www.perplexity.ai/hub/legal/enterprise-terms-of-service
- Perplexity AI. (2026b, February 25). Perplexity API terms of service. https://www.perplexity.ai/hub/legal/perplexity-api-terms-of-service
- Politico. (2026, February 28). OpenAI announces new deal with Pentagon, including ethical safeguards. https://www.politico.com/news/2026/02/28/openai-announces-new-deal-with-pentagon-including-ethical-safeguards-00805546
- Reuters. (2024, February 22). Exclusive: Reddit in AI content licensing deal with Google. https://www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/
- ScaleVise. (2026, February 27). EU AI Act 2026: New rules for training data and copyright. https://scalevise.com/resources/eu-ai-act-2026-changes/
- SiliconANGLE. (2024, July 25). Reddit blocks Bing, several other search engines from indexing its platform. https://siliconangle.com/2024/07/25/reddit-blocks-bing-several-search-engines-indexing-platform/
- The New York Times. (2026, February 27). OpenAI reaches A.I. agreement with Defense Dept. after Anthropic clash. https://www.nytimes.com/2026/02/27/technology/openai-agreement-pentagon-ai.html
- Tonmoy, S. M. T. I., Zaman, S. M. M., Jain, V., Rani, A., Ghosh, D., & Bhattacharya, S. (2025). Survey and analysis of hallucinations in large language models. Frontiers in Artificial Intelligence. https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1622292/ful


