1. Help Center
  2. Method
  3. Artificial Intelligence

What data foundation in terms of format and volume is needed to train an AI?

In Can Do, the data captured and maintained within the system is sufficient. This data repository accumulates over time and does not need to be explicitly collected for the AI.

Can Do generates a wealth of information (automatically) from comparatively little data using its algorithms. The AI then analyzes this information. The AI learns from regular, daily user behavior. Therefore, no additional effort is required.


Here is a basic explanation on this topic:

The data foundation required for training Artificial Intelligence (AI) varies greatly depending on the type of AI, the specific application domain, and the desired level of complexity. However, some general aspects regarding the format and volume of data can serve as guidance:

Format

  • Structured Data: These are data presented in an organized form, such as databases or tables, and often include numerical or categorical values. They are suitable for machine learning in areas like finance or customer relationship management.
  • Unstructured Data: This includes texts, images, videos, and audio files. Such data typically require preprocessing steps to be usable for machine learning.
  • Semi-structured Data: Combine elements from both of the above categories, such as emails or web pages containing both clearly defined data (like header information) and unstructured content (like body text).

The choice of data format depends on the goal of the AI. For text analysis, large text corpora are required, for image recognition, it's image sets, and for prediction models, structured historical data may be needed.

Volume

The required data volume also varies greatly:

  • Simple Models and Tasks: For less complex tasks or simple machine learning algorithms, small to medium-sized datasets (ranging from hundreds to tens of thousands of examples) may suffice.
  • Deep Learning: For more complex models, especially in the field of deep learning, very large datasets are often required, consisting of millions of examples. These large datasets are necessary to effectively train the many parameters in deep neural networks and to avoid overfitting.

Further Considerations

  • Quality: The data must be of high quality, meaning they should be relevant, complete, and correct. Dirty data needs to be cleaned to not affect the accuracy of the AI.
  • Diversity: The data should be diverse enough to cover all aspects of the problem that the AI is supposed to solve. Lack of diversity can lead to biases and poor performance in real-world applications.
  • Timeliness: Especially in fast-paced domains, training data should be up-to-date to ensure the relevance and effectiveness of the AI solutions.

In summary, the required data foundation for training an AI heavily depends on the specific application. While simple models may suffice with fewer and simpler data, more sophisticated applications require extensive, high-quality, and diverse data.