What problem are we solving?

Trained on massive publicly available data, large language models (LLMs) have demonstrated tremendous success across various fields. While more data contributes to better performance, a disconcerting reality is that high-quality public data will be exhausted in a few years. The underutilized distributed private data is trapped on edge device, research institutions ,private organisations working in their respective domain

We must harness this vast amount of high-quality data to enhance AI training and shape the future of intelligent systems.

Contributors are willing to share their high-quality data, provided that their privacy is protected and they receive fair compensation for their contributions to make AI Models better and efficient, which is not happening in the current scenario.

There are significant challenges in Generative AI at this moment. These are-

  • Issue of Data Ownership-Data ownership concerns arise when personal or proprietary data is shared, particularly with centralized AI systems. Organizations or individuals contributing data may want to retain control and rights over their data and how it’s used.

  • Data Provenance/citation - In traditional AI systems, it’s challenging to track and verify the origins of data used for model training, which is crucial for credibility, trust, and compliance.

  • Data Scarcity - Sharing sensitive data in centralized models can lead to privacy risks and data breaches. A lack of diverse data can lead to biased AI models, particularly when data from different populations or regions is scarce or unrepresentative.

  • Lack of data privacy - Generalized models often overlook user-specific patterns, reducing their effectiveness in personalized applications.

  • Lack of personalized learning - Generalized models often overlook user-specific patterns, reducing their effectiveness in personalized applications.

  • Compliance - Regulations like GDPR and CCPA require strict control over personal data, making it difficult for organizations to comply with data handling requirements

  • Integrity- In centralized systems, data can be tampered with, either accidentally or maliciously, affecting model accuracy and reliability.

  • Copyright Protection- When data is used in training models, there’s a risk of infringing upon copyright if data owners cannot control or track its use.

  • Trustworthy AI Outputs- AI systems are often viewed as black boxes, leading to skepticism about the reliability, fairness, and accuracy of their outputs.

  • Secure Data Sharing and Collaboration-Sharing data securely among multiple parties (e.g., in healthcare, finance, or research) is challenging due to privacy and confidentiality concerns.

Along with the above significant challenges, there are several problems related to centralized machine learning systems:

  1. Centralization: Current machine learning advancements are predominantly centralized, leading to limited accessibility and control over the models and datasets.

  2. Proprietary Datasets: The datasets required for training these centralized models are often proprietary, creating barriers to entry for individuals or organizations without access to such data.

  3. Monetization of Predictions: Predictions generated by these models are typically sold on a per-query basis, limiting their widespread use and accessibility.

  4. Model Obsolescence: Published models quickly become outdated without continuous effort to acquire new data and retrain them, posing a challenge to maintaining their effectiveness over time.


To tackle these issues, the proposed framework introduces Collaborative Model Update (CMU) powered by federated learning on blockchain.

Last updated