8669836876?profile=original

An Artificial Intelligence (AI) system is only as good as its training. For AI Machine Learning (ML) and Deep Learning (DL) frameworks, the training data sets are a crucial element that defines how the system will operate. Feed it skewed or biased information and it will create a flawed inference engine. 

MIT recently removed a dataset that has been popular with AI developers. The training set, 80 Million Tiny Images, was scraped from Google in 2008 and used in training AI software to identify objects. It consists of images that are labeled with descriptions. During the learning phase, an AI system will ingest the dataset and ‘learn’ how to classify images. The problem is that many of the images are questionable and the labels were inappropriate. For example, women are described with derogatory terms, body parts are identified with offensive slang, and racial slurs were sometimes used to label minority people. Such training should never be allowed.

AI developers need vast amounts of training data to train their systems. Collections are often created out of convenience, without consideration for courteous content, copyright restrictions, compliance to licensing agreements, people’s privacy rights, or respect for society. Unfortunately, many of the available sets were haphazardly created by scraping the internet, social sites, copyrighted content, and human interactions without approval or notice. 

Many of the most used training datasets have issues. A large number were created by unethically acquiring content, some contain derogatory or inflammatory information, and for others, the sample is not representative because it excludes certain groups that would benefit from inclusion. 

The problem has become worse over time. Flawed datasets, that were made openly available to the developer community early-on, became so popular that they are now considered a standard. These benchmarks are used to check accuracy and performance across different AI systems and configurations. 

Too few are vetted for inclusion, content, accuracy, or socially acceptable content. Using such flawed records is simply unethical because the resulting systems can be racially charged, biased, and promote inequality. 

We cannot have good AI if the commonly used datasets create unethical systems. All files should be vetted and both the creators and product developers held responsible. Just as chefs are held accountable for the ingredients they put into their prepared dishes, so should the AI community be held responsible for allowing poor data to result in harmful AI systems.

Interested in more? Follow me on LinkedInMedium, and Twitter (@Matt_Rosenquist) to hear insights, rants, and what is going on in cybersecurity.

Votes: 0
E-mail me when people leave their comments –

CISO and Cybersecurity Strategist

You need to be a member of CISO Platform to add comments!

Join CISO Platform

Join The Community Discussion

CISO Platform

A global community of 5K+ Senior IT Security executives and 40K+ subscribers with the vision of meaningful collaboration, knowledge, and intelligence sharing to fight the growing cyber security threats.

Join CISO Community Share Your Knowledge (Post A Blog)
 

 

 

CISO Platform Talks : Security FireSide Chat With A Top CISO or equivalent (Monthly)

  • Description:

    CISO Platform Talks: Security Fireside Chat With a Top CISO

    Join us for the CISOPlatform Fireside Chat, a power-packed 30-minute virtual conversation where we bring together some of the brightest minds in cybersecurity to share strategic insights, real-world experiences, and emerging trends. This exclusive monthly session is designed for senior cybersecurity leaders looking to stay ahead in an ever-evolving landscape.

    We’ve had the privilege of…

  • Created by: Biswajit Banerjee
  • Tags: ciso, fireside chat

CISO Talk (Chennai Chapter) - AI Code Generation Risks: Balancing Innovation and Security

  • Description:

    We’re excited to invite you to an exclusive CISO Talk (Chennai Chapter) on “AI Code Generation Risks: Balancing Innovation and Security” featuring Ramkumar Dilli (Chief Information Officer, Myridius).

    In this session, we’ll explore how security leaders can navigate the risks of AI-generated code, implement secure development guardrails, and strike the right balance between innovation and security. AI…

  • Created by: Biswajit Banerjee
  • Tags: ciso talk

CISO MeetUp: Executive Cocktail Reception @ Black Hat USA , Las Vegas 2025

  • Description:

    We are excited to invite you to the CISO MeetUp: Executive Cocktail Reception if you are there at the Black Hat Conference USA, Las Vegas 2025. This event is organized by EC-Council & FireCompass with CISOPlatform as proud community partner. 

    This evening is designed for Director-level and above cybersecurity professionals to connect, collaborate, and unwind in a relaxed setting. Enjoy…

  • Created by: Biswajit Banerjee
  • Tags: black hat 2025, ciso meetup, cocktail reception, usa events, cybersecurity events, ciso

6 City Playbook Round Table Series (Delhi, Mumbai, Bangalore, Pune, Chennai, Kolkata)

  • Description:

    Join us for an exclusive 6-city roundtable series across Delhi, Mumbai, Bangalore, Pune, Chennai, and Kolkata. Curated for top cybersecurity leaders, this series will spotlight proven strategies, real-world insights, and impactful playbooks from the industry’s best.

    Network with peers, exchange ideas, and contribute to shaping the Top 100 Security Playbooks of the year.

    Date : Sept 2025 - Oct 2025

    Venue: Delhi, Mumbai, Bangalore, Pune,…

  • Created by: Biswajit Banerjee