Every day, an astonishing 2.5 quintillion bytes of data are generated globally.

AI models harness a significant portion of this data through data scraping. This practice is central to AI advancements but has recently entangled leading AI entities like OpenAI and Microsoft in legal and ethical controversies.

How can we navigate the ethical maze of data scraping in AI training? A closer case study of our data scraping for training the Secure Redact AI models.

Behind the AI curtain

Data scraping in AI involves web crawling – an automated process where vast amounts of internet data, ranging from text to images, are harvested. This data undergoes extraction, aggregation, and preprocessing, forming the foundation for training AI models.

These models are constantly refined with new data, driving towards AI excellence. The recent acknowledgement by Google of its data scraping practices and Twitter's measures to counteract such activities (like access restrictions and rate limiting) underscores the widespread nature of this practice.

The ethical dilemma: privacy vs progress

The AI domain is caught in a stark dichotomy: the relentless pursuit of technological progress at odds with individual privacy.

Some experts express concerns over data transparency and the challenges in data extraction post-AI training. While legal in the US, web scraping also raises significant public and legal concerns.

The use of publicly available data, as seen in the case of Clearview AI, raises questions about deployment and anticipated use. The debate extends to the fair use of copyrighted material for AI training. Does using such material for AI training fall under fair use, or does it infringe on intellectual property rights?

“The legality of these practices is murky. Even if it’s illegal by copyright law, proving it is a huge task due to the closed nature of these systems.”

— Gary Bhumbra, Head of Machine Learning at Pimloc, the makers of Secure Redact

Transparency and accountability are key

“There’s an inherent opacity in AI models. They’re released without disclosing how or from where their data is gathered. Isn’t it questionable to obtain information without user consent, especially when ownership of web-shared data becomes a grey area?”

— Gary Bhumbra, Head of Machine Learning at Pimloc, the makers of Secure Redact

AI experts and ethicists like Timnit Gebru who emphasises in her 2018 paper "Datasheets for Datasets" the demand for transparency in AI data practices is growing. This need for openness is also evident in legal and regulatory domains, with a push for laws that balance data protection with ethical AI development. The incorporation of robust citation practices and recognition of copyrights in AI development (particularly in the realm of large language models) could significantly enhance ethical standards.

“I think companies should probably say where they got their data from.”

— James Leigh, CTO at Pimloc, the makers of Secure Redact

In the UK, there are ongoing efforts to update and share codes of practice and guidance, but such a law is not currently in place. The need for such specific legislation underscores the importance of accountability in AI development. By mandating source citation and copyright acknowledgement, AI models can foster transparency and ensure that creators and copyright owners are duly recognised for their contributions. This practice not only addresses ethical concerns but also fosters a culture of respect and responsibility towards intellectual property, ultimately leading to more trustworthy and ethically grounded AI systems.

“Imagine a law where a language model, like an LLM, must cite sources and give recognition to copyright holders. Without this, it could be considered fraudulent.”

— Gary Bhumbra, Head of Machine Learning at Pimloc, the makers of Secure Redact

A case study in responsible AI training: Secure Redact

At Secure Redact, we try to model ethical AI practice. We use publicly available datasets with permissive licences and responsibly gather data (like tailored video footage), so our team can train our AI models without infringing on privacy. Our approach is also one that is centred on minimal privilege and data anonymisation to match our commitment to ethical AI and to grow our mission: to advance visual AI systems in the interests of people and their freedoms.

As the AI industry evolves, the conversation around ethical AI development becomes increasingly relevant. A sustainable path forward involves striking a balance between innovation and privacy, to ensure progress in AI does not undermine individual rights.

The industry must prioritise transparency and responsible data usage, adhering to ethical standards that protect both the public interest and the integrity of intellectual property. Data scraping is not a trend that will slow down any time soon, so it is the responsibility of industry heads and experts to balance these developments with ethical practices.

How can you harness the power of AI to manage video responsibly?

Contact us

Our take on the ethical maze of data scraping in AI

Every day, an astonishing 2.5 quintillion bytes of data are generated globally.

Behind the AI curtain

The ethical dilemma: privacy vs progress

Transparency and accountability are key

A case study in responsible AI training: Secure Redact

How can you harness the power of AI to manage video responsibly?

How do Americans feel about data security?

The global AI regulation race: a world transformed by technology