This data scientist wants to build an archive about the history of internet measurement

  • Jim Cowie, co-founder and Chief Data scientist at DeepMacro, invites the creation of an online library about internet measurement.
  • He believes there are three steps to perfecting a task: save, narrative, and explore.

Jim Cowie, co-founder and Chief Data Scientist at DeepMacro, recently posted an article titled Thinking about Internet history on the APNIC website. He has over 25 years of experience as a data storyteller in Internet measurement and recently launched the Internet History Initiative, with the idea of building an internet library for future historians, piecing together the recorded history of the Internet.

Also read: What is APNIC? Inside the backbone of Asia’s internet

Curate history to interpret it and make it accessible and meaningful for future scholars.

Jim Cowie, co-founder and Chief Data Scientist at DeepMacro

Cowie argues that if we want to ensure that the story of the Internet is preserved in a quantifiable way for future generations of scholars, and that data is brought together to protect it from irreversible damage, we basically have three collective tasks to accomplish before we all forget how it works:

  • Preserve history by collecting irreplaceable records of how the Internet evolved.
  • Collate history to explain it and make it accessible and meaningful to future scholars.
  • Explore history and create tools and visualizations that everyone can enjoy and celebrate.

Step 1: Save

So what should we keep?

In addition to active measurements, we need to keep a record of registry data – to whom these network resources have been assigned on every day in history, from ARIN, RIPE NCC, and APNIC – and any information we can find about the DNS name associated with each IP address on a given day. These are collective clues to what all these Internet hosts are doing, and also provide clues that they may be located on Earth.

Refactor the Internet into a point-in-time database

Finally, all of this DNS and registry data is very ephemeral, meaning it can change daily without any warning. If we later want to build credible indicators, such as the density of Internet hosts in a given area, then we have to track the time of each brief observation. Recall that in the 2010s, the exhaustion of the available IPv4 pool triggered a wave of sales and international reallocation of network address blocks, so that (for example) a block of network addresses that once hosted DSL customers in Romania might disappear from the Internet for a while, only to reappear in a data center in Saudi Arabia to service web pages. The geography of the Internet changes quickly, so we not only need a geographic map of all IP addresses and the purpose of each IP address. We also need to know what this map has looked like on a daily basis over the past few decades as the hosts and resources associated with each IP address have moved and changed in functionality.

Finally, all of this DNS and registry data is very ephemeral, meaning it can change daily without any warning. If we later want to build credible indicators, such as the density of Internet hosts in a given area, then we have to track the time of each brief observation.

Recall that in the 2010s, the exhaustion of the available IPv4 pool triggered a wave of sales and international reallocation of network address blocks, so that (for example) a block of network addresses that once hosted DSL customers in Romania might disappear from the Internet for a while, only to reappear in a data center in Saudi Arabia to service web pages. The geography of the Internet changes quickly, so we not only need a geographic map of all IP addresses and the purpose of each IP address. We also need to know what this map has looked like on a daily basis over the past few decades as the hosts and resources associated with each IP address have moved and changed in functionality.

Step 2: Narrative

Once we have successfully preserved all of our endangered digital datasets, we can begin to manage and tell about them. Most Internet measurement research has focused on operational issues in the here and now – monitoring slowdowns and shutdowns within and between providers, and figuring out how the Internet bypasses corrupted routed traffic. The question of historical evolution is often secondary. We can find new ways to look at the Internet through the lens of history to get past this “operational trap.”

Part of the reason we do this is to encourage slower-growing, less-diverse parts of the Internet to grow faster, and it’s true that the national regulatory environment (and the central role of state providers in many economies) can prompt some parts of the Internet to behave in economy-specific ways. But Jim Cowie hopes that for the sake of future historians, we can find better ways to maintain geographic intuition, rather than falling into some kind of cognitive trap that sees a national Internet footprint as just another sovereign border to defend.

Some of these “workload fragments” are very specific in time and place for those who want to understand Internet connections that are consistent with historical events. For example, what was it like for academic users in China to use Google search in 2009? What was it like for a mobile user in Cairo to want to access Wikipedia in 2011? What did the financial sector in South America look like in connection with Bloomberg and Reuters in the 2000s? How diverse will the Ethereum node in 2020 or the Mastodon server in 2023 be in terms of hosting relative to Internet consumers around the world? Some of these parts are very relevant – we might be able to map the embeddings of hosts in the Internet and visualize the connections between providers that support a given part of the workload.

Step 3: Explore

The reason we strive to preserve and organize the history of the Internet as a technological product is to help the public understand how the Internet works its magic.Today’s Internet works incredibly well, in large part because of the specific conditions under which it grew and developed, under multi-stakeholder governance rather than under a multilateral treaty system that often values decentralized openness and innovation, while centralized authorities may be more inclined to prioritize security, predictability, and control.
Once we have saved the history of the Internet and we have recruited thoughtful scientists who can help us quantify some of the social benefits (net social costs) of the Internet, we will need tools to help tell those stories. Mostly visualizations, perhaps immersive walkthroughs, and certainly the kind of interactive exhibits that data journalists use to inform and entertain. “Our investment in providing these datasets will open the door to larger collaborations with artists, journalists and visual storytellers.”

That’s what Jim Cowie wants to get started with. We can confidently predict that just as the Internet has changed society, society will certainly continue to change the Internet through some competing combination of top-down regulation with bottom-up innovation and popular demand.

For those who care about the future of the Internet, the race is now on to become better librarians of Internet history so that we can preserve and tell the great things about the Internet.

Fei-Wang

Fei Wang

Fei Wang is a journalist with BTW Media, specialising in Internet governance and IT infrastructure, with a focus on interviewing leaders in the technology industry. Fei holds a Master of Science degree from the University of Edinburgh. Have a tip? Reach out at f.wang@btw.media.
Follow Me:

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *