Downeaster "Alexa" Top Million!

RG
Apr 30
4 min read

Updated: May 7

Cover of Billy Joel’s single "The Downeaster ‘Alexa’", via Wikipedia

The Downeaster ‘Alexa’ was a song on Billy Joel’s album Storm Front, from 1989. The album reached #1 in the US, was rated quadruple-platinum, five of the songs placed in the top 100, and the album was nominated for 5 Grammy Awards, so I suppose one could say that it was fairly successful.

The song tells the story of a struggling fisherman off Long Island, which is near Billy Joel’s home. It describes real places, highlights the challenges faced by fishermen, and includes an unusual number of nautical and local references, which (I think) adds to the “feel” of the song. Apparently, a “downeaster” is a style of fishing boat used widely in the US Northeast, and the name of the boat came from one owned by Billy Joel at the time and named after his daughter, Alexa Ray Joel.

Alexa is also the name of a company which tried to map the internet. Alexa Internet focused on web traffic analysis, and used bots (small programs) to “crawl” through web pages to gather information about them. In addition to being a standard source for information related to the popularity of websites, the “Alexa Top Million” was also frequently used as a proxy for research about the internet as a whole. The Alexa database also served as the basis for the creation of the Internet Archive, which is accessed through the Wayback Machine.

The company was bought by Amazon in 1999, but continued to be an important source of information about the World Wide Web for many years. Then, in 2021, Amazon announced the end of their website ranking and competitive analysis services, and shut down those services on 1-May-2022.

Alexa.com now directs you to information about Amazon Alexa, as Amazon bought the trademarks when they bought Alexa Internet.

But what do we do now, if we want to identify the most popular sites on the internet?

That question – and its answers – evolve as the internet does, to the point that it’s arguably as meaningless as “how high is up”?

When the web consisted of a handful of static pages, it seemed entirely reasonable to count the domains (ie, the “wikipedia.org” part of https://www.wikipedia.org ”). But what does that mean now?

Domains still exist, of course, but how should we count cloud services? Say that Al and Bob both have blogs on a cloud service called ‘cloud.cloud’, and the blogs are called al.cloud.cloud and bob.cloud.cloud. Is that one “site”, or two? Or three?

And what about services like www.google.com and www.google.ca? Or the services which support them, like googleapis.com, or googletagmanager.com?

Or, just to make things fun with our example above, say that they don’t create subdomains, but instead create directories, such as cloud.cloud/al and cloud.cloud/bob?

And what if some sites have dedicated domains as well, so that both al.cloud.cloud and alsblog.cloud point to the same site?

That’s just the tip of the iceberg. Just on the standalone domain side, we have web hosting services, content delivery networks (CDNs), proxy services like Cloudflare, and others, and that’s only one part of the internet. It’s complicated, to say the least.

As a result, any attempt to understand the internet is extremely complex and complicated, as are the tools for understanding it. Everyone needs to establish their own definition of a “site”, decide exactly what they want to track, figure out how to count it, and then figure out what to report. And that, of course, is dependent on being able to access the information in the first place. If you needed to log into cloud.cloud before you can see anything, how would you track that? A bot would see that as a single “site”, but there could be thousands of “sites”, depending on the definition used.

We should be sympathetic and acknowledge the magnitude and complexity of the task, particularly since it all changes minute-by-minute. In general, though, reporting by domain is one of the more common approaches.

Some organizations focus on SEO (Search Engine Optimization) and marketing, while others focus on networking and security. Some are free, while others are included as part of their paid offerings. Some are internet-wide, while others are focused in some way.

Wikipedia’s list of most-visited websites mentions Similarweb and Semrush, which both appear to be marketing data analysis companies that provide data as part of their paid offerings. In contrast, Common Crawl is a non-profit that provides its data for free. Other services, like Cloudflare Radar

and Cisco Umbrella, provide more networking- and security-based data, but are limited to the services they offer. (For example, Cloudflare Radar focuses on DNS traffic to their own resolver (1.1.1.1))

And then I found Tranco. This list is freely available, and appears to be aimed at researchers into web security or internet traffic. Their goal is to provide researchers with rankings that use multiple sources, avoid manipulation, and support the ability to reproduce results (by providing details on the sources and permanent references).

The download page clearly describes the list, with the sources, how they are used, and how they are consolidated. I think this is definitely worth more investigation!

On the latest list, wikipedia.org came in at #30, mozilla.org at #89, and archive.org at #154. Not bad. Could be better, but at least they’re up there.

They also publish their methodology, and the paper that started the ball rolling in the first place: “TRANCO: A Research-Oriented Top Sites Ranking Hardened Against Manipulation”.

This is definitely something I’d like to learn more about, and will certainly consider using if I do any research that warrants it.

The world changes, but we adapt to it. This is how we survive.

Cheers!

“In science, 'fact' can only mean 'confirmed to such a degree that it would be perverse to withhold provisional assent.' I suppose that apples might start to rise tomorrow, but the possibility does not merit equal time in physics classrooms."

Stephen Jay Gould

Today I Learned

Downeaster "Alexa" Top Million!

Recent Posts

Comments