Link Graph Analysis for esolangs.org

| tags: eso web

I was recently talked into taking on the role of the negligent MediaWiki administrator for the esoteric programming languages wiki. As a consequence, I have now a local copy of the XML dump of the wiki contents (that anyone can download). To celebrate Easter in the traditional manner, then, I've spent some time computing statistics of the link graph of the wiki.

Background

For the following statistics, vertices of the graph are all articles in the main, Category: and Esolang: namespaces, while links from A to B are the (directed) edges. Multiple links with the same target have been generally considered multiple edges; i.e., the link graph is a multidigraph.

The XML dump used was dated 2014-04-08T10:07:14Z. There are some known imperfections in the XML dump parsing: redirect pages have been considered separate vertices in their own right (with a single link to the redirect target), and only links in the raw wikitext source have been counted. (E.g., links hidden inside transcluded templates as well as outgoing links from the automatically generated category pages have been ignored.) In cases where it makes more sense, analysis has been restricted to the "main" strongly connected subgraph.

General Statistics

There were 1781 vertices and 10815 edges, or 10143 edges when collapsing multiple edges with the same head and tail.

The maximum outdegree was 858 ("Language list"), and the maximum indegree was 759 ("Category:Languages"). For full lists sorted by the out- and indegree, see outdegree.txt and indegree.txt. Maximum edge multiplicty was 17, achieved by the 17 links from Brainbool to Brainfuck.

The graph diameter (largest value of the length of the shortest directed path between any pair of vertices) was 14. This distance was achieved by the 24 pairs of vertices listed in diameter.txt.

The largest weakly connected component had 1692 vertices; the remaining 89 vertices were divided in 1 component of size 3, 5 components of size 2 and 76 components of size 1. The components of size 2 and 3 are mostly redirect pages.

The largest strongly connected component had 1303 vertices; the remaining 478 vertices were divided in 4 components of size 2 and 470 components of size 1. The four strongly connected components of size 2 are indicated in the table below.

# Pages
1 SELECT.
SELECT./Hello World
2 Preposterous Programming
Preposterous Programming Language
3 Postfix notation
Prefix notation
4 Esolang:Community portal
The Esoteric File Archive

Closeness Centrality

Looking only at the primary strongly connected component, it takes an average of 3.6990 clicks to get from one page to another. Some pages, however, are more central than others. The following table shows the top and bottom 10 pages ordered by the average number of clicks it takes to get to an arbitrary article, when starting from the mentioned page. (Many of the bottom ten are, quite naturally, redirect pages.)

This analysis was inspired by Six Degrees of Wikipedia.

Rank Clicks Page
1 1.6224 Language list
2 2.2955 Main Page
3 2.3147 List of ideas
4 2.3691 Joke language list
5 2.4198 Esoteric programming language
6 2.4428 Esolang:Featured languages/Candidates
7 2.6201 Category:Languages
8 2.7183 Hello world program in esoteric languages
9 2.9747 Brainfuck
10 3.0276 Popular problem
1294 5.3860 Pedro Gimeno
1295 5.4505 Funciton/Functions
1296 5.4574 Migol
1297 5.4597 Migol 11/IO Functions
1298 5.4820 Recursor
1299 5.6876 ByteByte
1300 6.0169 Amelia
1301 6.0783 Church integer
1302 6.2072 BytePushCore
1303 6.2072 PUSHEM

The full list of articles ranked by closeness is at center.txt.

Markov Chain Distribution

Looking again at the strongly connected component, we can build a Markov chain model, with transition probabilities \[ p(x_t \mid x_{t-1}) = \frac{\left| L(x_{t-1}, x_t) \right|}{\sum_x \left| L_(x_{t-1}, x) \right|}, \] where \(\left| L(x, y) \right|\) is the number of links from \(x\) to \(y\). The stationary distribution of the model (it might not exist, or there might be several; we will ignore these possibilities) corresponds to a (normalized) left eigenvector of the transition matrix \(\left[\mathbf{P}\right]_{ij} = p(x_t=j \mid x_{t-1}=i)\) with an eigenvalue of 1. See e.g. Wikipedia for further details.

In any case, intuitively the stationary distribution denotes the relative frequency of encountering a particular article, when clicking links randomly forever. This is quite close to the well-known PageRank algorithm, though the latter has some further bells and whistles, such as the damping factor. Top ten pages in this sense (highest probability in the stationary distribution) are given in the table below; full list can be found at markov.txt.

Rank Freq. Page
1 0.0868 Language list
2 0.0847 Category:Languages
3 0.0542 Esoteric programming language
4 0.0351 Brainfuck
5 0.0339 Turing-complete
6 0.0202 Turing machine
7 0.0195 Category:Implemented
8 0.0190 Computational class
9 0.0153 Befunge
10 0.0131 Category:Concepts

Shortest Path Finder

The following form can[*] be used to find shortest path between any two articles that are part of the link graph. Do note that this is not going to stay up to date with edits in the wiki. It will (try to) load the full link graph on first click, so a small delay is normal. With luck, that will also add page title completion to the input boxes, assuming HTML5 <datalist> support.

[*] Assuming a browser happy with quickly hacked-together JavaScript, exact spelling of titles, etc.


Force-Directed Graph Visualization

Courtesy of d3.js, here is also a force-directed graph layout visualization of the strongly connected component. It can be pretty slow, and it will be rather complex and messy. Vertices are colored based on their namespace (blue for main, orange for Category: and dark red for Esolang:), and hovering over a node should both show its name as a tooltip as well as highlight its immediate neighbors.