Actually, the snarXiv only generates tantalizing titles and abstracts at the moment, while the arXiv delivers matching papers as well. Details of the implementation are below. I’m the author, and I don’t remember exactly why I decided to do this. I did already have the framework lying around from a previous project, and I swear I spent more time doing research last weekend than implementing snarXiv.org.
Suggested Uses for the snarXiv
- If you’re a graduate student, gloomily read through the abstracts, thinking to yourself that you don’t understand papers on the real arXiv any better.
- If you’re a post-doc, reload until you find something to work on.
- If you’re a professor, get really excited when a paper claims to solve the hierarchy problem, the little hierarchy problem, the mu problem, and the confinement problem. Then experience profound disappointment.
- If you’re a famous physicist, keep reloading until you see your name on something, then claim credit for it.
- Everyone else should play arXiv vs. snarXiv.
The snarXiv is based on a context free grammar (CFG) — basically a set of rules for computer-generated mad libs. Each rule in a CFG consists of a term, and a set of choices for how to make that term. The choices can contain text, or other terms, or even refer recursively to the term being defined. The CFG syntax used on the snarXiv is a collection of statements “term ::= choices“, where choices is a list of possibilities separated by “|”. Some possibilities are just text, but the ones that look like “<newterm>” are directions to go find the definition for newterm and fill it in. For instance, the following grammar
nounphrase ::= <noun> | <adj> <adj> <noun> | super <nounphrase> noun ::= apple | pear | mailman adj ::= smelly | chartreuse | enormous
can produce nounphrases like “apple,” “enormous smelly mailman,” or “super super smelly chartreuse mailman.” The snarxiv’s grammar is 622 lines long, and ends like this:
... morecomments ::= <smallinteger> figures | JHEP style | Latex file | no figures | BibTeX | JHEP3 | typos corrected | <nzdigit> tables | added refs | minor changes | minor corrections | published in PRD | reference added | pdflatex | based on a talk given on <physicistname>'s <nzdigit>0th birthday | talk presented at the international <pluralphysconcept> workshop comments ::= <smallinteger> pages | <comments>, <morecomments> primarysubj ::= High Energy Physics - Theory (hep-th)| High Energy Physics - Phenomenology (hep-ph)| secondarysubj ::= Nuclear Theory (nucl-th)| Cosmology and Extragalactic Astrophysics (astro-ph.CO)| General Relativity and Quantum Cosmology (gr-qc)| Statistical Mechanics (cond-mat.stat-mech) papersubjects ::= <primarysubj> | <papersubjects>; <secondarysubj> paper ::= <title> \\ <authors> \\ <comments> \\ <papersubjects> \\ <abstract> ...
The coolest and most natural thing to do with a CFG is exploit recursiveness as much as possible. The more recursion built in, the less predictable and richer the output. For instance, the following definition of a “space” has three rules: space, singspace, pluralspace, which refer recursively to each other in many different ways, allowing for a huge number of possibilities.
space ::= <pluralspace> | <singspace> | <mathspace> singspace ::= a <spacetype> | a <spaceadj> <spacetype> | <properspacename> | <spaceadj> <properspacename> | <mathspace> | <mathspace> | a <bundletype> bundle over <space> | <singspace> fibered over <singspace> | the moduli space of <pluralspace> | a <spacetype> <spaceproperty> | the <spacepart> of <space> | a <group> <groupaction> of <singspace> | the near horizon geometry of <singspace> pluralspace ::= <spacetype>s | <spaceadj> <spacetype>s | <n> copies of <mathspace> | <pluralspace> fibered over <space> | <spacetype>s <spaceproperty> | <bundletype> bundles over <space> | moduli spaces of <pluralspace> | <group> <groupaction>s of <pluralspace>
Of course, there’s also a danger that in a very small number of cases the output might be a little pathological. The nounphrase example above, for instance, can produce any phrase of the form “super super … super enormous pear.” The snarXiv similarly occasionally mentions QFTs living on “the moduli space of moduli spaces of moduli spaces of moduli spaces of moduli spaces of SU(3) bundles over elliptically fibered Enriques surfaces.” Too much recursion can also quickly lead to exponentially long abstracts, which are even harder to read all the way through than the usual ones on the arXiv.
To get some actual output from the grammar definition, the most straightforward thing would be to write a script that reads in the grammar, and works its way down the tree, starting with the top term, filling in definitions recursively until it gets a block of text. Instead of using an external script, the snarXiv compiles each grammar into its own program, a technique that originated from a freshman CS project and evolved minimally from there — it’s less straightforward, not clearly better, but maybe a bit more fun. A perl script compiles the grammar file into OCaml code (snarxiv.ml):
type phrase = Str of string | Opts of phrase array array let _ = Random.self_init () let randelt a = a.(Random.int (Array.length a)) let rec print phr = match phr with Str s -> print_string s | Opts options -> let parts = randelt options in Array.iter print parts (* Grammar definitions *) let rec top = Opts [| [| paper;|]; |] ... and comments = Opts [| [| smallinteger; Str " pages";|]; [| comments; Str ", "; morecomments;|]; |] and primarysubj = Opts [| [| Str "High Energy Physics - Theory (hep-th)";|]; [| Str "High Energy Physics - Phenomenology (hep-ph)";|]; |] and secondarysubj = Opts [| [| Str "Nuclear Theory (nucl-th)";|]; [| Str "Cosmology and Extragalactic Astrophysics (astro-ph.CO)";|]; [| Str "General Relativity and Quantum Cosmology (gr-qc)";|]; [| Str "Statistical Mechanics (cond-mat.stat-mech)";|]; |] and papersubjects = Opts [| [| primarysubj;|]; [| papersubjects; Str "; "; secondarysubj;|]; |] and paper = Opts [| [| title; Str " \\\\ "; authors; Str " \\\\ "; comments; Str " \\\\ "; papersubjects; Str " \\\\ "; abstract; Str " ";|]; |] let _ = print top let _ = print_string "\n"
And snarxiv.ml is now a specialized program that, when compiled and run, spits out a paper title and abstract. This setup is more elaborate than necessary, but OCaml is a lovely language for recursive structures, and the code is nice and simple. OCaml is also fast, allowing the snarXiv to generate papers even more swiftly than your favorite python script, or Ed Witten in the 80’s.
A few years ago, the CFG-based CS paper generator SCIgen made a splash by getting one of their papers accepted to the conference SCI 2005. Their website has details, and links to some other random generators around the web.
- For those who aren’t high energy physicists, and are still interested (though I can’t imagine who that would be), the “X” in arXiv or snarXiv is supposed to be a greek chi. We’re meant to pronounce them like archive (as in “archive of physics papers”) and snarchive (as in “snarky archive of physics papers”). [↩]
- Please don’t sue me, arXiv.org, for stealing your CSS file and your beautiful color scheme. Also, Werner Heisenberg, if you’re still alive, please don’t sue me or my computer for libel. [↩]
- If someone pretentious is annoying you, and you use the theorem generator instead, you could try something like this. [↩]
- And check out the results. Also, pick up the unofficial arXiv vs. snarXiv wallpaper. [↩]
- I first encountered these in freshman year of college in an assignment for CS51: Abstraction and Design in Computer Programming. We had to implement a CFG in LISP, and the cleverest won its author lunch at the faculty club. The eventual winner was my friend Matt Gline’s theorem generator, which has since been enhanced with LaTeX, commutative diagrams, ajax, and stuff like that. [↩]