Want to take part in these discussions? Sign in if you have an account, or apply for one below
Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.
publishing.mathforge.org
discussion forum. The forum is no longer active but much discussion took place on these pages so an archive has been preserved.One aspect of people’s unhappiness with Elsevier in particular, and with paywalled journals more broadly, is text mining: automatically extracting information from vast numbers of papers. This is technologically feasible, but Elsevier does not allow it, apparently because it’s something they might be able to monetize in the future.
Chemists are particularly interested in this, since data is scattered throughout the chemistry literature in ways that could be machine-recognizable. Is this something we should be excited about as mathematicians? And is there anyone out there working on text mining in the mathematics literature? We could clearly learn a lot about the literature itself, and that would be valuable, but I’m wondering whether we could learn more about mathematics. It seems plausible to me, but I’m having trouble coming up with really compelling examples.
I was once told you effectively cannot text-mine the arXiv (or rather: make that public) because the arXiv would have to ask for permission during the submission process (which it still doesn’t, I think).
It seems difficult to get the rights if I see what Heather Piwowar went through (which also seems to indicate that publishers are not familiar with these kinds of requests). See also Cameron Neylon’s biting comment at They just don’t get it.
Is there really an issue with text mining the arXiv? They have the legal right to distribute the papers, so it seems simple: they send you the papers and you process them however you like. Maybe I’m overlooking some subtlety, but http://arxiv.org/abs/cs/0702012 seems to be a case of arXiv text mining of a sort (although done by arXiv personnel).
I think with Elsevier there are two issues:
(1) They won’t give you all the papers: even if you have a subscription, if you try to download them all they will cut you off. [Of course, this part is reasonable. Even with the arXiv, they do not allow massive automated downloads, so you would have to get the papers another way.]
(2) They won’t give you access at all unless you agree to terms of use that do not allow text mining.
One thing I’d love to do is to search through the literature for apparent coincidences. For example, the start of monstrous moonshine was the observation that the monster group had a 196,883-dimensional irreducible representation while 196,884 was one of the coefficients of the j-function. It would be fun to do brute force searches for various sorts of coincidences and patterns. Probably nothing important would come up, but it’s the sort of experiment that could be worth trying.
For example, one could try to make an index of where each natural number occurs in the literature. For small numbers this would be ridiculous, and of course it’s not even worth trying to index every 0 or 1. However, I’d bet the occurrences thin out relatively quickly, especially if you ignore numbers with particularly simple descriptions. For example, how many papers have ever mentioned the number 3485? It would be interesting to see what the occurrences of the remaining numbers looked like. Some would just be random numbers from numerical examples, but certain numbers would show up much more frequently, and it could be fun to take a look at why.
I think the arXMLiv project may be relevant to this discussion:
The last few years have seen the emergence of various content-oriented XML-based, content-oriented markup languages for mathematics and natural sciences on the web, e.g. OpenMath, Content MathML, or our own OMDoc and PhysML. These representation languages mathematics [sic], that makes the structure of the mathematical knowledge in a document explicit enough that machines can operate on it. The promise if these content-oriented approaches is that various tasks involved in doing mathematics (e.g. search, navigation, cross-referencing, quality control, user-adaptive presentation, proving, simulation) can be machine-supported, and thus the working mathematician is relieved to do what humans can still do infinitely better than machines.
In the arXMLiv project we try to translate the vast collection of scientific knowledge captured in the arXiv repository into content-based form, so that we can use it as a basis for added-value services.
https://trac.kwarc.info/arXMLiv/wiki/arXMLiv-project-description (authentication certificate probably needs to be accepted to open the page)
Four years ago, Scott Morrisson mentioned that he had a local copy of the entire arXiv. http://sbseminar.wordpress.com/2008/03/12/mathematical-grammer-ii/#comment-2874 So it sounds like, at least in 2008, there was no difficulty getting the raw data.
@David - I know that the arXiv has been put on bitTorrent or similar in the past. How it got there I cannot guess, but it is a good insurance policy for them if such a thing happens periodically.
arXiv has put up the entire repository in an S3 bucket and you would just need to pay Amazon around $20 for the bandwidth (plus your own costs of hosting and processing), so it is rather tempting to try some large-scale data mining. When I have the time I’ll use this to add some features to arXaliv such as full-text search, and following/tracking citations, and I should be able to expose an API for it.
Ummm, I’ve mentioned this before here with a link but nobody seemed interested.
Maybe this post would have been better as a comment in that empty thread.
1 to 10 of 10