Query Google Scholar using Python
In desperate need to organize my collection of scientific papers, I had a look at various tools which could help me organizing them. Probably one of the best out there is Mendeley. Mendeley seems to be a very good tool to keep your massive collection of pdfs under control. Unfortunately a very basic function, namely looking up a newly imported paper in Google Scholar to get attributes like: Authors, Year, etc. right, is bundled with a Mendeley account. I guess that’s their way of forcing the user to participate to their community stuff, since without the Google Scholar lookup Mendeley is pretty useless unless you want to fill all the attributes manually.
So I decided to write my own tool to make the lookup. Unfortunately Google does not really want to give away that precious data: they don’t provide an API and even block certain User-Agents from accessing the page. Then, there is also the problem of scraping the results page to get the right data.
The first problem can be trivially solved by setting a common User-Agent
string, the second one can be elegantly circumvented by using the bibtex files
provided in the search results. The bibtex entries are however only showed if
you enabled them in the settings, which are stored in a cookie. After a few
tries, I figured that the CF
attribute (citation format?) controls which
bibliography format should be offered in the results page and CF=4
corresponds to bibtex. Generating a fake cookie is easy, but you have to know
what must be included. In this case it looks like a 16 digit hex as ID and the
CF attribute is sufficient. The ID
is probably supposed to be your id, but
a randomly generated one also works like a charm.
The resulting cookie looks like this: GSP=ID=762a112b5c765732:CF=4
All you have to do now is to query Google Scholar using the user string and the cookie:
...
# fake google id (looks like it is a 16 elements hex)
google_id = hashlib.md5(str(random.random())).hexdigest()[:16]
GOOGLE_SCHOLAR_URL = "https://scholar.google.com"
HEADERS = {'User-Agent' : 'Mozilla/5.0',
'Cookie' : 'GSP=ID=%s:CF=4' % google_id }
def query(searchstr):
"""Return a list of bibtex items."""
searchstr = '/scholar?q='+urllib2.quote(searchstr)
url = GOOGLE_SCHOLAR_URL + searchstr
request = urllib2.Request(url, headers=HEADERS)
response = urllib2.urlopen(request)
html = response.read()
# grab the bibtex links
...
And Google Scholar will offer you links to the bibtex files of the results.
Getting those links is easy since they all start with "/scholar.bib"
. Just
search for those and download the targets.
The complete code is available on
github. It can be used as a python
library or a standalone application, you just call it like this: gscolar
"some author or title"
and it will print the first ten results in bibtex to
stdout.