Automating the extraction of information from papers

lesandrop · 21 Abril, 2020 14:46

Hi everyone,

I talked to Karen yesterday about automating the search for papers from Scopus database. I investigated the issue and found it easy to automate the process.

Scopus implements an API (https://dev.elsevier.com/sc_apis.html). I created an API token and managed to write a simple Python program to get the papers’ metadata through the API. In the search through the API, it is possible to define the country of affiliation of the authors of the papers that will be returned in the search.

For testing, I implemented a search for papers using the “citizen science” term and which have least one researcher from the countries “Brazil”, “Argentina”, “Colombia” or “Chile”. It worked well. The results for this search are 113 papers. The result is in the this file (https://github.com/lesandrop/papercollection-ricap/blob/master/scopusPapers.txt). In that file, each line is one paper. For each paper, I saved the following information separated by a semicolon (";"):

Paper title
Name of Journal or Conference
Paper DOI;
Paper URL
Authors Affiliation
Publication Date
Searched KeyWord
Searched Authors Countries

So, in this way, at Scopus we are able to collect the papers written by researchers who have an affiliation in the Iberoamerican region. For each search string, we would do a search for each country in the region, then aggregate the results. We can write code that collects data from papers, remove duplicates (this happens if in the same article there is a researcher from Brazil and another from Colombia, for example), categorize the papers by country of researchers, categorize by year of publication, categorize by journal, etc. If you want you can say what it would be good to do, I see if it is possible. Here is a sample of what the API allows us to collect http://alm.plos.org/docs/scopus

My Python code is available here https://github.com/lesandrop/papercollection-ricap. If you want you can use and modify at will.

Now, I will try to do the same for the Web of Science database.

Best regards
Lesandro Ponciano

lesandrop · 21 Abril, 2020 20:15

To complement the previous example I collected for the key-word “citizen science” (does not include the other key-words that were listed), the data for all countries in the Ibero-america region:

Argentina = 15 papers
Bolivia = 1 papers
Chile = 21 papers
Colombia = 4 papers
Costa Rica = 11 papers
Cuba = 0 papers
Dominican Republic = 0 papers
Ecuador = 10 papers
El Salvador = 0 papers
Guatemala = 0 papers
Honduras = 2 papers
Mexico = 44 papers
Nicaragua = 1 papers
Panama = 1 papers
Paraguay = 1 papers
Peru = 7 papers
Puerto Rico = 3 papers
Spain = 127 papers
Uruguay = 1 papers
Venezuela = 1 papers
Brazil = 73 papers
Portugal = 40 papers
Andorra = 0 papers
TOTAL in the Ibero-America = 363 papers

When were these papers published?
2006 = 1 papers
2009 = 2 papers
2010 = 6 papers
2011 = 6 papers
2012 = 6 papers
2013 = 12 papers
2014 = 29 papers
2015 = 34 papers
2016 = 47 papers
2017 = 60 papers
2018 = 57 papers
2019 = 69 papers
2020 = 34 papers
TOTAL in the Ibero-America = 363 papers

I also checked where the papers have been published. In other words, where have researchers in citizen science the Ibero-Amercia region published? There are 227 different journals / conferences. The top-20 are:

Journal of Apicultural Research = 11 papers
Fungal Diversity = 8 papers
Advances in Intelligent Systems and Computing = 8 papers
CEUR Workshop Proceedings = 7 papers
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) = 6 papers
Biodiversity and Conservation = 6 papers
ACM International Conference Proceeding Series = 6 papers
Marine Pollution Bulletin = 4 papers
Ecological Indicators = 4 papers
Proceedings of SPIE - The International Society for Optical Engineering = 4 papers
Proceedings of the National Academy of Sciences of the United States of America = 4 papers
PLoS ONE = 4 papers
International Journal of Biometeorology = 4 papers
Journal of Science Communication = 4 papers
Biological Conservation = 4 papers
European Journal of Wildlife Research = 4 papers
Check List = 3 papers
Environmental Monitoring and Assessment = 3 papers
International Journal of Environmental Research and Public Health = 3 papers
Water (Switzerland) = 3 papers
Society and Natural Resources = 3 papers

This is, of course, a very simple and very limited example. I only worked with one of the several key words that we identified. But I think you can already get a sense of what we can do automatically. Now it is important for us to see what needs to be done and check if it is possible to do it automatically through the API.

diego.torres · 21 Abril, 2020 20:35

Excelente trabajo @lesandrop!! muchas gracias.
Seria muy lindo hacer un mapa coloreado con algún indice de calor. Evidentemente, España, Brasil y Portugal son los previsibles. Particularmente a mi me sorprende la producción de Mexico.

Es muy interesante que fue creciendo el número de publicaciones. De todas formas me parece necesario diferenciar las de journals con las de conferencias.

nuevamente, que interesante lo que han hecho. Gracias por compartirlo.

Diego

lesandrop · 21 Abril, 2020 20:54

Obrigado, @diego.torres. Vou colocar isso em um mapa, vai ficar bem interessante. Também acho necessário separar as conferências dos periódicos, vou olhar isso também.

Sobre o Mexico, nós temos que ver se as outras bases de dados que @KarenSoacha está usando. Talvez esse padrão seja só do Scopus e se não mantenha nas outras bases de dados ou quando usamos outras palavras-chaves, como “ciência participativa”.

Outra coisa que eu fiquei pensando: talvez seria interessante ver o grau de cooperação entre pesquisadores da região Ibero-Americana. Ou seja, vericar o quão comum é que um mesmo artigo sobre ciência cidadã tenha, por exemplo, um autor da argentina e outro da Colômbia. Há grupos de cooperação na região? Eu não faço ideia de qual seria o resultado disso.

Obrigado pelo feedback!

Cordialmente,
Lesandro Ponciano

KarenSoacha · 23 Abril, 2020 10:41

Gracias @lesandrop hay un muy buen potencial en las consultas que estás haciendo.Creo que en la próxima reunión del grupo de mapeo podemos conversar sobre potenciales visualizaciones o consultas. La de @diego.torres me parece genial

Un poco de contexto para los que llegaron nuevos al hilo Esta consulta es para poder extraer los artículos de bases de datos como Scopus, WoS, -Seleccionamos 7 por ahora para el mapeo de RICAP- los artículos que incluyen la palabra citizen science o sus “sinónimos” según el listado que creamos.

De esa lista el resto es poder filtrar cuáles artículos son sobre experiencias de ciencia participativa en Iberoamérica. Lesandro está apoyando esa consulta para que se haga de la forma más automática posible. Una opción es que la consulta se asocie al país de afiliación del investigador. Esto no quiere decir que el artículo sea necesariamente sobre la región, pero es un primer filtro.

Por ahora creo que del metadato que traemos de las bases de datos podemos extraer todos los campos disponibles, siendo clave por supuesto: titulo, abstract, autores, año, keywords, país de afiliación del autor, en formato BibTeX. Uno, porque cuando los importemos en el Scolr donde haremos la revisión colaborativa, solo se insertan los campos requeridos por el programa, así que no hay problema de “exceso” de datos. Dos, porque esta consulta no ocupa mucho espacio en términos de metadatos y si podemos después jugar a consultas como las que plantea @diego.torres sobre artículos Vs. conferencias.

@lesandrop ¿Cómo funciona la extracción del pdf como tal del artículo? ¿El API nos a solo metadato?

@npiland @Alexandra por fa cuando puedan denle una mirada a este hilo.

Gracias y seguimos
Karen

lesandrop · 23 Abril, 2020 14:05

In the case of Scopus database, it is possible to do this filter automatically, because there is a field in the metadata that indicates whether it is a paper was published in a journal or not.

It is possible to automate the paper download process. But the databases prohibit this from being done. One reason for this is to prevent all papers from being downloaded and made public on the internet, disregarding copyright. When we download automatically, first action the paper databases take is to block the IP that is downloading. My suggestion is: if each PDF will be analyzed manually, then it can also be downloaded manually. I can generate the direct link for each paper that will be analyzed.

In addition, Web of Science also has an API. It is possible to automate the search for information there as well. I intend to do a test script in the next few days (I’m teaching classes at PUC Minas remotely, so I’m running out of time).

Something that I think is very important is to see what all the databases have in common, the analyses that will be possible to do in all of them. Otherwise, it will be difficult to integrate them all into something that is cohesive.

Thanks,
Lesandro Ponciano

npiland · 15 Mayo, 2020 02:08

Incredible! This is very exciting, thank you for working on this. I’m excited also to see the heat map and whether or not there are collaborations among researchers in Iberoamerica-- this reminds me of some of the work done in Dr. Emilio Bruna’s lab where it has been shown that most researchers partner Global North-Global South, and not Global South-Global South. Seeing what the patterns of collaboration are in our region would be extremely interesting.

@KarenSoacha - I looked into the potential to batch-download PDFs and/or their contents, and it’s a story that gets dark quickly (people have tried to do this and ended up in a Lot of Trouble). Essentially the databases keep track of how many PDFs are being downloaded a minute per IP address as well, and there are limits on what the rate of download can be- as @lesandrop noted, the databases block the IP to keep this from happening.

I hope everyone’s staying safe, and @lesandrop good luck with the remote classes!