miércoles, 13 de febrero de 2013

Find text inside huge collection of pdf files

So you have a bug collection of pdf files (like papers) and you are looking for a phrase. How to search from the GNU/Linux command line?
I found this method (sure is not optimal, better ideas?)

First create a file where results will be stored
$ touch /tmp/hits

Then list the files in your folder send the to pdftotext and grep the result. After each grep we print the name of the file (so the name is AFTER the results). In this case I was searching for papers mentioning "Tensegrity"
ls -1 | parallel 'pdftotext {1} - | echo {1}: $(grep -i tensegrity) >> /tmp/hits'
Finally you can explore the results in the /tmp/hits file. Note that this uses the GNU parallel command.
A useful command to explore the results is
 cat /tmp/hits | sed -n -e '/:$/!p'
This will show only the files indeed contain the string we searched for.

No hay comentarios:

Publicar un comentario