Searching lots of inconveniently formatted files at once
Here are some approaches to searching lots of PDF and Word documents at once without knowing much about computers.
What sort of search?
To decide which software to use, first decide which of these three software types you want.
We were mostly thinking about full text search. Full text search just displays all of the occurrences of a particular phrase, like when you use one of those "Find" boxes where you type something and see all of its occurrences.
If your files start as scans of paper documents (images), you'll need to run the files through optical character recognition first.
Rather than searching specific phrases, you might consider clustering the many documents by document similarity. This is probably a bit different than you are used to, so it might yield results that you hadn't anticipated.
Full text search
There are a lot of tools in this area. First decide whether you want the files to be stored online. This can help with collaboration and can simplify backup. On the other hand, keeping files offline can be more secure, reduce your dependence on a good internet connection and give you more flexibility with software.
Recoll, DocumentCloud and DocFetcher are free/libre/open-source.
Full text search was our main focus, so we composed directions for using the full text search software.
The online tools (Google Drive and DocumentCloud) work reasonably similarly to each other. To search with these tools,
- Make an account.
- Upload the documents.
- Select the documents.
The offline tools (Recoll, DocFetcher, Spotlight and Alfresco) also work similarly to each other. For these,
- Install the software.
- Put the documents on your hard drive, and remember where you put them.
- Inside of the program (Recoll, Spotlight, DocFetcher or Alfresco), indicate that the directory containing the files should be "indexed". This might also be phrased as "adding" a directory.
- Inside of the program, "index" your hard drive.
Optical character recognition
Adobe Acrobat Pro can run optical character recognition on several files in batch.
If that doesn't work for you or if you are opposed to proprietary software, consider the many free/libre/open-source graphical OCR tools.
These free tools do batch OCR across many files, but batch jobs might be less convenient in these programs than in Adobe Acrobat Pro.