JFreq takes plain text documents and turns them into a word frequency matrix. It tries hard to be a) quick, and b) not take up much memory. It could be better at both, but it's quite usable. The graphical version looks like this on a Mac with the online help open. Click on the image to see it full size in a new browser window.
The plain text files can be added directly, or by the folder-load. If folders are offered, JFreq only looks one level down into them for documents and assumes that everything it finds is a plain text file. It is helpful to make sure this is true.
During the counting process JFreq can, optionally
JFreq output is a folder containing your new word (or category) frequency matrix in a choice of formats, optionally gzipped to save space on your disk. The formats are
LDA-C: Blei's sparse matrix format used for fitting topic
models, but quite generally useful for word frequency data.
MTX: The Matrix Market sparse matrix format used in numerical
analysis, in the 'coordinate integer' format.
first choice of output format. Not well-suited for large scale
word-frequency data but reasonable for small document collections
and for content analyses
For the MTX and LDA-C formats row labels (filenames) and column labels (word types) are provided alongside the main sparse matrix file in the output folder. There's also a README to remind you how the format works.
R users may find it useful to know that the 'lda' package for topic models (and Blei's LDA-C software itself) expects files in the LDA-C format and has functions for reading them. Also, the 'Matrix' package contains a 'readMM' for reading Matrix Market format; just point it at the 'data.mtx' file in your output folder.
You can download Version 0.5.4 from Sourceforge. There are also alternative versions available for the command line and for other operating systems, and source code.
JFreq is open source software distributed under the Gnu Public License (GPL).
If you'd like to refer to the package in written work, you can use this:
Lowe W. (2011) 'JFreq: Count words, quickly'. Java software version 0.5.4, URL: http://www.conjugateprior.org/software/jfreq/