Basic Frequency Analysis with Ruby

Ruby is a pretty fun language, previously I've used it for N-body simulations following the Maya Open Lab Development series.

I decided to do a short brush-up project by writing a command-line ruby script that does some basic word statistics on text files. For example, looking at the number of occurrences of words of a certain length in a file. On the side we can use the optparse and benchmark libraries to make the interface a little cleaner and do some timing.

How can we find the most common words in a text file? Well, first we'd parse the text file on a word chunk basis, this can be done using a lexical parser, which converts a sequence of characters into a sequence of tokens, where the tokens are words in our case. Luckily regex can do this handily for us instead of rewriting our own, again.

Then, when we have this list of tokens, we want to detect duplicates, and keep track of the number of times each word appears. A way to do this is to keep a memory of words seen, and check each word and see if we've seen it before.

  • If it's new, create a new entry in the memory with the number of duplicates initialized at zero. 
  • Otherwise, if it already exists in memory, we increment the count of that word. 

A hash table dictionary can do this handily. With ruby, we can use a dictionary (called a Hash in Ruby) that tracks number of occurrences (with default value of 0) as simply as words = Hash.new(0).

Then, we can scan words from a line using regex and increment the dictionary like so:

while line = f.gets
line.chomp! # Remove trailing endline character \n
cleanLine = line.scan(/[A-Za-z](?:\w)+|(?:[A-Za-z]\.)+[A-Za-z]?/)
cleanLine.each do |word|
words[word] += 1
end

We can sort and cull lengths using Ruby's inbuilt syntax:

words.reject! {|k,v| v < options[:minOccurences]} # remove single occurrences
wordDist = words.to_a.sort{|x,y| y[1] <=> x[1]} # Sort big-small by occurrences

Using optparse also gives a nice help menu with ruby.

Usage: wordStats.rb [options] file1 file2...
    -v, --verbose                    Output more information
    -w, --word-length N              Minimum word length (Integer > 1)
    -o, --occurences N               Minimum number of occurences (Default: 2)
    -i, --interactive                Script pauses between each file
    -h, --help                       Display this screen

Example usage
./wordStats.rb *.txt -w 3 -o 10

Here's an example with a wiki article on Montague Druitt, the top 3 most common words are the, and, and was. which sounds about right.

$ ./wordStats.rb montaguedruitt.txt
--- montaguedruitt.txt
Number of characters per line:
10 0 849 0 0 18 30 1342 652 12 0 0 0 22 25 968 659 636 13 0 618 1226 602 707 69 11 0 524 541 922 408 471
Number of lines: 32, Number of characters: 11335
Word count (2+ occurences, ignore single-letter words):
the : 88, and : 65, was : 49, of : 46, in : 45, Druitt : 40, his : 38, to : 29, for : 25, by : 21, he : 21
Time elapsed: 0.0025387509958818555s
---
Finished : 1 Files processed in 0.0026010969886556268s

Source code on Github

Popular posts from this blog

Visualizing TLE Orbital Elements with Python and matplotlib

Strawberry DNA extraction and viewing with a cheap USB Microscope

Relearning web development by writing a clicking game using React, Npm and Browserify