Cuttlepress

Blogging about Hacker School?

Day 36

Busy day! I finally got my Twitter project fully working, with the final piece being that I want to set-it-and-forget-it on my AWS server, so I was determined to find some way of automatically restarting the process if it broke or if the server was restarted. Moshe helped with get supervisord set up, though we had a lot of trouble with getting the “program” part of the conf file working. Tom helped us determine that the solution was to use absolute paths for both Node itself and my .js file, like so:

[supervisord]

[program:curateddannel]
command= /usr/local/bin/node /home/ubuntu/twitter.js
autostart=true
autorestart=true
stdout_logfile=/home/ubuntu/supervisorlogs.log

I also followed this tutorial to make sure the supervisord process runs on machine startup. I’ll do a real announcement/linking to the project tomorrow, once I’m more sure that it’s stable.

I attended Lindsey’s talk on LVars and parallel computing, and I really enjoyed it. The talk and many of the questions lead to a lot of “oh so that’s what that means” moments for me, and like all of the best presentations, it made me interested in a thing I’d never even thought about before.

I had an ofice hour with Alex, which was also really great. We determined that a good approach to my “how do we define ‘similar to Hamlet?’” question would be to generate a language model of Hamlet and check incoming phrases against the probabilities contained therein. He walked me through the basics of how I could construct such a model (build up a dictionary of bigrams, along with their probabilities), along with some implementation tips (for example, store probabilities as a negative logprob). It was lot of very comprehensible information in a fairly short time, and it really felt like learning/progress were occurring. I wrote up the code for grabbing the unigram probabilities from Hamlet and got most of the way through the bigrams (with some debug help from Rishi). Here’s unigrams, where choppedtext is an array of all the words in the text:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
function makeUnigrams(choppedtext) {
  var dict = {};
  var highestscore = 0;
  // count the occurences of each word
  for (var i = 0; i<choppedtext.length; i++) {
      if (choppedtext[i] in dict){
          dict[choppedtext[i]] += 1;
      }
      else {
          dict[choppedtext[i]] = 1;
      }
  }
  // then use that count to calculate the probability of each word
  for (var word in dict) {
      var count = dict[word];
      var unlikeliness = (-1)*(Math.log(count/choppedtext.length));
      // storing the numbers as "negative logarithmic probabilities" helps by 
      // 1) ensuring no numbers too small for floats, and 
      // 2) allowing us to add them instead of multiplying
      dict[word] = unlikeliness;
      if (unlikeliness > highestscore) {
          dict["LEAST_COMMON_PROB"] = unlikeliness;
      }
      // dict also stores the highest unlikeliness, 
      // which can be assigned to words that aren't recognized at all
  }
  // and write it to the external file
  fs.writeFile(unigramsfile, JSON.stringify(dict, null, 4), function(err) {
      if(err) {
          console.log(err);
      }
      else {
          console.log("Unigrams saved to "+unigramsfile);
      }
  });
}