Hey all! The class notes are not fully finished, but have come a long way with the addition of links to book chapters with our material. If the book chapters are too dense, I can rewrite them in my super less dense casual style – just give me a heads up.

Y’all know what to do!

At the Secret Convention of Logicians, the Master Logician placed a band on each attendee’s head, such that everyone else could see it but the person themselves could not. There were many, many different colors of band. The Logicians all sat in a circle, and the Master instructed them that a bell was to be rung in the forest at regular intervals: at the moment when a Logician knew the color on his own forehead, he was to leave at the next bell. Anyone who left at the wrong bell was clearly not a true Logician but an evil infiltrator and would be thrown out of the Convention post haste; but the Master reassures the group by stating that the puzzle would not be impossible for anybody present. How did they do it?

You have all been sent an email. Also, if you like your LaTeX editing in a Web 2.0 flavor, try out www.sharelatex.com

Today’s lab is meant to just be an extension of the day’s activities with most of the day left open for projects

**Part 0: Variable Bit Encoding for Information Retrieval**

Although I stressed it in class, it is worth saying again: IR is very concerned with space efficiency! So much so that significant work goes into the careful compression of postings lists. Consider, for example, the Reuter’s corpus, a commonly used and cited collection of around 800,000 documents. In general, each document in this corpus has around 200 tokens per document, 6 characters per token, and 100,000,000 postings. Uncompressed, this gives a corpus size of nearly 960MB and a postings list of 250MB.

Given a positional index

encoding | postings list | ||||||||||||

the | docIDs | … | 283042 | 283043 | 283044 | 283045 | … | ||||||

gaps | 1 | 1 | 1 | … | |||||||||

computer | docIDs | … | 283047 | 283154 | 283159 | 283202 | … | ||||||

gaps | 107 | 5 | 43 | … | |||||||||

arachnocentric | docIDs | 252000 | 500100 | ||||||||||

gaps | 252000 | 248100 |

we see that storing the gaps between the positions of term incidences in a document often means storing much smaller numbers. In these cases we can use fewer bits to encode the difference, but not in all cases. We need a variable bit encoding in order to store the odd large gap, here and there.

Variable Byte Encoding:

Recall that an integer is stored as four bytes. We will pretend that they are unsigned, or only taking on non-negative values. This means that we can use them to represent numbers from to . However, we could certainly use fewer bits for numbers like 10 or 15 or even 511. A simple method for this is variable byte encoding. Given a number, we get its variable byte encoding by writing it first in binary, padding it at the front with 0s so that the number of bits used is a multiple of seven, then chunking the bits into groups of seven. The first group has a 1 appended to it to show that it is the first byte of the representation, then all other groups have a 0 appended to them. Take for example, 2056 is 100000001000 in straight binary. We chunk this into groups of at most seven as 10000 0001000 then pad it to 0010000 0001000. Finally we add one to the first chunk and zero to the rest to get 10010000 00001000, a two byte representation for 2056.

Write the following numbers in variable byte notation. Note that Google can convert numbers to binary for you!

- 1
- 1023
- 12345678

For example, the number 15 is 1111 in binary, so it’s gamma code is 1110111. Woo, less than a whole byte for 15, just barely better than the above variable byte representation.

What is the following sequence of gaps?

111001011110000010111111010110

How many bits would we have needed to encode this with the standard 4 byte integer encoding?

A cool idea from the IR textbook.

codes are relatively inefficient for large numbers as they encode the length of the offset in inefficient unary code. * codes* differ from codes in that they encode the first part of the code (*length*) in code instead of unary code. The encoding of *offset* is the same. For example, the code of 7 is 10,0,11 (again, we add commas for readability). 10,0 is the code for *length* (2 in this case) and the encoding of *offset* (11) is unchanged. (i) Compute the codes for the other numbers in Table 5.5 . For what range of numbers is the code shorter than the code? (ii) code beats variable byte code in Table 5.6 because the index contains stop words and thus many small gaps. Show that variable byte code is more compact if larger gaps dominate. (iii) Compare the compression ratios of code and variable byte code for a distribution of gaps dominated by large gaps.

Part 1: Final Projects

Part 2: Daily Feedback

An *n-simplex* is a collection of n+1 points which are connected by faces, edges, cells, etc. For example, a 1-simplex is a line segment, a 2-simplex is a triangle, and a 3-simplex is a tetrahedron. A *regular simplex* has all edges with equal length. Note that when we put a point on the center of every face of an n-simplex, we get a smaller n-simplex inside. For example, if we put a point in the middle of each edge of a triangle, then connect these points, we get a smaller triangle. If we put a point in the center of each face of a tetrahedron, then connect them, we get a smaller tetrahedron. Call this the *inscribed simplex*.

Given a regular 5-simplex, what is the ratio of the length of each edge connecting two points to the length of each edge in the inscribed simplex?

Today’s lab is all final projects! Keep in mind that your presentation will be a week from today, so if you need more resources, advice, etc, get in touch with me, and I will do what I can. Also, if you feel that your project is in a good place, you can always work on older projects, or find Ben or the RCs to learn something cool.

Your friend Ramsey invites you to a potluck at his house. Wanting to know how many people you’ll need to cook for, you ask about how many people will be there. Rather than give a straight answer, Ramsey tells you that the number of attendees will be such the smallest number such that he is guaranteed that among the guests, either:

- At least three of the guests will be mutual friends, or
- At least three of the guests will be mutual strangers (none of the three knows either of the other two)

How many guests will there be?

Today, there are only two main tasks, EPGYAuthor and Final Projects, with the optional ImageClusterer on the side. Spend your time wisely! If you are not interested in EPGYAuthor, and would rather just work on your final project, feel free, just make sure that you take a peek at EPGYAuthor and get a sense of what is going on. Also, if you really want to make it through previous labs, feel free to spend some time on it, but don’t get too caught up on older projects at the cost of not working on your final project!

From today, I used the following demos:

Also, if you want a great resource on clustering, check out the chapter on clustering from Jeff Ullman’s book.

**Part 0: Submissions**

Before getting started, make sure that you have updated your submissions directory fully and you have as much of your work in it as possible.

**Part 1: ImageClusterer**

Recall the demos that we saw of using clustering to find image features? We are going to do that ourselves! Get excited to implement k-means clustering! Download the starter code here. Read through the project and look at the part you need to implement, the clusterPixels() method. The general steps are outlined in the comments, but as far as actual implementation is concerned, you need to think carefully about a few parts, especially how you will keep track of which pixels wind up clustered to which centers. Remember, this is totally optional! If you want to devote your time to your final project, work on that instead.

**Part 1: EPGYAuthor**

(Read the entire part before starting.)

In this short project, we’re going to give a computer a source text or book, and then have the computer generate some text that’s similar to the text or book. For example, if you give it the complete works of Shakespeare, this program should spit out some text that looks like it could have been written by Shakespeare, as long as you don’t read too closely.

Sounds hard, right?

Not if you use the correct model! We learned in lecture that supervised learning has two main steps–feature/model selection, and actual learning (populating our model). If we select a great model, the learning should be pretty simple. In this project, we’re going to build a Markov model, then use a trivial learning technique (just pick based on the probability that it showed up before).

The model we’re going to use here is called the *Markov model.* Markov models were created by Andrey Markov, and subsequently used by the computer scientist Claude Shannon in 1948 to model language. Simple, a Markov model derives the current state from some probability distribution of the state(s) that immediately preceding this state. Markov models are widely used in machine learning, data mining, and statistics.

So, how can we use Markov models in modeling English? As we discussed in lecture, the distribution of letters in English is not random. Certain letters are much more likely to appear after certain sequences of letters than others. For example, the letters ‘e’ and ‘o’ are much more likely to follow the sequence of letters “indicat” than, say, ‘y’ or ‘b’. We can therefore build a Markov model to model which letters follow other sequences of letters, given a text, and then use that model to generate some new text.

Markov models have a notion of “order,” which is how many previous states it uses to generate the current state. For example, an order 0 Markov model would not use any previous states, so it would just use the distribution of single characters in a text to generate a new text. With an order 1 Markov model, we would now be representing common *bigrams*, or two letter sequences, since we will be picking letters that commonly follow other single letters. An order 2 Markov model would then look at each 2-character sequence in the text, and record what letters are likely to follow each 2-letter sequence. For example, in English, the 2-character sequence “tw” is much more likely to be followed by a vowel than a consonant.

To generate text, we could just take that model, start with a random letter or letters, and then just use the Markov model to generate letters one at a time, based on the preceding letters. With order 0, you will just get a bunch of nonsense. With order 2, you’ll get sequences of letters that actually may look like it can be text in some weird language. With order 2, you can definitely see traces of English–it could even be read as some strange form of long-lost English. At order 4, words start becoming apparent, but the sentence structure and the writing style still differs from the original text. Between 5-7, you definitely start seeing the original style. You can go longer, but, if you pick an order that’s too long, it will become increasing unlikely that your source text will have many examples of the long letter sequences, so you risk copying the source text exactly instead of generating relatively novel text.

So, your mission, if you choose to accept it, is to build a k-order Markov model for some source text, then generate random text output that uses this model. Take the following steps:

- Download the starter project: EPGYAuthor
- Read the source file character by character. As you go, track each sequence of characters, and the character that follows it. For example, for the short text “Hello world!” with an order 5 Markov model, you would track “Hello”->’ ‘, “ello “->’w’, “llo w”->’o’, “lo wo”->r, and so on.
- You’ll need to keep track of which letters follow every sequence of letters. In your text, if the letter ‘a’ follows “hell” 5 times, and the letter ‘o’ follows “hell” 15 times, you will then, when generating a random text, pick the letter ‘a’ 1/4 of the time when the 4 previous letters are “hell”, and pick ‘o’ 3/4 of the time when encountering those letters. Think about how you will store these. Maps and Lists will definitely be used together. When storing frequency data, consider just using duplicate values instead of storing a frequency number. If you do this correctly, it will make it very easy to randomly choose a character given the preceding sequence of characters.
- To read one character at a time, use BufferedReader.read(). This will read the character as an int, so you will need to cast it to a char. However, make sure to check first if the read() method returns -1. In that case it means we have reached the end of the file.
- Take some time in thinking through your data structure. Coding without first understanding what you’re doing and how you’re doing it will NOT work here.

- Now, use the model to randomly generate text. Pick a random sequence from your collection of sequences as the starting seed, then just pick the next letter one at a time depending on your model. You may want to cap how many characters it generates at some predetermined value.
- Have fun! You can find a lot of classic texts for free online, especially at Project Gutenberg. Also fun are generating film scripts and plays. Feel free to do any extensions by modifying the model. If you generate some interesting or hilarious texts, feel free to post on the blog, and make sure to include your source text as an attachment, and mention which order Markov model you used.
- (All texts provided in the starter project from Project Gutenberg.)

**Part 2: Final projects**

**Part 3: Daily feedback**

You are sending four bits over a network to a friend, but you suspect that the evil bit flipper man might try to flip a bit. To be cautious, you allow yourself three extra bits to help encode the values of the original four. Given that the evil bit flipper man may flip one of your bits, devise a scheme for using the seven bits to make sure that your friend receives the four original bits, regardless of which of the seven bits the evil bit flipper man might flip.