Probstat/random word

จาก Theory Wiki
ไปยังการนำทาง ไปยังการค้นหา
This is part of probstat course.

Download words.txt which is a list of common 4325 English words. (I obtain this list from [1], which took it from [2]. The original list of 5000 words contain duplicates because some word appears in many parts of speech. I also removed words with punctuation marks (e.g., o'clock).)

Random experiment: Pick one random word from the list.

Probability

Compute the probabilities of the following events.

  • Pick a word with an alphabet 'a'.
  • Pick a word witn an alphabet 'z'.
  • Pick a word with both alphabet 'a' and alphabet 'z'.
  • Pick a word with an alphabet 'b'.
  • Pick a word with alphabets 'a' and 'b'.
  • Pick a word that contains 'a' or 'b'. (Can you find this one without running a program?)
  • Pick a word that contains 'na'.
  • Pick a word that contains 'nal'.
  • Pick a word with no vowels.
  • Pick a word that contains more than 3 vowels. (Vowels are 'a', 'e', 'i', 'o', and 'u'.)

Conditional probability

Compute the conditional probabilities of the following events.

  • Probability that you pick a word with an alphabet 'a' given that you pick a word with 'z'.
  • Probability that you pick a word with 'nal' given that you pick a word with 'na'.
  • Probability that you pick a word with 'l' given that you pick a word with 'na'.
  • Probability that you pick a word with an alphabet 'c' given that you pick a word with 'r'.
  • Probability that you pick a word with an alphabet 'r' given that you pick a word with 'c'.

Answer these questions.

  • Given that we pick a word without vowels, which character has the highest probability of occurring? What's the probability?

Character prediction

This problem was inspired by this fb status by Supasate Choochaisri.

Here is a simple character prediction. Let K be a fixed parameter. (You will experiment with various values of K).

The parameter K is important as we will take the information from all the words whose length is at least K. You can ignore all shorter words. We will use that to predict the K-th after we get K - 1 characters for a valid words.

For example:

  • for K = 2, we will use information from all words whose length is at least 2, to predict the 2nd character of a word after receiving the first character of a random word whose length is at least 2.
  • for K = 4, we will use information from all words whose length is at least 4, to predict the 4th character of a word after receiving the first 3 characters of a random word whose length is at least 4.

Our prediction algorithm is simple:

procedure PREDICT( w ):                  // note that the length of w K-1 characters
1. find character with the highest probability of appearing after  w  in if you pick a word from the list at random.
        if there are more than one character with the highest probability, pick the smallest one  (e.g., if 'd' and 't' get the highest probability, use 'd'.)

For example if we use K=3, and the current string is 'ag', the following list shows all possible words:

again 
against
age
agency
agenda
agent
aggression
aggressive
ago
agree
agreement
agricultural
agriculture

Note that the most probable character after 'ag' is 'e' and 'r' (probability = 0.308), but 'e' is smaller; therefore the algorithm will predict 'e'.

Can you analyze the performance of this prediction? More precisely, can you compute the probability that the algorithm predict correctly given that we pick a random word of length at least K.

Answer in the following table:

K Number of words whose length is at least K Probability of correct prediction
2
3
4
5