Probstat/random word

This is part of probstat course.

Download words.txt which is a list of common 4325 English words. (I obtain this list from [1], which took it from [2]. The original list of 5000 words contain duplicates because some word appears in many parts of speech. I also removed words with punctuation marks (e.g., o'clock).)

Random experiment: Pick one random word from the list.

Probability

Compute the probabilities of the following events.

Pick a word with an alphabet 'a'.
Pick a word witn an alphabet 'z'.
Pick a word with both alphabet 'a' and alphabet 'z'.
Pick a word with an alphabet 'b'.
Pick a word with alphabets 'a' and 'b'.
Pick a word that contains 'a' or 'b'. (Can you find this one without running a program?)
Pick a word that contains 'na'.
Pick a word that contains 'nal'.
Pick a word with no vowels.
Pick a word that contains more than 3 vowels. (Vowels are 'a', 'e', 'i', 'o', and 'u'.)

Conditional probability

Compute the conditional probabilities of the following events.

Probability that you pick a word with an alphabet 'a' given that you pick a word with 'z'.
Probability that you pick a word with 'nal' given that you pick a word with 'na'.
Probability that you pick a word with 'l' given that you pick a word with 'na'.
Probability that you pick a word with an alphabet 'c' given that you pick a word with 'r'.
Probability that you pick a word with an alphabet 'r' given that you pick a word with 'c'.

Answer these questions.

Given that we pick a word without vowels, which character has the highest probability of occurring? What's the probability?

Character prediction

Here is a simple character prediction. Let K be a fixed parameter. (You will experiment with various values of K).

The parameter K is important as we will take the information from all the words whose length is at least K. You can ignore all shorter words. We will use that to predict the K-th after we get K - 1 characters for a valid words.

For example:

for K = 2, we will use information from all words whose length is at least 2, to predict the 2nd character of a word after receiving the first character of a random word whose length is at least 2.
for K = 4, we will use information from all words whose length is at least 4, to predict the 4th character of a word after receiving the first 3 characters of a random word whose length is at least 4.

Probstat/random word

Probability

Conditional probability

Character prediction

รายการเลือกการนำทาง

เครื่องมือส่วนตัว

เนมสเปซ

สิ่งที่แตกต่าง

ดู

เพิ่มเติม

ค้นหา

การนำทาง

เครื่องมือ