ML: Stochastic Processes
Introduction
In a number of problems we consider an ordered
sequence of random variables $\{X_1,X_2,...,X_n\}$.
If this sequence is in time, then we are talking about a random process $X_t$.
In this case, the sequence
$\{x_1,x_2,...,x_n\}$ is observed in one experiment (this realization of random variables).
This document continues the series of materials on probability theory. First, you should look into the Introduction to the Theory, Bayesian Method and the concept of Entropy.
Probabilities of letters in text
Let's consider natural language as an example of a random process. Suppose we have a Russian dictionary $\mathcal{V}=\{w^{(1)},...,w^{(\text{V})}\}$, consisting of $\text{V}=|\mathcal{V}|$ words (or symbols). The sequence $\mathcal{T}= w_1\,w_2....\,w_N$, where $w_i\in \mathcal{V}$ forms a text of length $N=\mathrm{len}(\mathcal{T})$. Further, for simplicity, we will talk about letters.
The probability of encountering a specific letter in the text depends on its history (preceding letters). For example, after '$\mathbf{э}$' the probability of encountering '$\mathbf{т}$' is 14 times higher than the unconditional probability of encountering the letter '$\mathbf{т}$': $P(\mathbf{т})=0.051$. On the contrary, some letter combinations are difficult to pronounce. For instance, after '$\mathbf{б}$' the appearance of '$\mathbf{п}$' is unlikely.
To calculate the joint probability $P(\mathbf{эт})$ of encountering the line 'эт' in the text, it is necessary to count the number of such lines $N(\mathbf{эт})$ and divide it by the number of all substrings of length two: $N(\mathbf{**})$, where the asterisk denotes any character.
To calculate the conditional probability $P(\mathbf{э} \to \mathbf{т})$, which is the probability of encountering the letter 'т' if the letter 'э' precedes it in the text, we need to select all substrings that satisfy the mask '$\mathbf{э*}$' ('$\mathbf{э}$', followed by any symbol '$\mathbf{*}$'), and determine how many of them contain the substring 'эт': $$ P(\mathbf{эт})= \frac{N(\mathbf{эт})}{N(\mathbf{**})} = 0.002,~~~~~~~~~~~~~~~~~~~~ P(\mathbf{э}\to \mathbf{т}) = \frac{N(\mathbf{эт})}{N(\mathbf{э*})} = 0.83. $$
Note that the substring 'эт' itself has a low probability of 0.002, but if we have already encountered the letter 'э', then the letter 'т' will follow it with a probability of 0.83.
The total number of both joint and conditional probabilities of two letters in the Russian alphabet, consisting of $33$ letters (excluding 'ё', but including spaces), is equal to $33^2=1089$. For a text of $N$ characters $N(\mathbf{**})=N-1$, and $N(\mathbf{э*})\approx N(\mathbf{э})=P(\mathbf{э})\cdot N$. By the definition of conditional probability: $$ P(\mathbf{эт}) = P(\mathbf{э})\cdot P(\mathbf{э}\to \mathbf{т}). $$
Since there are no terminal letters in the language (letters after which there are no other letters), the sum of $P(w_1\to w_2)$ over all $w_2$ in the vocabulary must equal one (there will be some letter after $w_1$). This holds true in the general case as well. If $P(w_1,...,w_{n-1})$ is nonzero, then the sum of $P(w_1,...,w_{n-1}\to w_n)$ over all words $w_n$ in the vocabulary $\mathcal{V}$ is 1.Markov Processes
By the definition of conditional probability, the joint probability of the sequence $w_1,...,w_n$ is: $$ P(w_1,...,w_n) = P(w_1,...,w_{n-1})\,P(w_1,...,w_{n-1} \to w_n). $$ Applying this relation recursively, it is straightforward to derive the сhain rule for calculating joint probability through conditionals: $$ P(w_1,...,w_n) ~=~ P(w_1)\cdot P(w_1\to w_2)\cdot...\cdot P(w_1...w_{n-1} \to w_n) ~=~ \prod^N_{k=1} P(w_1...w_{k-1} \to w_k). $$ This formula is interpreted as follows: first, the letter $w_1$ appears with the probability $P(w_1)$. If this happens, $w_2$ will appear with the probability $P(w_1\to w_2)$, and so on.
In Markov random processes the long history is "forgotten," and $$ P(w_1,...,w_{n-1} \to w_n) = P(w_{n-k},...,w_{n-1} \to w_n). $$ If the conditional probabilities do not change along the sequence, then such a process is called stationary.
A.Markov considered the simplest case $k=1$, where the joint probability $P(w_1,...,w_n)$ can be expressed through the probabilities $P(w_{k-1}\to w_k)$. Let's provide an example of a slightly more complex case with a history length of $2$: $$ P(w_1,...,w_n) = P(w_1)\cdot P(w_1\to w_2) \cdot P(w_1 w_2\to w_3) \cdot P(w_2w_3\to w_4)\cdot...\cdot P(w_{n-1}w_{n-2}\to w_n). $$
Naturally, the properties of "short memory" and "stationarity" are approximations, and their validity is assessed by the performance of models that use these approximations.