Abstract: Probability generating Function is the simplest one among all the gernerating functions, yet it is compact and useful in dealing with discrete random variables, especially the sum of independent random variables. This makes it a powerful tool in theory like stochastic process, where there is a need to deal with a collection of independent random variables. Therefore, it is still necessary to explore further properties of p.g.f. Two applications related to simple random walk and
Galton Watson process will also be introduced in this paper.
Keywords: probability generating function; random walk; Galton Watson process;
Introduction
Assume X is a discrete integervalued random variable with probability mass function $\{p_k = P(X=k)\}_{k=0}^{\infty}$, its probability generating function (p.g.f) is defined by the following power series: $$P(s):=E\left(s^{{X}\right)=\sum_{k=0}}{\infty} p_{k} s^{k} \quads \leq 1$$. We know that in the discrete case, probability distribution is determined by all the possible values of the random variable and the probability corresponding to those values, which is characterized by the socalled probability mass function. Probability generating function is another way to represent and characterize a probability distribution. From the following properties, we will see that it actually “encodes” everything about the distribution. Meanwhile, it makes some probability calculation easier. This is also the reason why it is a useful tool in the study of branching process or stochastic process. Note that there are other ways to transform a distribution: characteristic function, moment generating function and cumulant generating function. Many properties of probability generating function can also be applied to these functions. But they are not in the scope of this article.
Properties
Probability Generating Function to Probability Distribution
If we take kth derivative of p.g.f $P_X(s) = \sum_{k=0}^{\infty} p_{k} s^{k}$, we would get:
$P_X^{(k)}(s) = \frac{d^{(k)}P_{X}(s)}{ds^k} = k!p_ks^0 + c_1s^1 + ... + c_ns^n, c_i = const$
$\Rightarrow P_X^{(k)}(0) = \left.\frac{d^{(k)}P_{X}(s)}{d s^{k}}\right_{s=0} = k!p_k$
$\Rightarrow p_k = P(X = k) = \frac{1}{k!} P_X^{(k)}(0)$
Therefore, we can recover the probability distribution from the probability generating function. The p.g.f alone contains all the information about the original distribution. Since we can already derive all the probabilities of the distribution from p.g.f, it’s reasonable that we can extract properties like variance, expectation or moments of the distribution from p.g.f. . Through simple calculation, we would get (proof is omitted):

$E(X) = P_X^{(1)}(1)$ (techniquely, it is left derivative)

$Var(X) = P_X^{(2)}(1)  [P_X^{(1)}(1)]^2 + P_X^{(1)}(1)$

$\mathbb{E}\{X(X1)(X2) \ldots(Xk+1)\}=P_{X}^{(k)}(1)$
(kth derivative of p.g.f is kth factorial moment of X. In principle you can recover all the moments from factorial moments.)
Uniqueness Theorem for Probability Generating Function
From section 2.1, we know that p.g.f. determines $\{p_n\}$, which is the distribution. Uniqueness Theorem for p.g.f. tells us additionally that it determines a unique distribution.
Uniqueness Theorem: Let X and Y be discrete random variables with probability generating functions $P_X$ and $P_Y$, respectively. Then $P_X(s) = P_Y(s)$ iff P(X=k) = P(Y=k) for all integers k $\geq$ 0, i.e. the probability generating functions are the same iff the distributions of X and Y are the same.
Proof: If the distributions are the same, by the definition of p.g.f., $P_X(s) = P_Y(s)$. If $P_X(s) = P_Y(s)$, expand them into power series, $P_X(s) = \sum_{k=0}^{\infty} P(X=k)s^k = P_Y(s) = \sum_{k=0}^{\infty} P(Y=k)s^k$.
It follows that the coefficients of two power series must be the same, thus P(X=k) = P(Y=k) for all intergers $k \geq 0$, which means p.g.f. uniquelly determines a distribution.
Probability Generating Function of the Sum of Independent R.V.s
The main reason why p.g.f is powerful is that it makes things easier in dealing with the sum of independent random variables. It turns convolution into product:
Theorem 1: Assume $X_1, . . . , X_n$ are independent random variables. Let $Y = X_1 + . . . + X_n$. Then $$P_{Y}(s)=\prod_{i=1}^{n} P_{X_{i}}(s)$$ .
Proof:
$\begin{aligned} P_{Y}(s) &=\mathbb{E}\left(s^{\left(X_{1}+\ldots+X_{n}\right)}\right) \\ &=\mathbb{E}\left(s^{X_{1}} s^{X_{2}} \ldots s^{X_{n}}\right) \\ &=\mathbb{E}\left(s^{X_{1}}\right) \mathbb{E}\left(s^{X_{2}}\right) \ldots \mathbb{E}\left(s^{X_{n}}\right) \; (since \; X_1, ..., X_n \; are \; independent) \end{aligned}$
$\Rightarrow P_Y(s) =\prod_{i=1}^{n} P_{X_{i}}(s) . \quad$
If we don’t know how many random variable are in the sum, i.e. n is not fixed, instead, it is itself a random variable (say N), then we have the following theorem:
Theorem 2: Let $X_1, X_2, . . .$ be a sequence of i.i.d random variables with common p.g.f. $P_X(s)$. Let N be a random variable, independent of $X_i$, with p.g.f $P_N(s)$, and let $S_N = X_1+. . .+X_N$. Then $$P_{S_N}(s) = P_N(P_X(s))$$
Proof:
$\begin{aligned} P_{S_{N}}(s) &=\mathbb{E}\left(s^{S_{N}}\right)=\mathbb{E}\left(s^{X_{1}+\ldots+X_{N}}\right) \\ &=\mathbb{E}_{N}\left\{\mathbb{E}\left(s^{X_{1}+\ldots+X_{N}} \mid N\right)\right\} \quad(\text { law of total probability for expectations }) \\ &=\mathbb{E}_{N}\left\{\mathbb{E}\left(s^{X_{1}} \ldots s^{X_{N}} \mid N\right)\right\} \\ &=\mathbb{E}_{N}\left\{\mathbb{E}\left(s^{X_{1}} \ldots s^{X_{N}}\right)\right\} \quad\left(X_{i} \text { are independent of } N\right) \\
&=\mathbb{E}_{N}\left\{\mathbb{E}\left(s^{X_{1}}\right) \ldots \mathbb{E}\left(s^{X_{N}}\right)\right\} \quad\left(X_{i}\right.\text{ are independent of each other})\\
&=\mathbb{E}_{N}\left\{\left(P_{X}(s)\right)^{N}\right\} \quad\text{(by definition of }\left.P_{X}\right)\\
&=P_{N}\left(P_{X}(s)\right) \quad\text{(by definition of }\left.P_{N}\right)
\end{aligned}$
This theorem helps us to compute the p.g.f. (equivalently, finding the distribution) of the sum of i.i.d. R.V. when the number of random variables is random.
Note that we could get the basic version of Wald’s Identity using the above result: $$\begin{aligned}
P_{S_N} &= P_N \circ P_X\
P_{S_N}^{’}(s) &= P_X^{’}(s) {P_N^{’}(P_X(s))}\
P_{S_N}^{’}(1) &= P_X^{’}(1) {P_N^{’}(P_X(1))}\
E(S_N) &= E(N)E(X) = E(N)E(X_1)\end{aligned}$$
Convergence of Probability Generating Function
Note that in the definition of probability generating function, it simply assumes that $s \leq 1$ (the radius of convergence = 1) to ensure the power series will converge. Though it is a sufficient condition for the convergence of p.g.f, it is not necessary. In many cases, the radius of convergence can > 1. We now explore the convergence property of p.g.f. starting from some simple facts.
Theorem 1: p.g.f. must converge (pointwise) for at least$ s \in [1,1]$.
Proof: Suppose p.g.f $P_X(s) = \sum_{k=0}^{\infty} p_{k} s^{k}$.
$\forall 1 \leq s \leq 1$, series $\{s^k\}$ is monotonous and bounded,
and $\sum_{k=0}^{\infty} p_k$ converges, so by Albel’s test, series
$\sum_{k=0}^{\infty} p_{k} s^{k}$ converges.
Remark: The radius of convergence for p.g.f is at least 1.
Theorem 2: Suppose p.g.f. converges on (R, R), i.e., the radius of convergence is R, then p.g.f. converges uniformly on any closed set [r,r] where 0 < r < R.
Proof: $\forall 0 < r < R$, $\sum_{k=0}^{\infty} p_{k} r^{k}$ converges. So $p_kx^k \leq p_kr^k$ when $x \leq r$. By Weierstrass’s Theorem, $\sum_{k=0}^{\infty} p_{k} x^{k}$ converges uniformly on [r,r].
Remark: If p.g.f. converges on a closed set [R, R], then it must converges uniformly on [R, R]. Let R = 1, then we can get that p.g.f. must converge uniformly on [1, 1].
Theorem 3: Suppose p.g.f. converges on (R, R), then p.g.f. is continuous on (R, R).
Proof: For any $x_0 \in (R, R)$, there must exist $r > 0$, s.t. $x_0 \in [r,r] \subset (R,R)$. Since p.g.f. converges uniformly on [r,r], so p.g.f. is continuous at $x_0$. Since $x_0$ can be chosen arbitrarily in (R,R), so p.g.f. is continuos at every point in (R,R).
Theorem 4: Suppose p.g.f converges at the right (left) end of interval (R,R), then p.g.f. is left (right) continuos at this end.
Proof: Without loss of generality, assume p.g.f. converges at x = R.
We only need to prove $\sum_{k=0}^{\infty} p_ks^k$ converges uniformly on [0,R].
$\sum_{k=0}^{\infty} p_ks^k = \sum_{k=0}^{\infty} p_kR^k(\frac{s}{R})^k$,
because $\sum_{k=0}^{\infty} p_kR^k$ converges, $\{(\frac{s}{R})^n\}$ monotonically decreasing and uniformly bounded, so by Abel’s test, $\sum_{k=0}^{\infty} p_ks^k$ converges uniformly on [0, R].
Remark: Let R = 1, by the theorem we can get that $P_X(1) = P_X(1^)$, and $P_X(1) = P_X(1^+)$, i.e., p.g.f. is continuous at both ends of (1, 1). If R > 1, p.g.f. may not be continuos on its ends.
Theoretical Applications
The first reaching time in simple random walk
Model construction:
Random walk is a Stochastic Process that describes a path that consists a succession of random steps on some mathematical space. Probability generating function is especially useful in this area due to its good property in computing independent sum of random variables. Here we only consider one dimensional random walk on the real line and the length of each step is a constant (assume is 1). Let $S_n$ denotes the position of the object at the nth step. $X_1, ..., X_n$ denotes each step’s behavior, i.e., $X_i = 1$ if at ith step the object moves right on the axis and $X_i = 1$ if moves left. The probability that the object will move right at each step is p.
The p.g.f. (distribution) of the first reaching time:
In this section, we aim to use probability generating funciton to find the distribution of reaching times of the object. The reaching time is defined as $T_{i,j} :=$ “the first time to reach position j from position i”. We will concentrate on the situation when i = 0, j = 1, and assmue its distribution is characterized by p.m.f $p_n = P(T_{0,1} = n)$, then the goal is to find $\{p_n\}$, which is the distribution of the random varible $T_{0,1}$.
Firstly, $p_1 = 1$ is obvious. When n > 1, the object must have to go left to position 1 first and then reach 1 through the following (n1) steps. Additionally, we can decompose those (n1) steps into two phases: moving from 1 to 0, then moving from 0 to 1. So by the law of total probability we can get: $$\begin{aligned}
p_n &= \sum_{j=1}^{n2} qP(\text{j steps to first reach 0 from 1" and
nj1 steps to first reach 1 from 0")}\
&= q\sum_{j=1}^{n2} P(T_{1,0}=j \quad and \quad T_{0,1} = n1j) \end{aligned}$$
Intuitively, we can sense that the event “j steps to first reach 0 from 1” won’t have any effect on the event “nij steps to first reach 1 from 0”, which means for any given j P($T_{0,1}$=n1j) = P($T_{0,1}$=n1j  $T_{1,0}$=j). So it’s reasonable to regard two events as independent. This is the regeneration property of random walk.
With independence, we can get: $$\begin{aligned}
p_{n} &= q\sum_{j=1}^{n2} P(T_{1,0}=j)P(T_{0,1} = n1j)\
&= q\sum_{j=1}^{n2} P(T_{0,1}=j)P(T_{0,1} = n1j)\
&= q \sum_{j=1}^{n2} p_{j} p_{nj1}, \quad n \geq 3, \quad p_{0}=0, \quad p_{1}=p, \quad p_{2} = 0\end{aligned}$$
It is hard to solve distribution $p_n$ from the above equation since it contains a form of convolution. But we can use gernerating function method to convert it into product to reduce complexity of computation:
$\begin{aligned} \text{by definition of p.g.f. , } P_X(s) &= \sum_{k=0}^{\infty} p_ks^k\\ \Rightarrow (P_X(s))^2 &= \sum_{k=0}^{\infty}\left(\sum_{i=0}^{k} p_{i} p_{ki}\right) s^{k}\\ \sum_{i=0}^{k} p_ip_{ki} &= p_0p_k + p_kp_0 + \sum_{i=1}^{(k+1)2} p_ip_{(k+1)i1}, \quad k \geq 2\\ &= \frac{p_{k+1}}{q}, \quad k \geq 2\\ \Rightarrow q(P_X(s))^2 &= \sum_{k=2}^{\infty} p_{k+1} s^k\\ qs(P_X(s))^2 &= \sum_{k=2}^{\infty} p_{k+1}s^{k+1} = \sum_{k=0}^{\infty} p_ks^k  sp = P_X(s)  sp\\ \text{solving the quadratic equation, }\Rightarrow P_X(s) &= \frac{1\sqrt{14 p q s^{2}}}{2 q s}, \quad s \leq \frac{1}{2 \sqrt{p q}}\end{aligned}$
By now we successfully get the p.g.f. of $T_{0,1}$. From section 2 above, we know that p.g.f. characterizes a unique distribution of a random variable, and we can extract useful information only through p.g.f. without calculating its “real distribution” $\{p_n\}$. There are two major information concerned in this random walk example.
The probability that the object will never reach 1 from 0:
Since p.g.f. is defined as $P_X(s) = \sum_{k=0}^{\infty} p_ks^k \; (s \leq 1)$, $P_X(1) = \sum_{k=0}^{\infty} p_k = P(X < \infty)$, thus if random variable X can take value $\infty$, i.e., $P(X=\infty) > 0$, then $P_X(1) = 1  P(X = \infty) < 1$, and vice versa. So to determine if a random variable can take value $\infty$, one way is to check if $P_X(1)$
is < 1.
In this case, event “the object may never reach 1 from 0” is equivalent with $P(T_{0,1}=\infty) > 0$, so we only need to examine the value of $P_X(1)$. If $P_X(1) = 1$, $P(T_{0,1}=\infty) = 0$; Else,
$P(T_{0,1}=\infty) > 0$:
$P_X(1)=\frac{1\sqrt{14 p q}}{2 q}=\frac{1pq}{2 q}=\left\{\begin{array}{ll}
1, & p \geq \frac{1}{2} \\
\frac{p}{q} < 1, & p<\frac{1}{2}
\end{array}\right.$.
So the conclution is: the object will definitely reach position 1 from position 0 when $p \geq 0.5 \Leftrightarrow p \geq q$, no matter how long it takes; but when $p < 0.5 \Leftrightarrow p < q$, it is possible that the object will never reach 1 from 0, with probability $1\frac{p}{q}$.
Expected time (number of steps) to first reach 1 from 0:
Since we can easily extract expectation (firstorder moment) of a random variable from its p.g.f., so we again apply generating function method to find expected time to reach 1 from 0, i.e., $E(T_{0,1})$. It should be emphasized that in this case $E(T_{0,1})$ may not equal to $P_{T_0}^{\prime}(1)$. This is because $T_{0,1}$ may take value of infinity but p.g.f. only defines for finite values.

If $p < 0.5$, then $P(T_{0,1} = \infty) > 0$, so by definition of
expection we have $E(T_{0, 1}) = \infty$. 
If $p = 0.5$,
$P_X^{\prime}(s)=\frac{2 p}{\sqrt{14 p q s^{2}}}\frac{1\sqrt{14 p q s^{2}}}{2 q s^{2}}$,
$E(T_{0,1}) = \lim _{s \rightarrow 1^} P_X^{\prime}(s)=\lim _{s \rightarrow 1^}\left(\frac{1}{\sqrt{1s^{2}}}\frac{1\sqrt{1s^{2}}}{s^{2}}\right)=+\infty$.
This means even if the object will definitely reach 1, the expected
reaching time, however, is $\infty$. 
If $p > 0.5$,
$E(T_{0,1}) = \lim _{s \rightarrow 1^} P_X^{\prime}(s) = P_X^{\prime}(1) = \frac{1}{pq}$.
The population size in Galton Watson process
Model Construction:
At 0th generation, there is one individual. It reproduces offsprings and then dies, these offsprings are then belong to 1th generation. Afterwards, each of these offsprings can continue the same reproductionprocess, producing more and more generations.
Now define $Z_n$ as the population size at nth generation (i.e. the number of individuals at n), clealy $Z_0 = 1$. Y is the number of offsprings of an individual. Each individual’s family size (i.e. number of offsprings) $Y_1, Y_2, ... \; i.i.d. \sim Y$.
The p.g.f. (distribution) of $Z_n$:
To know the information of nth generation, first consider (n1)th generation. Assume each individual in (n1)th generation is labelled $1, ..., Z_{n1}$, and their number of offsprings is given by $Y_1, ..., Y_{Z_{n1}}$. Then we can get a formula: $Z_n = \sum_{i=1}^{Z_{n1}} Y_i$. This means $Z_n$ is the sum of i.i.d. random variables with random quantity. Section 2.3 shows that probability generating function will be a useful tool in dealing with this kind of problem.
Let $P_Y(s)$, $P_{Z_n}(s)$, $P_{Z_{n1}}(s)$ be the p.g.f. of random
variables $Y, Z_{n}, Z_{n1}$.
Since $Z_n = \sum_{i=1}^{Z_{n1}} Y_i$, by theorem 2 from section 2.3,
we can find a recursion formula for the p.g.f. of $Z_n$:
$P_{Z_n}(s) = P_{Z_{n1}}(P_Y(s))$.
Continue iterating, we will get p.g.f. for $Z_n$:
$P_{Z_n}(s)=\underbrace{P_Y(P_Y(P_Y(\ldots P_Y(s) \ldots))}_{n \text { times }})$.
Note that $Z_1 = Y$, then we have the following important recursion furmula.
Branching process recursion formula:
$P_{Z_n}(s)=\underbrace{P_{Z_1}(P_{Z_1}(P_{Z_1}(\ldots P_{Z_1}(s) \ldots))}_{n \text { times }}) = P_{Z_{n1}}(P_{Z_1}(s)) = P_{Z_1}(P_{Z_{n1}}(s))$
Remark: In many cases it’s not easy to solve the explicit expression of p.g.f. from the above formula when n is large, but for any distribution Y we can use it to work out the mean and variance of $Z_n$.
Mean and variance of $Z_n$:
Now we proceed to extract information about the distribution of $Z_n$ using p.g.f. and some of the consequences obtained in previous sections.
Theorem 1 (Mean): Let $E(Y) = \mu$, if $\mu < \infty$, then $E(Z_n) = \mu^n$.
Proof: Since $P_{Z_n}(s) = P_{Z_{n1}}(P_Y(s))$, derivative on both sides (or directly use Wald’s Identity introduced previously) will get: $E(Z_n) = E(Y)E(Z_{n1}) = \mu E(Z_{n1}) = \mu^2 E(Z_{n2}) = ... = \mu^n$.
Remark: In natural language, if each individual is expected to reproduce $\mu$ offsprings, then it is expected to have $\mu^n$ individuals at nth generation. This implicates a fact that if $\mu > 1$, the population size will grow exponentially by generation; if $\mu < 1$, the population size will decrease exponentially by
generation.
Theorem 2 (Variance): Let $E(Y) = \mu$, $Var(Y) = \sigma^2$, if$\ u < \infty$ and $\sigma^2 < \infty$, then
$V_n = \operatorname{Var}\left(Z_{n}\right)=\left\{\begin{array}{cl}
\sigma^{2} n & \text { if } \mu=1 \\
\sigma^{2} \mu^{n1}\left(\frac{1\mu^{n}}{1\mu}\right) & \text { if } \mu \neq 1
\end{array}\right.$
Proof:
$\begin{aligned}
Z_{n} &=\sum_{i=1}^{Z_{n1}} Y_{i} \\
\Rightarrow \operatorname{Var}\left(Z_{n}\right) &=\left\{\mathbb{E}\left(Y\right)\right\}^{2} \times \operatorname{Var}\left(Z_{n1}\right)+\operatorname{Var}\left(Y\right) \times \mathbb{E}\left(Z_{n1}\right)\\
\Rightarrow \quad V_{n} &=\mu^{2} V_{n1}+\sigma^{2} \mathbb{E}\left(Z_{n1}\right) \\
\Rightarrow \quad V_{n} &=\mu^{2} V_{n1}+\sigma^{2} \mu^{n1}
\end{aligned}$
It is then easy to get the result.
Conclusion
This article mainly covers all the important properties of probability generating function and uses two small examples to illustrate how p.g.f. works and why it is useful. The two examples introduced are just two basic and fundamental topics in the theory of random walk and branching process. If dig further, we can still use generation function method to analyze some classic problems like the Gambler’s Ruin or the Extinction Problem. These theoretical results can be applied in many aspects of the real world like in Demography or Econometrics.
References:
$https://en.wikipedia.org/wiki/Generating\_function$
$https://en.wikipedia.org/wiki/Branching\_process\#Extinction\_problem\_for\_a\_Galton\_Watson\_process$$https://web.ma.utexas.edu/users/gordanz/notes/advanced\_random\_walks\_color.pdf$
$https://www.stat.auckland.ac.nz/\sim fewster/325/notes/ch6.pdf$
$https://www.stat.auckland.ac.nz/\sim fewster/325/notes/ch4.pdf$
$https://www.dartmouth.edu/\sim chance/teaching\_aids/books\_articles/probability\_book/Chapter10.pdf$