Beam Search
greedy search
generate (or 「decode」) the target sentence by taking argmax on each step of the decoderide
problem with greedy search :oop
- Greedy decoding has no way to undo decisions!
- Input: il a m’entarté (he hit me with a pie)
- → he ____
- → he hit ____
- → he hit a ____ (whoops! no going back now…)
Exhaustive search decoding
Ideally we want to find a (length T) translation y that maximizes :this
\[ \begin{aligned} P(y | x) &=P\left(y_{1} | x\right) P\left(y_{2} | y_{1}, x\right) P\left(y_{3} | y_{1}, y_{2}, x\right) \ldots, P\left(y_{T} | y_{1}, \ldots, y_{T-1}, x\right) \\ &=\prod_{t=1}^{T} P\left(y_{t} | y_{1}, \ldots, y_{t-1}, x\right) \end{aligned} \]idea
We could try computing all possible sequences y:spa
- This means that on each step t of the decoder, we’re tracking \(V^t\) possible partial translations, where \(V\) is vocab size
- This \(O(V^t)\) complexity is far too expensive!
beam search
- Core idea : On each step of decoder, keep track of the k most probable partial translations (which we call hypotheses), where \(k\) is the beam size (in practice around 5 to 10)
- A hypothesis \(y_1,\cdots,y_t\) has a score which is its log probability:
\[ \operatorname{score}\left(y_{1}, \ldots, y_{t}\right)=\log P_{\mathrm{LM}}\left(y_{1}, \ldots, y_{t} | x\right)=\sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right) \]
- Scores are all negative, and higher score is better
- We search for high-scoring hypotheses, tracking top \(k\) on each step
- Beam search is not guaranteed to find optimal solution
- But much more efficient than exhaustive search!
Beam search decoding: stopping criterion
- In greedy decoding, usually we decode until the model produces a
token
- In beam search decoding, different hypotheses may produce
tokens on different timesteps
- When a hypothesis produces
, that hypothesis is complete.
- Place it aside and continue exploring other hypotheses via beam search.
- Usually we continue beam search until:
- We reach timestep T (where T is some pre-defined cutoff), or
- We have at least n completed hypotheses (where n is pre-defined cutoff)
Beam search decoding: finishing up
- We have our list of completed hypotheses.
- How to select top one with highest score?
- Each hypothesis \(y_1,\cdots,y_t\) on our list has a score
\[ \operatorname{score}\left(y_{1}, \ldots, y_{t}\right)=\log P_{\mathrm{LM}}\left(y_{1}, \ldots, y_{t} | x\right)=\sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right) \]
Problem with this: longer hypotheses have lower scores
Fix : Normalize by length. Use this to select top one instead:
\[ \frac{1}{t} \sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right) \]code