Post

[Paper Review] Dense Passage Retrieval for Open-Domain Question Answering

Paper Review for DPR model

[Paper Review] Dense Passage Retrieval for Open-Domain Question Answering
Dense Passage Retrieval for Open-Domain Question Answering
arxiv pdf link for DPR

0. Abstract ๐ŸŽฌ

Open-domain question answering ์€ ๋‹ต๋ณ€์— ํšจ๊ณผ์ ์ธ passage๋ฅผ retrieval ํ•˜๋Š” ๋ฐฉ์‹์— ์˜์กดํ•œ๋‹ค. ์ด ๋ฐฉ์‹์€ ์ง€๊ธˆ๊นŒ์ง€ TF-IDF, BM25์™€ ๊ฐ™์€ sparse vector space๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๊ธฐ์ธํ•ด ์™”๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด retrieval์„ ์œ„ํ•œ vector representation์ด ์ ์€ ์ˆ˜์˜ question & passage๋ฅผ ์‚ฌ์šฉํ•œ dual-encoder framework๋ฅผ ์ด์šฉํ•˜์—ฌ, dense representation์œผ๋กœ๋„ ์‚ฌ์‹ค์ƒ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค.

๊ด‘๋ฒ”์œ„ํ•œ open-domain QA dataset์„ ์ด์šฉํ•˜์—ฌ ํ‰๊ฐ€ํ–ˆ์„ ๋•Œ, ๋ณธ ๋…ผ๋ฌธ์˜ dense retriever์€ ์ƒ์œ„ 20๊ฐœ์˜ retrieval ์ •ํ™•๋„์—์„œ ๊ธฐ์กด์˜ Lucene-BM25 ์‹œ์Šคํ…œ ์ ˆ๋Œ€์ ์œผ๋กœ 9% ~ 19% ๋Šฅ๊ฐ€ํ–ˆ๋‹ค. ๋˜ํ•œ, ์ด๋Š” open-domain QA ์—์„œ ์ƒˆ๋กœ์šด SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

1. Introduction โ˜•๏ธ

QA (Open-domain Question Answering) ์€ ๊ฑฐ๋Œ€ํ•œ ๋ฌธ์„œ์˜ ์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ factoid question์— ๋‹ต๋ณ€ํ•˜๋Š” task์ด๋‹ค. ์ด์ „์˜ QA system์€ ๋‹ค์†Œ ๋ณต์žกํ•˜๊ณ , ๋‹ค์–‘ํ•œ ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ Comprehension model์˜ ๋ฐœ๋‹ฌ๋กœ ์ธํ•ด, ๋งค์šฐ ๊ฐ„๋‹จํ•œ two-stage framework๋กœ ๋‚˜๋ˆ„์–ด์กŒ๋‹ค.

  1. context retriever์ด ๋จผ์ € ๋‹ต๋ณ€์„ ์œ„ํ•œ passage๋“ค์˜ ์ž‘์€ ์ง‘ํ•ฉ์„ ์„ ํƒํ•œ๋‹ค.

  2. ๊ทธ๋ฆฌ๊ณ  reader์ด retrieved ๋œ context๋“ค์„ ๋ถ„์„ํ•˜์—ฌ ์˜ฌ๋ฐ”๋ฅธ ์ •๋‹ต์„ ๋„์ถœํ•œ๋‹ค.

๋ฌผ๋ก  QA task๋ฅผ machine reading๋งŒ์˜ task๋กœ ๋ฐ”๋ผ๋ณด๋Š” ๊ด€์  ๋˜ํ•œ ์ถฉ๋ถ„ํžˆ ๊ณ ๋ คํ•  ๋งŒ ํ•˜์ง€๋งŒ, huge performance degradation์˜ ์‚ฌ๋ก€๊ฐ€ ์กด์žฌํ•˜๊ธฐ์—, retrieval์— ๋Œ€ํ•œ ํ–ฅ์ƒ์ด ํ•„์š”ํ•˜๋‹ค๊ณ  ์—ฌ๊ฒจ์ง„๋‹ค.

QA์—์„œ retrieval์€ ์ฃผ๋กœ TF-IDF ์ด๋‚˜ BM25๋กœ ๊ตฌํ˜„๋˜์–ด ์™”๋Š”๋ฐ, ์ด๋Š” keyword ๋ฅผ ์ค‘์ ์œผ๋กœ sparse vector๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด์˜€๋‹ค. ๋ฐ˜๋Œ€๋กœ, dense vector์„ latent semantic encoding ์„ ํ™œ์šฉํ•˜์—ฌ ์•ž์„  sparse vector๊ณผ๋Š” ์ƒ๋ณด์ ์ธ ๊ด€๊ณ„์— ์žˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์•„๋ž˜์™€ ๊ฐ™์€ ์˜ˆ์‹œ๋ฅผ ์‚ดํŽด๋ณด์ž.

Q : Who is the bad guy in lord of the rings?

Useful context : Sala Baker is best known for portraying the villain Sauron in the Lord of the Rings trilogy.

Term-based system์€ villain ๊ณผ bad guy์— ๋Œ€ํ•œ semantic similarity๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š๊ธฐ ๋–„๋ฌธ์—, ํ•ด๋‹น context๋ฅผ retrieval ํ•˜๊ธฐ ์–ด๋ ต์ง€๋งŒ, dense retrieval system์€ ์ด ๋‘ ๋‹จ์–ด๋ฅผ ์—ฐ๊ฒฐ์ง€์–ด ํ•ด๋‹น context๋ฅผ reteival ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค.

๋” ๋‚˜์•„๊ฐ€์„œ Dense encoding ์€ learable ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํŠน์ • task์— ๋Œ€ํ•ด specific ํ•˜๊ฒŒ ํ•™์Šตํ•˜์—ฌ ์œ ์—ฐ์„ฑ ๋˜ํ•œ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฐ ๊ณผ์ •์€ MIPS (maximum inner product search) Algorithm์„ ํ†ตํ•ด์„œ ๊ณ„์‚ฐ๋œ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜, ์ผ๋ฐ˜์ ์œผ๋กœ ์ข‹์€ dense vector representation์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ํฐ ์ˆ˜์˜ question & context pair ์ด ํ•„์š”ํ•˜๋‹ค๊ณ  ์—ฌ๊ฒจ์ ธ ์™”๋‹ค. Dense retrieval ๋ฐฉ๋ฒ•์€ TF-IDF/BM25์™€ ๊ฐ™์€ ๊ณ ์ „ ๋ฐฉ์‹์„ ๋Šฅ๊ฐ€ํ•˜์ง€ ๋ชปํ–ˆ์—ˆ์ง€๋งŒ, ICT (inverse cloze task) training์„ ์ด์šฉํ•œ ๋ชจ๋ธ์ธ ORQA๊ฐ€ ์ฒ˜์Œ์œผ๋กœ ์ด ๋ฐฉ์‹์„ ๋Šฅ๊ฐ€ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค.

์—ฌ๊ธฐ์„œ ICT (inverse cloze task) ๋ž€, context ๋‚ด์—์„œ ํŠน์ • sentence๋ฅผ ์ถ”์ถœํ•˜์—ฌ, ํ•ด๋‹น sentence ๊ฐ€ ์–ด๋А context์— ์†ํ•˜๋Š”์ง€๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

ํ•˜์ง€๋งŒ, ์•ž์„  ORQA์˜ ์„ฑ๋Šฅ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , multiple domain ์ƒ์—์„œ์˜ SOTA๋ฅผ ๋‹ฌ์„ฑํ•˜๊ธฐ์—๋Š” 2๊ฐ€์ง€ ๋ถ€๋ถ„์—์„œ ๋ฌธ์ œ์ ์ด ์žˆ์—ˆ๋‹ค.

  1. ICT๋Š” computationally intensive ํ•˜๊ณ , ๋‹จ์ˆœํžˆ sentence๋ฅผ matching ์‹œํ‚ค๋Š” ๊ฒƒ์ด Question anwsering์— ํšจ๊ณผ์ ์ด๋ผ๋Š” ๊ฒƒ์ด ๋ช…ํ™•ํ•˜์ง€ ์•Š๋‹ค.
  2. context encoder์€ question-answer ์Œ์„ ์ด์šฉํ•ด fine-tuned ๋˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ•ด๋‹น encoder์„ ํ†ตํ•œ representation์ด ์ตœ์ ์˜ ๊ฐ’์ด ์•„๋‹ ์ˆ˜ ์žˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š”, ์ถ”๊ฐ€์ ์ธ pre-training ์—†์ด question-answer ์˜ ์Œ๋“ค(Not so much)๋งŒ ์ด์šฉํ•˜์—ฌ ๋” ๋‚˜์€ dense embedding model ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. Pretrained BERT model๊ณผ dual-encoder์„ ํ™œ์šฉํ•˜์—ฌ, ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ ์ˆ˜์˜ question-passage(answer) ์Œ์„ ์ด์šฉํ•˜๋„๋ก ํ•  ๊ฒƒ์ด๋‹ค.

์ด ๊ณผ์ •์—์„œ, ์œ ์‚ฌํ•œ question-passage ๋“ค์˜ ๋‚ด์ ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ตœ์ ํ™”๋ฅผ ์ง„ํ–‰ํ•  ๊ฒƒ์ด๋ฉฐ, batch ๋‚ด์˜ ๋ชจ๋“  question, passage ์Œ์„ ๋น„๊ตํ•  ๊ฒƒ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ DPR method์€ BM25 ๋ฐฉ์‹์„ ํฐ ์ฐจ์ด๋กœ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ, ๋‹จ์ˆœํžˆ representation์ด ์•„๋‹Œ end-to-end QA ์ •ํ™•๋„ ๋˜ํ•œ ORQA์— ๋น„ํ•ด ํฐ ์ฐจ์ด๋ฅผ ๋ƒˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ question-answer ์˜ ์Œ๋“ค์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋„, BM25์˜ ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€ํ•œ๋‹ค. ๋˜ํ•œ, ์ด๋Š” ์ถ”๊ฐ€์ ์ธ pre-train์„ ์š”๊ตฌํ•˜์ง€ ์•Š๋Š”๋‹ค. ๋˜ํ•œ Open-domain ์—์„œ retrieval์˜ ์„ฑ๋Šฅ์ด ๋†’์„์ˆ˜๋ก, end-to-end ์˜ QA ์„ฑ๋Šฅ ๋˜ํ•œ ๋†’์•„์ง„๋‹ค.

2. Background ๐Ÿง

open-domain ์˜ ์ฃผ์š” task๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ factoid question์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋‹ค์–‘ํ•œ ์ฃผ์ œ์˜ topic ์— ๋Œ€ํ•œ corpus๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ์ •๋‹ต์„ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด๋‹ค.

Q : Who first voiced Meg on Family Guy?
Q: Where was the 8th Dalai Lama born?

๋” ๊ตฌ์ฒด์ ์œผ๋กœ๋Š”, QA ๋ฅผ extractive ํ•œ question ์— ํ•œ์ •์ง“๋Š”๋‹ค. ๋‹ค์‹œ ๋งํ•ด question์— ๋Œ€ํ•œ ์ •๋‹ต์€ ํ•ญ์ƒ corpus set์˜ document ์— ํ•˜๋‚˜ ์ด์ƒ ์กด์žฌํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.

ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ๊ฒฝ์šฐ, open-domain question์˜ ํŠน์„ฑ์ƒ ๋งค์šฐ ๋งŽ์€ document๊ฐ€ ์กด์žฌํ•ด์•ผ ํ•˜๋ฉฐ, corpus์˜ ํฌ๊ธฐ๋Š” millions of document ์—์„œ billion๊นŒ์ง€ ๋งค์šฐ ํฐ ์ˆ˜๋Ÿ‰์„ ๊ฐ€์ง„๋‹ค๋‹ค.

๋”ฐ๋ผ์„œ ์ด๋ฅผ ์œ„ํ•œ efficient retriever component, ์ฆ‰ ์ •ํ™•ํ•œ ์ •๋‹ต์„ ์ฐพ์•„๋‚ด๊ธฐ ์ „์— query์™€ ์œ ์‚ฌํ•œ ์ง‘ํ•ฉ (์ „์ฒด corpus์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ)์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•˜๋‹ค. Retriever $R$ ์„ $R : (q, \mathcal{C}) \rightarrow \mathcal{C}_{\mathcal{F}}$ , $\mathcal{C}$๋ฅผ corpus, $q$ ๋ฅผ question ์ด๋ผ๊ณ  ํ–ˆ์„ ๋•Œ, retriever์€ $\mathcal{C}_{\mathcal{F}} \in \mathcal{C},\;\; |\mathcal{C}_{\mathcal{F}}| = k \ll |\mathcal{C}|$ ํ•œ corpus $\mathcal{C}$์˜ subset์ธ $\mathcal{C}_{\mathcal{F}}$๋ฅผ ๊ตฌํ•ด๋‚ด์–ด์•ผ ํ•œ๋‹ค.

3. Dense Passage Retriever (DPR) ๐Ÿฅฝ

์ด ๋…ผ๋ฌธ์—์„œ๋Š” open-domain QA task์—์„œ retrieval component ๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ์— ์ค‘์ ์„ ๋‘”๋‹ค. $M$ ๊ฐœ์˜ text passage ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, DPR ์˜ ๋ชฉํ‘œ๋Š” ์ด ๋ชจ๋“  passage๋ฅผ ๋ชจ๋‘ low-dimensional๋กœ ๋ณ€ํ™˜์‹œ์ผœ top-k relevant passage ๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ํšจ๊ณผ์ ์œผ๋กœ retrieval ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. (๋‹น์—ฐํ•˜๊ฒŒ๋„ $M$์€ ๋งค์šฐ ํฐ ์ง‘ํ•ฉ์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” 21 million ์ •๋„์ด๋‹ค.)

3.1 Overview

๋ณธ ๋…ผ๋ฌธ์˜ DPR์€ text passage๋ฅผ $d$-dimensional real-valued vector๋กœ encoding ํ•˜๋Š” dense encoder $E_p()$ ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. run-time ๋•Œ๋Š”, ์ด์™€ ๋‹ค๋ฅธ encoder์ธ $E_Q()$ ๊ฐ€ ์‚ฌ์šฉ๋˜๋Š”๋ฐ, ์ด๋Š” input question ์„ $d$-dimensional vector๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ, ์ด ๋ฒกํ„ฐ๋“ค๊ฐ„์˜ ๊ณ„์‚ฐ์„ ํ†ตํ•ด top-k relevant passage ๋ฅผ retrieval ํ•˜๊ฒŒ ๋œ๋‹ค. ์ด relevantness ๊ณ„์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ dot product๋กœ ๊ณ„์‚ฐ๋œ๋‹ค.

\[sim(q, p) = E_Q(q) \cdot E_P(p)\]

๋ฌผ๋ก  Cross Attention ๊ณผ ๊ฐ™์ด ๋‘ context ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๋”์šฑ ์ •ํ™•ํ•˜๊ฒŒ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ์‹ ๋˜ํ•œ ์กด์žฌํ•˜์ง€๋งŒ, ๋งค์šฐ ํฐ ์ˆ˜๋Ÿ‰์˜ passage ๋“ค์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” decomposable ํ•œ ๋ฐฉ์‹์ด ๋”์šฑ ํšจ์œจ์ ์ด๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ decomposable ํ•œ similarity function ์€ Euclidean distance (L2) ์ด๋ฉฐ, cosine similarity ๋˜ํ•œ unit vector ๋“ค์— ๋Œ€ํ•œ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์šฉ์ดํ•˜์ง€๋งŒ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” L2 ์™€ cosine similarity๋ฅผ ์ด์–ด์ฃผ๋ฉฐ, ๋”์šฑ ๊ฐ„๋‹จํ•œ dot product ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

Encoders

question ๊ณผ passage ๋ฅผ encoding ํ•˜๋Š” ๋ฐฉ์‹์—๋Š” neural network ๋„ ์žˆ์ง€๋งŒ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” 2๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ BERT ๋ฅผ ์‚ฌ์šฉํ•ด [CLS] ํ† ํฐ์„ representation์œผ๋กœ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค.

Inference

Inference ์‹œ๊ฐ„์—๋Š”, passage encoder $E_P$๋ฅผ ๋ชจ๋“  passage ์— ๋Œ€ํ•ด์„œ ์ ์šฉํ•˜๋ฉฐ, FAISS๋ฅผ ์ด์šฉํ•˜์—ฌ indexing ํ•œ๋‹ค. FAISS๋Š” open-source library๋กœ, dense vector๋“ค์— ๋Œ€ํ•œ ํšจ์œจ์ ์ธ clustering์„ ํ†ตํ•ด์„œ ๋งค์šฐ ๋งŽ์€ ์ˆ˜์˜ vector๋“ค์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋‹ค. Question $p$๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, $v_q = E_Q(q)$๋ฅผ ํ†ตํ•ด top-$k$ passage๋“ค์„ ์ฐพ๋Š”๋‹ค.

์ด ๊ณผ์ •์—์„œ HNSW ๋ผ๋Š” ANN ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ๋‹ค์Œ ๋งํฌ์— ์ƒ์„ธํ•˜๊ฒŒ ์„ค๋ช…๋˜์–ด ์žˆ๋‹ค.

3.2 Training

Encoder๋“ค์„ ํ•™์Šต์‹œ์ผœ dot-product similarity๋ฅผ ํ™œ์šฉํ•ด retrieval์„ ํšจ๊ณผ์ ์œผ๋กœ ๋งŒ๋“œ๋Š” ๊ฒƒ์€ metric learning problem์ด๋‹ค. ์ด ๋ชฉ์ ์€ ๊ณง relevant question-passage ์Œ์— ๋Œ€ํ•œ distance๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์— ์žˆ๋‹ค.

question $q$ ์— ๋Œ€ํ•ด์„œ ํ•™์Šต์„ ์ง„ํ–‰ํ•  ๋•Œ, $p_i^+$๋ฅผ positive passage (relevant passage for $q$)๋ผ๊ณ  ํ–ˆ์„ ๋•Œ, training data $\mathcal{D}$ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

\[\mathcal{D} = \{\left\langle q_i, p_i^+, p_{i,1}^-, \cdots, p_{i, n}^- \right\rangle\}_{i=1}^m\]

$\mathcal{D}$๋Š” n๊ฐœ์˜ instance๋ฅผ ๊ฐ€์ง€๋ฉฐ, 1์Œ์˜ ์˜ฌ๋ฐ”๋ฅธ question-answer ๊ณผ question ์— ๊ด€๊ณ„์—†๋Š” negative passage ๋ฅผ m ๊ฐœ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด data์— ๋Œ€ํ•œ loss function์„ positive passage์— ๋Œ€ํ•œ negative log likelihood๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

\[L(q_i, p_i^+, p_{i,1}^-, \cdots, p_{i, n}^-) = -log\frac{e^{sim(q_i, p_i^+)}}{e^{sim(q_i, p_i^+)} + \sum_{j=1}^{n}{e^{sim(q_i, p_{i,j}^-)}}}\]

Positive and negative passages

์ด ํ•™์Šต์„ ์œ„ํ•ด์„œ๋Š”, question-passage์˜ ์ ์ ˆํ•œ ์Œ์„ ์ฐพ๊ธฐ์—๋Š” ๋ช…ํ™•ํ•˜์ง€๋งŒ, negative passage๋“ค์„ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋งค์šฐ ํฐ ํ’€์—์„œ sampling ๋˜์–ด์•ผ ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, positive passage๋Š” QA dataset๋‚ด์— context๊ฐ€ ์กด์žฌํ•˜๊ฑฐ๋‚˜ answer๋ฅผ searching ํ•ด์„œ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‹ค๋ฅธ ๋ชจ๋“  passage, ์ฆ‰ relevant ํ•˜์ง€ ์•Š์€ passage๋Š” negative passage ์ด๋‹ค. ์‹ค์ œ๋กœ๋„ ์–ด๋– ํ•œ ๋ฐฉ์‹์œผ๋กœ negative passage๋ฅผ ๊ตฌํ•˜๋А๋ƒ๋Š” ์ฃผ๋กœ ์ค‘์š”ํ•˜๊ฒŒ ์—ฌ๊ฒจ์ง€์ง€ ์•Š์ง€๋งŒ, ๋•Œ๋กœ๋Š” ๋†’์€ ์„ฑ๋Šฅ์˜ encoder์„ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐ์— ์ฃผ์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค.

๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” negative passage๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹์„ 3๊ฐ€์ง€ ์ข…๋ฅ˜๋กœ ๋‚˜๋ˆ„์—ˆ๋‹ค.

  1. Random passage : corpus ๋‚ด๋ถ€์— ์žˆ๋Š” ๋žœ๋คํ•œ passage
  2. BM25 : BM25๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ๋˜์–ด answer์ด ํฌํ•จ๋˜์–ด ์žˆ์ง€๋Š” ์•Š์ง€๋งŒ, question ์˜ ํ† ํฐ์„ ๊ฐ€์žฅ ๋งŽ์ด ํฌํ•จ๋œ passage
  3. Gold : training set ๋‚ด๋ถ€์˜ ๋‹ค๋ฅธ question์— ๋Œ€ํ•ด์„œ positive passage๋กœ ํŒ๋ณ„๋œ passage

์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋“ค์— ๋Œ€ํ•ด์„œ๋Š” Section 5.2 ์—์„œ ๋‹ค๋ฃฐ ๊ฒƒ์ด๋ฉฐ, ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋˜ ๋ฐฉ๋ฒ•์€ gold passage๋“ค์„ ๊ฐ™์€ ํฌ๊ธฐ์˜ mini-batch์— ๋Œ€ํ•ด์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— 1๊ฐœ์˜ BM25 ๊ธฐ๋ฐ˜ negative passage์„ ๋”ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋” ๋‚˜์•„๊ฐ€์„œ, gold ๊ธฐ๋ฐ˜์˜ negative passage๋ฅผ ์žฌ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ computational ์ ์ธ ๊ด€์ ์—์„œ ํšจ์œจ์„ฑ์„ ๋ณด์ธ๋‹ค.

In-batch negatives

ํ•˜๋‚˜์˜ mini-batch์— $B$๊ฐœ์˜ question์ด ์กด์žฌํ•œ๋‹ค๊ณ  ํ•˜์ž. ๊ทธ๋ฆฌ๊ณ  respectively ํ•˜๊ฒŒ relevant ํ•œ passage์™€ ์Œ์„ ์ด๋ฃจ๊ณ  ์žˆ๋‹ค. ์ด questions ๋“ค๊ณผ passages ๋“ค์„ vector๋กœ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์„ ํ–‰๋ ฌํ™”ํ•œ ๊ฒƒ์„ ๊ฐ๊ฐ $Q$, $P$ ๋ผ๊ณ  ํ•˜๋ฉด (๊ฐ๊ฐ์˜ ํฌ๊ธฐ๋Š” $(B\times d)$๊ฐ€ ๋œ๋‹ค.), $S=QP^T$ ๋Š” ๊ณง ๊ฐ question๊ณผ passage ๊ฐ„์˜ ๋…๋ฆฝ์ ์ธ similarity๋ฅผ ๋‚˜ํƒ€๋‚ด๊ฒŒ ๋œ๋‹ค.

์ด๋ ‡๊ฒŒ $S$๋ฅผ ๊ตฌํ•˜๋ฉด, $(q_i, p_j)$ ์—์„œ $i=j$์ธ ๊ฒฝ์šฐ์—๋งŒ positive passage๊ฐ€ ๋˜๊ณ , ๋‚˜๋จธ์ง€์˜ ๊ฒฝ์šฐ์—๋Š” gold ๊ธฐ๋ฐ˜์˜ negative passage๊ฐ€ ๋˜์–ด B-1๊ฐœ์˜ negative passage๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ์ด ๋ฐฉ์‹์€ computational efficency๋ฅผ ๋ณด์žฅํ•˜๊ฒŒ ๋œ๋‹ค.

image

4. Experimental Setup ๐Ÿ”ฌ

4.1 Wikipedia Data Pre-processing

๋”ฐ๋กœ ์ถ”๊ฐ€์ ์ธ ์„ค๋ช…์ด ๋ถˆํ•„์š”ํ•˜์—ฌ ์›๋ฌธ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์™”๋‹ค.

Following (Lee et al., 2019), we use the English Wikipedia dump from Dec. 20, 2018 as the source documents for answering questions. We first apply the pre-processing code released in DrQA (Chen et al., 2017) to extract the clean, text-portion of articles from the Wikipedia dump. This step removes semi-structured data, such as tables, infoboxes, lists, as well as the disambiguation pages. We then split each article into multiple, disjoint text blocks of 100 words as passages, serving as our basic retrieval units, following (Wang et al., 2019), which results in 21,015,324 passages in the end.5 Each passage is also prepended with the title of the Wikipedia article where the passage is from, along with an [SEP] token.

4.1 Question Answering Datasets

์ด ๋…ผ๋ฌธ์—์„œ๋Š” 5๊ฐ€์ง€ QA dataset์„ ์ด์šฉํ•˜์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๋‹ค. ๊ทธ ๋ชฉ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. Natural Questions (NQ) : end-to-end QA์˜ ๋ชฉ์ ์— ๋งž๊ฒŒ design ๋˜์—ˆ์œผ๋ฉฐ, real Google search quesries ๊ณผ Wikipedia์˜ answer์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž‘์„ฑ๋˜์—ˆ๋‹ค.
  2. TriviaQA : Web ์ƒ์— ์กด์žฌํ•˜๋Š” trivia questions๋ฅผ ์ด์šฉํ•˜์—ฌ ๋งŒ๋“ค์–ด์กŒ๋‹ค.
  3. WebQuestions (WQ) : Google Suggest API๋ฅผ ์ด์šฉํ•˜์—ฌ answer๋“ค์€ Freebase ๋‚ด๋ถ€์— ์žˆ๋„๋ก ํ•˜๋Š” dataset
  4. CuratedTREC (TREC) : TREC QA track๊ณผ ์—ฌ๋Ÿฌ Web soruce๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” open-domain QA dataset
  5. SQuAD v1.1 : Reading comprehension์„ ํ†ตํ•ด ์–ป์€ ์œ ๋ช…ํ•œ dataset

image

4.2 Selection of positive passages

TREC, WebQuestions, ๊ทธ๋ฆฌ๊ณ  TriviaQA๋Š” ์ ์€ ์ˆ˜์˜ question-answer ์Œ์ด ์ฃผ์–ด์กŒ๊ธฐ ๋•Œ๋ฌธ์—, BM25๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ€์žฅ answer ์ด ์žˆ์„ ํ™•๋ฅ ์ด ๋†’์€ context๋ฅผ ์ฐพ๋Š”๋‹ค. ๋งŒ์•ฝ ์ƒ์œ„ 100๊ฐœ์˜ passage๋“ค ๋ชจ๋‘ ์ •๋‹ต์„ ํฌํ•จํ•˜๊ณ  ์žˆ์ง€ ์•Š๋‹ค๋ฉด, ํ•ด๋‹น question์€ ๋ฌด์‹œ๋  ๊ฒƒ์ด๋‹ค.

SQuAD์™€ Natural Questions๋“ค์— ๋Œ€ํ•ด์„œ๋Š”, ๊ธฐ์กด์˜ passage๊ฐ€ ๋‚˜๋‰˜์–ด์ ธ ์žˆ๊ณ , candidate passage ์™€ pool ๋‚ด๋ถ€๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅด๊ฒŒ processing ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐ gold passage๋ฅผ ๊ทธ์— ์ƒ์‘ํ•˜๋Š” passage์™€ ๊ต์ฒด์‹œํ‚จ๋‹ค. ๋งŒ์•ฝ ์ด ์ž‘์—…์— ์‹คํŒจํ•œ๋‹ค๋ฉด, ๊ทธ ์งˆ๋ฌธ์„ ์‚ญ์ œํ•œ๋‹ค.

5. Experiments: Passage Retrieval ๐Ÿงช

์ด ์„น์…˜์—์„œ๋Š”, retrieval performance ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ๊ธฐ์กด์˜ retrieval method์— ๋Œ€ํ•ด์„œ ์–ด๋–ค ํšจ๊ณผ๋ฅผ ๊ฐ€์ง€๋Š”์ง€์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ณธ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์˜ main experiement์—์„œ ์‚ฌ์šฉ๋œ DPR ๋ชจ๋ธ์€ batch size๊ฐ€ 128์ด๋ฉฐ, BM25 ๊ธฐ์ค€์˜ negative passage๋ฅผ ํฌํ•จํ•œ in-batch negative ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  question-passage ์Œ๋“ค์„ ํฐ ๋ฐ์ดํ„ฐ์…‹ (NQ, TriviaQA, SQuAD) ์— ๋Œ€ํ•ด์„œ๋Š” 40 epoch ๋งŒํผ, ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹ (TREC, WQ)์— ๋Œ€ํ•ด์„œ๋Š” 100 epoch ํ•™์Šต์‹œํ‚จ๋‹ค. ๋˜ํ•œ lr ์€ $10^{-5}$ ๋กœ ์„ค์ •ํ•˜๊ณ , optimizer์€ Adam ์„ ์‚ฌ์šฉํ–ˆ๋‹ค. (dropout : 0.1)

๊ฐ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ ์ž˜ ํ•™์Šต๋˜๋Š” retriever์„ ์œ ์—ฐํ•˜๊ฒŒ ๋‹ค๋ฃจ๋Š” ๊ฒƒ๋„ ์ข‹์ง€๋งŒ, ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ ์ „๋ฐ˜์ ์œผ๋กœ ์ข‹์€ ํ•™์Šต๋ฅ ์„ ๊ฐ€์ง€๋Š” retriever ์„ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ ๋˜ํ•œ ์ข‹์€ ์ ‘๊ทผ์ด ๋”œ ๊ฒƒ์ด๋‹ค.

๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” multi-dataset encoder์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•˜์—ฌ SQuAD ๋ฐ์ดํ„ฐ์…‹์„ ์ œ์™ธํ•œ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์„ ๋ณ‘ํ•ฉํ•˜์˜€๋‹ค. ๋˜ํ•œ BM25, BM25 + DPR, traditional retriever ์„ ๋ชจ๋‘ ์‹คํ—˜ํ•ด๋ณด์•˜๋‹ค. ์ด ๊ณผ์ •์—์„œ BM25 ์™€ DPR์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐํ•ฉํ•˜๊ธฐ ์œ„ํ•ด์„œ $BM25(q, p) + \lambda * sim(q,p)$ ์™€ ๊ฐ™์ด ์„ ํ˜•์  ๊ฒฐํ•ฉ์œผ๋กœ ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•˜์˜€๋‹ค. ($\lambda = 1.1$์ผ ๋•Œ๊ฐ€ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ๋†’์•˜๋‹ค.)

5.1 Main Results

image

๋ณธ ๋…ผ๋ฌธ์—์„œ 5๊ฐ€์ง€ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ top-k passage๋ฅผ ๋ฝ‘์•„๋‚ด๋Š” passage retrieval์„ ์ง„ํ–‰ํ–ˆ๋‹ค. SQuAD dataset๋ฅผ ์ œ์™ธํ•˜๊ณ , DPR์€ BM25 ๋ณด๋‹ค ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , k๊ฐ€ ์ž‘์„ ๋•Œ ํŠนํžˆ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹๋“ค ๊ฐ„์˜ ์ •ํ™•๋„์˜ gap์ด ์ปค์กŒ๋‹ค.

multiple dataset์„ ์ด์šฉํ•˜์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ, 5๊ฐ€์ง€ ๋ฐ์ดํ„ฐ์…‹ ์ค‘ ๊ฐ€์žฅ ์ž‘์€ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง„ TREC dataset ์ด ๋งค์šฐ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์˜€๋‹ค. ๋ฐ˜๋Œ€๋กœ, Natural Questions ์™€ WQ ๋Š” ์ž‘์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์ด๋ฉฐ, TriviaQA์˜ ๊ฒฝ์šฐ ์˜คํžˆ๋ ค ์กฐ๊ธˆ ๋‚ฎ์•„์ง€๊ธฐ๋„ ํ–ˆ๋‹ค. ์ด ๊ฒฐ๊ณผ๋Š” ์ถ”ํ›„ DPR ๊ณผ BM25์˜ ๊ฒฐํ•ฉ์— ์˜ํ•ด ๋”์šฑ ํ–ฅ์ƒ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

So why the SQuAD performs better in BM25?
1. anntoators ๋“ค์ด passage๋ฅผ ๋ณธ ํ›„์— ์งˆ๋ฌธ์„ ์ž‘์„ฑํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, passage์˜ ํ‚ค์›Œ๋“œ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์„ ํ™•๋ฅ ์ด ๋†’๋‹ค.
2. data๋“ค์ด Wikipeidia ์—์„œ 500๊ฐœ ์ •๋„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, bias ๊ฐ€ ์žˆ์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค.

5.2 Ablation Study on Model Training

Sample efficiency

image

๊ฐ training dataset์˜ ํฌ๊ธฐ์— ๋”ฐ๋ผ์„œ ์ •ํ™•๋„๊ฐ€ ๋‹ฌ๋ผ์ง€๊ฒŒ ๋˜๋Š”๋ฐ, ๊ทธ๋ž˜ํ”„์—์„œ ๋ณด๋‹ค์‹œํ”ผ dataset์˜ ํฌ๊ธฐ๊ฐ€ 1k ๊ฐœ๋งŒ ๋˜์–ด๋„ BM25์˜ ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ, retrieve ํ•˜๋Š” top-k ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ์„ฑ๋Šฅ ๋˜ํ•œ ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋œ๋‹ค.

In-batch negative training

image

์œ„ ํ‘œ๋Š” negative passage์˜ ์„ ์ • ๋ฐฉ์‹, Negative passage์˜ ๊ฐœ์ˆ˜, In-batch negative ์‚ฌ์šฉ ์œ ๋ฌด, retrieve ํ•˜๋Š” passage์˜ ์ˆ˜์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ์˜ ์ฐจ์ด๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

๋šœ๋ ทํ•˜๊ฒŒ, #N(negative passage์˜ ์ˆ˜) ๊ฐ€ ์ปค์ง์— ๋”ฐ๋ผ์„œ ์„ฑ๋Šฅ์ด ์ข‹์•„์ง์ด ๋‚˜ํƒ€๋‚ฌ์œผ๋ฉฐ, Gold ๋ฐฉ์‹ ๋‹จ์ผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค, BM25 ๊ธฐ์ค€ negative passage๋ฅผ 1๊ฐœ์”ฉ ์„ž์–ด์„œ ์‚ฌ์šฉํ•ด ์ฃผ๋Š” ๊ฒƒ์ด ๋†’์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ๋‹ค.

ํ•˜์ง€๋งŒ, BM25 ๊ธฐ์ค€ negative passage์˜ ์ˆ˜๋ฅผ 1๊ฐœ์—์„œ 2๊ฐœ๋กœ ๋Š˜๋ฆฐ ๊ฒฐ๊ณผ, ์„ฑ๋Šฅ์˜ ์ฐจ์ด๊ฐ€ ๊ฑฐ์˜ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์€ ๊ฒƒ์œผ๋กœ ๋ณด์•„, BM25์˜ ์ˆ˜๋Š” ๋ชจ๋ธ์— ํฌ๊ฒŒ ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋Š”๋‹ค.

๊ฐœ์ธ์ ์œผ๋กœ ์กฐ๊ธˆ ํฅ๋ฏธ๋กœ์› ๋˜ ๊ฒƒ์€, In-batch negative passage method๋ฅผ ํ™œ์šฉํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋‹จ์ง€ computational efficency๋ฅผ ๋ณด์žฅํ•˜๋Š” ๊ฒƒ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์„ฑ๋Šฅ์ ์ธ ์ธก๋ฉด์—์„œ๋„ ์˜์˜๊ฐ€ ์žˆ์—ˆ๋‹ค.

Impact of gold passages

image

Gold passage์˜ ํ•„์š”์„ฑ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‹คํ—˜์„ ํ†ตํ•ด ์•Œ์•„๋ƒˆ๋‹ค. Dist. Sup์€ BM25์— ๋”ฐ๋ฅธ negative passage๋ฅผ ์˜๋ฏธํ•˜๋Š”๋ฐ, ์ด์— ๋น„ํ•ด 1% ์ •๋„์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์ด๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์•„, Gold passage๊ฐ€ ๋”์šฑ ์ข‹์€ ๋ฐฉ๋ฒ•์ด๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Similarity and loss

image

๋จผ์ €, similarity function ์— ๋Œ€ํ•ด์„œ ๋‘๊ฐ€์ง€ ๋ฐฉ์‹์ธ Dot Product, L2 distance ๋ฅผ ๋น„๊ตํ•˜๊ฒŒ ๋œ๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ Loss function ์— ๋Œ€ํ•ด์„œ 2๊ฐ€์ง€ ๋ฐฉ์‹์„ ๋น„๊ตํ•˜๋Š”๋ฐ, NLL (Negative log likelihood)์™€ triplet loss ๋ฅผ ๋น„๊ตํ•œ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ Dot Product ์™€ NLL ์„ ์‚ฌ์šฉํ–ˆ์„ ๊ฒฝ์šฐ retrieval ์„ฑ๋Šฅ์ด ์ข‹์•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ด ๋‘ ๊ฐ€์ง€ ๋ฐฉ์‹์„ ์ด์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๊ตฌ์„ฑํ–ˆ๋‹ค.

Cross-datset generalization

์ด์™ธ์—๋„, ์ถ”๊ฐ€์ ์ธ fine-tuning ์ด ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š”, ํŠน์ • dataset ์„ ์ด์šฉํ•ด ํ•™์Šตํ•œ ๋ชจ๋ธ์„ ๋‹ค๋ฅธ dataset์— ์ ์šฉํ•ด๋ด„์œผ๋กœ์จ ์ด๋ฅผ ์ฆ๋ช…ํ•œ๋‹ค. NQ dataset์— ๋Œ€ํ•ด์„œ๋งŒ DPR๋ฅผ ํ•™์Šต์‹œํ‚จ ํ›„, WQ, TREC์— ์‹คํ—˜์„ ํ•ด๋ณธ ๊ฒฐ๊ณผ, ์œ ์˜๋ฏธํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋„๋ฉฐ, ์ƒ๋‹นํžˆ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค.

5.3 Qualitative Analysis

BM25 ๋ณด๋‹ค DPR์ด ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , BM25๊ณผ ๊ฐ™์€ Term-matching ๋ฐฉ๋ฒ•์€ ํŠน์ • ๊ตฌ๋‚˜ ์„ ํƒ์  ํ‚ค์›Œ๋“œ์— ๋Œ€ํ•ด sensitive ํ•˜๋‹ค. ๋ฐ˜๋ฉด DPR์€ semantic relationship์„ ๋”์šฑ ์ž˜ ํ‘œํ˜„ํ•˜๊ฒŒ ๋œ๋‹ค.

5.4 Run-time Efficiency

The main reason that we require a retrieval component for open-domain QA is to reduce the number of candidate passages that the reader needs to consider, which is crucial for answering userโ€™s questions in real-time. We profiled the passage retrieval speed on a server with Intel Xeon CPU E5-2698 v4 @ 2.20GHz and 512GB memory. With the help of FAISS in-memory index for real-valued vectors10, DPR can be made incredibly efficient, processing 995.0 questions per second, returning top 100 passages per question. In contrast, BM25/Lucene (implemented in Java, using file index) processes 23.7 questions per second per CPU thread.

On the other hand, the time required for building an index for dense vectors is much longer. Computing dense embeddings on 21-million passages is resource intensive, but can be easily parallelized, taking roughly 8.8 hours on 8 GPUs. However, building the FAISS index on 21-million vectors on a single server takes 8.5 hours. In comparison, building an inverted index using Lucene is much cheaper and takes only about 30 minutes in total.

6. Experiments: Question Answering ๐Ÿง

6.1 End-to-end QA System

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ retriever system์— ๋Œ€ํ•ด์„œ ์œ ์—ฐํ•˜๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” end-to-end QA system์„ ๊ตฌํ˜„ํ–ˆ๋‹ค. ์ด ์‹œ์Šคํ…œ์€ neural reader ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.

๋จผ์ €, retriever์ด top-k retrieved passage๋ฅผ ์ œ๊ณตํ•˜๋ฉด, reader model์€ passage๋“ค์— ๋Œ€ํ•œ selection score ์„ ๊ฐ passage์— ๋ถ€์—ฌํ•˜๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐ passage๋“ค์— ๋Œ€ํ•ด์„œ answer span์„ ์ถ”์ถœํ•˜๊ณ , ๊ฐ span ๋งˆ๋‹ค์˜ ์ ์ˆ˜๋ฅผ ๋ถ€์—ฌํ•˜๊ฒŒ ๋œ๋‹ค. ๊ฒฐ๋ก ์ ์œผ๋กœ ๊ฐ€์žฅ ๋†’์€ passage score์—์„œ์˜ answer span ์ด ์ •๋‹ต์œผ๋กœ ์ถ”์ถœ๋œ๋‹ค.

์ด ๊ณผ์ •์—์„œ passage selection model์€ reranker ์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, question ๊ณผ passage ๊ฐ„์˜ cross attention์„ ์ด์šฉํ•ด์„œ passage ๊ฐ„์˜ similarity ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด ์—ฐ์‚ฐ์€ decomposable ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ passage์— ๋Œ€ํ•ด์„œ ์ ์šฉํ•  ์ˆ˜๋Š” ์—†์ง€๋งŒ, dual-encoder๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹๊ธฐ ๋•Œ๋ฌธ์—, ์ž‘์€ top-k ์— ๋Œ€ํ•ด์„œ ์ด ๊ณผ์ •์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

image

์‹์€ ์œ„์™€ ๊ฐ™์€๋ฐ, $\hat{P}$ ๋ผ๋Š” ๋ชจ๋“  Top-k passage ์— ๋Œ€ํ•ด์„œ cross-attention ์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ํ•ด๋‹น ๊ฐ’์„ softmax ๋กœ ํ•™์Šต์‹œ์ผœ ๊ฐ€์žฅ ์—ฐ๊ด€๋„๊ฐ€ ๋†’์€ passage๋ฅผ ์ฐพ๋Š”๋‹ค. ๊ทธ ํ›„์—, ํ•ด๋‹น passage์—์„œ $w$ ๋ผ๋Š” learnable vector์„ ์ด์šฉํ•ด์„œ start token, end token ์„ ๊ณฑํ•œ ๊ฐ’์„ answer span score ๋กœ ์ ์šฉํ•˜์—ฌ answer span ์„ ์ฐพ๊ฒŒ ๋œ๋‹ค.

reader์˜ ํ•™์Šต ๊ณผ์ •์€, positive-passage ์— ๋Œ€ํ•œ selection score์˜ log-likelihood ๋ฅผ ํ†ตํ•ด ํ•™์Šต๋˜๋ฉฐ, answer span์€ positive passage์—์„œ์˜ ๋ชจ๋“  answer span์˜ marginal log-likelihood๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต๋œ๋‹ค. ํ•˜๋‚˜์˜ passage ๋‚ด๋ถ€์—์„œ ์ •๋‹ต์ด ์—ฌ๋Ÿฌ ๋ฒˆ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ชจ๋“  answer span ์— ๋Œ€ํ•ด์„œ ํ•™์Šตํ•œ๋‹ค.

6.2 Results

image

๊ฐ Model, dataset์˜ ํ†ตํ•ฉ์„ ๊ธฐ์ค€์œผ๋กœ ์œ„์™€ ๊ฐ™์ด ์ •ํ™•๋„๋ฅผ ์ธก์ •ํ–ˆ๋‹ค. ์ „์ฒด์ ์œผ๋กœ retriever์˜ ์ •ํ™•๋„๊ฐ€ ๋†’์„์ˆ˜๋ก end-to-end ์ •ํ™•๋„๊ฐ€ ๋†’์•„์ง„๋‹ค. ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค (ORQA, REALM, etc)์€ ๋ชจ๋ธ์„ ์œ„ํ•œ pre-training์„ ์ˆ˜ํ–‰ํ–ˆ๊ณ , ๋†’์€ ๊ณ„์‚ฐ๋ณต์žก๋„๋ฅผ ์ง€๋‹ˆ๊ณ  ์žˆ์ง€๋งŒ, ๋ณธ ๋…ผ๋ฌธ์˜ DPR ๋ชจ๋ธ์€ ์ถ”๊ฐ€์ ์ธ pre-training ์—†์ด ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ตฌํ˜„ํ•˜์—ฌ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.

์ถ”๊ฐ€์ ์œผ๋กœ, Retrieval model ๊ณผ reader model์„ joint ํ•˜์—ฌ ๊ฐ™์€ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ์จ ๋™์‹œ์— ํ›ˆ๋ จ์‹œํ‚ค๋Š” ์‹คํ—˜ ๋˜ํ•œ ํ•ด๋ณด์•˜์œผ๋‚˜, 39.8EM์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ ๋…๋ฆฝ์ ์ธ retrieval, reader model์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋”์šฑ ๋†’์€ ์„ฑ๋Šฅ์„ ๊ฐ€์ง„๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ๋‹ค.

8. Conclusion ๐ŸŽฌ

๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ dense retrieval method๊ฐ€ ๊ธฐ์กด์˜ traiditional sparse retrieveal componet ๋ฅผ ๋Šฅ๊ฐ€ํ•˜๊ณ , ์ž ์žฌ์ ์œผ๋กœ ๋Œ€์ฒดํ•˜์˜€๋‹ค. ๊ฐ„๋‹จํ•˜๊ฒŒ dual-encoder๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๋†€๋ผ์šด ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ์ด ์†์— ๋ช‡๋ช‡ ์ค‘์š” ์š”์†Œ๋“ค์ด ์กด์žฌํ•˜๊ธฐ๋„ ํ–ˆ๋‹ค. ๋” ๋‚˜์•„๊ฐ€์„œ ์ด ๋…ผ๋ฌธ์—์„œ์˜ ์‹œํ—˜์  ๋ถ„์„์€, ๋”์šฑ ๋ณต์žกํ•œ ๋ชจ๋ธ๋“ค์ด ํ•ญ์ƒ ์ถ”๊ฐ€์ ์ธ value๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ ๋˜ํ•œ ์•Œ๊ฒŒ ๋˜์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ, ๊ฒฐ๊ตญ ๋ณธ ๋…ผ๋ฌธ์˜ ๋ฐฉ์‹์œผ๋กœ SOTA๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

This post is licensed under CC BY 4.0 by the author.

Trending Tags