TF-IDF (Term Frequency – Inverse Document Frequency)

๐Ÿ“Œ TF-IDF (Term Frequency – Inverse Document Frequency)๋ž€?

TF-IDF๋Š” ๋ฌธ์„œ ๋‚ด ๋‹จ์–ด์˜ ์ค‘์š”๋„๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋ฌธ์„œ์—์„œ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€๋ฅผ ์ˆ˜์น˜ํ™”ํ•˜์—ฌ, ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜๊ฑฐ๋‚˜ ๊ฒ€์ƒ‰ ์—”์ง„์—์„œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ๋žญํ‚นํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

1๏ธโƒฃ ๊ฐœ๋… ์ •๋ฆฌ

(1) TF (Term Frequency, ๋‹จ์–ด ๋นˆ๋„)

๋ฌธ์„œ ๋‚ด์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•˜๋Š” ๋นˆ๋„๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.TF(t,d)=f(t,d)NTF(t, d) = \frac{f(t, d)}{N}TF(t,d)=Nf(t,d)โ€‹

  • f(t,d)f(t, d)f(t,d) : ํŠน์ • ๋ฌธ์„œ ddd์—์„œ ๋‹จ์–ด ttt์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜
  • NNN : ๋ฌธ์„œ ddd์— ๋“ฑ์žฅํ•˜๋Š” ๋ชจ๋“  ๋‹จ์–ด์˜ ์ด ๊ฐœ์ˆ˜

โœ… TF์˜ ์˜๋ฏธ: ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์ผ์ˆ˜๋ก ํ•ด๋‹น ๋ฌธ์„œ์—์„œ ์ค‘์š”ํ•˜๊ฒŒ ์‚ฌ์šฉ๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ.

(2) IDF (Inverse Document Frequency, ์—ญ๋ฌธ์„œ ๋นˆ๋„)

ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ์—ฌ๋Ÿฌ ๋ฌธ์„œ์—์„œ ๋“ฑ์žฅํ•˜๋Š” ์ •๋„๋ฅผ ์กฐ์ •ํ•˜๋Š” ๊ณ„์ˆ˜์ž…๋‹ˆ๋‹ค. ๋„ˆ๋ฌด ๋งŽ์€ ๋ฌธ์„œ์—์„œ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด(์˜ˆ: “the”, “is”, “and” ๊ฐ™์€ ๋ถˆ์šฉ์–ด)๋Š” ์ •๋ณด ๊ฐ€์น˜๊ฐ€ ๋‚ฎ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ๊ฐ€์ค‘์น˜๋ฅผ ๋‚ฎ์ถฅ๋‹ˆ๋‹ค.IDF(t)=logโก(D1+df(t))IDF(t) = \log \left(\frac{D}{1 + df(t)}\right)IDF(t)=log(1+df(t)Dโ€‹)

  • DDD : ์ „์ฒด ๋ฌธ์„œ ๊ฐœ์ˆ˜
  • df(t)df(t)df(t) : ๋‹จ์–ด ttt๊ฐ€ ๋“ฑ์žฅํ•œ ๋ฌธ์„œ ๊ฐœ์ˆ˜

โœ… IDF์˜ ์˜๋ฏธ: ํŠน์ • ๋ฌธ์„œ์—์„œ๋งŒ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์ผ์ˆ˜๋ก(ํฌ๊ท€ํ• ์ˆ˜๋ก) ๋” ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌ.

(3) TF-IDF ๊ณต์‹

TF์™€ IDF๋ฅผ ๊ณฑํ•œ ๊ฐ’์ด ํ•ด๋‹น ๋‹จ์–ด์˜ ์ตœ์ข… ์ค‘์š”๋„ ์ ์ˆ˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.TFโˆ’IDF(t,d)=TF(t,d)ร—IDF(t)TF-IDF(t, d) = TF(t, d) \times IDF(t)TFโˆ’IDF(t,d)=TF(t,d)ร—IDF(t)

โœ… TF-IDF์˜ ์˜๋ฏธ:

  • ํŠน์ • ๋ฌธ์„œ์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜์ง€๋งŒ, ์ „์ฒด ๋ฌธ์„œ์—์„œ๋Š” ๋“œ๋ฌผ๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์˜ ์ ์ˆ˜๊ฐ€ ๋†’์Œ.
  • ๋ฌธ์„œ ์ „์ฒด์—์„œ ํ”ํ•œ ๋‹จ์–ด(“the”, “is”, “and”)๋Š” ์ž๋™์œผ๋กœ ๊ฐ€์ค‘์น˜๊ฐ€ ๋‚ฎ์•„์ง.

2๏ธโƒฃ Python์„ ํ™œ์šฉํ•œ TF-IDF ๊ณ„์‚ฐ(1) Scikit-learn์„ ํ™œ์šฉํ•œ TF-IDF ๋ฒกํ„ฐํ™”

from sklearn.feature_extraction.text import TfidfVectorizer

# ์ƒ˜ํ”Œ ๋ฌธ์„œ
documents = [
"I love machine learning. Machine learning is amazing.",
"Natural language processing is a part of AI.",
"Deep learning advances AI and machine learning."
]

# TF-IDF ๋ณ€ํ™˜๊ธฐ ์ƒ์„ฑ
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# ๋‹จ์–ด ๋ชฉ๋ก ์ถœ๋ ฅ
print("TF-IDF ๋‹จ์–ด ๋ชฉ๋ก:", vectorizer.get_feature_names_out())

# ๊ฐ ๋ฌธ์„œ์˜ TF-IDF ๊ฐ’ ์ถœ๋ ฅ
print("TF-IDF ํ–‰๋ ฌ:\n", tfidf_matrix.toarray())

โœ… ์„ค๋ช…:

  • TfidfVectorizer()๋ฅผ ์‚ฌ์šฉํ•ด ๋ฌธ์„œ๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜.
  • fit_transform()์„ ํ†ตํ•ด TF-IDF ๊ฐ’์„ ๊ณ„์‚ฐ.
  • toarray()๋ฅผ ์ด์šฉํ•ด ๋ณ€ํ™˜๋œ ํ–‰๋ ฌ์„ ์ถœ๋ ฅ.

(2) ํŠน์ • ๋‹จ์–ด์˜ TF-IDF ๊ฐ’ ํ™•์ธ

import pandas as pd

# TF-IDF ๊ฒฐ๊ณผ๋ฅผ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋ณ€ํ™˜
df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(df)

๐Ÿ“Œ ์ถœ๋ ฅ ์˜ˆ์‹œ

aiadvancesamazingdeepislearninglovemachinenaturalpartprocessing
๋ฌธ์„œ 1000.52000.370.520.37000
๋ฌธ์„œ 20.420000.420000.420.420.42
๋ฌธ์„œ 30.330.4600.4600.3300.33000

โœ… ์„ค๋ช…:

  • ํŠน์ • ๋‹จ์–ด(์˜ˆ: “machine”)๊ฐ€ ํŠน์ • ๋ฌธ์„œ์—์„œ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€๋ฅผ ์ˆ˜์น˜๋กœ ํ™•์ธ ๊ฐ€๋Šฅ.
  • ๋ฌธ์„œ๋งˆ๋‹ค ๋‹จ์–ด์˜ ๊ฐ€์ค‘์น˜๊ฐ€ ๋‹ค๋ฅด๊ฒŒ ๋ถ€์—ฌ๋จ.

(3) ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ์ด์šฉํ•œ ๋ฌธ์„œ ๋น„๊ต

TF-IDF ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

from sklearn.metrics.pairwise import cosine_similarity

# ๋ฌธ์„œ ๊ฐ„ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ
cos_sim = cosine_similarity(tfidf_matrix)
print("๋ฌธ์„œ ๊ฐ„ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„:\n", cos_sim)

โœ… ์ถœ๋ ฅ ์˜ˆ์‹œ

๋ฌธ์„œ ๊ฐ„ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„:
[[1. 0.118 0.529]
[0.118 1. 0.206]
[0.529 0.206 1. ]]

โœ… ์„ค๋ช…:

  • cosine_similarity()๋ฅผ ์ด์šฉํ•ด ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •.
  • 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์œ ์‚ฌํ•œ ๋ฌธ์„œ, 0์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์—ฐ๊ด€์„ฑ์ด ๋‚ฎ์Œ.

3๏ธโƒฃ TF-IDF์˜ ์žฅ์ ๊ณผ ๋‹จ์ 

โœ… ์žฅ์ 

  1. ๋ถˆํ•„์š”ํ•œ ๋‹จ์–ด ๊ฐ€์ค‘์น˜ ๊ฐ์†Œ: “the”, “is”, “and” ๊ฐ™์€ ๋ถˆ์šฉ์–ด์˜ ์˜ํ–ฅ๋ ฅ์„ ์ž๋™์œผ๋กœ ๋‚ฎ์ถค.
  2. ๋ฌธ์„œ ๋‚ด ์ค‘์š”ํ•œ ๋‹จ์–ด ๊ฐ•์กฐ: ํŠน์ • ๋ฌธ์„œ์—์„œ๋งŒ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋†’๊ฒŒ ๋ถ€์—ฌ.
  3. ๊ฐ„๋‹จํ•˜๊ณ  ํšจ์œจ์ : ๊ณ„์‚ฐ์ด ๋น ๋ฅด๊ณ  ๋ฌธ์„œ ๊ฒ€์ƒ‰, ๋ฌธ์„œ ๋น„๊ต ๋“ฑ์˜ ๋‹ค์–‘ํ•œ NLP ์ž‘์—…์— ์‚ฌ์šฉ ๊ฐ€๋Šฅ.

โŒ ๋‹จ์ 

  1. ๋ฌธ๋งฅ ์ •๋ณด ๋ถ€์กฑ: TF-IDF๋Š” ๋‹จ์–ด์˜ ์ˆœ์„œ๋‚˜ ์˜๋ฏธ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š์Œ. (์˜ˆ: “apple”๊ณผ “fruit”์˜ ๊ด€๊ณ„๋ฅผ ์ธ์‹ ๋ชปํ•จ)
  2. ํฌ๊ท€ ๋‹จ์–ด ๊ฐ€์ค‘์น˜ ๋ฌธ์ œ: ๋„ˆ๋ฌด ๋“œ๋ฌธ ๋‹จ์–ด๋Š” ํ•„์š” ์ด์ƒ์œผ๋กœ ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ๋ฐ›์„ ์ˆ˜ ์žˆ์Œ.
  3. ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์— ๋ฏผ๊ฐ: ๋ฌธ์„œ ๊ฐœ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•˜๋ฉด IDF ๊ณ„์‚ฐ์ด ๋ณ€ํ™”ํ•˜์—ฌ ๊ฐ€์ค‘์น˜๊ฐ€ ๋ณ€ํ•  ์ˆ˜ ์žˆ์Œ.

๐Ÿ“Œ TF-IDF vs ๋‹ค๋ฅธ NLP ๊ธฐ๋ฒ• ๋น„๊ต

๊ธฐ๋ฒ•์„ค๋ช…์žฅ์ ๋‹จ์ 
TF-IDF๋‹จ์–ด ๋นˆ๋„ ๊ธฐ๋ฐ˜ ๊ฐ€์ค‘์น˜ ๋ถ€์—ฌ๋น ๋ฅด๊ณ  ๊ฐ„๋‹จํ•จ๋ฌธ๋งฅ ๊ณ ๋ ค X
Word2Vec๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜์˜๋ฏธ์  ๊ด€๊ณ„ ๋ฐ˜์˜ ๊ฐ€๋Šฅ๋ฌธ์žฅ ์ „์ฒด ๋ถ„์„ ์–ด๋ ค์›€
BERT๋ฌธ์žฅ ๋‹จ์œ„์˜ ์˜๋ฏธ ํ•™์Šต๋ฌธ๋งฅ์„ ๊ณ ๋ คํ•œ ์˜๋ฏธ ๋ถ„์„ ๊ฐ€๋Šฅ๋ชจ๋ธ์ด ํฌ๊ณ  ๋А๋ฆผ

โœ… TF-IDF๋Š” ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์ด์ง€๋งŒ, ๋‹จ์–ด ๊ฐ„ ์˜๋ฏธ๋ฅผ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•˜๋ฏ€๋กœ Word2Vec, BERT ๊ฐ™์€ ๋ฐฉ๋ฒ•๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋ฉด ๋”์šฑ ์ •๊ตํ•œ NLP ๋ถ„์„์ด ๊ฐ€๋Šฅ!

๐Ÿš€ TF-IDF ํ™œ์šฉ ์˜ˆ์‹œ

โœ” ๊ฒ€์ƒ‰ ์—”์ง„: ๊ฒ€์ƒ‰์–ด์™€ ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฒฐ๊ณผ ์ •๋ ฌ
โœ” ๋ฌธ์„œ ๋ถ„๋ฅ˜: ๋‰ด์Šค, ์ด๋ฉ”์ผ, ๋ฆฌ๋ทฐ ๋“ฑ์˜ ์ž๋™ ๋ถ„๋ฅ˜
โœ” ์ŠคํŒธ ํ•„ํ„ฐ๋ง: ์ŠคํŒธ ๋ฉ”์ผ๊ณผ ์ •์ƒ ๋ฉ”์ผ์˜ TF-IDF ํŒจํ„ด์„ ๋น„๊ตํ•˜์—ฌ ๋ถ„๋ฅ˜
โœ” ๊ฐ์ • ๋ถ„์„: ํ…์ŠคํŠธ์—์„œ ๊ธ์ •/๋ถ€์ • ํ‚ค์›Œ๋“œ์˜ ์ค‘์š”๋„ ๋ถ„์„

โœ… ๊ฒฐ๋ก 

  • TF-IDF๋Š” ๋‹จ์–ด ์ค‘์š”๋„ ์ธก์ • ๋ฐ ๋ฌธ์„œ ๋น„๊ต๋ฅผ ์œ„ํ•œ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ•.
  • ๋น ๋ฅด๊ณ  ๊ฐ„๋‹จํ•˜๋ฉฐ, ๊ฒ€์ƒ‰ ์—”์ง„, ์ถ”์ฒœ ์‹œ์Šคํ…œ, ๋ฌธ์„œ ๋ถ„๋ฅ˜ ๋“ฑ ๋‹ค์–‘ํ•œ ๊ณณ์—์„œ ์‚ฌ์šฉ๋จ.
  • ๋ฌธ๋งฅ์„ ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•˜๋ฏ€๋กœ, Word2Vec, BERT ๊ฐ™์€ ๊ธฐ๋ฒ•๊ณผ ํ•จ๊ป˜ ํ™œ์šฉํ•˜๋ฉด ๋”์šฑ ์ •๊ตํ•œ ๋ถ„์„ ๊ฐ€๋Šฅ.

Leave a Comment