自然言語処理（NLP）の基礎と応用: Pythonで学ぶテキストデータの解析 - Python転職初心者向けエンジニアリングブログ

自然言語処理（NLP）は、コンピュータが人間の言語を理解し、処理するための分野です。PythonにはNLPを実現するための豊富なライブラリが存在し、本記事ではNLPの基礎から応用までをPythonコードを交えながら詳しく解説します。

1. NLPの基礎

1.1 テキストデータの前処理

NLPの最初のステップはテキストデータの前処理です。これにはテキストのクリーニング、トークン化、ストップワードの削除などが含まれます。

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# テキストデータのクリーニング
def clean_text(text):
    text = re.sub(r"[^a-zA-Z]", " ", text)  # アルファベット以外の文字をスペースに置換
    text = text.lower()  # 小文字化
    return text

# ストップワードの削除とトークン化
def tokenize_and_remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.lower() not in stop_words]
    return tokens

# 例: テキストデータの前処理
original_text = "Natural Language Processing is an exciting field of study!"
cleaned_text = clean_text(original_text)
tokens = tokenize_and_remove_stopwords(cleaned_text)
print(tokens)

1.2 単語の埋め込み（Word Embedding）

単語の埋め込みは単語をベクトル空間にマッピングする手法で、単語の意味や関連性を捉えるのに役立ちます。Word2VecやGloVeなどのアルゴリズムがよく知られています。

from gensim.models import Word2Vec

# テキストデータからWord2Vecモデルの構築
word2vec_model = Word2Vec(sentences=[tokens], vector_size=100, window=5, min_count=1, workers=4)
word_vectors = word2vec_model.wv

# 特定の単語のベクトル取得
vector = word_vectors["natural"]
print(vector)

2. NLPの応用

2.1 テキスト分類

テキスト分類はテキストを事前に定義されたカテゴリに分類するタスクで、感情分析やトピック分類などが該当します。

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# データの用意（適切なデータセットを使用）
X_train, X_test, y_train, y_test = train_test_split(text_data, labels, test_size=0.2, random_state=42)

# TF-IDFベクトライザーの適用
vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# ナイーブベイズ分類器の学習と予測
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)
y_pred = classifier.predict(X_test_tfidf)

# 分類結果の評価
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:\n", report)

2.2 感情分析

感情分析はテキストの中から感情や態度を抽出するタスクで、ポジティブ、ネガティブ、または中立の感情を判定することがあります。

from textblob import TextBlob

# テキストデータの感情分析
def sentiment_analysis(text):
    analysis = TextBlob(text)
    sentiment = analysis.sentiment.polarity
    if sentiment > 0:
        return "Positive"
    elif sentiment < 0:
        return "Negative"
    else:
        return "Neutral"

# 例: テキストデータの感情分析
sample_text = "I love using Python for natural language processing!"
result = sentiment_analysis(sample_text)
print("Sentiment:", result)

3. まとめと次回のテーマ

NLPはテキストデータの解析や理解に幅広く応用されており、Pythonを使用することでこれらのタスクを効果的に実施できます。次回のテーマは「機械学習モデルの構築と最適化」です。Pythonを使って機械学習モデルを構築し、最適化する方法について解説します。お楽しみに！