Python AI 自然言語処理 - 初心者AI開発

f:id:nikudaisukikun:20181231182115j:plain

自然言語処理とは
使用ツール
word2vec
下処理

1.自然言語処理とは

自然言語処理とは、人間が使用する言語をコンピューターに処理させる技術であり、人工知能が活躍する分野である。

今回は、単語をベクトル化することを目標にする。

単語をベクトル化することで、コンピューターに単語を計算することができる。

2.使用ツール

Python
Jupyter notebook
keras
tensorflow

3.word2vec

今回単語をベクトル化するために、word2vec N gram model を使用

N gram model 　はA B Cという単語の並びがあるとき、AとCからBが出現する確率をニューラルネットワークに計算させる理論

f:id:nikudaisukikun:20181231184850j:plain

緑の2文字が与えられたとき、赤の単語が出現する確率をニューラルネットワークで計算

計算結果から、単語のベクトルを取り出す→word2vec

4.下処理

下処理の手順

文章を準備
文章を単語に分割
AIが単語を計算できるように、単語にidを振る

1.文章の準備

下処理のテストとして、英語の文書をコピペする。

処理としては、英語のほうが日本語より簡単

理由:英語の単語間はスペースが入るので分けやすい。

準備した文章:

The core of the model consists of an LSTM cell that processes one word at a time and computes probabilities of the possible values for the next word in the sentence. The memory state of the network is initialized with a vector of zeros and gets updated after reading each word. For computational reasons, we will process data in mini-batches of size batch_size. In this example, it is important to note that current_batch_of_words does not correspond to a of words. Every word in a batch should correspond to a time t. TensorFlow will automatically sum the gradients of each batch for you.

f:id:nikudaisukikun:20181231203136p:plain Jupyter notebook上で

text=" コピペ文章 "

文章から、単語に分割

keras.preprocessing.text.text_to_word_sequence()の機能を利用

(text:文章

filters:文章中から削除したい文字を指定。今回は記号を削除

split:"条件 "条件に単語に分割する条件を指定。今回は、"スペース"で分割

words[0:13]:分割された単語の0~13を表示

2.文章を単語に分割

やりたいこと

単語ごとに個別の数字を割り振る

(例)単語→数字(word2id)：たこ焼き→１,お金→2,車→3

　数字→単語(id2word)：1→たこ焼き,2→お金,3→車

f:id:nikudaisukikun:20181231210706p:plain

　sorted(set(words)):単語のかぶりを削除

word2idx={} :Pythonの辞書

辞書：{key1:本体1,key2:本体2,key3:本体3}

辞書の本体とkeyをひもづけ

(例)辞書[key1]→本体１

key1を代入して本体１を取り出すことができる

idx2word=np.array(vocab):数字を単語に変換

(例)インデックス(先頭からの数＊スタート0から)を使用

idx2word=["a","b","c","d","e"]

インデックス2:0からスタートなので先頭から3番目

idx2word[2]="c"

for i ,word in enumerate(idx2word):
word2idx[word]=i

辞書の設定を行っている

key：分割した単語をループで次々にkeyに設定

本体:keyに設定したときのループ回数を本体(単語の番号)に設定

テスト

f:id:nikudaisukikun:20181231212713p:plain

idx2word[52]:52はtheを表す

word2idx['the']:theは52を表す

下処理が終わったので、今後

ニューラルネットワークで計算させ、単語のベクトルを獲得する