natural language - 演算法筆記

natural language🚧

linguistics / computational linguistics

「語言學」旨在研究人類的文字文法、發音韻律、說話溝通、寫作翻譯、……。

語言學的分支之中，語義學、語用學與自然語言有關。

「計算語言學」就是利用計算機研究語言學。

natural language

「自然語言」是人類所說的語言，與「制式語言」相對應。

人類語言龐大複雜、人類文法缺乏規律。工程師放棄教條式的文法，改為觀察詞彙與詞彙之間的關聯，判斷最佳詞彙組合。這群人只碰計算機科學、不碰語言學，於是創造了新稱呼「自然語言處理」。

知名工具NLTK、CoreNLP。繁體中文的語料庫。

text tagging🚧

簡介

文字的篩選與標註。

text tokenization

分詞。英文句子，斷開成許多單字。

tokenization  英文句子去除標點符號，打散成單字
lemmatisation 英文單字的變化型，變成原型
stemming      英文單字的變化型，去除字尾
http://tartarus.org/martin/PorterStemmer/

part-of-speech tagging     英文單字標註詞類，例如名動形副介
shallow parsing / chunking 英文單字標註句型，例如主詞受詞補語
named-entity recognition   英文單字標註義類，例如人時地事物

constituent parser 得到詞類樹狀結構
dependency parser  得到句型樹狀結構
http://nlp.stanford.edu:8080/parser/

text tokenization

分詞。中文句子，斷開成許多詞彙。

將一句話斷開成許多詞彙 ---> 將 一句 話 斷開 成 許多 詞彙

讀者可以玩玩看中研院的斷詞系統、谷歌書籍詞彙統計。

人類在對談時，大腦一瞬間綜合了聲調、文法、情境、表情、肢體動作、認知、知識，藉此正確地分詞。每個人出生的前十年，大腦不斷地發展這個能力，但是我們至今仍然不知道這個能力的詳細內容。目前計算機科學家所能掌握的，僅僅是文法而已。

一、建立「常見詞彙大全」。
　　窮舉所有的詞語組合方式，找到最好的詞語組合。
　　甲、greedy method：令長詞優先配對。
　　乙、dynamic programming：計算出現次數總和（或機率乘積）最大的詞語組合。
　　優點：計算速度飛快而且精準。
　　缺點：無法處理未知詞彙。

二、n-gram，n是一個變數。此處以2-gram為例：

　　　　　　　　　　　 2-gram
　　　　常見詞彙何其多 ------> 常見、見詞、詞彙、彙何、何其、其多

　　蒐集大量文章，統計所有2-gram的出現次數即可。就這麼簡單。
　　教科書習慣表示成機率：出現次數再除以總次數。
　　缺點：完全沒有參考中文文法，經常得到莫名其妙的詞彙。
　　優點：採用機率模型，可以容忍人類亂無章法的句法！

三、剖析樹。
　　依照文法，分解句子變成樹狀圖，並且判斷詞性。
　　然而人類講話亂無章法，窮舉各種狀況的時間複雜度極高。

四、有向無環圖DAG。
　　比樹狀圖還有彈性。

text segmentation

分段。一篇文章，自動切割出適當段落。

TextTiling algorithm

text similarity

相似度。判斷文字有多像。演算法共兩類：

orthography (spelling) 拼字（以字元為基本單位）
phonetics (sound)      拼音（以音素為基本單位）
https://www.kdnuggets.com/2019/01/comparison-text-distance-metrics.html

拼字演算法：例如edit distance。拼音演算法：例如soundex。

text prediction（predictive text）

預測。推測接下來的文字。

text correction（spell check）

校正。修正拼字錯誤、文法錯誤。

text analysis🚧

簡介

文字的統計與分析。

text frequency

頻率。建立二分圖矩陣，統計每個單字在每份文件的出現次數。

TF-IDF        每個單字在每份文件的出現次數。

text embedding

嵌入。單字們化作向量們，然後降維。方便計算文件相似度。

bag of words  每個單字的出現次數
word2vec      每個單字的出現位置

text categorization

分類。區分文章類型，有如報紙版面分類。

進階應用有情感分析、防垃圾郵件、聊天機器人。

topic model：觀察每份文章的詞彙，根據詞彙們的出現次數、比重，判斷文章類型。將文章涵義以數值形式記錄下來。

latent semantic analysis, LDS
vector space model + SVD
(跟eigenxxxxx不太一樣，沒有先求兩兩共變異數)

probabilistic latent semantic analysis, PLDS
document->topic->word
http://blog.csdn.net/yangliuy/article/details/8330640

LDA, latent Dirichlet allocation
http://cos.name/2013/01/lda-math-gamma-function/
http://cos.name/2013/01/lda-math-beta-dirichlet/

text search🚧

text search（full-text search）

檢索。在文件之中搜尋特定單字。

document：文件。一大堆字串。
word：單字。一個字串。

index / n-gram / k-mer

ScanCount: Efficient Merging and Filtering Algorithms for Approximate String Searches

frequency / correlation

topic model = collaborative filtering
TF-IDF
word2vector

top-k / queryselector

LCP            https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5445192/
wavelet tree   https://lars76.github.io/various/wavelet-trees-python/

text compression

壓縮。減少儲存容量。

ACB compression（associative coder of Buyanovsky）
http://www.stringology.org/DataCompression/acb/index_en.html
http://www.cs.brandeis.edu/~fabricio/files/cosci170.htm

PPM compression（prediction by partial matching）
http://www.stringology.org/DataCompression/ppmc/index_en.html
https://en.wikipedia.org/wiki/Prediction_by_Partial_Matching

natural language understanding🚧

簡介

natural language generation

生成。自動寫作，生成文章。

https://en.wikipedia.org/wiki/SCIgen
https://en.wikipedia.org/wiki/Article_spinning
http://images3.wikia.nocookie.net/__cb20120321065110/hunterx/images/9/92/Neon%27s_Lovely_Ghostwriter.jpg

automated content authorship
http://wordai.com/
http://mag.udn.com/mag/digital/storypage.jsp?f_ART_ID=131873
http://www.insead.edu/facultyresearch/faculty/profiles/pparker/
http://www.insead.edu/facultyresearch/faculty/personal/pparker/

report generator
http://narrativescience.com/
http://mag.udn.com/mag/digital/printpage.jsp?f_ART_ID=389334

lyrics generator
http://arxiv.org/abs/1505.04771   DopeLearning
automatic rhyme detection
http://mining4meaning.com/2015/02/13/raplyzer/

neural storyteller
http://www.cs.toronto.edu/~mbweb/

political speech generation
http://arxiv.org/abs/1601.03313

natural language paraphrasing

釋義。換句話說、重新闡釋，有如查字典。

natural language summarization

摘要。歸納文章重點，生成文章。

story understanding
http://alumni.media.mit.edu/~mueller/storyund/storyres.html

natural language comprehension（question answering）

理解。給定一篇文章、一些提問，根據文章內容產生正確答案。

知名數據集SQuAD。

natural language communication（dialog modeling）

溝通。給定一句話，產生適當的回話。

知名軟體如Eugene Goostman、Akinator、掰噗。

natural language translation（machine translation）

翻譯。一份文章，翻譯成另外一種語言。

知名軟體如Google Translate、有道翻译、Dr.eye。

http://www.statmt.org/book/
http://mt-class.org/
http://104.131.78.120/

natural language identification

鑑定。一份文章，判斷所屬語言。

natural language programming

編程。靠一張嘴寫程式。

https://www.threads.com/@satyanadella/post/DMdkzPgqc8F