GitHub项目：自然语言处理领域的相关干货整理

自然语言处理（NLP）是计算机科学，人工智能，语言学关注计算机和人类（自然）语言之间的相互作用的领域。本文作者为NLP初学者整理了一份庞大的自然语言处理领域的概览。选取的参考文献与资料都侧重于最新的深度学习研究成果。这些资源能为想要深入钻研一个NLP任务的人们提供一个良好的开端。

指代消解

https://github.com/Kyubyong/nlp_tasks#coreference-resolution

论文自动评分

论文：Automatic Text Scoring Using Neural Networks（使用神经网络的自动文本评分）：https://arxiv.org/abs/1606.04289
论文：A Neural Approach to Automated Essay Scoring（一种自动将论文评分的神经学方法）：http://www.aclweb.org/old_anthology/D/D16/D16-1193.pdf
挑战：Kaggle:The Hewlett Foundation: Automated Essay Scoring（Kaggle：The Hewlett Foundation:论文自动评分系统）：https://www.kaggle.com/c/asap-aes
项目：Enhanced AI Scoring Engine（增强的人工智能得分引擎）：https://github.com/edx/ease

自动语音识别

维基百科：语言识别：https://en.wikipedia.org/wiki/Speech_recognition
论文：DeepSpeech 2: End-to-End Speech Recognition in English and Mandarin（深度语音2:用英语和普通话进行端对端语音识别）：https://arxiv.org/abs/1512.02595
论文：WaveNet:A Generative Model for Raw Audio（WaveNet:原始音频的生成模型）：https://arxiv.org/abs/1609.03499
项目：A TensorFlow implementation of Baidu’s Deep Speech architecture（百度深度语音架构的一个TensorFlow实现：https://github.com/mozilla/DeepSpeech
项目：Speech-to-Text-WaveNet: End-to-end sentence level English speech recognition using DeepMind’s WaveNet（Speech-to-Text-WaveNet: 使用DeepMind的WaveNet，对端到端句子的英语水平语音识别）：https://github.com/buriburisuri/speech-to-text-wavenet
挑战：The 5th CHiME Speech Separation and Recognition Challenge（第五届CHiME语音的分离和识别挑战）：http://spandh.dcs.shef.ac.uk/chime_challenge/
资料：The 5thCHiME Speech Separation and Recognition Challenge（第五届CHiME语音的分离和识别挑战）：http://spandh.dcs.shef.ac.uk/chime_challenge/download.html
资料：CSTRVCTK Corpus ：http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
资料：LibriSpeech ASR corpus：http://www.openslr.org/12/
资料：Switchboard-1 Telephone Speech Corpus：https://catalog.ldc.upenn.edu/ldc97s62
资料：TED-LIUM Corpus：http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus

自动摘要

维基百科：自动摘要：https://en.wikipedia.org/wiki/Automatic_summarization
书籍：Automatic Text Summarization（自动本文摘要）：https://www.amazon.com/Automatic-Text-Summarization-Juan-Manuel-Torres-Moreno/dp/1848216688/ref=sr_1_1?s=books&ie=UTF8&qid=1507782304&sr=1-1&keywords=Automatic+Text+Summarization
论文：Text Summarization Using Neural Networks（使用神经网络进行文本摘要）：http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.823.8025&rep=rep1&type=pdf
论文：Ranking with Recursive Neural Networks and Its Application to Multi-DocumentSummarization（使用递归神经网络及其应用程序对多文档摘要进行排序）：https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/viewFile/9414/9520
资料：Text Analytics Conferences（文本分析会议）：https://tac.nist.gov/data/index.html
资料：Document Understanding Conferences（文书理解会议）：http://www-nlpir.nist.gov/projects/duc/data.html

共指消解

信息：共指消解：https://nlp.stanford.edu/projects/coref.shtml
论文：Deep Reinforcement Learning for Mention-Ranking Coreference Models（对Mention-Ranking的共指模型进行深度强化学习：https://arxiv.org/abs/1609.08667
论文：Improving Coreference Resolution by Learning Entity-Level Distributed Representations（通过学习实体级分布式表示来改善相关的解决方案）：https://arxiv.org/abs/1606.01323
挑战：CoNLL 2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes（CoNLL 2012共享任务:在OntoNotes中对多语言的不受限制的共指进行建模）：http://conll.cemantix.org/2012/task-description.html
挑战：CoNLL 2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes（CoNLL 2011共享任务:在OntoNotes中对多语言的不受限制的共指进行建模）：http://conll.cemantix.org/2011/task-description.html

语法错误校正

论文：Neural Network Translation Models for Grammatical Error Correction（语法错误校正的神经网络翻译模型）：https://arxiv.org/abs/1606.00189
挑战：CoNLL 2013 Shared Task: Grammatical Error Correction（CoNLL 2013共享任务:语法错误校正）：http://www.comp.nus.edu.sg/~nlp/conll13st.html
挑战：CoNLL 2014Shared Task: Grammatical Error Correction（CoNLL 2014共享任务:语法错误校正）：http://www.comp.nus.edu.sg/~nlp/conll14st.html
资料：NUSNon-commercial research/trial corpus license：http://www.comp.nus.edu.sg/~nlp/conll14st/nucle_license.pdf
资料：Lang-8 Learner Corpora：http://cl.naist.jp/nldata/lang-8/
资料：Cornell Movie–Dialogs Corpus：http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
项目：Deep Text Corrector（深度文本校正器）：https://github.com/atpaino/deep-text-corrector
产品：deep grammar：http://deepgrammar.com/

字素转换到音素

论文：Grapheme-to-Phoneme Models for （Almost） Any Language（适合(几乎)任何语言的字素到音素的模型）：https://pdfs.semanticscholar.org/b9c8/fef9b6f16b92c6859f6106524fdb053e9577.pdf
论文：Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning（多语言神经语言模型:跨语语音表达学习的案例研究）：https://arxiv.org/pdf/1605.03832.pdf
论文：Multi task Sequence-to-Sequence Models for Grapheme-to-Phoneme Conversion（多任务序列到序列的字素到音素转换的模型）：https://pdfs.semanticscholar.org/26d0/09959fa2b2e18cddb5783493738a1c1ede2f.pdf
项目：Sequence-to-Sequence G2P toolkit（序列到序列G2P工具包）：https://github.com/cmusphinx/g2p-seq2seq
资料：Multilingual Pronunciation Data（多语种发音数据）：https://drive.google.com/drive/folders/0B7R_gATfZJ2aWkpSWHpXUklWUmM

语种识别

维基百科：语种识别：https://en.wikipedia.org/wiki/Language_identification
论文：AUTOMATIC LANGUAGE IDENTIFICATION USING DEEP NEURAL NETWORKS（使用深度神经网络的自动语言识别）：https://repositorio.uam.es/bitstream/handle/10486/666848/automatic_lopez-moreno_ICASSP_2014_ps.pdf?sequence=1
挑战： 2015 Language Recognition Evaluation（2015语言识别评估）：https://www.nist.gov/itl/iad/mig/2015-language-recognition-evaluation

语言建模

维基百科：语言模型：https://en.wikipedia.org/wiki/Language_model
工具包： KenLM Language Model Toolkit（KenLM语言模型工具包）：http://kheafield.com/code/kenlm/
论文：Distributed Representations of Words and Phrases and their Compositionality（词汇和短语的分布表示及其组合性）：http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
论文：Character-Aware Neural Language Models（Character-Aware神经语言模型）：https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewFile/12489/12017
资料： Penn Treebank ：https://github.com/townie/PTB-dataset-from-Tomas-Mikolov-s-webpage/tree/master/data

词形还原

维基百科：词形还原：https://en.wikipedia.org/wiki/Lemmatisation
工具包：WordNet Lemmatizer：http://www.nltk.org/api/nltk.stem.html#nltk.stem.wordnet.WordNetLemmatizer.lemmatize
资料：Treebank-3：https://catalog.ldc.upenn.edu/ldc99t42

唇语辨别

维基百科：唇读法：https://en.wikipedia.org/wiki/Lip_reading
论文：Lip Reading Sentences in the Wild （在野外读懂唇语）：https://arxiv.org/abs/1611.05358
论文：3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition（交叉视听匹配识别的3D卷积神经网络）：https://arxiv.org/abs/1706.05739
项目： Lip Reading – Cross Audio-Visual Recognition using 3D Convolutional Neural Networks（唇读法—使用3D卷积神经网络的交叉视听识别：https://github.com/astorfi/lip-reading-deeplearning
资料： The GRID audiovisual sentence corpus：http://spandh.dcs.shef.ac.uk/gridcorpus/

机器翻译

论文：Neural Machine Translation by Jointly Learning to Align and Translate（通过共同学习来调整和翻译神经机器翻译）：https://arxiv.org/abs/1409.0473
论文：Neural Machine Translation in Linear Tim（在线性时间中的神经机器翻译）：https://arxiv.org/abs/1610.10099
挑战： ACL2014 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION（ACL2014第九届统计机器翻译研讨会）：http://www.statmt.org/wmt14/translation-task.html#download
资料：OpenSubtitles2016:http://opus.lingfil.uu.se/OpenSubtitles2016.php
资料： WIT3:Web Inventory of Transcribed and Translated Talks:https://wit3.fbk.eu/
资料： The QCRI Educational Domain （QED） Corpus：http://alt.qcri.org/resources/qedcorpus/

命名实体识别

维基百科：命名实体识别：https://en.wikipedia.org/wiki/Named-entity_recognition
论文：Neural Architectures for Named Entity Recognition（命名实体识别的神经结构）：https://arxiv.org/abs/1603.01360
项目： OSU Twitter NLP Tool：https://github.com/aritter/twitter_nlp
挑战： Named Entity Recognition in Twitter（在推特上被命名的实体识别）：https://noisy-text.github.io/2016/ner-shared-task.html
资料：CoNLL-2002 NER corpus：https://github.com/teropa/nlp/tree/master/resources/corpora/conll2002
资料：CoNLL-2003 NER corpus：https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003

释义检测

论文：Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection（动态池和展开递归自动编码器的释义检测）：http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.650.7199&rep=rep1&type=pdf
项目：Paralex: Paraphrase-Driven Learning for Open Question Answering（Paralex：释义驱动学习的开放问答）：http://knowitall.cs.washington.edu/paralex/
资料：Microsoft Research Paraphrase Corpus：https://www.microsoft.com/en-us/download/details.aspx?id=52398
资料：Microsoft Research Video Description Corpus ：https://www.microsoft.com/en-us/download/details.aspx?id=52422&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F38cf15fd-b8df-477e-a4e4-a4680caa75af%2F
资料： Pascal Dataset：http://nlp.cs.illinois.edu/HockenmaierGroup/pascal-sentences/index.html
资料：Flicker Dataset：http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html
资料： TheSICK data set：http://clic.cimec.unitn.it/composes/sick.html
资料： PPDB:The Paraphrase Database：http://www.cis.upenn.edu/~ccb/ppdb/
资料：WikiAnswers Paraphrase Corpus：http://knowitall.cs.washington.edu/paralex/wikianswers-paraphrases-1.0.tar.gz

语法分析

维基百科：语法分析：https://en.wikipedia.org/wiki/Parsing
工具
上一篇：SSM-Mybatis保存多条记录,foreach循环列表和数组select in
下一篇：iTerm2固定标签名字

One - One Code All

Blog Content

The minute you think of giving up, think of the reason why you held on so long.