自然语言处理(NLP)是计算机科学,人工智能,语言学关注计算机和人类(自然)语言之间的相互作用的领域。本文作者为NLP初学者整理了一份庞大的自然语言处理领域的概览。选取的参考文献与资料都侧重于最新的深度学习研究成果。这些资源能为想要深入钻研一个NLP任务的人们提供一个良好的开端。
指代消解
论文自动评分
论文:Automatic Text Scoring Using Neural Networks(使用神经网络的自动文本评分):https://arxiv.org/abs/1606.04289
论文:A Neural Approach to Automated Essay Scoring(一种自动将论文评分的神经学方法):http://www.aclweb.org/old_anthology/D/D16/D16-1193.pdf
挑战:Kaggle:The Hewlett Foundation: Automated Essay Scoring(Kaggle:The Hewlett Foundation:论文自动评分系统):https://www.kaggle.com/c/asap-aes
项目:Enhanced AI Scoring Engine(增强的人工智能得分引擎):https://github.com/edx/ease
自动语音识别
论文:DeepSpeech 2: End-to-End Speech Recognition in English and Mandarin(深度语音2:用英语和普通话进行端对端语音识别):https://arxiv.org/abs/1512.02595
论文:WaveNet:A Generative Model for Raw Audio(WaveNet:原始音频的生成模型):https://arxiv.org/abs/1609.03499
项目:A TensorFlow implementation of Baidu’s Deep Speech architecture(百度深度语音架构的一个TensorFlow实现:https://github.com/mozilla/DeepSpeech
项目:Speech-to-Text-WaveNet: End-to-end sentence level English speech recognition using DeepMind’s WaveNet(Speech-to-Text-WaveNet: 使用DeepMind的WaveNet,对端到端句子的英语水平语音识别):https://github.com/buriburisuri/speech-to-text-wavenet
挑战:The 5th CHiME Speech Separation and Recognition Challenge(第五届CHiME语音的分离和识别挑战):http://spandh.dcs.shef.ac.uk/chime_challenge/
资料:The 5thCHiME Speech Separation and Recognition Challenge(第五届CHiME语音的分离和识别挑战):http://spandh.dcs.shef.ac.uk/chime_challenge/download.html
资料:CSTRVCTK Corpus :http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
资料:LibriSpeech ASR corpus:http://www.openslr.org/12/
资料:Switchboard-1 Telephone Speech Corpus:https://catalog.ldc.upenn.edu/ldc97s62
资料:TED-LIUM Corpus:http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus
自动摘要
维基百科:自动摘要:https://en.wikipedia.org/wiki/Automatic_summarization
书籍:Automatic Text Summarization(自动本文摘要):https://www.amazon.com/Automatic-Text-Summarization-Juan-Manuel-Torres-Moreno/dp/1848216688/ref=sr_1_1?s=books&ie=UTF8&qid=1507782304&sr=1-1&keywords=Automatic+Text+Summarization
论文:Text Summarization Using Neural Networks(使用神经网络进行文本摘要):http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.823.8025&rep=rep1&type=pdf
论文:Ranking with Recursive Neural Networks and Its Application to Multi-DocumentSummarization(使用递归神经网络及其应用程序对多文档摘要进行排序):https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/viewFile/9414/9520
资料:Text Analytics Conferences(文本分析会议):https://tac.nist.gov/data/index.html
资料:Document Understanding Conferences(文书理解会议):http://www-nlpir.nist.gov/projects/duc/data.html
共指消解
论文:Deep Reinforcement Learning for Mention-Ranking Coreference Models(对Mention-Ranking的共指模型进行深度强化学习:https://arxiv.org/abs/1609.08667
论文:Improving Coreference Resolution by Learning Entity-Level Distributed Representations(通过学习实体级分布式表示来改善相关的解决方案):https://arxiv.org/abs/1606.01323
挑战:CoNLL 2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes(CoNLL 2012共享任务:在OntoNotes中对多语言的不受限制的共指进行建模):http://conll.cemantix.org/2012/task-description.html
挑战:CoNLL 2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes(CoNLL 2011共享任务:在OntoNotes中对多语言的不受限制的共指进行建模):http://conll.cemantix.org/2011/task-description.html
语法错误校正
论文:Neural Network Translation Models for Grammatical Error Correction(语法错误校正的神经网络翻译模型):https://arxiv.org/abs/1606.00189
挑战:CoNLL 2013 Shared Task: Grammatical Error Correction(CoNLL 2013共享任务:语法错误校正):http://www.comp.nus.edu.sg/~nlp/conll13st.html
挑战:CoNLL 2014Shared Task: Grammatical Error Correction(CoNLL 2014共享任务:语法错误校正):http://www.comp.nus.edu.sg/~nlp/conll14st.html
资料:NUSNon-commercial research/trial corpus license:http://www.comp.nus.edu.sg/~nlp/conll14st/nucle_license.pdf
资料:Lang-8 Learner Corpora:http://cl.naist.jp/nldata/lang-8/
资料:Cornell Movie–Dialogs Corpus:http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
项目:Deep Text Corrector(深度文本校正器):https://github.com/atpaino/deep-text-corrector
产品:deep grammar:http://deepgrammar.com/
字素转换到音素
论文:Grapheme-to-Phoneme Models for (Almost) Any Language(适合(几乎)任何语言的字素到音素的模型):https://pdfs.semanticscholar.org/b9c8/fef9b6f16b92c6859f6106524fdb053e9577.pdf
论文:Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning(多语言神经语言模型:跨语语音表达学习的案例研究):https://arxiv.org/pdf/1605.03832.pdf
论文:Multi task Sequence-to-Sequence Models for Grapheme-to-Phoneme Conversion(多任务序列到序列的字素到音素转换的模型):https://pdfs.semanticscholar.org/26d0/09959fa2b2e18cddb5783493738a1c1ede2f.pdf
项目:Sequence-to-Sequence G2P toolkit(序列到序列G2P工具包):https://github.com/cmusphinx/g2p-seq2seq
资料:Multilingual Pronunciation Data(多语种发音数据):https://drive.google.com/drive/folders/0B7R_gATfZJ2aWkpSWHpXUklWUmM
语种识别
维基百科: 语种识别:https://en.wikipedia.org/wiki/Language_identification
论文:AUTOMATIC LANGUAGE IDENTIFICATION USING DEEP NEURAL NETWORKS(使用深度神经网络的自动语言识别):https://repositorio.uam.es/bitstream/handle/10486/666848/automatic_lopez-moreno_ICASSP_2014_ps.pdf?sequence=1
挑战: 2015 Language Recognition Evaluation(2015语言识别评估):https://www.nist.gov/itl/iad/mig/2015-language-recognition-evaluation
语言建模
工具包: KenLM Language Model Toolkit(KenLM语言模型工具包):http://kheafield.com/code/kenlm/
论文:Distributed Representations of Words and Phrases and their Compositionality(词汇和短语的分布表示及其组合性):http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
论文:Character-Aware Neural Language Models(Character-Aware神经语言模型):https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewFile/12489/12017
资料: Penn Treebank :https://github.com/townie/PTB-dataset-from-Tomas-Mikolov-s-webpage/tree/master/data
词形还原
工具包:WordNet Lemmatizer:http://www.nltk.org/api/nltk.stem.html#nltk.stem.wordnet.WordNetLemmatizer.lemmatize
资料:Treebank-3:https://catalog.ldc.upenn.edu/ldc99t42
唇语辨别
论文:Lip Reading Sentences in the Wild (在野外读懂唇语):https://arxiv.org/abs/1611.05358
论文:3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition(交叉视听匹配识别的3D卷积神经网络):https://arxiv.org/abs/1706.05739
项目: Lip Reading – Cross Audio-Visual Recognition using 3D Convolutional Neural Networks(唇读法—使用3D卷积神经网络的交叉视听识别:https://github.com/astorfi/lip-reading-deeplearning
资料: The GRID audiovisual sentence corpus:http://spandh.dcs.shef.ac.uk/gridcorpus/
机器翻译
论文:Neural Machine Translation by Jointly Learning to Align and Translate(通过共同学习来调整和翻译神经机器翻译):https://arxiv.org/abs/1409.0473
论文:Neural Machine Translation in Linear Tim(在线性时间中的神经机器翻译):https://arxiv.org/abs/1610.10099
挑战: ACL2014 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION(ACL2014第九届统计机器翻译研讨会):http://www.statmt.org/wmt14/translation-task.html#download
资料:OpenSubtitles2016:http://opus.lingfil.uu.se/OpenSubtitles2016.php
资料: WIT3:Web Inventory of Transcribed and Translated Talks:https://wit3.fbk.eu/
资料: The QCRI Educational Domain (QED) Corpus:http://alt.qcri.org/resources/qedcorpus/
命名实体识别
维基百科:命名实体识别:https://en.wikipedia.org/wiki/Named-entity_recognition
论文:Neural Architectures for Named Entity Recognition(命名实体识别的神经结构):https://arxiv.org/abs/1603.01360
项目: OSU Twitter NLP Tool:https://github.com/aritter/twitter_nlp
挑战: Named Entity Recognition in Twitter(在推特上被命名的实体识别):https://noisy-text.github.io/2016/ner-shared-task.html
资料:CoNLL-2002 NER corpus:https://github.com/teropa/nlp/tree/master/resources/corpora/conll2002
资料:CoNLL-2003 NER corpus:https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003
释义检测
论文:Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection(动态池和展开递归自动编码器的释义检测):http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.650.7199&rep=rep1&type=pdf
项目:Paralex: Paraphrase-Driven Learning for Open Question Answering(Paralex:释义驱动学习的开放问答):http://knowitall.cs.washington.edu/paralex/
资料:Microsoft Research Paraphrase Corpus:https://www.microsoft.com/en-us/download/details.aspx?id=52398
资料:Microsoft Research Video Description Corpus :https://www.microsoft.com/en-us/download/details.aspx?id=52422&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F38cf15fd-b8df-477e-a4e4-a4680caa75af%2F
资料: Pascal Dataset:http://nlp.cs.illinois.edu/HockenmaierGroup/pascal-sentences/index.html
资料:Flicker Dataset:http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html
资料: TheSICK data set:http://clic.cimec.unitn.it/composes/sick.html
资料: PPDB:The Paraphrase Database:http://www.cis.upenn.edu/~ccb/ppdb/
资料:WikiAnswers Paraphrase Corpus:http://knowitall.cs.washington.edu/paralex/wikianswers-paraphrases-1.0.tar.gz
语法分析