Natural Language Processing(NLP)
Introduction
Natural Language : It is an ordinary language which is involve naturally in human through use and repetition without conscious planning or premeditation.
•NLP: NLP is a combination of linguistics, computer science , and artificial intelligence which is basically focusing about interaction between computers and human language.
•NLP deals with processing and analyzing large amounts of natural language data.
History
1950 Alan Turing When he published an article titled “ Computing Machinery and Intelligence” which is now called Turing test as a criterion of intelligence.
1964 / Eliza Joseph weizenbaum : A simulation of a Rogerian psychotherapist, rephrasing her response with a few grammar rules.
1970/ SHRDLU Terry Winograd: A natural language system working in restricted “Block Word” with restricted vocabularies, work extremely well.
1982/ Jabberwacky Rollo Carpenter: Chatterbot with stated aim to “simulate natural human chat in an interesting , entertaining and humorous manner
1990/ Dr. Sbaitso: Creating Lab (Singaporean company): AI speech synthesis program for MS-DOS based personal computer. Software
2006 / Watson IBM AI Based software
2011 / Siri Apple A Virtual Assistant
2014 / Amazon Alexa Amazon A Virtual Assistant
2016/ Goggle Assistant Google A Virtual Assistant
NLP Components
Phonetics and phonology: The study of language sounds
Ecology: The study of language conventions for punctuation, text mark-up and encoding
Morphology: The study of meaningful components of words
Syntax: The study of structural relationships among words
Lexical semantics: The study of word meaning
Compositional semantics :The study of the meaning of sentences
Pragmatics : The study of the use of language to accomplish goals
Discourse conventions: The study of conventions of dialogue
What is Text?
•A set of characters which belong to a particular language and having specific meaning.
•The textual information which are available in many forms and languages has to be processed first before feeding to machine.
Text processing
There are three processing techniques which are widely being used for text analysis:
Lexical Processing
When we plot any text document which are having enough words , then we see that word frequency follow the Zipf distribution
Mostly three types of words available in the text corpus:
1.Stop words : such as is , an, the, etc
2.Significant words : These words help us for real text analysis
3.Rarely occurring words
Stop words are not much useful for the many application so we used remove it because its takes a lot of memory and decrease the model performance.
Tokenization
•Tokenization is a processing of breaking text corpus into different words, sentences or paragraphs.
•The breaking of information will be a per the requirement of an application .
After removal of the stop words, we need to take care of redundant information as well.
Text Representation
Bag of Words
•Also called Bag-of-Words model
•Each row of the table represent each document.
•Columns represent the vocabulary of the text
For example :
Doc1: “Dangal is a super duper hit movie”
Doc2: “The succuss of movie depends upon the performance of the actors”
Doc3: “No movies are releasing due to Covid”
•In above model, after removing all the stop words, the values in the cell represent the number of times a term ’t’ is present in the document ‘d’ which is term frequency.
•
•Term movie is presents in all the documents while actor is present in second documents.
TF-IDF
•Instead of focusing word frequencies in the tables which been created for bag-of-words models, we can have the representation which focus more on word importance.
TF Value Calculation
•Review 2: This movie is not scary and is slow
Here,
•Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’
•Number of words in Review 2 = 8
•TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/(number of terms in review 2) = 1/8
IDF calculation
We can calculate the IDF values for the all the words in Review 2: IDF(‘this’) = log(number of documents/number of documents containing the word ‘this’) = log(3/3) = log(1) = 0
TF-IDF Calculation
Application of TF-IDF
•Spam Email Detector
•Sentiment Classification
Syntactic Processing
•Focuses more on grammar syntax.
•Widely used in application such as
•Question answering systems,
•Information Extraction ,
•Sentiment Analysis ,
•Grammar Checking
•Part of speech(PoS) tagging is an important task which is used a preprocessing steps in many application.
Semantic Processing
•Focuses mor on the meaning of given peace of text.
Text Database
•WordNet and Concept Net: A semantically oriented a dictionary of English with richer structure.
Word Sense Disambiguation
•WSD task is to identify the correct sense of an ambiguous word such as bank, bark, pitch etc.
•Lesk Algorithm
•NLP is being used in many application
•social media analysis : Twitter sentiment analysis, topic modeling etc.
•Chatbot
•Information Extraction etc.