What is Spacy?
Many of you think weird after hearing this word or some may think it is related to SPACE or something other which ever related to the word mentioned above,it all upto the imagination level of a person and the thinking capability.
What actually SPACY means: spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
Natural Language Processing(NLP) It is an area of computer science and Artificial Intelligence concerned with interaction between computer and human(Natural Language) with the help of NLP it can analyze large amount of natural language data.
Spacy is one the best framework to work with large amount of natural language and to process those data.
With the help of spacy one can able to build his her own model ,after creating the model it has to be trained on various examples and test it with the new values and analyze accuracy of the model.
HOW TO INSTALL SPACY ON YOUR MACHINE?
Spacy can be installed in various operating system and it is open source
Before installing spacy in your machine make sure Python is installed properly installed your machine .To check whether python is installed or not in your machine,check the link below.
https://stackoverflow.com/questions/8917885/which-version-of-python-do-i-have-installed
1)LINUX OR UBUNTU
Using pip, spaCy releases are currently only available as source packages.
TERMINAL COMMANDS
Step 1) pip install -U spacy
When using pip it is generally recommended to install packages in a virtual environment to avoid modifying system state:
step 2)venv .env
source .env/bin/activate
pip install spacy
2)CONDA
Thanks to our great community, we’ve finally re-added conda support. You can now install spaCy via conda-forge
:
COmmand => conda install -c conda-forge spacy
FEATURES PRESENT IN SPACY
Spacy provides many features as just a function,with the help of built in functions many number of lines can be reduced and the maximum efficiency can be achieved,some of the features of spacy are
- Tokenization
- Part-of-speech (POS) Tagging
- Dependency Parsing
- Lemmatization
- Sentence Boundary Detection (SBD)
- Named Entity Recognition (NER)
- Similarity
- Text Classification
- Rule-based Matching
- Training
- Serialization
I will explain some of the concepts in detail and make sure that you can understand it .
- Tokenization:
Segmenting the given text into words,punctuations marks etc.with the help of tokenization. The importance of tokenization is that only after the text is splitted or segmented into words it will be used for analysis.In the tokenization process the words need which is in the form of abbreviation it has to splitted into the correct form spacy helps in identifying the abbreviations good.
For example, punctuation at the end of a sentence should be split off — whereas “U.K.” should remain one token
CODE
make sure the model is installed in your machine : ‘en_core_web_sm’
import spacy
nlp = spacy.load(‘en_core_web_sm’)
doc = nlp(u’Apple is looking at buying U.K. startup for $1 billion’)
for token in doc:
print(token.text)
The output will be in the form of
Apple |is |looking |at |buying |U.K.|startup|for |$ |1|billion
2)Part-of-speech tags and dependencies
Spacy can spare and tag the given doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context.
Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency.
CODE
import spacy
nlp = spacy.load(‘en_core_web_sm’)
doc = nlp(u’Apple is looking at buying U.K. startup for $1 billion’)
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
Here the print function contains many attributes
Text: The original word text.example base word for is is be
Lemma: The base form of the word. example the pos of IS is VERB
POS: The simple part-of-speech tag.
Tag: The detailed part-of-speech tag.
Dep: Syntactic dependency, i.e. the relation between tokens.
Shape: The word shape — capitalisation, punctuation, digits.
is alpha: Is the token an alpha character?
is stop: Is the token part of a stop list, i.e. the most common words of the language?
REMAINING FEATURES WILL BE EXPLAINED IN NEXT POST