Simple and Efficient way to Tokenize any kind of Text with Python
Tokenizing large corpus of various kinds of languages.
Introduction
Tokenization process in which we convert sentences or bunch of words into tokens for further machine learning or nlp tasks. This is key aspect of every natural language processing project to analysis and process text data easily for implement further machine learning approaches comfortably. This process is widely used through various kinds of methods for feature extraction.
Various Tokenization Techniques
White Space Tokenization
Dictionary based Tokenization
Rule based Tokenization
Penn Tree Tokenization
Spacy Tokenizer
NLTK Tokenizer
Moses Tokenizer
Method
we’ll be using a french text from wikipedia on computer programming
fr_text = r"""La programmation informatique est l'ensemble des activités qui permettent l'écriture des programmes informatiques. C'est une étape importante de la conception de logiciel et de matériel.Pour écrire le résultat de cette activité, on utilise un langage de programmation, un code de communication permettant à un être humain de dialoguer avec une machine en lui soumettant des instructions et en analysant les données matérielles fournies par le système, généralement un ordinateur."""
def sentence2token(corpus):
# method using regex
tempCorpus = re.sub("[?]", "END", corpus)
tempCorpus = re.sub("[.]", "END", tempCorpus)
tempCorpus = re.sub("[\n]", "END", tempCorpus)
tokens = re.split("END", tempCorpus)
# output
return tokensoutcome = sentence2token(corpus)
Discussion
As we used regex pre defined python library to tokenize our french pieces of text of an article written on computer programming. which goes through some pieces of statements or lines of code of regex methods.
First of all , we used re.sub() to replace occurrences of a particular sub-string with another sub-string incase of abolish special characters of ending sentences and divide string into lines.
In addition to, in the we used re.split() to convert our processed string into set of list which is appropriate datatype for further piece of machine work.