Simple and Efficient way to Tokenize any kind of Text with Python

Photo by Isaac Chou on Unsplash

Introduction

Tokenization process in which we convert sentences or bunch of words into tokens for further machine learning or nlp tasks. This is key aspect of every natural language processing project to analysis and process text data easily for implement further machine learning approaches comfortably. This process is widely used through various kinds of methods for feature extraction.

Various Tokenization Techniques

White Space Tokenization

Dictionary based Tokenization

Rule based Tokenization

Penn Tree Tokenization

Spacy Tokenizer

NLTK Tokenizer

Moses Tokenizer

Method

we’ll be using a french text from wikipedia on computer programming

fr_text = r"""La programmation informatique est l'ensemble des activités qui permettent l'écriture des programmes informatiques. C'est une étape importante de la conception de logiciel et de matériel.Pour écrire le résultat de cette activité, on utilise un langage de programmation, un code de communication permettant à un être humain de dialoguer avec une machine en lui soumettant des instructions et en analysant les données matérielles fournies par le système, généralement un ordinateur."""
def sentence2token(corpus):
# method using regex
tempCorpus = re.sub("[?]", "END", corpus)
tempCorpus = re.sub("[.]", "END", tempCorpus)
tempCorpus = re.sub("[\n]", "END", tempCorpus)
tokens = re.split("END", tempCorpus)
# output
return tokens
outcome = sentence2token(corpus)

Discussion

As we used regex pre defined python library to tokenize our french pieces of text of an article written on computer programming. which goes through some pieces of statements or lines of code of regex methods.

First of all , we used re.sub() to replace occurrences of a particular sub-string with another sub-string incase of abolish special characters of ending sentences and divide string into lines.

In addition to, in the we used re.split() to convert our processed string into set of list which is appropriate datatype for further piece of machine work.

--

--

--

Hi everyone, I am computer science student and try to enhance my technical experience through sharing and adapting various technological information.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Configuration of the Kubernetes cluster with external ETCD for a lab environment (2)

Understanding the Monad in Java

https://www.wattpad.com/story/252327448?utm_source=android&utm_medium=link&utm_content=story_info&wp

CSS Some Key Topics

CI/CD for Cloud-Native Applications

Using URL Parameters with Your Salesforce Community

Under the coding hood: Hacktory 2018

🔥 Building ⚛ React Micro Frontends Monorepo with State Management using NX in 2 min 😅

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jarsham Dhir

Jarsham Dhir

Hi everyone, I am computer science student and try to enhance my technical experience through sharing and adapting various technological information.

More from Medium

Use Of Keywords And Identifiers

Detect anomalies in your data with Elasticsearch & Kibana

Getting Familiar to The World of Machine Learning

Your one-stop to all machine learning libraries