• No products in the cart.

Efficiency Improvement and Cost Reduction with Simple Text Mining Technique


First of all, I will start by saying a lot of real world practical problems can be solved by applying simple artificial intelligence (AI) techniques. Thus, I’m not talking about the ones doing AI research at Google, etc. More importantly, the keyword here is “SIMPLE”.

In this article, I will give an example of improving operation efficiency and thus reduce company’s operating expense as a result of extremely simple text mining algorithm. You will not believe with just a few lines of code, it could save hours of intensive manual labor work.

In the end, I will also give my opinion on why sometimes, or maybe most of the times, inefficiency that can be solved with simple artificial intelligence technique still exists in those companies.

(Please note that I use the term artificial intelligence (aka machine learning) and text mining loosely here. But text mining is part of artificial intelligence and it mainly deals with unstructured data, aka text data.)


Before I detail my personal experience of applying text mining technique for improving efficiency, I happened to stumble across a Bloomberg article on how text mining helped eliminate 360,000 hours of manual labor work.

JPMorgan, one of the largest banks in the world, recently completed an automated machine-learning program called “COIN”, for Contract Intelligence. According to JPMorgan, this program parsed financial deals such as commercial loan agreements that used to consume 360,000 hours of work each year by lawyers and loan officers. Furthermore, it claims that the software can review documents in seconds and is less error-prone. The benefits of “COIN” are obvious: reducing expenses as well as risks.

The full Bloomberg article can be accessed below:

You may ask, is this something new (as in, a new technology)? No, I would say. You can almost bet that high-tech software firms such as Google, Facebook, Amazon, etc. have leveraged similar technology for their work. But these technologies haven’t been used in other industries such as finance. Thus, there are huge opportunities in non-software high-tech industries to improve their overall efficiency.

Now, let me talk about my experience. I once worked in a high tech semiconductor firm where you can imagine the focus of this kind of high-tech firm was on hardware. Some of the software used in the firm was timeworn. For example, obsoleted PERL scripts from 15 years ago. Also, there were a lot of inefficiencies that could be optimized. For instance, there was a dedicated team of 50 people who manually look at employees/users requests via email and categorize those requests. After I had understood this team’s business operations, the first thing which popped up in my mind was that, couldn’t this operation be automated and thus eliminate the tedious manual labor process.

The answer is of course “YES”. In fact, it’s very simple to achieve that. I will discuss the process of the automation in the next two sections. The key aspect to remember here is that this simple automation program could result in $5 million in cost reduction annually.

Text Classification

The challenge I described in the previous section is a text classification problem. That is, given a document (i.e. email, user request), figure out what category it belongs to. The example everyone should be familiar with is “SPAM/Not-Spam” as with emails. How does one classify an email and determine whether or not it’s a SPAM?

The simplest method is hard-coding the rules. For instance, the email was not sent to me but to “undisclosed recipients”, or the email body contains quite a few capitalized words or a combination of these characteristics.

The hard-coding rules method can work well with high accuracy. However, building and maintaining these rules are expensive.

Thus, we want to leverage the benefits of hard-coding rules but yet want to eliminate the cost of building and maintaining these rules. We can achieve this by combining hard-code rules with machine learning techniques. Specifically, we will incorporate supervised learning techniques by training our model to learn about existing documents (i.e. emails) and the categories (i.e. SPAM, not-Spam) they belong to.

The supervised learning technique we will be using is called “Naïve Bayes”. Please note that other machine-learning techniques can be used, such as Logistic Regression, SVM, etc. These other algorithms may improve the accuracy of the classification. However, since the intent of this article is not about researching machine-learning algorithms, it’s beyond our scope at this time. My goal is to demonstrate improving real world business operation efficiency with simple automation using artificial intelligence technique.

The “Naïve Bayes” intuition is very simple, after all, “Naïve” means simple. Anyway, this classification technique is based on the well-known “Bayes rule”. If you don’t know it, you can look up on wiki:

It’s just conditional probability as you might have learnt in your high school or college statistics class.

The key here in using “Naïve Bayes” is to use a subset of words in the documents. Without going through the details of the “Naïve Bayes” algorithm, you can think of it as computing the probability of a word giving a category. Thus, if you ask “Naïve Bayes” what’s the chance of the word “Lottery” in a SPAM e-mail, it can tell you the precise probability.


Of course, if we have to implement “Naïve Bayes” or other machine-learning algorithm, then there is going to be a lot of work. Since I already stressed multiple times the word “simplicity”, we will make good use of the “NLTK” package in Python. You will be surprised how much can be accomplished with just a few lines of code.

For the “NTLK” package, take a look at http://www.nltk.org

The example problem I will use is what I discussed above, 50 people manually look at user requests and categorize those requests. As an illustration, there are two categories, one is “Unix” and the other one is “Windows”, i.e. the user request help, is it a Unix machine help or Windows machine help.

So, how does a real person determine whether the user request is a “Unix” request or “Windows” request? Well, what are the simple hard-coding rules? A request which contains the word “Unix” is most likely a Unix request and vice versa for “Windows”.

You can add more words to this set of rules via experience. But we can figure it out another way, that is, gather basic statistics of all the words. For example, we look at the frequency of each word; if let’s say, the word “archive” appears in “Unix” category 1000 times, but only appears in “Windows” category 10 times, we can add this word to our rules.

With this understanding, let’s see how we can combine the set of rules with Naïve Bayes implementation in NTLK to automate our “Unix/Windows” categorization:

# Let’s say “docs” is a variable which contains the list of document text and its associate category. I.e. we have 1000 user requests and we already know each request’s category.

import ntlk

def getFeatures(document, word_features):
token = nltk.word_tokenize(document)

    document_words = set([w.lower() for w in token])

    features = {}
    for word in word_features:
        features[‘contain({})’.format(word)] = (word in document_words)

    return features
# Determine a new user request (i.e. newDocText)’s category word_features = [‘unix’, ‘Windows’, ‘archive’, ‘laptop’, …] f = [(getFeatures(docText, word_features), category) for (docText, category) in docs] trainSet, testSet = f[200:], f[:200]
classifier = ntlk.NaiveBayesClassifier.train(trainSet)
classifier.classify(newDocText) # => give the category of the newDocText # You can also check the accuracy of the classification via :
nltk.classify.accuracy(classifier, testSet)


Wow, as you can see, just a few lines of code can eliminate $5 million in cost annually at company expense. Isn’t that amazing? Of course, one can argue the problem presented could be simple and thus it can be solved with just a few lines of code. I totally agree with this assessment. In fact, I would argue that in the real world, there are more simple problems than complicated ones. Even complicated problems can be reduced to simpler forms and be tackled using simple methods. The small percentage of truly complicated problems can be left to artificial intelligence scientists.

Similar to JPMorgan’s COIN program, the above classification of user requests not only reduce the cost but also improve the turnaround time. It used to take a few hours for a user request to get assigned the category because human labor is slower and doesn’t have the capacity to handle a lot of user requests. Furthermore, the error-rate is much lower with our automated program compared to human labor.

Lastly, in the real world, outside of the software high-tech industry, a lot of the problems/inefficiencies are not very complicated. But they aren’t solved due to of the numerous numbers of other non-technical related issues. For instance, employees at the firms don’t have the talents. This is very true; software and artificial intelligence are not those firms’ core business. Second, even when there are talents at the firms, it could be tough to solve the inefficiency issue. Can you guess why?

Improving inefficiency or automation usually means job loss for a lot of people. If improving inefficiency is an order from executive management, then everything should be ok. However, if you are an employee at the firm and observe there are inefficiency (which you can help to resolve) in another team or department, what are you going to do? Do you want to say to your boss (depends on how open he/she is) that you can help the company save $$ amount of money by retiring other coworkers from their work.

About the Author
Sean Chen
Passionate about technology, data science, finance and trading. Develop R and Python cheat sheets @ http://datasciencefree.com Play poker and sports for fun.

December 18, 2017

0 responses on "Efficiency Improvement and Cost Reduction with Simple Text Mining Technique"

Leave a Message

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

© HAKIN9 MEDIA SP. Z O.O. SP. K. 2013