Our typical workday is spent reading and writing e-mails, letters, reports, articles etc. It isn’t enigmatic that 80% of the information in business comes from unstructured forms, prominently Text.
Historically, computational linguistics is a very old subject and computer scientists have targeted the natural language since then. In 90’s Text Analytics started emerging as a subject. It has found applications ranging from pharmaceutical drug discovery, risk management, cyber-crime prevention to customer care service.
Text analytics, with the use of “natural language processing” (NLP), holds the key to unlocking the business value of these big data assets. In the era of big data, the right platform enables businesses to fully utilize their data and take advantage of the latest text analytics and NLP algorithms. How does NLP work?
NLP is a way for computers to analyze and understand human language and derive meaningful conclusions from it. Common word processors treat language like a string of symbols. NLP considers the sequence that alphabets make words, words make phrases, several phrases make sentences and sentences eventually convey the idea.
NLP algorithms are based on machine learning algorithms because the system should automatically learn rules by analyzing examples. NLP algorithms are used to summarize blocks of texts, creating chat bots to interact with the customers, identification of the sentiment of a string of text.
There are some open source NLP libraries like Apache Open NLP that provides sentence segmentation, part-of-speech tagging. Natural Language Toolkit (NLTK) provides modules for processing text, tagging, parsing and more.
Data scraping, sometimes called web scraping, data extraction, or web harvesting, is simply the process of collecting data from websites and storing it on your local database or spreadsheets. For starters, data scraping may sound one of those scary techno jargons. But it’s more understandable than you think. Data scraping tools come in handy not only for recruitment purposes, but also in marketing, finance, e-commerce and many other industries.
Among the best scraping tools in the market there is a tool called “Octoparse”. Octoparse allows us to take all the text from the website and thus we can download all the content on the website and then save it from a structured format like Excel, HTML, CSV, or your databases without coding. Only drawback is that it can’t scrape data from PDFs.
There is another tool called “ParseHub” which is a visual extraction tool and it can handle interactive maps, calendars etc. It gives you 5 projects and 200 pages per run which is enough for a student. ParseHub supports operating systems such as Windows, Linux, Mac OS X. ParseHub is more user friendly for programmers with API access. Both Octoparse and ParseHub are free tools to be used for data extraction.
There is a commercial data scraping tool called “Content Grabber”. It can extract data from dynamic websites like AJAX websites. It is more suitable for people with good programming skills because it has editing, scripting and debugging interfaces. It uses many third-party tools which aid in data scraping. Content grabber can integrate with Visual Studio 2013 for script editing and debugging.
There are some companies that have stable topic domain while others have multiple and shifting topic domains. For shifting topic domains, we would need platforms that excel in speed and flexibility of analysis. For example, some companies might need the platform to integrate with their customer relationship management systems (CRM).
There are some important things a company should consider before choosing any tool for data extraction. Analytical methods chosen should identify the topic accurately and should be flexible enough to accommodate technical terms and local jargons. It should be able to work with large and small data sets with same precision. The platform should have flexible reporting dashboards that help visualize the text and can be customized for formatting and editing. It should be able to analyze and extract data at various levels of details. The last part would be that it should have the ability to export the results in flexible formats.
“Unstructured Data and the 80 Percent Rule”– Seth Grimes
By Kaustubh Karanje