This post is part 1 in a 2 part series. Read Part 2: Organization-Wide Setup here.
*The code snippet following this sentence was updated to reflect the most up to date container build available (updated 12/17/18).
In recent years, the use of text analysis for social science research has grown rapidly. Scholars have analyzed text to summarize open-ended survey responses, measure content censorship, understand mental health expression on social media, and analyze exposure to ideological diversity, to name a few.
At Urban, we recently combined our Spark for Social Science platform and our internal text analytics service to study government interaction with citizens on social media.
Analyzing text can take a lot of time, but you can use container technology to set up your own advanced text analytics service in minutes. In this post, I’ll show you how.
Analyzing Text can be Hard
Yes, analyzing text can be a lot of work.
First, you have to clean the text by putting it into a standard format, then comes lemmatizing, or stemming words to their roots (“running” becomes “run”, “was” becomes “be”, “goes” becomes “go”, etc.). Next you might train or deploy a language model to tag parts of speech (“went” is a VBD — verb past tense, “oranges” is an NNS — plural noun).
Then you might run another model to recognize key entities like people and places (“George Washington” is a name, “Washington, DC” a location), then another model to analyze dependencies (in “Al ran from the dog”, “ran” refers to Al’s action), then another model to tag the sentiment of the text (“I love ice cream!” is positive), and so on.
The good news for our researchers is that they do not have to go through this process every time they want to analyze text.At Urban, our Data Science team deploys a scalable, cloud based container of Stanford CoreNLP that does it for them.
Enter Stanford CoreNLP
About four years ago, a group at Stanford developed a Java-based tool called CoreNLP, now widely used, which can accomplish most of the necessary advanced text analytics tasks a researcher might need — from the basics of tokenizing and stemming to the more advanced topics of dependency parsing and sentiment analysis — without training your own model for each one. The tool is currently available in seven languages and is continually being updated.
Because it is an open-source, Java-based application, it can be difficult to install. Luckily, if you want to set it up yourself, you don’t need to worry about any of the installation. You just need to download the right container.
The Container Revolution
Think of a container like a self-contained computer that lives on your computer (hear me out). It has its own operating system, files, and programs already installed. When you download a container someone has already made, you’re downloading a fully functional system with all the pieces you need to make that system work already installed, configured, and in the right places.
As you can imagine, this has made container technology — lead by the company Docker — immensely popular, because it streamlines the setup process for complex applications like Stanford CoreNLP.
Using It Yourself
If you want to use CoreNLP for yourself, you’ll need to install Docker by following the instructions on their website and downloading the proper container, which, in the case shown below, is named graham3333/corenlp-complete-custom. For those using Windows 7, you’ll need to follow a different process (Docker is natively supported in Windows 10, Mac OS X and Linux, but not Windows 7). Note that you will be prompted to install a few additional services to make Docker work and you will need an administrative password.
To download and run the container, type the following in your command prompt or terminal. Finding the command prompt or terminal varies by device. For Windows 10 Users, open the start menu and type “cmd”, and for Windows 7 users, double-click on the Docker Quickstart Terminal shortcut on your Desktop. For Mac users, type “terminal” into your spotlight search.*
docker run -itd -p 9000:9000 --name corenlp graham3333/corenlp-complete
Accessing the Service to Analyze Text
Now that you’ve run the container, the server should be running and ready to receive requests (you may have to wait a minute or two). To check that it is working, in all platforms except Windows 7, navigate to http://0.0.0.0:9000 in your browser. In Windows 7, you’ll navigate to http://192.168.99.100:9000/ in your browser.
Run some simple sentences into the interactive interface to test out the service and see the different types of analyses available.
The tool is easiest to access in the two dominant open-source programming languages for data analysis: R and Python. Programmatically, you can make a request in R as follows (keep in mind, you’ll need to replace 0.0.0.0 with 192.168.99.100 if you’re in Windows 7):
And in Python like so:
Note that the first time CoreNLP analyzes text, it is initializing, so it will be a bit slow. It is much faster after the first attempt.
Analyzed text will be returned in both languages in JSON format, and should be analyzed appropriately in either language and converted to a dataframe format. For a list of different available requests and their descriptions, see the CoreNLP Annotators section of the website.
To analyze all your text data and convert to a dataframe format, you can do the following in R (again, keep in mind, you’ll need to replace 0.0.0.0 with 192.168.99.100 if you’re in Windows 7):
And Python:
Other Options
This is obviously not the only option for conducting natural language processing research. Many others are just as simple: cloud-provider NLP services via API (such as Amazon’s Amazon Comprehend, Google’s Cloud Natural Language, and Microsoft’s Text Analytics API), and distributed natural language processing using Apache Spark, to name just a few.
Setting Up Advanced Text Analytics for Your Organization
Having trouble getting Docker to work with Windows 7? Have large datasets and can’t wait for CoreNLP to process all the records? Want to set this service up for others? In Part 2, I dive into how to set up CoreNLP for the whole organization.