Learn to Analyze Text Data in Bash Shell and Linux!
Bash may not the best way to handle all kinds of data! But, there often comes a time when you are provided with a pure Bash environment, such as what you get in the common Linux based super computers and you just want an early result or view of the data before you drive into the real programming, using Python, R and SQL, SPSS, and so on. Expertise in these data-intensive languages also comes at the price of spending a lot of time on them. In contrast, bash scripting is simple, easy to learn and perfect for mining textual data! Particularly if you deal with genomics, microarrays, social networks, life sciences, and so on. It can help you to quickly sort, search, match, replace, clean and optimise various aspect of your data, and you wouldn’t need to go through any tough learning curves.
Use of Bash in data mining
There are several examples of practical data mining that will have a flow of importing specific data resources into flat text-type files. Bash can run different programs (grep, sort, sed, and so on) on those files, clean, optimise and extract preliminary views (cut, csvlook, view, cat, head, etc.) of the data. There is one part of data mining, which involves unstructured data and then transforming it into a structured one (awk, shell). A scripting language like Bash can be very useful for doing the transformation. Therefore, we strongly believe, learning and using Bash shell/ scripting should be the first step if you want to say, Hello Big Data!
This book starts with some practical bash-based flat file data mining projects involving:
- University ranking data [Previews: Part I, Part II, Part III]
- Facebook data [Previews: Part I, Part II]
- Crime Data
- Shakespeare-era plays and poems data
If you haven’t used Bash before, feel free to skip the projects and get to the tutorials part. Read the tutorials and then come back to the projects again. The tutorial section will introduce with bash scripting, regular expressions, AWK, sed, grep and so on.
Finally, it gives you a concise beginner friendly guide to the big data landscape including an overview of the critical Big Data tools such as HDFS, MapReduce, YARN, Flume, Hive and more. The book finishes with a near-complete list of references to all the relevant command line and Big data tools.
- Don’t forget to get the animated video course too!
Also available at
- Udemy (video lectures)
- Educative (videos with code play grounds) and
- Leanpub.com (PDF/ electronic book)
About the author
Ahmed Arefin, PhD is an enthusiastic computer programmer with more than a decade of well-rounded computational experience. Following his PhD (Computer Science) and Postdoc (Parallel Data Mining) research he’s moved forward to become a Scientific Computing professional keeping his research interests on, in the area of parallel, distributed and accelerated computing. He loves to code, research, write and teach.