Daniel Adanza Dopazo MACHINE LEARNING ON BIG DATA USING MONGODB, R AND HADOOP Master thesis Maribor, December 2016

Transkripcija

1 Daniel Adanza Dopazo MACHINE LEARNING ON BIG DATA USING MONGODB, R AND HADOOP Master thesis Maribor, December 2016

2

3 STROJNO UČENJE NA VELIKIH PODATKIH Z UPORABO MONGODB, R IN HADOOP MACHINE LEARNING ON BIG DATA USING MONGODB, R AND HADOOP Magistrsko delo Študent: Študijski program: Daniel Adanza Dopazo študijski program 2. stopnje Informatika in tehnologije komuniciranja Mentor: red. prof. dr. Vili Podgorelec Lektor(ica):

4 i

5 ii

6 iii

7 Strojno učenje na velikih podatkih z uporabo MongoDB, R in Hadoop Ključne besede: veliki podatki, strojno učenje, analiza podatkov UDK: 004.8:004.65(043.2) Povzetek Osrednji namen tega magistrskega dela je testiranje različnih pristopov in izvedba več eksperimentov na različnih podatkovnih zbirkah, razporejenih na infrastrukturi za obdelavo velikih podatkov. Da bi dosegli ta cilj, smo magistrsko nalogo strukturirali v tri glavne dele. Najprej smo pridobili deset javno dostopnih podatkovnih zbirk z različnih področij, ki so dovolj kompleksne (glede na obseg podatkov in število atributov) za namen izvajanja analize velikih podatkov na ustrezen način. Zbrane podatke smo najprej predhodno obdelali, da bi bili združljivi s podatkovno bazo MongoDB. V drugem delu smo analizirali zbrane podatke in izvedli različne poskuse s pomočjo orodja R, ki omogoča izvedbo statistične obdelave podatkov. Orodje R smo pri tem povezali s podatkovno bazo MongoDB. V zadnjem delu smo uporabili še ogrodje Hadoop, s pomočjo katerega smo dokončali načrtovano infrastrukturo za obdelavo in analizo velikih podatkov. Za namen tega magistrskega dela smo vzpostavili sistem v načinu enega vozlišča v gruči. Analizirali smo razlike z vidika učinkovitosti vzpostavljene infrastrukture in delo zaključili z razpravo o prednostih in slabostih uporabe predstavljenih tehnologij za obdelavo velikih podatkov. iv

8 v

9 Machine learning on big data using MongoDB, R and Hadoop Key words: big data, machine learning, data analysis UDK: 004.8:004.65(043.2) Abstract The main purpose of this master thesis is to test different approaches and perform several experiments on different datasets, deployed on a big data infrastructure. In order to achieve that goal we will structure the thesis in three different parts. First of all, we will obtain ten publicly available datasets from different domains, which are complex enough (in terms of size and number of attributes) in order to perform the big data analysis in the proper way. Once they are gathered, we will pre-process them in order to be compatible with the MongoDB database. Second of all, we will analyse the data and perform various experiments using the R statistical and data analysis tool, which at the same time will be linked to the MongoDB database. Finally, we will use Hadoop for deploying this structure on big data. For the purpose of this master thesis, we will use it in a single node cluster mode. We will analyse the differences from the performance point of view and discuss the advantages and disadvantages of using the presented big data technologies. vi

10 vii

11 KAZALO 1 INTRODUCTION MASTERING THE DATABASES Installing and configuring MongoDB server Learning mongo commands Preparation of the datasets MASTERING THE DATA MINING Installing and configuring R Learning to use R with its library rmongodb MASTERING THE HADOOP What is big data and Hadoop? Installing and configuring Hadoop Deployment of R algorithms in Hadoop PERFORMING THE EXPERIMENTS AND ANALYZING THE RESULTS APPLYING MACHINE LEARNING ALGORITHMS CONCLUSION REFERENCES... Error! Bookmark not defined. viii

12 ix

13 KAZALO SLIK Picture 2.1: Files which contains the MongoDB server version Picture 2.2: Files which contains the bin folder of MongoDB server Picture 2.3: Practical example of the usage of Mongo Picture 2.4: Practical example of Mongo running in command mode Picture 2.5: Window in the advanced configuration for the operative system Windows Picture 2.6: Window with the environment variables already configured Picture 2.7: Connecting with the hockey dataset in Mongo... 7 Picture 2.8: Inserting data into our players dataset Picture 2.9: Output result after including a new row into our players dataset Picture 2.10: Accessing collections in Mongo Picture 2.11: Showing the differences after applying the»pretty«function Picture 2.12: Practical example of removing one row in Mongo Picture 2.13: Practical example of updating a collection Picture 2.14: Practical example of dropping a collection Picture 2.15: Sample query in Mongo Picture 2.16: Executing a query in our players dataset Picture 2.17: Executing a different query in our players dataset Picture 2.18: Geting indexes in player dataset Picture 2.19: Showing all implemented datsets in Mongo Picture 2.20: Showing WEKA interface Picture 2.21: showing WEKA interface Picture 2.22: Showing all attributes of the arrythmia dataset Picture 2.23: Showing the attributes of diabetic dataset Picture 2.24: Showing the attributes of letter dataset Picture 2.25: Showing the attributes of nursery dataset Picture 2.26: Showing the attributes of splice dataset Picture 2.27: Showing the attributes of student dataset Picture 2.28: Showing the attributes of tumor dataset Picture 2.29: Showing the attributes of waveform dataset Picture 2.30: Showing the attributes of cmu dataset Picture 2.31: Showing the attributes of kddcup dataset Picture 3.1: Official documentation of R Picture 3.2: Installing the packages»rmongodb«in R Picture 3.3: Output result after installing»rmongodb« Picture 3.4: connecting with mongo datasets since R IDE Picture 3.5: Accessing databases and collections since R Picture 3.6: Executing some queries with our sample data since R Picture 3.7: executing some queries since R Picture 3.8: Graphics showing the results after executing some queries Picture 3.9: executing»count«function and»head function« Picture 3.10: converting to bson format in R x

14 Picture 3.11: Output result after executing queries in our sample data in R Picture 3.12: sample using the»count«function in»rmongodb« Picture 3.13: Executing some experiments in R Picture 3.14: graphic showing the output results of the experiment Picture 3.15: graphic with bars showing the output results of the experiment Picture 4.1: Window showing the environment variables Picture 4.2: Command window showing the current java version installed in my computer Picture 4.3: Output result showing the hadoop version installed Picture 4.4: Output results after executing»hdfs namemode -format«command Picture 4.5: Environment variables of my computer Picture 4.6: Output results after running Hadoop Picture 4.7: Output results after executing»yarn« Picture 4.8: Initial page after running Hadoop Picture 4.9: Initial page showing the cluster configuration of Hadoop Picture 4.10: Initial configuration of Hadoop in R Picture 4.11: executing»mapreduce« Picture 4.12: Basic usage of»rhdfs«library Picture 4.13: Unserializing data with»rhdfs« Picture 4.14: Executing different»hdfs«commands Picture 4.15: Using»mapreduce«in Hadoop Picture 4.16: practical example using»mapreduce« Picture 4.17: graphic using showing the output result of the preivous example Picture 4.18: set of commands processing data with»rhadoop« Picture 4.19: Final results after applying»rmr« Picture 5.1: Results for the experiments of arrythmia dataset Picture 5.2: More results about arrythmia dataset Picture 5.3: Results for the experiments of cmu dataset Picture 5.4: Results for the experiments of diabetic dataset Picture 5.5: Results for the experiments of tumor dataset Picture 5.6: More results for tumors dataset Picture 5.7: Results for the experiments of kddcup dataset Picture 5.8: Results for the experiments of letter dataset Picture 5.9: Results for the experiments of nursery dataset Picture 5.10: Results for the experiments of splice dataset Picture 5.11: Results for the experiments of waveform dataset Picture 5.12: Results for the experiments of students dataset Picture 5.13: More results for the experiments of students dataset Picture 6.1: Applying machine learning algorithms in R Picture 6.2: Showing main features of iris dataset Picture 6.3: Summary of iris dataset Picture 6.4: Applying machine learning algorithms to iris data Picture 6.5: spliting iris dataset into training and test Picture 6.6: Final results after applying machine learning on iris dataset Picture 6.7: Applying regression tree algorithm to letter dataset xi

15 Picture 6.8: Applying regression tree algorithm to arrythmia dataset Picture 6.9: Applying regression tree algorithm to diabetic dataset Picture 6.10: Applying regression tree algorithm to kddcup dataset Picture 6.11: Applying regression tree algorithm to nursery dataset Picture 6.12: Applying regression tree algorithm to splice dataset Picture 6.13: Applying regression tree algorithm to student dataset Picture 6.14: Applying regression tree algorithm to tumor dataset Picture 6.15: Applying regression tree algorithm to waveform dataset xii

16 xiii

17 1 INTRODUCTION Context At the beginning I would like to make a brief comment about the context of this master thesis, starting with data mining. Data mining is one subfield inside the computer sciences, which consists of a process where we discover the patterns of a huge data set, hence it includes different methods at the intersection of different fields like artificial intelligence, machine learning and so on. I would like to point out that the main purpose of data mining is the extraction of information that comes from a data set and its transformation into another structure that can be used for other different reasons. Main purpose of the master thesis The main purpose of the master thesis is to analyse the different relationships within the attributes of a database or to extract some additional information out of them. In order to get that it will be necessary to use the help of different tools for storing data in the database (mongo DB), for analysing the already introduced data (R IDE) and finally for learning about how to deploy big data by using different tools like Hadoop. Brief description of the content The first part of the thesis is dedicated to MongoDB. Here I will install, learn and include the different datasets that we are going to use for the project with the previous mentioned tool. The second part of the report is about data mining. There we will install, configure and learn how to use the R ide with its necessary libraries in order to connect it with our datasets. In the third part of the thesis we will talk about big data and we will use Hadoop. We will also learn how to connect this tool with R and mongo. 1

18 Finally, in the rest of the parts I am going to analyze different data, make different experiments, apply different machine learning algorithms and make some inferences about big data and the obtained results. Aims Even if we will work over different set of databases trying to make some inferences about the data and obtain different statistics the real goal of the project is the application of the previously mentioned technologies and tools that allow us to work with big data and to somehow demonstrate that they can be quite useful and applied to a variety of situations. Therefore the main aim of the study is the application of different technologies that allow us to work on big data. Objectives These are the other aims for the project: application of a database with NOSQL datastore like in the case of mongodb application and usage of different machine learning algorithms like R deployment of a big data structure like in the case of Hadoop deployment of some machine learning algorithms from R on Hadoop, as much as establishing of MongoDB for storing the datasets in, and then using the algorithms from R deployed on Hadoop in order to learn from the data in MongoDB datasets performing different experiments on the selected datasets Assumptions and limitations The main purpose of the research is not to get information about about the databases that I have taken like an example, but to get familiar with and to try out different technologies that allow us to analyze a big quantity of data. I would also like to mention different assumptions and shortcoming of the research. We should always keep in mind that our inferences are based on a sample of data. Hence the numbers could be slightly different from the numbers we would get by analysing other 2

19 sources. Nevertheless it is always good to make some inferences and to make a good estimation about different features. 3

20 2 MASTERING THE DATABASES In this section I am going to describe everything that is necessary for the database, including the description of the necessary tools and the steps that I took in order to install them. Furthermore, I am going to include a simple guide of the basic commands that will allow us to check the information inside the database and manipulate it. In order to achieve this goal, we will use the database type named NoSQL, which is a type of a data base based on JSON objects and it is slightly different from the typical SQL databases that we have usually used in our projects. The tool that we will use in order to handle these type of databases is called Mongo [1], a cross platform tool that has been released under GNU General public license and that works through the terminal in command mode. 2.1 Installing and configuring MongoDB server The first step that is necessary to make in order to accomplish the final step of the project is to install and configure the necessary tools. Mongo [2] is an open-source document database that provides high performance, high availability and automatic scaling. This type of tool contains a data structure that is composed of a field and value pairs. Their documents are quite similar to JSON objects. The first thing that we need to do is to go to its official web page [3] and download the executable file that is in my case specific for the operative system Windows 8 (64 bit architecture). Right after ending the installation we will see that we have the files shown in a print screen below in the path that is by default: c:/program Files/MongoDB/Server/3.0/ 2

21 Picture 2.1: Files which contains the MongoDB server version 3.0. And that is what we can see in the bin directory right after installing Mongo DB: Picture 2.2: Files which contains the bin folder of MongoDB server. In order to configure Mongo DB for the very first time, it is necessary to open the terminal cmd.exe and to type the following commands in order to move to the necessary folder. 3

22 Picture 2.3: Practical example of the usage of Mongo. As you can see in the image above, we have used the command cd in order to change to the correct directory and when we are in the correct directory, we can execute the command mongo. After that we will see that it does not find any database and it does not work as we would have maybe expected. The reason for this is that we need to create a folder with the command mkdir \data\db. After creating this directory, we will see that our tool is working well. Picture 2.4: Practical example of Mongo running in command mode. 4

23 Here we have a practical demonstration of the mongo DB tool working good. In one terminal we used mongod command and in the other terminal we used mongo command that initiates a dialog. Finally, I would like to make an additional step that is not strictly necessary but helps to facilitate the things when starting Mongo tool. This step consists of configuring the CLASS_PATH in our environment variables, for which we will need to enter the following location in our computer: Control Panel > All control Elements > system > advanced system configuration Picture 2.5: Window in the advanced configuration for the operative system Windows 8. After that we will see something similar to this window. After clicking on advance configuration system we will also need to enter in system properties > advanced settings > environment variables. 5

24 Picture 2.6: Window with the environment variables already configured. In the environmental variables we will need to add a new variable with the name PATH and with the value C:/Program Files/MongoDB/Server/3.0/bin. Finally, I would like to say that this step is very useful because from now on it will not be necessary to enter in that hard to remember address in order to start mongo. Now it would be enough to just open the terminal and to type the word mongod. 2.2 Learning mongo commands. To install the necessary tools is not enough if we want to accomplish our final goal. Hence the next step in this case would be to create a sample database and to fill it with some unimportant sample data. In next paragraphs I will also explain in greater detail how Mongo DB works and which functions of different commands are provided by it. As we saw in the previous step, we will need to open two terminals. In one of them we will type the command mongo, hence this window will act like the server. In the other window we will type the command mongo, hence this window will act like the client that will 6

25 connect with the server that is in the local host. It will be in this second window that we will need to type the different commands necessary for inserting, deleting and modifying different information. STEP 1: BASIC DATA BASE COMMANDS: There are different commands that allow us to manipulate the databases. Fortunately, they are quite straightforward to use. Out of all of them we will highlight the following: Db: It tells you which database you are using at the moment (by default you would be using the data base named test ). use <data base name>: It would switch the database that you intend to use. For example: If you don t want to use the database local anymore and you prefer to use the database named hockey you can type use hockey I also would like to point out that by default, if it does not find any database under that name, it will automatically create one. Db.dropDatabase(): this command will erase the database that you are currently using. In order to facilitate your work the terminal will provide you a feedback explaining if everything went as expected or not. Finally, I would like to demonstrate all that I have mentioned by providing a screenshot of my own terminal, where I used these commands, and its expected output: Picture 2.7: Connecting with the hockey dataset in Mongo 7

26 STEP 2: INSERTING JSON OBJECTS INTO COLLECTIONS: So far we have seen how to create, delete and switch the usage of the different databases, but we have not yet looked at the manipulation of different data that is inside the databases. For that it would be necessary to mention that the Mongo database organizes its information into collection. These collections are analogous to a table with different rows if we would like to compare them with a normal sql database. The truth is that its syntax is more similar to the collections in java. Where we can insert one object that will be added into the overall collection. Db.collectionName.insert(<Data that you want to insert>): In this command the word db refers to the database that you are currently using. In our example we are using the database named hockey and refering to the collection name. By default, if it does not find any collection under that name it will create one. Finally you will need to introduce the data in JSON format as a parameter that you want to add to that database. Here we have a practical demonstration of the previously explain command. The database is hockey and the collection is players. In the last line we can prove that the information has been added in a successfully way. Picture 2.8: Inserting data into our players dataset. 8

27 Finally, I would like to mention that this only works with one single object in JSON. If we would like to insert more than one object we will need to create an array and separate the different JSON objects using commas. The basic structure would be something like that: Db.collection.insert ( [ {JSON OBJECT1}, {JSON OBJECT2}, ] ) Also I would like to mention that by introducing more than one JSON object, the output result will be quite different than with a single one. Picture 2.9: Output result after including a new row into our players dataset. As we can see in the previous image the tool will give us a lot different information, such as the number of inserted rows, the number of modified ones and whether there were any errors. STEP 3: DELVING MORE DEEPLY INTO COLLECTIONS: In order to operate and see different information relative to the collections, we can have the following commands: Show collections: this command is self-explanatory. It will show all the collections that we have created in the database that we are using. By default, it will also include one collection named system.indexes. Here we can see a practical demonstration of it. 9

28 Picture 2.10: Accessing collections in Mongo. Db.collection.find(): This command allows us to see all the data that is inside the previously referred collection. One of the main disadvantages of this command is that the view of the data is very compact. Nevertheless it can all be repaired with the function pretty(). We can see the differences between both in the following image: Picture 2.11: Showing the differences after applying the»pretty«function. 10

29 Db.collection.findOne(): it works exactly like find().pretty() but with the main difference that it only shows the first JSON object of the collection. Db.collection.remove( { id of the object } ): This command removes only one row in the overall collection. In order to distinguish this object from the others, it will be necessary to introduce its identification. Picture 2.12: Practical example of removing one row in Mongo. Db.collection.update({identifier},{new object}): This function allows you to find an object inside the previously named collection and update it with the information that you want. Like always, I will provide you a practical demonstration of that. Picture 2.13: Practical example of updating a collection. 11

30 Db.collection.drop(): it eliminates the previously mentioned collection completely, which implies all its data that has inside and the name of the collection in itself. Picture 2.14: Practical example of dropping a collection. STEP 4: HOW TO MAKE QUERIES IN THE COLLECTIONS: Db.collection.find/findOne( parammeter :value): These commands have already been seen when we wanted to get the overall output of the entire collection. Still we can use it using two conditions at the same time. For giving a practical example of that let s show all the players that have accomplished two conditions: First: the position is defenseman and second: the age has to be twenty one. Picture 2.15: Sample query in Mongo. 12

31 $or:[condition1, condition2] Lastly I would like to talk about other types of queries. Sometimes we don t want to be that strict and we would like to show all of the rows that accomplish one condition or another one. For those cases we will need to introduce the variable or that will be followed by an array. And inside each element of the array there will be objects separated by commas that will tell you which conditions can be accepted. Like always, the best way of understanding it is using a practical example with our previous mentioned collection players. In the next image we can see how the find() and pretty() functions are being combined with the variable or in order to get what we want which is in this case to show all those players that play in the position of the left wing or right wing. Picture 2.16: Executing a query in our players dataset Other variables for comparison: So far we saw a lot of different types of queries combining the logical operators OR and AND. Neverhteless, the truth is that we can still delve more deep by showing other variables that allow us to make numerical comparison: For example $gt: value. This expression it is used if we want to establish the condition that a chosen parameter should be greater than the specified value. Using a practical example this is the query that we will need to type if we want to show all the players whose age is greater than 30: 13

32 db.players.find( { "age" : {$gt:30}} ).pretty() Overmore I would also like to mention the other variable named $gte which works exactly the same but with the difference that it will show all the rows where the age is greater or equal to 30: db.players.find( { "age" : {$gte:30}} ).pretty() Finally, there are another two complementary variables: $lt for showing all the rows that are lower than the specified value. db.players.find( { "age" : {$lt:30}} ).pretty() As you can guess, we also have $lte in order to show all the rows which are lower or equal the specified value which is for this case age 30: db.players.find( { "age" : {$lte:30}} ).pretty() 14

33 At the end I would like to mention another variable which is $ne. In our practical example its function would be to show all those players whose age is NOT EQUAL to 30, for example 29 would be accepted as well as 31. db.players.find ( { "age" : {$ne:30}} ).pretty() In addition to all that I would like to mention that we didn t customize the queries as much as we could. For example, in all our previous cases we will receive all the information from the rows that we want. The counterpart in the SQL syntax will always be SELECT * FROM. If we want to show for example just the name, we will need to specify it with one condition, according to the following syntax: { parameter : 1/0 }. The number one or zero will indicate whether you want to show one parameter or not. Finally, in order to clear up some things I would like to mention that by defaul. As we saw in the previous queries the mongo db tool will show all the parameters that contain that collection, but if you start specifying the conditions, the default values change and it will only show the specified values plus the _id that it will be shown by default and you will need to specify that you don t want to see it if that is the case: I am going to show you a practical example using our previous players collection: I am going to make a query returning all the players that play in the center position. And from all their information I only want to see their names and I don t want to see their id: 15

34 Picture 2.17: Executing a different query in our players dataset. In order to end with this step, I would like to mention another two functions that, while not very useful, can be helpful in some contexts. One of the functions is: limit(number) This function will return only the first number of rows and it will ignore the rest. For example, in our previous query if we want just to show the three first rows, the query would look something like that. db.players.find( {"position":"center"},{"name":1, _id:0} ).limit(3) Also another complementary function is skip(number). In this case it does not work exactly in the sql queries. It will not show a certain number of last rows. What it does is to ignore the first number of rows and show the rest. For example if there are ten different rows and we use the skip(3) function it will show the last 7 rows, ignoring the first three. Here is a practical example: db.players.find( {"position":"center"},{"name":1, _id:0} ).skip(3) STEP 4: USING INDEXES: In order to understand why it is useful to use the indexes in this type of databases it will be necessary to know an overview of how Mongo tool works inside. When you are making a 16

35 query with one condition the actions that will be executed are: looping each row in the collection and checking if each single row accomplishes the condition or not. And if this condition is accomplished it will be necessary to print out its information. The best way of understanding the procedures is always with an example: in our previous collection that contains around twenty different players let s suppose we want to show all of those whose age is lower than twenty one. Mongo tool will go one by one checking if each player s age is lower than twenty one. Because it is only a sample database and there is only a very limited number of rows, we can see the result in milliseconds. But what happens if for the project I want to use a database with one hundred thousand customers? In this case the performance will be very poor and it will take a lot of time till I can see the first results. For these cases it is quite useful to know how to use indexes because we will see a huge difference in performance in a real and robust database Finally, I would like to point out that for the end user it is very difficult to appreciate exactly how much time you saved by using indexes. But there is a way of seeing how much time it would take to execute this query. For that you need to include in the end the function:.explain ( executionstats ). So once we saw why the indexes are used and in which type of databases we should use them, we will continue with the main commands for manipulating them: Db.collection.ensureIndex({parameter:1}): It creates an index of the previously specified parameter. Db.collection.getIndexes(): Here we can see an output of all of the indexes that we have created for the previously specified collection. Db.collection.dropIndex({parameter:1}): As you can possibly guess it will drop the previously specified index that you have created. The best way of understanding what I say is with a practical demonstration using the sample collection players: 17

36 Picture 2.18: Geting indexes in player dataset. Finally, I would just like to mention that in order to achieve the best performance we need to use the indexes in a sensitive way. For example, a quite common mistake is to create an index for each single parameter in the collection. Each time that you create an index the performance query of that collection will be downgraded. Hence it will be recommended to use the indexes only with the parameters that you will use a lot, like for example the name, the age and the player position, and to leave the other unimportant parameters that you will barely use in the queries. What s more, another disadvantage that is necessary to count on is that after each time that you actualize the collection it will also be necessary to actualize its associated indexes. STEP 5: USING GROUPS AND AGGREGATION: The last topic that I would like to cover when talking about Mongo db commands is the groups. If we want to group the rows depending on an exact parameter then we can use the variable $group. Although I would like to mention that this variable will have to be combined with other variables. 18

37 For example, let s put a query in our previous example where we want to group all the players depending on their position and for each different group we want to sum the number of those who play that role. Then we would need to make a query with the following syntax combined with the variable $sum. db.players.aggregate( { $group : { _id : "$position", total : {$sum : 1} } } ) Other times when making groups we need to get the average value of another different parameter. For example, let s suppose that we want to get an average age for all the players grouping them by position. Then we will need to use the variable $avg in the following way. db.players.aggregate( { $group : { _id : "$position", avgage : {$avg : $age} } } ) In order to end this section I would like to mention the self-explanatory variables $min and $max which will show the biggest and the lowest value of each group. And the way of using both it is exactly the same than for the previous group variables. db.players.aggregate( { $group : { _id : "$position", avgage : {$max : $age} } } ) db.players.aggregate( { $group : { _id : "$position", avgage : {$min : $age} } } ) Preparation of the datasets For the purpose of this master thesis I am going to apply different experiments in ten different datasets that are going to be taken like examples in order to show the deployment of big data. In this part of the report I am going to mention each single one of them and I will also summarize the procedure that I made in order to include them in Mongo. In addition 19

38 to that I also will show different screenshots in order to demonstrate the correct inclusion of them Arrhythmia Cmu newsgroup clean 1000 sanitized Diabetic data Tumor data Kddcup99 Letter data Nursery data Splice data Wave form data High school students data Picture 2.19: Showing all implemented datsets in Mongo. 20

39 ADDING ALL DATASETS TO MONGO One of the main problems that I found in order to convert the database to Mongo was the format of the data. All datasets were in.arff format so they had to be parsed to JSON in order to be readable for MongoDB. The solution that I found for this problem was the following: First of all I downloaded and installed the WEKA tool version 3.6. WEKA is one of the most popular suit of machine learning software developed by the University of Waikato it is a free software licensed and it is written in Java. You have an image bellow showing how I opened one of the sample databases about arrhythmia in the previously mentioned tool. Picture 2.20: Showing WEKA interface. One of the many functions that these tools give you is the conversion of files into arff format into csv. Because there are no direct conversions within JSON and arff, I found two different solutions for handing it. The fist solution it worked with almost all the different datasets except with the cmu sample data and it consists in parsing them first into.csv format and then parsing them into an array of JSON readable format in order to successfully include them in the database. Right after the first step I used another tool named JSON buddy Application desktop version 3.3. By the usage of this tool I managed to convert the files into the final format without any troubles and preserving the structure and the content as it was before. 21

40 Picture 2.21: showing WEKA interface. Unfortunately, the first solution didn t work will all my datasets, there is another dataset named CMU which is very large and because of its size it could not be parsed in JSON buddy application. So the second solution consists of first parsing the data into csv, like in the first solution, and after that using the following command in order to import directly this data into Mongo: mongoimport --host port db cmu --collection data --type csv --headerline --file./cmu.csv -j 256 For importing Mongo you have to specify the host and the port which are in this case: : After that it is necessary to specify the name of the database and the name of the collection that will be created finally we tell them that we have a headerline that specifies the name of the fields, the path of the file and one last option -j 256 which solved me different CPU processing problems. As a final result I managed to have the ten different databases already imported and ready to work in Mongo tool. 22

41 DESCRIPTION OF THE DATASETS: In order to have an overview of the content, I am going to show a summary of all the different datasets that I will use like an example in order to apply the different technologies for big data. The first data set it is about 452 different attributes about different people. Picture 2.22: Showing all attributes of the arrythmia dataset. The second database gathers different data about diabetic people. In this case we found a much smaller database but each row has a much wider number of attributes.. Picture 2.23: Showing the attributes of diabetic dataset. Finally I am going to show a summary of the different datasets about: letters, nursery, splice, high schools, tumor, waveform and cmu: 23

42 Picture 2.24: Showing the attributes of letter dataset. Picture 2.25: Showing the attributes of nursery dataset. Picture 2.26: Showing the attributes of splice dataset. 24

43 Picture 2.27: Showing the attributes of student dataset. Picture 2.28: Showing the attributes of tumor dataset. Picture 2.29: Showing the attributes of waveform dataset. 25

44 Picture 2.30: Showing the attributes of cmu dataset. Picture 2.31: Showing the attributes of kddcup dataset. 26

45 3 MASTERING THE DATA MINING So far we saw all the things that concern the databases. For that we have learned how to use the tool mongo database, unfortunately this is not enough in order to achieve the final goal of the project. The second step that we need to make is to analyze all of this data that we have previously loaded in our local host with a quite popular data mining tool named R [4]. With the help of this tool we will be able to analyze our data in a much better way and we will be able to make operations that we cannot do just with our mongo tool. 3.1 Installing and configuring R So far we saw how to install, create, manipulate and master the nosql database using the mongo tool. Unfortunately, this is just one part of the entire task. The next step that we need to make in order to accomplish our goal for this master thesis is to analyze all this data that contains the database with another data mining tool and to be able to make operations from there. For this purpose, we will use a quite famous tool named R. WHAT IS R? R[5] is composed by the software environment for statistical computing and its associated programming language that supports this tool. I would also like to talk about the software environment which has been developed in the year 1993, and that has been continuously in development till today, having the last stable version from April 16 th It has been developed by the R development core team and its paradigm covers array, object-oriented, imperative functional, procedural and reflective areas. On the other hand, I would like to also talk about the programming language which is very well known and used among data miners for developers and is widely used also in another areas like statistics, data analysis, polls, surveys and studies of scholarly literature databases. The whole R project has been released under GNU, General Public License. Its source code has been written primarily in C, FORTRAN and R. 27

46 Hence R is available for free, with different avaialable versions depending on the operative system. I would like to point out that even while this tool supports some graphical frontends, it works primarily on a command line interface, allowing in this way the user to work faster, to not waste resources and to be able to run it in all the machines, regardless their characteristics. STATISTICAL AND PROGRAMMING FEATURES: R can be combined with a big number of supported libraries. Together they are able to implement a big variety of statistical and graphical techniques. From all of them we can highlight the classical statistical tests[6], the linear and nonlinear modeling, classification, clustering and so on. The reason for this amount of libraries is the fact that R is very easily extensible through functions and extensions that can even be written in its own language. About the programming features of the R language I would like to highlight that it is an interpreted language that is able to support matrix arithmetic and a lot of data structures like vectors, arrays, matrices data frames and lists. Overmore, R is able to support procedural programming with functions and it is also able to support object oriented programming with some generic functions. INSTALLING R: The best way of getting the R tool is to enter its official page [7]. Here we will be able to see a quite plain web page with different links depending on our operative system. In order to clear it up, I will show an image of the official web page where you can download it. 28

47 Picture 3.1: Official documentation of R. After choosing Windows I started to download the file corresponding with the last version untill that moment: ( ). The current operative system that I have is Windows 7 and after following the straightforward steps of the installer, I managed to download the version for 64 bits. I am not going to delve very deeply into the R interface because we already saw how it works in another subjects in the university and that would be redundant. What s more, I have already explained the different areas that cover the programming language. INSTALLING THE RMONGODB LIBRARY: So far I have installed the main R tool and I also explained how it works but the truth is that if we want to connect it with our mongo database, that we talk about in another sections, this is not enough. In order to accomplish this part, we need to install one library that is named: rmongodb, which will help us to connect both tools. The way of installing this library is quite straightforward and the same than any other library in R. We just have to run the command install.packages( library name ) In order to clear it up I am providing an image of the R interface with its output results at the beginning of its installation: 29

48 Picture 3.2: Installing the packages»rmongodb«in R. As we can appreciate in the previous image, the library allows us to be installed in different languages. Because it is easier for me I decided to install it in Spanish language but the functionality should be the same, regardless of the language. Right after that we will be able to see how all the necessary files and packages are downloaded in a successful way and we will be able to see a final message informing us that everything went as expected. In order to clear it up I will show you and screenshot of my computer right in the moment when the installation was successfully completed: 30

49 Picture 3.3: Output result after installing»rmongodb«. With the steps that I have explained in this part we have installed the R tool with its necessary library. I would like just mention that there is another alternative option that allows us to get the library: we use the command install. Packages that we will install are the last stable versions that have been released for this library but alternatively we can also run the latest development version from the github repository. In this case it would be necessary to run the following commands: library(devtools) install_github(repo = "mongosoup/rmongodb") 3.2 Learning to use R with its library rmongodb In order to connect our R data mining tool with our Mongo database we need a couple of requirements and we need to write a couple of lines in the command mode: 31

50 STEP 1: CONNECTING MONGODB TO R As we previously mentioned we need first to do a couple of actions in order to connect both tools: To install and run our mongo database externally, out of the R tool. To do this we need to type mongod in our command window The second thing we need to do is to run the R tool and to summon the library rmongodb that we have installed in the previous section. In order to get that we just need to type the following command: library(rmongodb) I will present a theoretical explanation of the basic mongo commands: Mongo.create(): with this function we are able to connect with a mongo database server that can be local or remote and return an object that belongs to a class named mongo. This object can be used for further communication over the connection. Mongo.is.connected(variable): This function is used in order to see if the variable is well connected or not to the mongo database server. If it is connected it will give back TRUE and if not, it will print out FALSE. Variable class mongo: if you type the name of a variable with mongo class it will give back all the basic parameters attached to it like host or username. Following the same scheme than always, once given the theoretical demonstration, I would like to clear it up showing an image of it with the commands used in a practical way: 32

51 Picture 3.4: connecting with mongo datasets since R IDE. STEP 2: GETTING BASIC INFORMATION ABOUT THE DATA BASE AND THE COLLECTIONS: In order to get the list of all the databases or collections in mongo we need to learn this quite straightforward commands: mongo.get.databases(variable): By typing this command we will get the list of all variables that have found the object that has to be Mongo class mongo.get.database.collections(mongo, db) On the other hand if what we want to get is the list of all collections, we have to type this second line of code. mongo.count(mongo, coll): as its name implies, it is able to count the number of elements that have a previously specified collection. 33

52 I also would like to point out that the best way of accessing the information of the database is to check first if the variable has been correctly connected and if it has, to access at the information and to do nothing otherwise. So as you can guess you need to write a couple of more lines of code and the final result should be something like that: if(mongo.is.connected(mongo) == TRUE) { mongo.get.databases(mongo) } if(mongo.is.connected(mongo) == TRUE) { db <- "hockey" mongo.get.database.collections(mongo, db) } if(mongo.is.connected(mongo) == TRUE) { coll <- "hockey.players" mongo.count(mongo, coll) } Finally, I will provide an image with a practical use of both functions when connecting R with my mongo databases in my local host: we can see we still have the database hockey that we used in the previous sections like an example: 34

53 Picture 3.5: Accessing databases and collections since R. STEP 3: FINDING SOME DATA: So far we got just a basic information but now we will delve more deep into the advanced options in order to get selected information that we want to get. For that we will need to learn the following commands: mongo.find.one(mongo, coll): this command finds the first record inside the previously specified collection that matches the query. mongo.distinct(mongo,coll,key): it will find all the distinct elements into the specified collection, the distinct elements will be found according with the given key. Once more we have the same issue than before and it is better to encapsulate the queries into an if statement in order to avoid possible errors so those are the final lines of code that we need to write: 35

54 if(mongo.is.connected(mongo) == TRUE) { mongo.find.one(mongo, coll) } if(mongo.is.connected(mongo) == TRUE) { res <- mongo.distinct(mongo, coll, "name") head(res, 2) } if(mongo.is.connected(mongo) == TRUE) { cityone <- mongo.find.one(mongo, coll, '{"name":"craig Adams"}') print( cityone ) mongo.bson.to.list(cityone) } Finally, I would like to provide you the output results that I got after typing those commands into my R interface: I would like to mention that after the last line of command: mongo.bson.to.list(cityone) I got all the information relative to that object not only the _id but as far as it is very long and not very necessary to show that information, I decided to not show it in the image, and to rather expose only the most meaningful information: 36

55 Picture 3.6: Executing some queries with our sample data since R. STEP 4: CREATING BSON OBJECTS. mongo.bson.from.list: This function is used in order to convert a list into a JSON object. The process is very natural because the objects in R are very similar to the real JSON objects in mongo database. I also would like to point out that this process internally calls other 37

56 functions like: mongo.bson.buffer.create, mongo.bson.buffer.append, mongo.bson.from.buffer. mongo.bson.from.json: alternatively this function can be used in we want to create a BSON object from a BSON. It has the same result than the previous one. mongo.bson.from.list: as you can guess it creates a BSON object from a list. This is the last alternative option that you have for creating BSON objects. Here I am providing you the lines of code with the correct use of the previously mentioned functions. query <- mongo.bson.from.list(list('city' = 'COLORADO CITY')) query <- mongo.bson.from.list(list('city' = 'COLORADO CITY', 'loc' = list( , ))) buf <- mongo.bson.buffer.create() mongo.bson.buffer.append(buf, "city", "COLORADO CITY") query <- mongo.bson.from.buffer(buf) mongo.bson.from.json('{"city":"colorado CITY", "loc":[ , ]}') date_string <- " :01:06" query <- mongo.bson.from.list(list(date = as.posixct(date_string, tz='msk'))) 38

57 Finally, I would like to provide a screenshot with a practical demonstration of these functions: Picture 3.7: executing some queries since R. STEP 5: EXAMPLE OF ANALYSIS. In order to perform our first analysis, we will use our collection example named coll that contains the hockey players. We will also need to use functions like mongo.distict which allows us to get a vector with all the different values according with the key. What s more, I would like to mention that we will also use another two functions that are not from the library but that are still useful for representing the data that we have with a graphic. For a practical example of it we will grab the collection of players in hockey and we will analyze their age. For that we will use the following commands: 39

58 if(mongo.is.connected(mongo) == TRUE) { pop <- mongo.distinct(mongo, coll, "age") hist(pop) boxplot(pop) } With these lines of code, we first check if we have connected correctly to the database and if so, we get the age of the players that are inside the collection. After that we will represent this data into two graphics. Histogram for representing the frequency and boxplot for expressing the given data into a box-and-whisker plot. I managed to get the following output results: 40

59 Picture 3.8: Graphics showing the results after executing some queries. As we can see with theses graphics we can analyze the average or the frequency of the different age ranks much better. Finally, in order to end with our analysis I would like to find all of those players that are older than 18, which means they are adults, and to analyze those two that are the oldest of all of them. In order to get that I used the following code: 41

60 nr <- mongo.count(mongo, coll, list('age' = list('$gte' = 18))) print( nr ) pops <- mongo.find.all(mongo, coll, list('age' = list('$gte' = 18))) head(pops, 1) Picture 3.9: Executing»count«function and»head function«. As we can see there are twenty-six players in the list that are adults and we also got the information of the first player in the list. STEP 6: CHANGING THE DATABASE SINCE R In order to achieve this step we will need to use one of the functions that we already saw before: 42

61 mongo.bson.from.json. It will allow as to create a JSON object that we will add to our collection a couple of lines later. After that we will need to use the function mongo.insert.batch. in order to insert the previously created JSON object into the specified collection. Finally, we have to make the last step where we will prove that the data has been successfully added to our collection. a <- mongo.bson.from.json( '{"position":"goalie", "id": , "weight":220, "height":"6 1", "imageurl":" "birthplace":"fussen, DEU", "age":29, "name":"thomas Greiss", "birthdate":"january 29, 1986", "number":1 }' ) b <- mongo.bson.from.json( '{"position":"goalie", "id": , "weight":220, "height":"6 1", "imageurl":" "birthplace":"fussen, DEU", "age":29, "name":"thomas Greiss", "birthdate":"january 29, 1986", "number":1 }' ) icoll <- paste("hockey", "players", sep=".") mongo.insert.batch(mongo, icoll, list(a,b) ) dbs <- mongo.get.database.collections(mongo, "hockey") print(dbs) mongo.find.all(mongo, icoll) In this case we have added to our collection hockey players our json objects named a and b, once more. In order to prove the effectiveness of those lines of code I will provide the screenshots that I got in my own R studio interface. 43

62 Picture 3.10: Converting to bson format in R. Finally, after the command line mongo.find.all I got each single object in the collection. In order to demonstrate the fact that the data has been added in a successful way, I will show the information of the last object in the collection. In the image I ve highlighted the information of that object in red and we can see that it is the same information that we previously loaded in the JSON object that we created. For example the position is goalie, and the birthplace is Fussen DEU. 44

63 Picture 3.11: Output result after executing queries in our sample data in R. APPLYING OUR KNOWLEDGE TO ONE OF OUR DATASETS For this section we are going to run our mongo database and we will connect it with our R studio. This time, we will analyze the data relative to our student database making queries in order to get some useful information. For this section I am not going to delve very deeply in the commands that I used because that is part of the previous section. Rather than that I will directly write down the used commands and the output results. QUESTION 1: WHICH GENDER DISTRIBUTION DO WE HAVE IN THE HIGHSCHOOLS? In order to achieve that I used the following commands that compares whether the FIELD 2 contains the character F for Females and M for males: 45

64 females.count <- mongo.count(mongo, coll, list(field2="f")) print(females.count) males.count <- mongo.count(mongo, coll, list(field2="m")) print(males.count) counts <-c(females.count,males.count) barplot(counts, main="gender Distribution",names.arg=c("Females", "Males")) The final results showed that from the 649 students 383 are girls and the other 266 are boys. This means that the girls are majority in the high school getting the 59 01% of the total students. The boys have the other 40 99%. Picture 3.12: Sample using the»count«function in»rmongodb«. 46

65 QUESTION 2: DOES THE LEVEL OF EDUCATION OF THEIR PARENTS INFLUENCE THE MARKS OF THE STUDENTS? In order to answer this question there are two attributes in the database (the attributes number seven and eight) that correspond to the level of education of their parents. The first thing that I did is to separate them in four groups. All the students whose mother or father are in certain level of education will be grouped together with the possibility of overlapping in the case that both parents have different level of education. For separating them I used the following commands j12 = '{"$or": [{"FIELD7": "4"}, {"FIELD8": "4"} ] }' query=mongo.bson.from.json(j12) l1.count <- mongo.count(mongo, coll, query) print(l1.count) j12 = '{"$or": [{"FIELD7": "3"}, {"FIELD8": "3"} ] }' query=mongo.bson.from.json(j12) l2.count <- mongo.count(mongo, coll, query) j12 = '{"$or": [{"FIELD7": "2"}, {"FIELD8": "2"} ] }' query=mongo.bson.from.json(j12) l3.count <- mongo.count(mongo, coll, query) j12 = '{"$or": [{"FIELD7": "1"}, {"FIELD8": "1"} ] }' query=mongo.bson.from.json(j12) l4.count <- mongo.count(mongo, coll, query) Also I would like to show you my output results in my terminal: 47

66 Picture 3.13: Executing some experiments in R. QUESTION 2: HOW MUCH TIME ON AVERAGE DO THE STUDENTS DECICATE TO THEIR STUDIES PER DAY? The first thing that I will do is to show the different commands that I used in order to calculate this data. For this case the needed functions are quite similar to those ones in the first question. v1 <- mongo.count(mongo, coll, list(field14="1")) v2 <- mongo.count(mongo, coll, list(field14="2")) v3 <- mongo.count(mongo, coll, list(field14="3")) v4 <- mongo.count(mongo, coll, list(field14="4")) variables <-c(v1,v2,v3,v4) boxplot(variables) barplot(variables, main="amount of study time (h)",names.arg=c("1h ", "2h ","3h ","4h")) hist(variables, main="amount of study time (h)",names.arg=c("1h ", "2h ","3h ","4h")) 48

67 After that I will provide some screenshots with the output results that I got in the graphs. Basically we can guess that most of the students (305 out of 649) study 2 hours per day, with 1,93 being the average of the study time of all of them: Picture 3.14: Graphic showing the output results of the experiment. Picture 3.15: Graphic with bars showing the output results of the experiment. 49

68 4 MASTERING THE HADOOP 4.1 What is big data and Hadoop? WHAT IS BIG DATA [8]? Regardless if this data is structured or not the term big data describes a huge volume of data that inundates a business on an everyday basis. Regardless of what it looks like the amount of data is not so important, what really matters is what organizations do with this big amount of data. Secondly I would like to point out that there is not an exact size of bytes for using big data, there are mainly three different properties that influence it: The velocity of accessing this information, the volume of data, and the variety of it. There are two additional dimensions: variability and complexion. Finally, in order to end with this brief introduction to big data I would like to explain swiftly why it is important and what are different fields where big data is being used in today s world. First of all, big data is important because it allows you to save costs, it allows you to reduce time, it allows you to optimize your offering, to make the product development easier and finally it also allows you to make smart decisions when you combine it with high-powered analytics. Secondly, I would like to mention that big data has some applications in today s world in fields like banking, education, government, health care, manufacturing and retail. WHAT IS HADOOP [9]? Hadoop it is a software project developed by apache and released as an open source project. Its main purpose is enabling distributed processing of large data sets across clusters of commodity servers. In its design it is focused on being scaled up from a single server to thousands of machines. One of the main points of this technology is its good fault tolerance which means that the entire system admits a high degree of mistakes or unfortunate circumstances. 50

69 This fault tolerance is achieved by relying on the end hardware, the resistance of its clusters comes from the ability that the software has to detect and handle the failures that occur in the application layer. After that I would also like to mention that the Hadoop architecture is divided into three main layers: The ODP Core which consists of a standalone interface, the IBM Open platform with Apache Hadoop and IBM Hadoop ecosystem. Finally I would like to mention the main features that Hadoop has when operating with big data: The first one is its scalability, it is possible to work with very huge amount of data, the second good feature that it has is the low-cost architecture. The third, as we mentioned before, is its good fault tolerance and finally I would like to point out its flexibility, because this tool is able to manage structured and unstructured data, and it is very easy to join and aggregate multiple sources with the goal of making a deeper analysis. WHAT IS HDFS [10]? HDFS comes from the acronym, Hadoop Distributed file system and it has been developed by using the distributed file system design. One of the main advantages of this distributed system is the low-cost hardware and the fault tolerance. Another features that we can also highlight are for example the ease of access of a very large amount of data, as this data is stored across multiple machines. In addition to data I also would like to point out that HDFS also makes applications available to parallel processing. Secondly I am going to sum up the main features of HDFS: As I previously mentioned it is suitable for the distributed storage and processing. Hadoop provides a command interface to interact with HDFS Streaming access to file system data. Finally, in order to end with this section, I am going to talk about two different elements that are part of the HDFS architecture: name node and data node. 51

70 Name node: it is the commodity hardware that contains the GNU/Linux operative system and the name node software. The system having the name node acts as the master server and within its different tasks we can highlight: the management of the file system namespace, regulateing client s access to files and also executing system operations like opening and closing files, opening and closing directories, renaming and so on. Data node: it is the commodity hardware that has the GNU/Linux operating system and the data node software, their role is the management of data storage of their system. Within its main tasks we can highlight the performance of read-write operations on the file system, and the performance of different operations like block, deletion and replication according with the instructions of the namenode. 4.2 Installing and configuring Hadoop The first thing that will be necessary to do in order to install Hadoop is to get and unpack the source code. The files are available in different places but in order to rely on a more trusted source I decided to download them from the official site[11]. Also I would like to mention that I will use the Hadoop version under the operative system windows 10. Even if the configuration is a little bit more difficult I decided to not use any virtual machine for that. INSTALLING HADOOP IN AN STANDALONE MODE The first step that will be necessary to perform if we want to run our Hadoop on our computer is to install java, in my case I had it already installed so what I did is to be sure that I works correctly and to set up the following environment variables: we have to open the following menus: System properties > environmental variables and then we should see something like that: 52

71 Picture 4.1: Window showing the environment variables. Here in this menu it will be necessary to include a new variable named: JAVA_HOME with the following value: C:/java referring our local file system where this library is located. Finally, we will also need to change one of the system variables named Path adding the following value to the array: C:/java/bin. After that I am going to make sure that the java tools are set up correctly in my computer by running the following command in the cmd window: 53

72 Picture 4.2: Command window showing the current java version installed in my computer. As you can see I have installed the Java version 1.7.0_80 in my computer now let s move to the next step where we will go back to the environmental variables. In this step it will be necessary to set up a new variable named HADOOP_HOME and it will contain the path of the directory where we placed Hadoop (in my case C:\ hadoop \) After that It will also be necessary to modify another variable that we already had in our computer: PATH, fortunately for us it is possible to set up more than one address in the same variable as far as we separate them with semicolon, hence I added the hadoop path (C:/Hadoop-2.4.0/bin) also to this environment variable. Picture 4.3: Output result showing the hadoop version installed.. 54

73 As you can see I have the Hadoop version installed in my computer in my local file system. INSTALLING HADOOP IN A PSEUDO DISTRIBUTED MODE: In order to accomplish that part I am going to change the Hadoop configuration applying the different changes in different files that I will mention right after. Finally, I will also demonstrate that I have my tool well configured and ready to use. Inside the directories named etc and hadoop (C:\hadoop-2.4.0\etc\hadoop) we can find the following files that we will change: Core-site.xml: For this file I will add the following tags within the already existent configuration tag: <configuration> <property> <name>fs.default.name </name> <value> hdfs://localhost:9000 </value> </property> </configuration> Hdfs-site.xml: In this case we will also need to add some information within the configuration tag. Also I would like to mention that we are assuming that we have the name mode and the data node in the following routes: C:/hadoop/hadoopinfra/hdfs/namenode and C:/hadoop/hadoopinfra/hdfs/datanode 55

74 <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value> </property> </configuration> Yarn-site.xml Also we will need to add the following configuration to the previously mentioned file: <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> 56

75 Mapred-site.xml: We can also find a file named mapred-site.xml.template, we will need to rename it for mapred-side.xml and we will also need to add the following configuration: <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> After we have finished with the configuration for the pseudo distributed mode, I would also like to mention that the configuration can be slightly different in older or newer versions. Finally, I would like to verify that my Hadoop is working correctly. On the one hand it will be useful for me to know that everything works fine so far and on the other hand it will be also useful to demonstrate it in this report. The first verification is name node set up: for that I will need to navigate using the command mode to the folder Hadoop-2.4.0/bin and type the following command hdfs namemode -format. 57

76 Picture 4.4: Output results after executing»hdfs namemode -format«command. Now I want to run the following commands in order to see the hdfs (Hadoop distributed file system) with the commands hdfs namenode and hdfs datanode. In order to execute both commands in a successful way I had to handle different issues in the configuration of hadoop. On the one hand the native libraries for windows were not included so I downloaded [11] and included them in the bin folder of hadoop. On the other hand I had to handle another issue: In order to run hadoop in windows operative system it is necessary to solve two main problems: First of all it is necessary to install and configure in the right way the following tools: Software development Kit version 10 which is the unique one compatible with my operative system windows 10. Download and extract the maven files [12] into the following path C:/maven, to download and extract the protocol buffer [13] version in the following path: C:/protobuf. Also I needed to add the following environmental variables. And the following link to the PATH: 58

77 Picture 4.5: Environment variables of my computer. Finally, I had to solve an incompatibility issue within the java version of my computer and the java version that hadoop used because it was not the same version 1.7 and 1.8 and also one of them was using x64 bits whereas the other was using x32 bits. We could have different solutions to this problem and the solution that I found is to use 32 bits with the newer java version 1.8 for both tools. Right after that I managed to make the hdfs work in a successfull and satisfactory way and in order to demonstrate that I am going to provide a screenshot of their activity whereas they are running: 59

78 Picture 4.6: Output results after running Hadoop. After that I run the following commands: yarn resourcemanager, yarn nodemanager Because of the configuration that we have made with the environment variables I don t have to run the commands in any specific path, the own system detects the files. In order to demonstrate that I managed to make them running in the correct way I am going to show and screenshot of this commands running and I am going to enter in the following urls: and where we will be able to see the basic configuration of Hadoop in our system. 60

79 Picture 4.7: Output results after executing»yarn«. Finally, I am also going to show two more screenshots demonstrating that I have both services running on my computer: 61

80 Picture 4.8: Initial page after running Hadoop. Picture 4.9: Initial page showing the cluster configuration of Hadoop. Finally, I would like to mention another problem that we will have to face in future ocasions if we format the filesystem again. We can have errors with the cluster ID. In order to solve that we will need to get the number of the cluster id in the namenode and add it to the following command. For example, in my case: hdfs namenode -format -clusterid CID-8bf db6-a949-8f74b50f2be9. 62

81 In this way we will be able to format hadoop again and to run it without any problems. 4.3 Deployment of R algorithms in Hadoop So far we ve managed to install and run the R studio in a separate way and connect it with hadoop, also in the previous section we ve managed to run hadoop tools in our computer in a successfully way. The goal of this section is to configure our R studio in such a way that is able to connect with our hadoop. Once we accomplish it we will end with the technical part of this master thesis. In order to get it we will need to follow a couple of steps: STEP 1: CONFIGURATION OF THE ENVIRONMENT The first thing that we will need to do is to open our R (in our case the version 3.3) and to run the following commands: Sys.setenv("HADOOP_CMD"="/hadoop-2.7.1/bin/hadoop.cmd") Sys.setenv("HADOOP_PREFIX"="/hadoop-2.7.1") Sys.setenv("HADOOP_STREAMING"="/hadoop-2.7.1/share/hadoop/tools/lib/hadoopstreaming jar") Sys.setenv("HADOOP_HOME"="/hadoop-2.7.1") These two commands set up the variables that refer to the hadoop bin and the hadoop streaming file which can change depending on the version that you have in hadoop. Finally, we want to be sure that our two variables were set up correctly so we will use the following command this time: Sys.getenv("HADOOP_CMD") 63

82 After that we should see the path that we would previously enter without any trouble. Here you have a screenshot of the command shell with all the commands that have been previously mentioned so far. Picture 4.10: Initial configuration of Hadoop in R. STEP 2: INSTALLING THE NECESSARY PACKAGES The first thing that we will need to do is to download[14] and install the Rtools. In our case the latest available was the version 3.3 so that is the one that I downloaded and install. After we already have the necessary tools we will need to install nine different packages that are going to be used. For that we will use the command install.packages and inside the parenthesis we will create an array where we will mention all the packages that will be automatically installed. Install.packages(C( rjava, Rcpp, JSONIO, bitops, digest, functional, string, plyr, reshape2 )) Finally, we will also need to install the three main packages that compose Rhadoop: rhdfs, rmr2 and rhbase: for that we will need to type the following commands 64

83 library(devtools) install_github("rmr2", "RevolutionAnalytics", subdir="pkg") install_github("rhdfs", "RevolutionAnalytics", subdir="pkg") install_github("rhbase", "RevolutionAnalytics", subdir="pkg") STEP 3: MAKING FIRSTS TESTS IN ORDER TO KNOW IF EVERYTHING WORKS AS WE EXPECTED. First of all, I would like to describe briefly what is the role of rmr2 package: The main function of it consists in performing statistical analysis in R via Hadoop MapReduce functionality on a hadoop cluster. Secondly I would like to show how rmr2 works correctly in my computer and in order to achieve that I run the following basic commands: library(rmr2) from.dfs(to.dfs(1:100)) from.dfs(mapreduce(to.dfs(1:100))) If everything has been set up correctly, you should not see any errors and instead you should see an output like this: 65

84 Picture 4.11: executing»mapreduce«. Also I would like to describe briefly what is the role of the rhdfs package. Its main tasks consist of providing the basic connectivity with the hdfs which means the hadoop distributed file system. With this package you are able to perform different operations like read, write, modify files stored in HDFS and so on. Finally, I am going to run a simple test showing if the packages hdfs work correctly. In order to verify that I am going to run the following commands in R tool: library(rhdfs) hdfs.init() hdfs.ls("/") And those are the output results that I got. Because I didn t receive any error I assumed that everything works OK. 66

85 Picture 4.12: Basic usage of»rhdfs«library. STEP 3: MAKING ADVANCED TESTS Once we proved that our rhadoop libraries work correctly we are going to perform different tests in order to explain how to implement machine learning algorithms in R in Hadoop with the extracted data from our mongo database. The first thing that we are going to do is to operate the hadoop distributed file system with our rhdfs: For achieving that I would like to mention that we need to have the following prequisites: To start in the command mode the following tasks: hdfs namenode, hdfs datanode, yarn resourcemanager and yarn nodemanager To have inported and initialized all the libraries inside R that are needed to perform the job in the right way To initialize the environment variables called HADOOP_CMD and HADOOP_STREAMING: Once we got all that, we can run the following commands. The first thing we will do is to write a file called iris.txt in our RHDFS: 67

86 library(rhdfs) hdfs.init () f = hdfs.file("iris.txt","w") data(iris) hdfs.write(iris,f) hdfs.close(f) f = hdfs.file("iris.txt", "r") dfserialized = hdfs.read(f) df = unserialize(dfserialized) df hdfs.close(f) After that I would like to provide a screenshot in order to show my output results Picture 4.13: Unserializing data with»rhdfs«. 68

87 Also I would like to make a demonstration and explain the function of different commands from this library that we will use in future sections. Picture 4.14: Executing different»hdfs«commands. hdfs.ls('./'): read the list of files and directories from hdfs hdfs.copy( name1, name2 ): copy a file from one hdfs directory into another hdfs.move( name1, name2 ): move a file from one hdfs directory into another hdfs.delete( file ): delete the file that is passed like a parameter hdfs.get( name1, name2 ): download a file located in hdfs to the local store of your computer hdfs.rename( name1, name2 ), hdfs.chmod( name1, name2 ) and hdfs.file.info('./') are self-explanatory Finally, I am also going to make another tests in order to try out at the same time both libraries: rmr2 and rhdfs. For that I am going to make two different examples. 69

88 In both cases the first thing that I need is to import all the libraries. In some cases, I am not completely sure if they are necessary but what I am sure is that it is better to summon them and that that it works correctly this way. library(rjava) library(rcpp) library(rjsonio) library(bitops) library(digest) library(functional) library(stringr) library(plyr) library(reshape2) library(devtools) library(methods) Sys.setenv("HADOOP_CMD"="/hadoop-2.7.1/bin/hadoop.cmd") Sys.setenv("HADOOP_PREFIX"="/hadoop-2.7.1") Sys.setenv("HADOOP_STREAMING"="/hadoop-2.7.1/share/hadoop/tools/lib/hadoopstreaming jar") Sys.setenv("HADOOP_HOME"="/hadoop-2.7.1") Sys.setenv("HADOOP_CONF"="/hadoop-2.7.1/libexec") After that it would also be necessary to import the rmr2 library, the rhdfs library and to initialize the the rhdfs system in the following way: 70

89 library(rmr2) library(rhdfs) hdfs.init() Using map reduce for the first time: word count problem: In this example we will use the function map reduce for the first time and because of that we will use an example that is as simple as possible. In this case we will not store the output in any file. We will store it in a variable and we will examine the results in different ways: In the first line we use the function for the first time and inside the parameters we include a txt file that contains the novel of Moby-dick, the whale. After that we will see output results of the a variable and finally we will also fetch the contents of this temporal file into another variable. Picture 4.15: Using»mapreduce«in Hadoop. So far we made the simple example but now we will make it a bit more complicated. Our goal now is to process the input and to count the length of each single row inside the text file. 71

90 Finally, we will show the results in a graph. At the same time, we will also need to fetch the results obtained in the temporal file into different variables in order to finally be able to represent them in a graph. Picture 4.16: practical example using»mapreduce«. Picture 4.17: graphic using showing the output result of the preivous example. 72

91 Comparing the performance between a standard R program and R map reduce program: The first commands have the goal of implementing a standard R program where we have all the numbers squared. a.time = proc.time() small.ints2=1: result.normal = sapply(small.ints2, function(x) x^2) proc.time() - a.time In the second part of the exercise we will do exactly the same than before with the difference that we will implement map reduce this time: b.time = proc.time() small.ints= to.dfs(1:100000) result = mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2)) proc.time() - b.time Picture 4.18: set of commands processing data with»rhadoop«. 73

92 Inside the performance comparison we can see the R standard program outperforms the map reduce when we are processing small amounts of data. That is something normal because Hadoop system needs to spawn daemons, it needs job coordination and fetching data from data nodes. Hence the Map reduce takes a few seconds more Testing and debugging the rmr2 program: In this example I used a practical approach about some techniques for debugging and testing the rmr2 program. In order to achieve that I made the following steps: First of all, I configured the rmr in a local way. Second of all I performed the same basic example in order to get the information about the squares of the first million numbers. Finally, I printed out the time and the structure of the obtained information. Picture 4.19: Final results after applying»rmr«. 74

93 5 PERFORMING THE EXPERIMENTS AND ANALYZING THE RESULTS So far we ve managed to do the following things: First we have installed and used the mongo database and we have stored the ten different datasets that we will use like a sample, secondly we have installed our R tool and we managed to connect R with mongo using the library named rmongodb in order to analyze the already existent data. Finally, we have installed Hadoop, a tool that allows us to manage big data and in addiction to that we also have connected our R tool with Hadoop with the usage of different libraries like rmr2 y rhdfs. I will explain for each single database what you could do with the data and I will also provide some results of some experiments: FIRST DATA SET: ARRYTHMIA: WHAT CAN YOU DO WITH THIS DATA SET? This dataset contains 457 different attributes which shows different features about people who had that symptom. It includes characteristics like: age, gender, height, weigth, or exactly which type of elyptic way make the involuntary movements of this people. You could argurably guess with this data if this disease depends on the weight, height if it is more likely that you have it at certain age or if it is more usual in males or females, and that could be helpful for investigation for example. PROVIDED RESULTS: On the one hand I will show which is the age distribution within the people that have had arrhythmia detected: Here you have the results: 75

94 Picture 5.1: Results for the experiments of arrythmia dataset. I would like to point out that in this case the database does not contain too many samples, which is why in the frequency graph you cannot see too many samples. The size of the data set remains mostly in the amount of attributes On the other hand, in the left side we can see that the age average for having arrhythmia is between 40 and 50 years approximately- I would also like to provide the gender distribution for people which have arrhythmia in order to know if one gender is more weak against this symptom than the other. Picture 5.2: More results about arrythmia dataset. 76

95 The results show something quite unusual. There are exactly the same amount of man than woman that had suffered arrhythmia. Usually one gender should have a bit more than the other but about the same. SECOND DATASET CMU: WHAT CAN YOU DO WITH THIS DATASET? This dataset contains a lot of different attributes about people, the chromin comes from central management unity and it could be helpful for knowing where are those people from, how many surveys they had made or how much money do they have. PROVIDED REULTS: In this case the performed experiments are about which accuracy distribution values they have (Graph 1) which relativity values they have (Graph 2) and finally which amount of subjectiveness (graph 3). Picture 5.3: Results for the experiments of cmu dataset. 77

96 THIRD DATA SET: DIABETIC DATA: WHAT CAN YOU DO WITH THIS DATASET? This dataset contains 49 different attributes about diabetic people, we have gathered features like: race, number of days in the hospital, clinical speciality that was treated, number of operations, number of diagnoses and more medical data. It could be helpful for knowing how the patiens respond to the different operations or how many days they usually need to be in the hospital. PROVIDED RESULTS: The first experiment that I have performed is for knowing how many of them take insulin and which is the gender distribution of those people. Picture 5.4: Results for the experiments of diabetic dataset. In the first case we should see the numbers, 10 of them are female whereas 9 of them are male, and the proportion is almost 50% even if it does not look like it at the first sight. On the other hand, in the second graph I would like to point out that in most of the cases there is no information about this attribute. In the rest of the cases all diabetics take insulin which is something according with the normality. 78

97 FOURTH DATA SET: TUMOR: WHAT CAN YOU DO WITH THIS DATASET? This dataset is quite technical and it contains the different features that different tumors have, which part of the body they affect, their size and their behavior. It can be very usefull to learn from the experience from all those features in order to guess for future tumors or in order to know which type of tumors are more aggressive and which ones are more likely to appear. PROVIDED RESULTS In this case I performed different experiments in order to examine the average and the frequency in two of attributes. The first one is called HG2507-HT2603_at, HG2507- HT2603_at. Picture 5.5: Results for the experiments of tumor dataset. We can see that in the first attribute the values are within a range and

98 Picture 5.6: More results for tumors dataset. FIFTH DATA SET: KDDCUP99: WHAT CAN YOU DO WITH THIS DATASET? This dataset talks about data mining and knowledge discovery competition from the year 1999, it contains different features like which protocol the participants are using or which number of logings have been failed. It can be useful for example in order to know which protocols are becoming more popular within the participants or in order to find different error prone situations with the contained error data PROVIDED RESULTS: In the kddcup_99 data set we have different attributes and in this case I am going to show: that can take the amount of destination host and the amount of service destination host. 80

99 Picture 5.7: Results for the experiments of kddcup dataset. SIX DATA SET: LETTER: WHAT CAN YOU DO WITH THIS DATASET? This dataset contains the different features that have the characters within them. We can highlight: height width, which corners are they touching and so on. It can be usefull for some areas of research to ideantify the different Romanic characters or to compare them with chinesse or japanesse characters. PROVIDED RESULTS: In this case I show the different values that can take the attributes width x-box and y-box. 81

100 Picture 5.8: Results for the experiments of letter dataset. SEVENTH DATA SET: NURSERY WHAT CAN YOU DO WITH THIS DATASET: This dataset contains different information about children that where in the nursery. It can be usefull to analyze different data like the health, finance or the number of brothers that they have in order to know which of them are more likely to go to the nursery and if it is related with some of this attributes. PROVIDED RESULTS: In this case we saw the different values and frequency that the database has for the different attributes: parents and has_nurs (if it has an auxiliary nurse or not). 82

101 Picture 5.9: Results for the experiments of nursery dataset. EIGHT DATA SET: SPLICE WHAT CAN YOU DO WITH THIS DATASET? The dataset contains 61 different attributes about which class of splice there are and which features do they have. It can be useful for knowing which tipe of splice you should use in which pipe or which features work better in which installations. PROVIDED RESULTS In this case I am going to examine two different attributes named in the data set attribute_1 and attribute_2. In both cases they can take four different values C, A, G, T. We will examine the different likelihood for each case in both attributes and on the other hand we will examine the average, the minimum and the maximum values. 83

102 Picture 5.10: Results for the experiments of splice dataset. NINTH DATA SET: WAVEFORM WHAT CAN YOU DO WITH THIS DATASET? In this case the data set contains forty different attributes for each row. Each one with the different dots that contain the wave form. It can be usefull to know which waveforms we have gathered in the nature and also to guess what can a typical wavefrom graphic be like for different purposes. 84

103 PROVIDED RESULTS: In this case I show the different values that can take the attributes x1, x2, x3. Picture 5.11: Results for the experiments of waveform dataset. TENTH DATA SET: STUDENTS DATA WHAT CAN YOU DO WITH THIS DATASET? This dataset contains data about high school students and you will be able to know for example if the level of education of their parents influences their marks, the amount of study or if the girls get better marks than the boys PROVIDED RESULTS: I will show the results that I have obtained for the high school dataset. In the first photo we see the results about whether the level of education of their parents influenced the students. 85

104 Picture 5.12: Results for the experiments of students dataset. Finally, I also show the gender distribution within the high school students. Picture 5.13: More results for the experiments of students dataset. 86

105 6 APPLYING MACHINE LEARNING ALGORITHMS Machine learning consists of a subfield inside the computer sciences that has come from the pattern recognition and from the theory of computational learning of an artificial intelligence. We could arguably define the machine learning like the field of study that gives computers the ability to learn without being explicitly programmed. This subfield of computer sciences explores the algorithms that are able to learn and make decisions and predictions expressed on data. I also would like to point out that machine learning is closely related and sometimes overlapped with the subfield named computational statistics, which is a discipline that focuses on prediction making through the use of computer. It has strong bounds with the mathematical optimization which provides methods, theory and application domains to the machine learning. We can also find some applications like spam filtering, computer vision, optical character recognition, search engines and so on. Finally, I would also like to point out that data mining sub-field focuses more on exploratory data analysis and it is referred like unsupervised learning. STEP 1: STARTING WITH MACHINE LEARNING ALGORITHMS Like an introduction to the topic, I am going to apply in R one of the most simple machine learning algorithm named KNN (K nearest neighbors), for this case I will apply a sample data set named iris. These are the first commands for the procedure and its explanation: names(iris) <- c( Sepal.Length, Sepal.Width, Sepal.Length, Sepal.Width ) names(iris) library(ggvis) iris %>% ggvis(~sepal.length, ~Sepal.Width, fill = ~Species) %>% layer_points() iris %>% ggvis(~petal.length, ~Petal.Width, fill = ~Species) %>% layer_points() 87

106 First of all, we create an array with all the attribute names. Next, we imported a library named ggvis that is able to make more complex graphics that can be useful for this case. Finally, I create two graphics. We are able to see the relationship within two attributes petal width and petal height. On the other hand, we can also appreciate that sepal width and sepal height are not as related as the other two. Picture 6.1: Applying machine learning algorithms in R. Secondly I typed the following commands: table(iris$species) round(prop.table(table(iris$species)) * 100, digits = 1) summary(iris) The purpose of these commands basically is to see what we have inside the species attribute and to round them in order to make them ready for the experiment. 88

107 Picture 6.2: Showing main features of iris dataset Thirdly, we also want to see a summary of two attributes petal width and height in order to see the relationship between them and in order to have a better understanding of the data set that we are experimenting with, after that we also prepare the workspace that we are working on by importing the library. Those two actions can be summarized in the following commands. summary(iris[c("petal.width", "Sepal.Width")]) library(class) After that we start with a very important step that we will need to take. This step is named normalization, and it will make the data more consistent. I also would like to mention that sometimes normalization is not strictly necessary, if there are not too many differences within the minimum value and the maximum inside the data set this step might be not strictly necessary but still always advisable. Coming back to the topic I would like to provide you with the following commands that make the normalization step: 89

108 normalize <- function(x) { num <- x - min(x) denom <- max(x) - min(x) return (num/denom) } iris_norm <- as.data.frame(lapply(iris[1:4], normalize)) summary(iris_norm) Picture 6.3: Summary of iris dataset The fourth thing that we have to do is to prepare the training and the test sets. The first thing that we need to do in order to accomplish this step is to create a seed, the function of which is to create random numbers. This function takes a sample with a size set like a 90

109 number of rows of the iris data set. Finally, we used the variable obtained with the sample function in order to define our train and our test sets: set.seed(1234) ind <- sample(2, nrow(iris), replace=true, prob=c(0.67, 0.33)) iris.training <- iris[ind==1, 1:4] iris.test <- iris[ind==2, 1:4] Picture 6.4: Applying machine learning algorithms to iris data Finally, I would like to point out that in our train and test set we do not have all five attributes, we have only four, because we actually want to predict the fifth attribute. We will also apply the knn function in order to predict the results. But even if it seems that the work is all done we will also need to analyze the results. For that the first thing that we need to do is to import the gmodels library. 91

110 iris.trainlabels <- iris[ind==1, 5] iris.testlabels <- iris[ind==2, 5] iris_pred <- knn(train = iris.training, test = iris.test, cl = iris.trainlabels, k=3) iris_pred library(gmodels) Picture 6.5: spliting iris dataset into training and test. Finally, we will analyze the results and we will see that the algorithm worked quite well and was right in all cases except one. In order to analyze the results, I typed the following command: 92

111 CrossTable(x = iris.testlabels, y = iris_pred, prop.chisq=false) Picture 6.6: Final results after applying machine learning on iris dataset. STEP 2: APPLYING MACHINE LEARNING (REGRESION TREE) TO OUR DATASETS In order to apply machine learning to our datasets we will use regression tree because it is one of the most popular and recommended algorithms and we will apply it to all our datasets with the following commands structure: 93

112 library(class) library(rpart) coll <- dataset.collection dataset <- mongo.find.all(mongo, coll, data.frame=true) dataset raw = subset(dataset, select=c("x.box","y.box","width","high")) raw row.names(raw) = raw.orig$casnumber row.names(raw) = dataset$casnumber raw = na.omit(raw); frmla = high ~ x.box + y.box + width fit = rpart(frmla, method="class", data=raw) printcp(fit) # display the results plotcp(fit) # visualize cross-validation results summary(fit) 94

113 DATA SET: LETTER: After applying the regression tree on this dataset with the previously mentioned commands I got the following output results: Picture 6.7: Applying regression tree algorithm to letter dataset 95

114 DATA SET: ARRYTHMIA: After applying the regression tree for this dataset to the following attributes I managed to have the following results: Picture 6.8: Applying regression tree algorithm to arrythmia dataset 96

115 DATA SET: DIABETIC: After applying regression tree to the following attributes: gender, weight, race and age I got the following output results: Picture 6.9: Applying regression tree algorithm to diabetic dataset 97

116 DATA SET: KDDCUP99: Now I will analyze the different attributes from the dataset kddcup from the year 99. These are the output results that I got: Picture 6.10: Applying regression tree algorithm to kddcup dataset 98

117 DATA SET: NURSERY: Now I will apply the regression tree to the attributes of the nursery dataset, and those are the output results: Picture 6.11: Applying regression tree algorithm to nursery dataset 99

118 DATA SET: SPLICE: Now I will apply the regression tree to the attributes of the splice dataset, and those are the output results: Picture 6.12: Applying regression tree algorithm to splice dataset 100

119 DATA SET: STUDENT: Now I will apply the regression tree to the attributes of the student dataset, and those are the output results: Picture 6.13: Applying regression tree algorithm to student dataset 101

120 DATA SET: TUMOR: Now I will apply the regression tree to the attributes of the tumor dataset, and those are the output results: Picture 6.14: Applying regression tree algorithm to tumor dataset 102

121 DATA SET: WAVEFORM: Now I will apply the regression tree to the attributes of the waveform dataset, and those are the output results: Picture 6.15: Applying regression tree algorithm to waveform dataset 103

122 7 CONCLUSION With the introduction of new technologies, the devices and different means of communication, like the social networks, the quantity of data that is being produced is growing very fast year by year. Just so that we have a general idea of how much data we create I would like to mention that we have produced 5 billion of gigabytes of data since the beginning of time till The amount of data required to manage the applications and technologies is becoming increasingly bigger, so there has to be a way of handling this issue. It is here where the role of big data becomes clear. Under the name big data we understand something like a collection of very big datasets that cannot be processed by the usage of traditional computing techniques. Furthermore, in recent years, the big data is not merely data any more and it has become a different subject which involves different tools techniques and frameworks. With this master thesis I had the chance of working with big data which was a good way of appreciating first hand the main benefits of it, out of which I would like to mention the main two: The fact of using such a big quantity of information allows you to learn about the response for campaigns, promotions and other advertisement medium. Using the information allows you to have more information about your products which can be useful for planning the production or for future decision making. I would also like to have some words for Hadoop, the big data technology that I have been useing through this master thesis. Hadoop is a solution that has been provided by Google in 2005 and started like as an open source project. Hadoop is able to run applications with the usage of the Map reduce algorithms on different CPU nodes that are processed in parallel. With the performance of this work I was able to prove that Hadoop is a strong solution with a very high fault tolerance and a very good scalable approach for big data. I would also like to give some conclusions and explanations about the other subject of this master thesis which is machine learning. Machine learning consists in a subfield inside computer science which was derived from the study of computational learning and pattern recognition. We could define machine learning as the field of study that gives computers 104

123 the ability to learn without being explicitly programmed this subfield of computer sciences explores the algorithms that can learn from the experience and are able to make predictions based on the gathered data. Furthermore, I would like to point out that machine learning is quite related with the discipline of computational statistics which also focuses on prediction making but in this case it is done by the usage of computers. Once we have a good definition of what machine learning is, I would like to give my own conclusions based on my own experience. As I saw throughout the making of this master thesis, the machine learning approaches can be applied to a lot of different fields being able in this way to improve product quality. Machine learning approaches are of particular interest considering the steadily increasing search outputs and accessibility of existing evidence is a particular challenge of the research field quality improvement. At the end, I would also like to talk about the results of analyzing our different datasets: As we saw before big data analysis helps you to identify the connexions between the different attributes, to have a better understanding about our already existent data and at the same time it allows you to make predictions about what the future data entries are going to be like. We could identify in this way the different features of the tumor in order to classify them or we could also know wich features are related to the apparition of arrhythmia. Finally, we could also see other less medical examples like the data about highschool students and how different variables related to each other. 105

124 8 REFERENCES [1] Official documentation with general description of MongoDB [2] Main description and basic features that provides us a teorical base for Mongo DB from Wikipedia web page [3] Main page of mongo where you can download the necessary tools legally [4] R ide is a tool for analyzing data, you can find the official web page in [5] It contains the basic features, history and information about the different versions of R tool and the second one contains a tutorial where we can get started and learn about data mining in R [6] The following web page contains a theoretical base where we can find the main information about machine learning algorithms and how they work and the second link contains a basic guide for getting started with the rmongodb library inside R [7] Main page with the official documentation for R ide [8] like the main source of information in order to express the main features of machine learning [9] The official documentation about Hadoop [10] Main tutorial about how to use the Hadoop Distributed File System on Windows [11] Main page where you can download the latest Hadoop version 106

125 [11] The native libraries from windows can be downloaded in the following URL: [12] The maven files are needed for the purpose of this master thesis and they can be downloaded in the following URL: version [13] The procol buffer in needed in order to work correctly the entire system and it can be downloaded in the following URL: [14] The R tools can be downloaded in the following URL: 107

126 II

127 III

128 IV

Prikaži več