Remerciez-le!

Remerciez @Admin pour avoir partagé cet document gratuitement, de la manière la plus simple, en partageant sur les réseaux sociaux.

Installation guide Installation guide Revision History Revision No. Description

Installation guide Installation guide Revision History Revision No. Description Date Author 1.0 Created 20131025 Shirlin Voon 1.0 NUTCH 1. Download nutch from the link and unzip the file. http://www.apache.org/dyn/closer.cgi/nutch/ 2. Install JAVA in ubuntu  javac -version  sudo apt-get install openjdk-7-jdk 3. Set JAVA_HOME  export JAVA_HOME = /usr/lib/jvm/java-7-openjdk-i386  echo $JAVA_HOME // to check whether JAVA_HOME set correctly 4. Run “bin/nutch” 5. Run the following command if you see “Permission denied”  chmod +x bin/nutch 6. Create folder urls and save the starting url need to be crawled inside seed.txt file 7. Edit the file conf/regex-urlfilter.txt. This file stores the blacklist file. 8. Download sorl from the link and unzip the file. http://www.apache.org/dyn/closer.cgi/lucene/solr/ 9. Start sorl using command below:  cd Applications/apache-solr-3.6.2/example  java –jar start.jar 10. SORL ADMIN WEBSITE  http://localhost:8983/solr/admin/  http://localhost:8983/solr/admin/stats.jsp 11. Start nutch using command below:  cd Applications/apache-nutch-1.7  bin/nutch inject crawl/crawldb urls  bin/nutch generate crawl/crawldb crawl/segments  s1 = `ls –d crawl/segments/2* | tail -1`  echo $s1  bin/nutch fetch $s1  bin/nutch parse $s1 1 Installation guide  bin/nutch updated crawl/crawldb $s1  bin/nutch invertlinks crawl/linkdb –dir crawl/segments  bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb – linkdb crawl/linkdb crawl/segments/* 12. Start nutch command can simplify into 1 command as below:  bin/nutch crawl urls –dir crawl –solr http://localhost:8983/solr/ -depth 3 –topN 5 2.0 ApacheMySQLPHP 1. Create folder www at home 2. Download phpMyAdmin and unzip and put in folder www http://www.phpmyadmin.net/home_page/downloads.php 3. sudo apt-get install update 4. sudo apt-get install phpmyadmin 5. sudo apt-get install apache2 6. sudo apt-get install libapache2-mod-php5 7. Install MySQL (optional)  sudo apt-get install mysql-server libapache2-mod-auth-mysql php5- mysql  sudo mysql_install_db  sudo apt-get install mysql-client-core-5.5  sudo apt-get install php5-cli 8. Chage default document root  sudo cp /etc/apache2/sites-available/default /etc/apache2/sites- available/mysite  gksudo gedit /etc/apache2/sites-available/mysite  Change documentroot to new location. PS: make sure no space in new location  Change <Directory to new location 9. Deactivate old site and activate new site  sudo a2dissite default && sudo a2ensite mysite 10. Restart Apache2  sudo service apache2 restart 3.0 GIT 1. sudo apt-get install git //install GIT 2. git init //initialize git 2 Installation guide 3. git status //check status 4. git add * //add all file git add Prism3 //add folder git add “test.java” //add 1 file 5. git rm * //remove all file 6. git commit –m “Description” //commit before push to server 7. git push http://dev@200.15.16.140:8080/Prism.git 8. git pull http://dev@200.15.16.140:8080/Prism.git 4.0 SINGLE NODE CLUSTER SETUP 1. You may refer to link below: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single- node-cluster/ 2. Make sure JAVA is installed 3. Create an user used for all machine  sudo addgroup hadoop  sudo adduser –ingroup hadoop prism  su – prism 4. Install ssh  sudo apt-get install ssh  ssh localhost 5. Generate SSH key  ssh-keygen –t rsa –P “” 6. Enable SSH access to local machine without key in password every time  cat /home/prism/.ssh/id_rsa.pub >> /home/prism/.ssh/authorized_keys 7. Disabling IPv6  sudo nano /etc/sysctl.conf  copy following lines to end of the file # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1  save and restart PC  Check whether IPv6 is enabled using following command  cat /proc/sys/net/ipv6/conf/all/disable_ipv6  Value 0 means enable and value 1 means disable 8. Update $home/.bashrc 9. Open conf/hadoop-env.sh and set JAVA_HOME 10. Create directory and set permission 3 Installation guide  sudo mkdir –p /app/hadoop/tmp  sudo chown prism /app/hadoop/tmp  sudo chmod 750 /app/hadoop/tmp 11. Set 3 site.xml file  conf/core-site.xml  conf/mapred-site.xml  conf/hdfs-site.xml 12. bin/hadoop namenode –format ** if permission denied run “chmod +x bin/hadoop” 13. bin/start-all.sh 14. jps (Make sure there is NameNode, DataNode, JobT racker, T askT racker, SecondaryNameNode) 15. Copy file from local file system to HDFS and run process  bin/hadoop dfs –copyFromLocal {localFileDirectory} {HDFS Directory}  bin/hadoop dfs –ls {HDFS Directory}  bin/hadoop jar hadoop*examples*.jar wordcount {HDFS input} {HDFS output} 16. Others command  bin/hadoop dfs –cat /home/prism/output/part-r-0000  bin/hadoop dfs –getmerge {HDFS Directory} {localFileDirectory}  bin/hadoop dfs –rmr /home/prism/test-output //remove directory from HDFS 5.0 MUL TI NODE CLUSTER SETUP 1. You may refer to link below: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi- node-cluster/ 2. sudo nano /etc/hosts 200.15.16.141 prism1 //master 200.15.16.242 prism2 //slave 3. Copy master SSH key to slave authorized_key  ssh-copy-id –I /home/prism/.ssh/id_rsa.pub prism@prism2 4. Edit conf/masters (only for master PC)  Change “localhost” to “prism1” //prism1 is master 5. Add all slave nodes in conf/slaves (only for master PC) 4 Installation guide  prism1 //prism1 as master and also slave  prism2 //prism2 as slave  prism3 //prism3 as slave 6. Set 3 site.xml file (in all machines)  conf/core-site.xml //change “localhost” to “prism1”  conf/mapred-site.xml //change “localhost” to “prism1”  conf/hdfs-site.xml //change value to number of nodes 7. Format namenode (in all machines)  bin/hadoop namenode –format  If fail to format the namenode i. sudo rm /app/hadoop/tmp –r //remove directory /app/hadoop/tmp ii. sudo mkdir /app/hadoop/tmp //create back the directory iii. sudo chown –R prism /app/hadoop/tmp //give access permission to prism 8. Start DFS and MapReduce  bin/start-dfs.sh  bin/start-mapred.sh  If datanode does not start at slave i. Reformat namenode in slave **data will lost** OR ii. Update namespaceID in problem datanode Manually copy NameNode namespaveID to DataNode namespaceID NameNode: /app/hadoop/tmp/dfs/name/current/VERSION DataNode: /app/hadoop/tmp/dfs/data/current/VERSION 9. Copy file from local file system to HDFS and run process  bin/hadoop dfs –copyFromLocal {localFileDirectory} {HDFS Directory}  bin/hadoop dfs –ls {HDFS Directory}  bin/hadoop jar hadoop*examples*.jar wordcount {HDFS input} {HDFS output} 10. If reduce job start, please check /etc/hosts file for all machine. Make sure the hostname can be resolve. 6.0 LINUX useful command 1. sudo mkdir /app//Create directory 2. sudo rm /app –r //delete directory recursive 3. sudo chown prism /app //give permission for directory “/app” to user “prism” 4. sudo nano /etc/hosts //open file “/etc/hosts” in command 5 Installation guide 5. sudo gedit /etc/hosts //open file “/etc/hosts” in document 6. sudo addgroup hadoop //add group “hadoop” 7. sudo adduser –ingroup hadoop prism //add user “prism” into group “hadoop” 8. su – prism //change user to “prism” 7.0 REPORT FORMAT 8.0 BACKUP & RECOVERY 9.0 CHECKLIST 6 uploads/s3/ 1-0-nutch-installation-guide.pdf