Setting up Nutch 2.1 with MySQL to handle UTF-8


These instructions assume Ubuntu 12.04 and Java 6 or 7 installed and JAVA_HOME configured.

Install MySQL Server and MySQL Client using the Ubuntu software center or sudo apt-get install mysql-server mysql-client at the command line.

As MySQL defaults to latin (are we still in the 1990s?) we need to edit sudo vi /etc/mysql/my.cnf and under [mysqld] add

innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
max_allowed_packet=500M

The innodb options are to help deal with the small primary key size restriction of MySQL. Restart your machine for the changes to take effect. The max_allowed_packet option is so you don’t run into issues as your database and the pages you store in it get larger.

Check to make sure MySQL is running by typing sudo netstat -tap | grep mysql  and you should see something like

tcp        0      0 localhost:mysql         *:*                     LISTEN

We need to set up the nutch database manually as the current Nutch/Gora/MySQL generated db schema defaults to latin. Log into mysql at the command line using your previously set up MySQL id and password type

mysql -u xxxxx -p

then in the MySQL editor type the following:

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

and enter followed by

use nutch;

and enter and then copy and paste the following altogether:

CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` mediumtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;

Then type enter. You are done setting up the MySQL database for Nutch.

 

Set up Nutch 2.1 by downloading the latest version from http://www.apache.org/dyn/closer.cgi/nutch/. Untar the contents of the file you just downloaded and going forward we will refer to this folder as ${APACHE_NUTCH_HOME}.

From inside the nutch folder ensure the MySQL dependency for Nutch is available by editing the following in ${APACHE_NUTCH_HOME}/ivy/ivy.xml

<!– Uncomment this to use MySQL as database with SQL as Gora store. –>
<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/>

Edit the ${APACHE_NUTCH_HOME}/conf/gora.properties file either deleting or commenting out the Default SqlStore Properties using #. Then add the MySQL properties below replacing xxxxx with the user and password you set up when installing MySQL earlier.

###############################
# MySQL properties            #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxxx
gora.sqlstore.jdbc.password=xxxxx

Edit the ${APACHE_NUTCH_HOME}/conf/gora-sql-mapping.xml file changing the length of the primarykey from 512 to 767 in both places.
<primarykey column=”id” length=”767″/>

Configure ${APACHE_NUTCH_HOME}/conf/nutch-site.xml to put in a name in the value field under http.agent.name. It can be anything but cannot be left blank. Add additional languages if you want (I have added Japanese ja-jp below) and utf-8 as default as well. You must specify Sqlstore.

<property>
<name>http.agent.name</name>
<value>Your Nutch Spider</value>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>

Install ant using the Ubuntu software center or sudo apt-get install ant at the command line.

From the command line cd to your nutch folder type ant runtime
This may take a few minutes to compile.

 

Start your first crawl by typing the lines below at the terminal (replace ‘http://nutch.apache.org/’ with whatever site you want to crawl):

cd ${APACHE_NUTCH_HOME}/runtime/local
mkdir -p urls
echo 'http://nutch.apache.org/' > urls/seed.txt
bin/nutch crawl urls -depth 3 -topN 5

You can easily add more urls to search by hand in seed.txt if you want. For the crawl, depth is the number of rounds of generate/fetch/parse/update you want to do (not depth of links as you might think at first) and topN is the max number of links you want to actually parse each time. Note however Nutch keeps track of all links it encounters in the webpage table (it just limits the amount it actually parses to TopN so don’t be surprised by seeing many more rows in the webpage table than you expect by limiting with TopN).

Check your crawl results by looking at the webpage table in the nutch database.

mysql -u xxxxx -p
use nutch;
SELECT * FROM nutch.webpage;

You should see the results of your crawl (around 159 rows). It will be hard to read the columns so you may want to install MySQL Workbench via sudo apt-get install mysql-workbench and use that instead for viewing the data. You may also want to run the following SQL command select * from webpage where status = 2; to limit the rows in the webpage table to only urls that were actually parsed.

 

Set up and index with Solr If you are using Nutch 2.1 at this time you are into the bleeding edge and probably want the latest version of Solr 4.0 as well. Untar it to to $HOME/apache-solr-4.0.0-XXXX. This folder will be now referred to as ${APACHE_SOLR_HOME}.
Download http://nlp.solutions.asia/wp-content/uploads/2012/08/schema.xml  and use it to replace ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.

From the terminal start solr:
cd ${APACHE_SOLR_HOME}/example
java -jar start.jar

You can check this is running by opening http://localhost:8983/solr in your web browser.

Leave that terminal running and from a different terminal type the following:
cd ${APACHE_NUTCH_HOME}/runtime/local/
bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex

You can now run queries using Solr versus your crawled content. Open http://localhost:8983/solr/#/collection1/query and assuming you have crawled nutch.apache.org in the input box titled “q” you can do a search by inputting text:nutch and you should see something like this:

 

There remains a lot to configure to get a good web search going but you are at least started.

, , , ,

  1. #1 by James Sullivan on September 2, 2012 - 4:36 am

    There is no way that I am currently aware of (see http://wiki.apache.org/nutch/InjectOptions) to inject urls from a MySQL database rather than a text file. However, having said that I don’t think it would be that difficult to either work around (create a script to generate the URL text file from the db) or to add that functionality to Nutch.

  2. #2 by yangtc on October 24, 2012 - 12:34 am

    when I set up reference this doc, and error occours :SolrIndexerJob: java.lang.RuntimeException: job failed: name=solr-index, jobid=job_local_0001
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
    at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)
    at org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:54)
    at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:75)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:84

    and nutch2.1 ouput:
    ERROR: [doc=fi.foofactory.blog:http/2007/03/twice-speed-half-size.html] unknown field ‘site’
    at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:313)
    at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:73)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:208)

    why?
    in my envioment there is no example/solr/conf/ dir, i mkdir conf in example/solr , and dowload schema.xml and then this error occors

  3. #3 by James Sullivan on November 7, 2012 - 9:38 pm

    Changed to LONGBLOB

  4. #4 by huanvn on November 28, 2012 - 2:24 am

    @yangtc

    I got the same errors and have no idea why. After checking log files, I made some modifications

    1) Put schema.xml file in:
    ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.
    (instead of ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml)

    2) Use command:
    bin/nutch solrindex http://127.0.0.1:8983/solr/collection1 -reindex
    (instead of bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex)

  5. #5 by nutch_install on December 15, 2012 - 2:36 am

    To start crawling , the first command as mentioned above

    cd ${APACHE_NUTCH_HOME}/runtime/local

    requires runtime directory which is not present in my APACHE_NUTCH_HOME . I have downloaded binaries for nutch 2.1 from apache nutch mirror site.
    But it does not contain any runtime folder.

    So, how can i execute this query??

    Please somebody help.

    waiting for reply…

    Thank you..

  6. #6 by nutch_install on December 17, 2012 - 2:03 am

    nutch_install :
    To start crawling , the first command as mentioned above
    cd ${APACHE_NUTCH_HOME}/runtime/local
    requires runtime directory which is not present in my APACHE_NUTCH_HOME . I have downloaded binaries for nutch 2.1 from apache nutch mirror site.
    But it does not contain any runtime folder.
    So, how can i execute this query??
    Please somebody help.
    waiting for reply…
    Thank you..

  7. #7 by wannabe on December 27, 2012 - 4:02 am

    Hi there,

    I followed your tutorial to the letter and it works great. However, I would like to use Nutch 2 with Solr 3.6 My reason is because I want to use drupal as a front-end interface (It has a Solr 3.x module). So my question is, how can I use Nutch 2 with Solr 3.6?

    Thanks in advance

    W

  8. #8 by James Sullivan on December 27, 2012 - 2:21 pm

    Huanvn,

    You are absolutely right. Solr was updated and they made some changes. I will update this tutorial so it is correct.

    James

  9. #9 by James Sullivan on December 27, 2012 - 2:24 pm

    Wannabe,

    I believe if you use the schema.xml for 3.6 you shouldn’t have any problems. So instead of using http://nlp.solutions.asia/wp-content/uploads/2012/08/schema.xml you should use the schema.xml included with Nutch 2.0 which I believe was for 3.6. Let me know if that works.

    James

  10. #10 by James Sullivan on December 27, 2012 - 2:26 pm

    nutch_install

    Are you sure you built it? If it successfully built the folders should be there. If you have not built it then the folders will not be there.

  11. #11 by James Sullivan on December 27, 2012 - 2:28 pm

    All,

    I recommend using the Nutch mailing list user@nutch.apache.org . You will probably get a much quicker response than you will here. I only check this page monthly.

  12. #12 by wannabe on December 30, 2012 - 3:59 pm

    Just tell us one thing. Where the heck did you find instructions to setup MYSQL? All I can find is your blog ..

  13. #13 by James Sullivan on January 1, 2013 - 11:04 am

    Wannabe,

    There are no other instructions to setup MySQL for Nutch that I am aware of. I used the general installation instructions for Nutch 2.X and wrote up what I did along with some changes to deal with issues that other people have run into since then. There are a lot of helpful people on the Nutch mailing lists.

  14. #14 by kiran on January 1, 2013 - 1:11 pm

    Hi,

    Why do you think i keep getting the error (Specified key was too long; max key length is 767 bytes) when i try to create the table ?

    When i changed the ‘id’ to varchar(150) it worked but more than 200 its not working ?

    Is it some kind of limitation or only the problem i am having ?

    Thank you,
    Kiran.

    • #15 by James Sullivan on February 12, 2013 - 10:18 pm

      Apologies for starting with the obvious, but how long is the URL you are trying to index that is causing the error? If you have set it up exactly like above (particularly with the changes to /etc/mysql/my.cnf) your max key length should be 767 characters (not just bytes in case your URL is UTF-8). May I ask what version of MySQL you are running as some of the older versions cannot support larger primary key sizes.

  15. #16 by Prashant on March 1, 2013 - 1:22 am

    Hi,
    When I try to crawl with this command bin/nutch crawl urls -depth 3 -topN 5, I get following error;


    Exception in thread “main” org.apache.gora.util.GoraException: java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column ‘text’ (max = 16383); use BLOB or TEXT instead
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
    at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:69)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:136)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
    Caused by: java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column ‘text’ (max = 16383); use BLOB or TEXT instead
    at org.apache.gora.sql.store.SqlStore.createSchema(SqlStore.java:226)
    at org.apache.gora.sql.store.SqlStore.initialize(SqlStore.java:172)
    at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
    … 8 more
    Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Column length too big for column ‘text’ (max = 16383); use BLOB or TEXT instead
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
    at com.mysql.jdbc.Util.getInstance(Util.java:386)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1052)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
    at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
    at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2345)
    at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2330)
    at org.apache.gora.sql.store.SqlStore.createSchema(SqlStore.java:224)
    … 11 more

    Tried to look for more clues in log directory, I could only get this much info there;
    “2013-03-01 13:47:14,845 INFO store.SqlStore – creating schema: webpage”

    Which means, Nutch was creating schema: webpage, i.e. the table when it got crashed.

    When I searched on the web about this esxception, I found that it is a reported bug (https://issues.apache.org/jira/browse/NUTCH-970), and it seems it will be handled only in the next release i.e. 2.2.

    But from the comments above it seems people were able to successfully crawl with MySQL.

    Has anyone faced the same problem? What could be the reason and solutions?

    Thank you,
    Prashant

    • #17 by James Sullivan on March 31, 2013 - 4:17 pm

      Prashant,

      I have not had that issue but if you are able to share the URLs you are injecting I could try and replicate it.

    • #18 by James Sullivan on April 2, 2013 - 3:56 am

      Prashant,

      You aren’t by any chance using the most recent Nutch 2.X version are you? There have been some changes to 2.X (the addition of the prevModifiedTime and batchId columns to the DB) and you should use the following SQL in your nutch database to create the webpage table:

      CREATE TABLE `webpage` (
      `id` varchar(767) NOT NULL,
      `headers` blob,
      `text` mediumtext DEFAULT NULL,
      `status` int(11) DEFAULT NULL,
      `markers` blob,
      `parseStatus` blob,
      `modifiedTime` bigint(20) DEFAULT NULL,
      `prevModifiedTime` bigint(20) DEFAULT NULL,
      `score` float DEFAULT NULL,
      `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
      `batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
      `baseUrl` varchar(767) DEFAULT NULL,
      `content` longblob,
      `title` varchar(2048) DEFAULT NULL,
      `reprUrl` varchar(767) DEFAULT NULL,
      `fetchInterval` int(11) DEFAULT NULL,
      `prevFetchTime` bigint(20) DEFAULT NULL,
      `inlinks` mediumblob,
      `prevSignature` blob,
      `outlinks` mediumblob,
      `fetchTime` bigint(20) DEFAULT NULL,
      `retriesSinceFetch` int(11) DEFAULT NULL,
      `protocolStatus` blob,
      `signature` blob,
      `metadata` blob,
      PRIMARY KEY (`id`)
      ) ENGINE=InnoDB
      ROW_FORMAT=COMPRESSED
      DEFAULT CHARSET=utf8mb4;

  16. #19 by a.james.wolf on March 16, 2013 - 4:30 pm

    I was successful with using your tutorial however it did take a bit more to complete than what you mentioned. For example running and ivy-bootstrap and some other missing details.

    My question is this. I see here: http://searchthenews.wikispaces.com/Useful+Nutch+Commands
    For example there are many references to segments and databases.

    With your setup what is the name of the database?

    • #20 by James Sullivan on March 29, 2013 - 1:37 am

      CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

      This means that the name the name of the database is ‘nutch’.

  17. #21 by hush_lucy on March 26, 2013 - 11:13 am

    Hi James,

    I am currently trying to set up Nutch with MySql and elastic search as indexer. I followed your instructions diligently but I cannot create the table with id varchar(767) as Primary Key. I always get this error “ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes” I tried everything but it still didn’t work.

    I’m almost giving up on setting the id varchar(767) and I was wondering if changing it to varchar(199) would cause problems in the future.

    I need your kind assistance please.

    • #22 by James Sullivan on March 29, 2013 - 1:33 am

      It will still work with varchar(199) as long as the URLs you index are less than 200 characters long. In my experience you eventually run into longer URLs if you are doing a webcrawl. Did you edit the file my.cnf and under [mysqld] add

      innodb_file_format=barracuda
      innodb_file_per_table=true
      innodb_large_prefix=true

      and use the create table sql with

      `id` varchar(767) NOT NULL,

      Those are the parts that enable you to have a larger key.

  18. #23 by kimsovon on April 5, 2013 - 10:20 pm

    I’m too, When try to crawl with this command bin/nutch crawl urls -depth 3 -topN 5, I get

    Exception in thread “main” org.apache.gora.util.GoraException: java.io.IOException: java.lang.NullPointerException
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
    at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214)
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:136)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
    Caused by: java.io.IOException: java.lang.NullPointerException
    at org.apache.gora.sql.store.SqlStore.getConnection(SqlStore.java:747)
    at org.apache.gora.sql.store.SqlStore.initialize(SqlStore.java:160)
    at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
    at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
    … 8 more
    Caused by: java.lang.NullPointerException
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:188)
    at org.apache.gora.sql.store.SqlStore.getConnection(SqlStore.java:735)
    … 11 more

    Please, help me.

  19. #24 by James Sullivan on July 15, 2013 - 8:03 pm

    First thing to make sure is that the status is greater than 1 for the row you want indexed. Anything with a status of 1 has not been fetched yet so will not be indexed and you need to run the nutch crawl again. Also if you are using Nutch 2.2 please see http://nlp.solutions.asia/?p=362 for more up-to-date instructions.

You must be logged in to post a comment.