土人之NLP日志

Thursday, 22 January 2015

latex 一些图盖住文字 Figure block text的解决方法

用clip 选项就可。例子如下：

\includegraphics[width=3in,clip]{examplenew.eps}

以上

Thursday, 4 December 2014

How to build a naive (very naive) system scored over 30,000 in RecSysChallenge 2015?

How to build a naive (very naive) system scored over 30,000 in RecSysChallenge 2015?

-The task

Given a sequence of click events performed by some user during a typical session in an e-commerce website, the goal is to predict whether the user is going to buy something or not, and if he is buying, what would be the items he is going to buy.

The detail of RecSysChallenge 2015 can be found at http://2015.recsyschallenge.com/index.html.

-A naive system (too simple and too naive!!)

It is very easy to build a system which achieves a score > 30,000 by using two simple rules.

--Rules
Rule#1: The items which are bought no less than MINF=10 times in the train data.
Rule#2: The items which are clicked no less than MINCLICKS=2 times *in each sesssion* in the test data.

--Steps:
Step#1: Obtain the list (buys.list) of items which are bought in the train data.
Step#2: Keep the items following Rule#1 and Rule#2 for each session.

--The code in perl
########################

#Usage
print stderr "Usage: perl a.pl buys.list MINF test_file MINCLICKS\n";

$MINF = $ARGV[1];
$MINCLICKS = $ARGV[3];
open FP, $ARGV[0] or die; #buys.list
while(){
($item, $freq) = split /\t/;
next if($item eq "" or $freq < $MINF);
$itemlist{$item} = $freq;
#print stderr "ITEM:$item\n";
}
close FP;

open FP, $ARGV[2] or die;#test file
while(){
($sid, $stime, $item, $sg) = split /\,/;
next if($sid eq "");
next if(!exists $itemlist{$item});
#print stderr "Tobuy:$sid\t$item\n";
$pred_buys{$sid}{$item}++;
}
close FP;

foreach $sid (sort{$a <=> $b} keys %pred_buys){
$pred = "";
foreach $item (keys %{$pred_buys{$sid}}){
$pred .= $item."," if($pred_buys{$sid}{$item} >=$MINCLICKS);
}
next if($pred eq "");
$pred =~ s/\,$//g;
print "$sid;$pred\n";
}

##The END
########################

Upload the results and get a score =33780.1.

OK. Time to have a rest. Let us party!!!

Monday, 21 November 2011

从ppt 转成 eps文件过程

作为一个免不了写paper的人，自然需要制作一些eps文件嵌入到latex文件。下面怎么使用ppt来做图最后生成eps。

1. 使用ppt 做一页slide。
2. 打印输出到pdf(current slide)
3. pdf再次打印成ps（print to file）。
3. 使用 ps -> eps选项

Wednesday, 22 June 2011

How to make a EPS from PPT

1. PPT ---(CutePDF) -> PDF

2. PDF ---(ToPS) -> PS (choose auto-rotate and center)

3. PS -> EPS

How to set a new printer named ToPS (http://u.cs.biu.ac.il/~herzbea/makeP.htm)

To install the MS Publisher Imagesetter virtual printer, open the `Printers` settings folder, and click `Add Printer`. Select a local printer; since it is virtual, select the FILE: port. Specify `Generic` as manufacturer, and then you can choose the MS Publisher Imagesetter.

After you install the printer, you may want to fine-tune its properties. To do so, open again the Printers setting folder, and right click on the printer; select `Properties`. Under `Device Settings`, set very low values (e.g. 5) to the following two parameters:

Minimum font size to download as outline

Maximum font size to download as bitmap

Next, go to the `Advanced` tab, and from there, select `Printing Defaults…`. In the window that opens, select `Document options`, then `Postscript options`. Set the following two options:

PostScript Output Option: Optimize for Portability

TrueType Font Download Option: Outline

Monday, 23 May 2011

Moses-训练Hierachical model

注意点：

--2011.5.24
1. input-type set as 0 for hiero/string-to-tree; 3 for tree-to-string
2. 应该使用：/moses-chart-cmd/src/moses_chart

--2011.5.23
在train-model.perl时使用 -hierachical -glue-grammar -max-phrase-length 5。记得把训练phrase-model的-reordering去掉。

Sunday, 23 January 2011

训练基于Moses的中英翻译

训练基于Moses的中英翻译

开始学习SMT，第一步就是想熟悉一下SMT的整个运行过程。于是乎就开始练习使用Moses。使用过程遇到一些问题，记下来免得以后忘记了。

基本上按照这里的指南一步一步的测试。该指南写的相当得好。很好！
http://www.statmt.org/moses_steps.html

但是还是会遇到几个问题。
测试环境：2.6.31-22-server #68-Ubuntu SMP Tue Oct 26 16:50:02 UTC 2010 x86_64 GNU/Linux

----------------
支持软件
----------------
++支持软件的版本尽量和指南一致。如果没有一致的，可以选择稍微早一点的版本。如果选择后面的版本，会有一些变数。我还没有去探索具体哪些版本好用。
比如：SRILM 1.5.7，在主页上没有这个版本。我开始选择最新版本，结果遇到一些问题。后来改成下载1.4.6就好用了

++moses-scripts编译中间会提示一些Boost lib缺少的问题，实际不影响(也许在某些地方有用，但是作为我这么菜鸟级的，无所谓)

++如果moses是直接下载的，有时候会出现dos格式 ^M问题，导致perl脚本执行不了的问题。
解决方案1. 把所有的perl文件第一行结尾加入空格。
解决方案2. 在所有调用位置加入 perl a.pl
解决方案3. svn co https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk moses去下载(这也是指南推荐的，我开始土了，直接用web去下载)

---------------
训练过程
---------------
++语料必须是UTF8的

++Giza++在训练过程中，遇到一些core溢出的问题，经过google找回来的问题，我做了如下更改，好用！
*** file_spec.h 2009/07/10 21:38:39 1.1
--- file_spec.h 2009/07/13 11:37:21
! char time_stmp[17];

! sprintf(time_stmp, "%02d-%02d-%02d.%02d%02d%02d.", local->tm_year,
(local->tm_mon + 1), local->tm_mday, local->tm_hour,
local->tm_min, local->tm_sec);
--- 37,49 ----
! char time_stmp[19];

! sprintf(time_stmp, "%04d-%02d-%02d.%02d%02d%02d.", 1900 + local->tm_year,
(local->tm_mon + 1), local->tm_mday, local->tm_hour,
local->tm_min, local->tm_sec);

++MERT训练的时候，应该加入--mertdir 来指定路径，是不是旧版本没有这个问题？我不是很清楚。

++本文用了64位机器，脚本路径应该是i686-m64

大致就这样，其他要做的事情就等。。。直到模型训练成功。

-------------
初步试验结果
--------------
++设置
FBIS作为双语语料
GIGA_xin来训练语料模型(只使用头1M句)
GIGA_xin来训练Recaser(所有的)
没有MERT

++BLUE
19.79

哇，这么高！休息先

2011.1.24

Wednesday, 8 December 2010

抽取subtree居然抽错了

在百忙之中，居然发现抽取3元的双语subtree列表过程中，漏掉很多组合。TMD

下次要小心。