From lrr.hll at gmail.com Fri Oct 7 03:47:17 2011 From: lrr.hll at gmail.com (Larry Hull) Date: Thu, 6 Oct 2011 20:47:17 -0700 Subject: [Langutils-devel] Error on example? Message-ID: I just tried loading langutils in sbcl 1.0.51 and ran the example file. The eval-when form (other than generating a deprecation warning) failed with the error message that "the value NIL is not of type HASH-TABLE." Any ideas? Larry -------------- next part -------------- An HTML attachment was scrubbed... URL: From lrr.hll at gmail.com Sun Oct 9 03:57:28 2011 From: lrr.hll at gmail.com (Larry Hull) Date: Sat, 8 Oct 2011 20:57:28 -0700 Subject: [Langutils-devel] Question on verb phrases Message-ID: Can anyone give me an example that generates a chunked verb phrase? I'm having difficulty understanding the verb tags. Consider the following string which is actually in the pdf in the docs "Show me all the coats for winter. In Section 7.1 of the pdf, this should be chunked such that (VX Show VX) (NX me NX) (NX all the coats NX) (PX for winter PX) But running tag generates: (tag "Show me all the coats for winter") "show/NNP me/PRP all/DT the/DT coats/NNS for/IN winter./NN " and running chunk generates: (chunk "Show me all the coats for winter.") (# # # # # # # # #) The tag show no verbs whatsoever, the chunk has show as noun, verb and adverb phrases. If I tag the string "The dog ate his food" (tag "The dog ate his food") "The/DT dog/NN ate/VBD his/PRP$ food/NN "ate" is clearly labeled as a verb, but when I chunk the same string (chunk "The dog ate his food") (# # # # # #) the word "ate" disappears completely and "the dog" and "his food" are NX, VX and ADVP. Clearly I'm misunderstanding something. Can someone shed a little light on my ignorance? Larry -------------- next part -------------- An HTML attachment was scrubbed... URL: From eslick at media.mit.edu Sun Oct 9 04:19:54 2011 From: eslick at media.mit.edu (Ian Eslick) Date: Sat, 8 Oct 2011 21:19:54 -0700 Subject: [Langutils-devel] Question on verb phrases In-Reply-To: References: Message-ID: <2A8D6C20-BD75-4E22-BBD0-15E627C76947@media.mit.edu> If you're using the github version, I just pushed a patch for the chunker. A compatibility fix for the latest CCL caused a regression in the phrase patterns. Thanks for reporting it! 'Show' defaults to NN and that particular sentence construction doesn't seem to allow the Brill rule tagger to figure out that it's a VB in that specific context. I'm not sure if it was always like that, or it's a regression. However, the tagger is using files trained on a news corpus which contains far less imperative statements so it's not too surprising. "Can you tell me about the coats for winter" gets it right because tell is more obviously a VB. Ian On Oct 8, 2011, at 8:57 PM, Larry Hull wrote: > Can anyone give me an example that generates a chunked verb phrase? > > I'm having difficulty understanding the verb tags. > > Consider the following string which is actually in the pdf in the docs "Show me all the coats for winter. > > In Section 7.1 of the pdf, this should be chunked such that > (VX Show VX) (NX me NX) (NX all the coats NX) (PX for winter PX) > > But running tag generates: > > (tag "Show me all the coats for winter") > "show/NNP me/PRP all/DT the/DT coats/NNS for/IN winter./NN " > > and running chunk generates: > (chunk "Show me all the coats for winter.") > (# # # > # # > # # # > #) > > The tag show no verbs whatsoever, the chunk has show as noun, verb and adverb phrases. > > If I tag the string "The dog ate his food" > > (tag "The dog ate his food") > "The/DT dog/NN ate/VBD his/PRP$ food/NN > > "ate" is clearly labeled as a verb, but when I chunk the same string > > (chunk "The dog ate his food") > (# # # > # # #) > > the word "ate" disappears completely and "the dog" and "his food" are NX, VX and ADVP. > > Clearly I'm misunderstanding something. Can someone shed a little light on my ignorance? > > Larry > _______________________________________________ > Langutils-devel mailing list > Langutils-devel at common-lisp.net > http://lists.common-lisp.net/cgi-bin/mailman/listinfo/langutils-devel From jianshi.huang at gmail.com Tue Oct 11 07:33:33 2011 From: jianshi.huang at gmail.com (Jianshi Huang) Date: Tue, 11 Oct 2011 16:33:33 +0900 Subject: [Langutils-devel] Period not correctly tokenized? In-Reply-To: References: <4E5653F4.9010102@common-lisp.net> Message-ID: Hey Kevin, On Fri, Oct 7, 2011 at 2:57 PM, Jianshi Huang wrote: > Currently it works for me, but I'm not sure whether it will break > something else... > > There must be a reason for not including #\. in the punctuation type. > > Anyway, here's the patch for git. > I messed up your repository with eslick's cl-langutils. LOL So here's the patch for your langutils. -- ???? (Jianshi Huang) http://huangjs.net/ -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Fix-tokenization-for-sentence-ending-periods.patch Type: text/x-patch Size: 936 bytes Desc: not available URL: From raison at chatsubo.net Tue Oct 11 14:41:10 2011 From: raison at chatsubo.net (Kevin Raison) Date: Tue, 11 Oct 2011 07:41:10 -0700 Subject: [Langutils-devel] Period not correctly tokenized? In-Reply-To: References: <4E5653F4.9010102@common-lisp.net> Message-ID: <4E945586.4060903@chatsubo.net> Jianshi, actually, my repo is an old copy of eslick's with some modifications. eslick's cl-langutils is now the gold standard for this library; please use it an ignore mine from here on out. cheers, kevin On 10/11/2011 12:33 AM, Jianshi Huang wrote: > Hey Kevin, > > On Fri, Oct 7, 2011 at 2:57 PM, Jianshi Huang wrote: >> Currently it works for me, but I'm not sure whether it will break >> something else... >> >> There must be a reason for not including #\. in the punctuation type. >> >> Anyway, here's the patch for git. >> > > I messed up your repository with eslick's cl-langutils. LOL > > So here's the patch for your langutils. > > > > > > _______________________________________________ > Langutils-devel mailing list > Langutils-devel at common-lisp.net > http://lists.common-lisp.net/cgi-bin/mailman/listinfo/langutils-devel From jianshi.huang at gmail.com Tue Oct 11 15:21:49 2011 From: jianshi.huang at gmail.com (Jianshi Huang) Date: Wed, 12 Oct 2011 00:21:49 +0900 Subject: [Langutils-devel] Period not correctly tokenized? In-Reply-To: <4E945586.4060903@chatsubo.net> References: <4E5653F4.9010102@common-lisp.net> <4E945586.4060903@chatsubo.net> Message-ID: On Tue, Oct 11, 2011 at 11:41 PM, Kevin Raison wrote: > Jianshi, actually, my repo is an old copy of eslick's with some > modifications. ?eslick's cl-langutils is now the gold standard for this > library; ?please use it an ignore mine from here on out. > Ok, I'll ask eslick about the problem. Thanks for the help. -- ???? (Jianshi Huang) http://huangjs.net/ From eslick at media.mit.edu Tue Oct 11 16:01:31 2011 From: eslick at media.mit.edu (Ian Eslick) Date: Tue, 11 Oct 2011 09:01:31 -0700 Subject: [Langutils-devel] Period not correctly tokenized? In-Reply-To: References: <4E5653F4.9010102@common-lisp.net> Message-ID: <2785386C-E6BC-475A-AC4A-76A58E679FAF@media.mit.edu> Periods are handled specially because they show up in numbers, abbreviations, e.g. and i.e., etc. You lose numbers as tokens if you split out periods naively. Sent from my iPhone On Oct 11, 2011, at 12:33 AM, Jianshi Huang wrote: > Hey Kevin, > > On Fri, Oct 7, 2011 at 2:57 PM, Jianshi Huang wrote: >> Currently it works for me, but I'm not sure whether it will break >> something else... >> >> There must be a reason for not including #\. in the punctuation type. >> >> Anyway, here's the patch for git. >> > > I messed up your repository with eslick's cl-langutils. LOL > > So here's the patch for your langutils. > > > -- > ???? (Jianshi Huang) > http://huangjs.net/ > <0001-Fix-tokenization-for-sentence-ending-periods.patch> > _______________________________________________ > Langutils-devel mailing list > Langutils-devel at common-lisp.net > http://lists.common-lisp.net/cgi-bin/mailman/listinfo/langutils-devel From jianshi.huang at gmail.com Wed Oct 12 01:03:27 2011 From: jianshi.huang at gmail.com (Jianshi Huang) Date: Wed, 12 Oct 2011 10:03:27 +0900 Subject: [Langutils-devel] Period not correctly tokenized? In-Reply-To: <2785386C-E6BC-475A-AC4A-76A58E679FAF@media.mit.edu> References: <4E5653F4.9010102@common-lisp.net> <2785386C-E6BC-475A-AC4A-76A58E679FAF@media.mit.edu> Message-ID: Hi Ian, On Wed, Oct 12, 2011 at 1:01 AM, Ian Eslick wrote: > Periods are handled specially because they show up in numbers, abbreviations, e.g. and i.e., etc. ?You lose numbers as tokens if you split out periods naively. I see. But it seems there's a bug in the handling. Please ignore my patch. -- ???? (Jianshi Huang) http://huangjs.net/