4 Replies ・ Started by shyro at 2017-09-02 17:03:58 UTC ・ Last reply by Kimtaro Admin at 2017-10-06 23:20:42 UTC

How were the example sentences retrieved from Tatoeba?

I was working on my own project a little and I like how jisho.org has example sentences assigned to the JMDict meanings, but I can't really find any data source where this information could be obtained from. Tatoeba itself only has really limited and clunky search and very limited open source data where I don't think this info is included. Did you find some better data source than I was able to or did you process all the data yourself to get these cool results, jisho.org creators?

jakobd2 at 2017-09-02 22:09:43 UTC

I'm not involved with Jisho.org but I guess you already found this page: https://tatoeba.org/eng/downloads
I would assume that associating the sentences with JMdict entries works via language parsing which I think Jisho.org uses Ve for: https://github.com/Kimtaro/ve (which, going after the description text, uses MeCab).

I could be wrong about this, though.

shyro at 2017-09-03 10:14:46 UTC

Thanks! I suppose that makes sense.

shyro at 2017-09-03 11:08:43 UTC

Actually, that still doesn't answer my question, as my main point was how was it associated with a meaning/sense rather than the JMDict entry itself. This entry http://jisho.org/word/%E5%A4%A7%E5%A4%89 is a good example of that, where each meaning written with the same Kanji has different example sentence associated with it, with two of the meanings even being both na-adjectives, so there has to be something more rather than just language parsing unless I'm still missing something...

Kimtaro Admin at 2017-10-06 23:20:42 UTC

@shyro I use the Japanese indices file linked from the page that @jakobd2 linked to. It contains the data needed to associate a sentence with the jmdict sense. The format of those indices is described here: http://www.edrdg.org/wiki/index.php/Sentence-Dictionary_Linking

to reply.