December 2006
Les petits papiers de SQLPro - L'art des Soundex - Club d'entraide des développeurs francophones
by parmentierf & 3 othersNon les soundex ne sont pas de petites bêtes rampantes que l’on tue à l’aide d’un insecticide… Ce sont en fait des mécanismes de recherche portant sur la consonance des mots. Ils sont en général utilisé dans de très grandes bases de données pour lesquelles la recherche approchée d’un nom peut être d’une très grande utilité...
Nous avons réalisé ces petites bêtes, à l’aide de DELPHI 3. Mais n’importe quel autre langage performant (C , Java) peut être utilisé à ces fins.
JoshDrew.com
by parmentierfLast month, I needed to use the metaphone and edit (Levenshtein) distance algorithms for a fuzzy search of a MySQL table. Of course, neither is available as a built-in MySQL function. So, I had to install them as UDFs. The MySQL source distribution includes a metaphone UDF function in udf_example.cc. However, I couldn't find a Levenshtein UDF anywhere, so I wrote one, by converting a C implementation by Lorenzo Seidenari. I suspect that other people could benefit from this code, and you can download it from joshdrew.com. I compared the function's output to that of the PHP levensthein() function for a couple million word pairs; the results agreed completely - that's good enough for me. (this code comes with no warranty whatsoever, but I really hope you find it useful)
November 2006
Algorithmes vectoriels et bioinformatique
by parmentierf & 1 otherThèse de doctorat de Sylvie Hamel - recherche approximative de chaînes de caractères.
September 2006
Tame the Beast by Matching Similar Strings
by parmentierfI described the algorithms in two classes: equivalence methods and similarity ranking methods. Equivalence methods return a Boolean result, whereas the similarity ranking methods return a numeric similarity measure or distance metric. In information retrieval systems, it is possible to mix methods to produce a faster hybrid approach. A typical approach is to employ a two-pass mechanism in which an equivalence method is used by the database as a first pass filter, and a ranked similarity method is applied to the filtered entries for the second pass. Ranked similarity methods tend to be algorithmically more complex than equivalence methods, so are usually implemented as custom code outside of the database.
A Guided Tour to Approximate String Matching - Navarro (ResearchIndex)
by parmentierfWe survey the current techniques to cope with the problem of string matching allowing errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities.
FTP Directory: ftp://ftp.cs.arizona.edu/agrep/
by parmentierf (via)Les fichiers concernant la commande agrep
SourceForge.net: The bitap / agrep library
by parmentierf & 1 otherThe bitap library is a clean implementation of regular expression (regex/grep) string matching using the bitap algorithm. Approximate (a.k.a. fuzzy) matching is allowed.
Indexed Approximate String Searching
by parmentierf & 1 otherUne bonne présentation sur la recherche approximative de chaînes (références, algos, ...)
TRE home page
by parmentierf & 1 otherTRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.
agrep is in this library.
1
(10 marks)