Click here to Skip to main content
15,886,067 members
Articles / Programming Languages / C++

Regex Style Captures Using a Grammar.

Rate me:
Please Sign up or sign in to vote.
3.88/5 (4 votes)
22 Mar 2018CPOL 4.9K   2  
Capture text using a grammar instead of a regex.

Introduction

One of the most useful features of std::regex is the ability to capture text. However, sometimes regex just doesn't cut it for more ambitious captures.

Example

C++
#include <iostream>
#include <regex>

int main()
{
    try
    {
        std::regex rx("([A-Z_a-z]\\w*)(?:\\s*,\\s*([A-Z_a-z]\\w*))*");
        std::string input = "111 One, 2, Three, Four 222 Five,Six,Seven Eight, 9, 10";
        std::cregex_iterator iter(input.c_str(), input.c_str() + input.size(), rx);
        std::cregex_iterator end;

        for (; iter != end; ++iter)
        {
            for (std::size_t i = 0, size = (*iter).size(); i < size; ++i)
            {
                const std::string str = (*iter)[i].str();

                if (!str.empty()) std::cout << str << '\n';
            }

            std::cout << '\n';
        }
    }
    catch (const std::exception &e)
    {
        std::cout << e.what() << '\n';
    }

    return 0;
}

This code outputs the following:

One
One

Three, Four
Three
Four

Five,Six,Seven
Five
Seven

Eight
Eight

The problem is with the case "Five,Six,Seven" as the regex is only able to capture "Five" and "Seven", omitting the "Six".

The Solution

C++
// http://www.benhanson.net/parsertl.html
#include "parsertl/enums.hpp"
#include "parsertl/generator.hpp"
#include <iostream>
#include "parsertl/search_iterator.hpp"

int main()
{
    try
    {
        parsertl::rules grules(*parsertl::rule_flags::enable_captures);
        parsertl::state_machine gsm;
        lexertl::rules lrules;
        lexertl::state_machine lsm;

        grules.token("Name");
        grules.push("list", "(Name) "
            "| list ',' (Name)");
        parsertl::generator::build(grules, gsm);
        lrules.push("[A-Z_a-z]\\w*", grules.token_id("Name"));
        lrules.push(",", grules.token_id("','"));
        lrules.push("\\s+", lexertl::rules::skip());
        lexertl::generator::build(lrules, lsm);

        std::string input = "111 One, 2, Three, Four 222 Five,Six,Seven Eight, 9, 10";
        lexertl::citerator liter(input.c_str(), input.c_str() + input.size(), lsm);
        parsertl::csearch_iterator iter(liter, gsm);
        parsertl::csearch_iterator end;

        for (; iter != end; ++iter)
        {
            for (const auto &vec : *iter)
            {
                for (const auto &pair : vec)
                {
                    std::cout << std::string(pair.first, pair.second) << '\n';
                }
            }

            std::cout << '\n';
        }
    }
    catch (const std::exception &e)
    {
        std::cout << e.what() << '\n';
    }

    return 0;
}

The above code output the following:

One
One

Three, Four
Three
Four

Five,Six,Seven
Five
Six
Seven

Eight
Eight

The first line in each block is the equivalent of $0 in regex parlance. The following lines are $1 and $2. Each capture is a vector in order to support recursion, so in the example of "Five,Six,Seven" Six and Seven are stored under $2 and Five is stored under $1.

History

22/03/2018 Created.

23/03/2018 Added regex example.

15/02/2024: Updated example to use parsertl17 syntax.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
United Kingdom United Kingdom
I started programming in 1983 using Sinclair BASIC, then moved on to Z80 machine code and assembler. In 1988 I programmed 68000 assembler on the ATARI ST and it was 1990 when I started my degree in Computing Systems where I learnt Pascal, C and C++ as well as various academic programming languages (ML, LISP etc.)

I have been developing commercial software for Windows using C++ since 1994.

Comments and Discussions

 
-- There are no messages in this forum --