Service for StackExchange: Implementing a Data Provider

Shuqian Ying

4.65/5 (17 votes)

Nov 5, 2013

CPOL

19 min read

31038

25916

Client side data provider for data importing to a relational data service build for StackExchange data dump.

Please extract sample data from the root directory of the document packages into the root directory of the data service site.

And API documents (optional)

Please extract all API document items from the root directory of the document packages into the root directory of the data service site.

Note: The importer and data service now run under .Net 4.5.1 and the later under Asp.Net Mvc 5.

Membership+ Management System, Part III: Query Intelligence

Introduction

StackExchange (SE) provides a collection of question/answer (Q/A) web sites that are quite useful for raising technical questions in a wide range of knowledge domains that can be answered by other knowledgeable users. In addition to the online content, they also make their data dump available periodically in xml format.

Xml data are not suitable for direct querying. Since the data obviously is relational in nature, it is best to import them into a relational database for structured query to be performed on them. There are a few existing ones (see, for example, here and here), but they are not most up-to-date and the data structure for SE had already been changed since.

The present article describes how to implement a client side data importer for a data service built according to a data schema inferred from the published data of SE. Depending on the requirements, data import in relational databases in general can be complex to accomplish. Since the current article describes a general solution, it must handle the following cases properly:

Data are expected to be imported multiple times after the initial one.
Data can be obtained from multiple sources.
The database is not read only, which means new data could be added to or deleted from it in between two imports.
Data tables can have auto-generated keys. Due to the above cases, it is impossible to have the same auto-generated key value for an entity inside the target database as the original one, proper key mapping must be performed.
The data may not be clean. There are duplications, invalid foreign keys in the source data.
The squence of the data fed by the provider may not be in proper order that respects data relationships (see the following).

Most of the logic for doing relational data import under the above requirements are already been handled by the data service, a data provider needs only to supply proper relational data and specify how data should be fed into the service API one by one.

Online demostrations of the data service built for SE are

Data service for serverfault.com. It can be found here. The database engine used is PostgreSQL;
Data service for gis.stackexchange.com. It can be found here. The database engine used is also PostgreSQL.

The data inside of them are loaded using the data importer to be described here.

Albeit this article is specialized to data imporation for SE, it also set up some basics for a few future articles to be written about the data service built for SE. In addition, most of what are described are also relevant to any of our other data services, like the membership service published here in the CodeProject.

Background

The initial idea was triggered by a blog post of Scott Hanselman (see here) on OData service for SE. Having the capability of semi-automatically building data services (e.g. see here) for relational data source according a given data schema, we took up the challenge to build a data service for SE ourselves. It took us a lot more time than 12 hrs on an airplane. But extra-time spent was worth it since the results are more user and developer friendly, at least for us. In this article and the following ones on the same topic, we shall show why it is.

SE host questions and answers. However, finding answers to similar questions that a user has is not always an easy task due to the large number of un-related questions there. Despite the fact that using tagging system might help, a flat one layer tagging system will become less efficient when the size of the data gets beyond a certain level, in which case tags themselves becomes a problem, not a solution. A user can use web search engines to do searching, however the results obtained are fuzzy and un-controllable at best. This is because a general search engines have no knowledge about specific knowledge domains, they have to use statistical algorithms to do user intension guessing and results selection, which could discard information that are truly relevant to the questioner but are mis-interpretated or considered to have lower weight by search engine for certain reasons. Guessing and/or satisfying everybody's needs is hard no matter how sophisticated a single algorithms might be.

One solution to this problem is to build a vertical search engine for the problem domain of interest. SE focus on a few large knowledge domains in which a questioner is supposed to have already had some background knowledge so he/she can interact with the system to retrieve more of these background knowledge and raise more and more intelligent questions that can be expressed in ways that simple keywords search can't match, something like ${ find questions whose body matches keywords 'word1, word2, ...' raised by users whose name contains 'john' in between 'Datetime1' and 'Datetime2' and whose answers have a score greater than 5 and the comments to the said answers does not matches keyword 'nonesense' } and then ${ order the results according to the score in descending order and for those having the same score, order them according to the time of post in ascending order } should be easy to construct without having to learn a "strange new language" or being discouraged by constant syntax errors; the expressions should be interpreted and executed by the search engine correctly without imposing "subjective" judgment of its own (see here for more details). In addition, the user can "save" the search result in a virtual sub-set defined by the expression for viewing, reporting or further questioning, data mining using the same procedure, recursively.

Our data service is designed to be such kind of systems.

The data schema and extensions

The exported files from SE are Badges.xml, Comments.xml, PostHistory.xml, Posts.xml, Users.xml, and Votes.xml, they are mapped to a corresponding tables in the database. However, one find more tables are needed after reading the readme.txt, they are PostTypes, PostHistoryTypes and VoteTypes. Further inspection of the Post.xml data file reveals that the posts have tags represented by the value of the Tags attribute, where a tag is enclosed by < and >. They are best to be represented by a Tags table and a link table PostTags that is used to associates a tag with a post. There are 11 tables as shown in the following diagram. The name and type of the fields are obtained from the data file and from here. There are predefined primary keys and values for entities inside PostTypes, PostHistoryTypes and VoteTypes. These values should be retrieved from here, since the values defined inside readme.txt are not in sync with the actual date inside the xml files.

The foreign key constraints are illustrated in the following diagram, most of them can be easily recognized from the name of the field. Posts table has two fields that references itself: AcceptedAnswerId and ParentId, both of which refer to another entity in Posts when they are not null. However, AcceptedAnswerId can not be a foreign key. This is because a foreign key refers to an existing entity added in the past (relative to the time the current entity was added), but a non-null AcceptedAnswerId refers to an entity in the future (an answer to the current question).

Front page

Graphical representation of the data schema used.

This kind of data schema refers to the ones that define an instance of a database inside a traditional relational database engine. However, they do not provide enough information for the semi-auto production processes to proceed. Extensions are needed. Most of the extended schema information are not relevant to the current article. One of them that is relevant to the present article is the intrinsic identifiers (ids) introduced in our system. One of the purposes for intrinsic ids is to identify an entity inside of a collection of database copies. They are not necessarily identical to primary keys because some kind of primary keys are assigned by the database based on its state at the time, like the most commonly used auto-generated primary keys, whose values are local to database that has nothing to do with what an entity is. They can be different from copy to copy (of a database). However an immutable foreign key that points to an auto-generated primary key can serve as a intrinsic id, because it is already pre-defined by the said primary key.

Intrinsic ids for an entity in the current case are listed in the corresponding document (see here) for its corresponding class. For example, since the primary key for the Posts table is auto-generated, intrinsic ids for the Post entity are chosen to be [ PostTypeId, OwnerUserId, Title, CreationDate ], which is considered to uniquely identifies a post. Two posts with the same set of intrinsic ids is recorded only once inside the database even if they may have different other attributes and, depending on the application context, the other copy is treated either as a duplicate or a update. Therefore selecting proper set of intrinsic ids is very important for the correct behaivior of the system.

The importer

The importer class is auto-generated given an extended data schema for any relational data source. The included importer class StackExchangeBulkImporter is based on the above one. However most of what are described in the following can also be applied to data services constructed for any other (extended) data schema. The data service and the importer together handle the complex logic of data importing that are implemented to satisfy the following general requirements:

The imported data must already has valid value for primary keys and/or foreign keys.
Auto-generated primary keys will be assigned a new value that is most likely different from the original one.
The validity of foreign keys are checked when the attached database engine support it; those entities having invalid foreign keys will be rejected and reported.
Foreign keys that point to auto-generated primary keys will be modified to the corresponding value for the primary key.
Entities having the same set of intrinsic ids are imported only once, all sub-sequent imports, if any, will be ignored and recorded as duplicates.
Multiple imports of the same data set is allowed. Existing data will not be affected by multiple imports.

It should be noted that due to the fact that besides PostType, PostHistoryType and VoteType sets (tables), the primary keys for other data set are auto-generating integers to satisfy the above requirements. In this case data import here is different from identical database replication. Client software should therefore not use hard coded value for auto-generated keys if multiple copies of the database constructed this way are used simultaneously, e.g., in a load balance setting, since they will be different from copy to copy of the data set. The importation is more appropriate to be called database re-creation in which all original data relationship are respected and all data regarded invalid are dropped.

Of course replication can be achieved by set all primary keys to be non-auto-generating, this kind of database will be read only, however. We did not choose this option in this article.

It interacts with client data provider and user interface via three parts.

Input part to which custom data feeder can be attached;
Output part which service the purpose of progress updating, import status report, etc., and
Control part which allows a user to set various parameters for the import and to start or stop the importation.

The input part

The input parts are the set of client supplied enumerators for each data set named after the following pattern:

<Entity Name> + "Enum"

where <Entity Name> is the name of the entity inside the data set. For example PostEnum:

    /// <summary>
    /// Client supplied <c>Post</c> enumerator.
    /// </summary>
    public IEnumerable<Post> PostEnum
    {
        get;
        set;
    }
    ...

And the pre-counts of source entities used for progress reporting named after the pattern:

<Entity Name> + "Entities"

For example PostEntities:

    ...
    /// <summary>
    /// Client supplied <c>Post</c> number of entities, if known.
    /// </summary>
    public Int64? PostEntities
    {
        get;
        set;
    }
    ...

No progress information for a given data set is available if the corresponding "Entities" is not set.

The output part

Events

    ...
    /// <summary>
    /// Occurs when the enumeration state is changed.
    /// </summary>
    public event Action<EnumProgressStateEvent> EnumStateChanged = null;
 
    /// <summary>
    /// Occurs when the enumeration is progressing.
    /// </summary>
    public event Action<EnumProgressEvent> EnumProgress = null;
 
    /// <summary>
    /// Occurs when the update is incomplete.
    /// </summary>
    public event Action<EntitySetType, IEnumerable<IUpdateResult>> UpdateIncomplete = null;
    ...

Reports

The status reports are recoded in internal properties named after the pattern:

"Invalid" + <Entity Name> + "s"

For example InvalidPosts:

    ...
    internal List<IUpdateResult> InvalidPosts
    {
        get
        {
            if (_invalidPosts == null)
                _invalidPosts = new List<IUpdateResult>();
            return _invalidPosts;
        }
    }
    private List<IUpdateResult> _invalidPosts = null;
    ...

They are pushed to the client via event UpdateIncomplete.

The control part

Settings

There are various public fields and properties that can be used to control the import. A user can find document inside the code to find their use, they will not be described further here.

Methods

There are five methods that a client software can call to control the import. They are:

    /// <summary>
    /// It initializes the visitor before enumerating data entities.
    /// </summary>
    /// <param name="baseAddress">The base http address for the source data service that contains the data.</param>
    /// <param name="credentials">Application agent credentials required to access the data source service.</param>
    public void Init(string baseAddress, CallerCredentials credentials = null)
    {
        ...
    }
     
 
    /// <summary>
    /// Start the importation.
    /// </summary>
    public void StartProcess()
    {
        ...
    }
    
 
    /// <summary>
    /// Stop the importation.
    /// </summary>
    public void StopProcess()
    {
        ...
    }
    
 
    /// <summary>
    /// Load a previous state so the importing can be started from it next.
    /// </summary>
    public void LoadState(string statepath)
    {
        ...
    }
    
 
    /// <summary>
    /// Save the current state so the importing can be started from current state later.
    /// </summary>
    public void SaveState(string statepath)
    {
        ...
    }

Some details about the importer

It is useful to discuss some details of the auto-generated importer so the data provider could be implemented correctly. The StartProcess method processes each data feed for a particular table from the data provider in such an order that the inter-dependency between different kinds of entities are respected. This is relatively easy for entity set (table) that depends on other kinds of entities. However, for entities that depend on other entities inside the same set, like the Post set, the logic is more involved. This is because there is in fact no guarantee that the xml data source from SE feeds the Post entities in a proper order that respect the data dependency, in fact it does not. In addition, entities in Post set has a property, called AcceptedAnswerId (see the schema section), that needs to be mapped to new key values, which are available only after all the Post entities are added. These two "problems" require post processing which is handled at the end of the code block dedicated to the Post set:

#region import into "Posts" set.
//...
//... foreach entity feed by the data provider, try add it to the database ...
//...

if (InvalidPosts.Count > 0)
{
    var ifkeys = (from d in InvalidPosts 
           where (d.OpStatus & (int)EntityOpStatus.InvalideForeignKey) > 0 
           select d).ToArray();
    // remove all invalid posts that have invalid foreign key
    foreach (var e in ifkeys)
        InvalidPosts.Remove(e);
    l.Clear();
    //
    // Load all invalid posts that have invalid foreign key and try to add them to the database 
    // for the second time. Of course, the ones with true invalid keys will be rejected and added 
    // to InvalidPosts again.
    //
    foreach (var e in ifkeys)
    {
        l.Add((e as PostUpdateResult).UpdatedItem);
        if (l.Count == DataBlockSize)
        {
            if (HandlePostBlock(svc, set, l, totalKnown, ref add_perc, ref added))
                ProcessedPosts = added;
            l.Clear();
        }
    }
    if (l.Count > 0)
    {
        if (HandlePostBlock(svc, set, l, totalKnown, ref add_perc, ref added))
            ProcessedPosts = added;
    }
    //   ...
    //   ... codes used to update progress
    //   ...
    
}
else
{
    //...
    //... codes used to update progress
    //...
}
// Post processor provided by the data provider
if (PostPostProcessor != null && !PostPostProcessDone)
{
    PostPostProcessor(cntx, svc, PostKeyMap);
    PostPostProcessDone = true;
}
 
#endregion

The following is the handler for the Post entities

private bool HandlePostBlock(PostServiceProxy svc, PostSet set, List<Post> l, 
                               bool totalKnown, ref double add_perc, ref Int64 added)
{
    // update the ParentId foreign key
    for (int i = 0; i < l.Count; i++)
    {
        if (l[i].ParentId != null)
        {
            int newkey;
            if (PostKeyMap.TryGetValue(l[i].ParentId.Value, out newkey))
                l[i].ParentId = newkey;
        }
    }
    // update the LastEditorUserId foreign key
    for (int i = 0; i < l.Count; i++)
    {
        if (l[i].LastEditorUserId != null)
        {
            int newkey;
            if (UserKeyMap.TryGetValue(l[i].LastEditorUserId.Value, out newkey))
                l[i].LastEditorUserId = newkey;
        }
    }
    // update the OwnerUserId foreign key
    for (int i = 0; i < l.Count; i++)
    {
        if (l[i].OwnerUserId != null)
        {
            int newkey;
            if (UserKeyMap.TryGetValue(l[i].OwnerUserId.Value, out newkey))
                l[i].OwnerUserId = newkey;
        }
    }
    int iptr = 0;
    int cnt = l.Count;
    for (int i = 0; i < cnt; i++)
    {
        // check to see whether or not the entity is already in the the database
        // according to intrinsic ids: PostTypeId, OwnerUserId, Title and CreationDate
        var r = svc.LoadEntityByNature(cntx, l[iptr].PostTypeId, l[iptr].OwnerUserId, 
                                             l[iptr]. Title, l[iptr].CreationDate);
        if (r != null && r.Count > 0)
        {
            // yes, remove it from the entity list to be added
            PostKeyMap.Add(l[iptr].Id, r[0].Id);
            l.RemoveAt(iptr);
            added++;
            //...
            //... codes used to update progress
            //...
        }
        else
            iptr++;
    }
    if (l.Count == 0)
        return true;
    for (int i = 0; i < l.Count; i++)
        l[i].UpdateIndex = i;
    // add the list of entities to the database
    var rs = svc.AddOrUpdateEntities(cntx, set, l.ToArray());
    if (rs != null && rs.ChangedEntities != null)
    {
        // check each entity's status and handle it
        foreach (var r in rs.ChangedEntities)
        {
            if ((r.OpStatus & (int)EntityOpStatus.Added) != 0 || 
                (r.OpStatus & (int)EntityOpStatus.Updated) != 0 || 
                (r.OpStatus & (int)EntityOpStatus.NoOperation) != 0 || 
                (r.OpStatus & (int)EntityOpStatus.NewPrimaryKey) != 0 || 
                (r.OpStatus & (int)EntityOpStatus.NewForeignKey) != 0)
            {
                if ((r.OpStatus & (int)EntityOpStatus.NewPrimaryKey) != 0)
                {
                    var olditem = l[r.UpdatedItem.UpdateIndex];
                    PostKeyMap[olditem.Id] = r.UpdatedItem.Id;
                }
                added++;
            }
            else
                InvalidPosts.Add(r);
        }
    }
    else
        return false;
    if (PostEntities.HasValue)
    {
        //...
        //... codes used to update progress
        //...
    }
    return true;
}

Note that the default behavior of the data service is to throw exceptions when adding an invalid entity is attempted. Such kind of behavior is not desirable when performing batch processing. The cntx parameter of type CallContext has a property called IgnoreInvalideItems, which when set to true (see the Init method) will prevent the service from throwing exceptions when invalid entities are entered, instead, the status of the update status is recorded in the OpStatus bit flags of update result whose type is named after the pattern:

<entity name> + "UpdateResult"

where again <entity name> is the name of the entity under consideration.

Other implementation details can be found inside of the codes, which are documented.

The data provider

Data providers should implement a predefined interface.

The data provider interface

A data provider should implement the interface IStackExchangeImportProvider in assembly StackExchangeShared. The document for the interface can be found here.

The DataSourceParams property is a data structure used to specify the data source, which in our current case is xml files. So this property is assigned a string value representing the directory where the xml files can be found.

The UpdateEntityCount property is a callback used to update the entity count of a particular set. Those data set that does not have an item count will not be sent progress data.

The entity enumerator for each entity set is return by calling method named after the pattern:

"Get" + <entity name> + "s"

where <entity name> is the name of the entity of the corresponding set, e.g. GetPosts().

Since Post set is self-depending, there is a post processing delegate PostPostProcessor which can be set by implementer to do post processing mentioned above.

Implementation of data provider

The entities in PostType, PostHistoryType and VoteType sets has predefined value and primary keys, which should be retrieved from here, because the values defined inside readme.txt are not in sync with the actual date inside the xml files. They are hard coded, for example:

public IEnumerable<PostType> GetPostTypes()
{
    List<PostType> DataList = new List<PostType>();
    DataList.Add(new PostType { IsPersisted = true, Id = 1, Name = "Question" });
    DataList.Add(new PostType { IsPersisted = true, Id = 2, Name = "Answer" });
    DataList.Add(new PostType { IsPersisted = true, Id = 3, Name = "Wik" });
    DataList.Add(new PostType { IsPersisted = true, Id = 4, Name = "TagWikiExcerpt" });
    DataList.Add(new PostType { IsPersisted = true, Id = 5, Name = "TagWiki" });
    DataList.Add(new PostType { IsPersisted = true, Id = 6, Name = "ModeratorNomination" });
    DataList.Add(new PostType { IsPersisted = true, Id = 7, Name = "WikiPlaceholder" });
    DataList.Add(new PostType { IsPersisted = true, Id = 8, Name = "PrivilegeWiki" });
    if (UpdateEntityCount != null)
        UpdateEntityCount(EntitySetType.PostType, DataList.Count);
    return DataList;
}

The Badge, Comment, PostHistory, Post, User, and Vote entities are read out from the corresponding xml file using a fast forward only reader: XmlReader. For example, for Post entities:

public IEnumerable<Post> GetPosts()
{
    char cps = System.IO.Path.DirectorySeparatorChar;
    System.IO.Stream strm;
    string fname = dataPath + cps + "Posts.xml";
    if (System.IO.File.Exists(fname))
    {
        strm = new System.IO.FileStream(fname, System.IO.FileMode.Open, 
                                               System.IO.FileAccess.Read);
        if (UpdateEntityCount != null)
            UpdateEntityCount(EntitySetType.Post, GetRowCount(strm));
        XmlReaderSettings xrs = new XmlReaderSettings();
        xrs.IgnoreWhitespace = true;
        xrs.CloseInput = false;
        XmlReader xr = XmlReader.Create(strm, xrs);
        xr.MoveToContent();
        if (xr.ReadToDescendant("row"))
        {
            do
            {
                var e = new Post { IsPersisted = true };
                xr.MoveToAttribute("Id");
                e.Id = int.Parse(xr.Value);
                xr.MoveToAttribute("PostTypeId");
                e.PostTypeId = byte.Parse(xr.Value);
                if (xr.MoveToAttribute("LastEditorUserId"))
                    e.LastEditorUserId = int.Parse(xr.Value);
                if (xr.MoveToAttribute("OwnerUserId"))
                    e.OwnerUserId = int.Parse(xr.Value);
                if (xr.MoveToAttribute("ParentId"))
                    e.ParentId = int.Parse(xr.Value);
                xr.MoveToAttribute("Body");
                e.Body = xr.Value;
                e.IsBodyLoaded = true;
                if (xr.MoveToAttribute("AcceptedAnswerId"))
                    e.AcceptedAnswerId = int.Parse(xr.Value);
                if (xr.MoveToAttribute("LastEditorDisplayName"))
                    e.LastEditorDisplayName = xr.Value;
                if (xr.MoveToAttribute("OwnerDisplayName"))
                    e.OwnerDisplayName = xr.Value;
                if (xr.MoveToAttribute("Tags"))
                {
                    e.Tags = xr.Value;
                    handleTags(e.Id, e.Tags);
                }
                if (xr.MoveToAttribute("Title"))
                    e.Title = xr.Value;
                xr.MoveToAttribute("CreationDate");
                e.CreationDate = DateTime.Parse(xr.Value + "+00:00").ToUniversalTime();
                if (xr.MoveToAttribute("AnswerCount"))
                    e.AnswerCount = int.Parse(xr.Value);
                if (xr.MoveToAttribute("ClosedDate"))
                    e.ClosedDate = DateTime.Parse(xr.Value + "+00:00").ToUniversalTime();
                if (xr.MoveToAttribute("CommentCount"))
                    e.CommentCount = int.Parse(xr.Value);
                if (xr.MoveToAttribute("CommunityOwnedDate"))
                    e.CommunityOwnedDate = DateTime.Parse(xr.Value + "+00:00").ToUniversalTime();
                if (xr.MoveToAttribute("FavoriteCount"))
                    e.FavoriteCount = int.Parse(xr.Value);
                if (xr.MoveToAttribute("LastActivityDate"))
                    e.LastActivityDate = DateTime.Parse(xr.Value + "+00:00").ToUniversalTime();
                if (xr.MoveToAttribute("LastEditDate"))
                    e.LastEditDate = DateTime.Parse(xr.Value + "+00:00").ToUniversalTime();
                if (xr.MoveToAttribute("Score"))
                    e.Score = int.Parse(xr.Value);
                if (xr.MoveToAttribute("ViewCount"))
                    e.ViewCount = int.Parse(xr.Value);
                e.NormalizeValues();
                yield return e;
            } while (xr.ReadToNextSibling("row"));
        }
        strm.Close();
    }
}

Here one may have already noticed that the datetime properties, like CreationDate, are converted using

 DateTime.Parse(xr.Value + "+00:00").ToUniversalTime()

This is because 1) our data service uses universal time coordinate to record time and 2) the date-time value inside the xml data files from SE does not contain time zone information, we had to make an assumption that they are also using universal time coordinate.

Another point to note is that the xml files from SE do not have explicit Tags and PostTags files. They must be derived from the Tags field of the Post data using method handleTags

private SortedDictionary<string, List<int>> TagMap 
                                     = new SortedDictionary<string, List<int>>();
private Dictionary<int, List<int>> PostTag 
                                     = new Dictionary<int, List<int>>();
 
private void handleTags(int id, string tagstr)
{
    if (string.IsNullOrEmpty(tagstr))
        return;
    int ipos = tagstr.IndexOf("<");
    if (ipos == -1)
        return;
    tagstr = tagstr.Substring(ipos + 1);
    ipos = tagstr.IndexOf(">");
    Action<string> addtag = tag =>
    {
        List<int> l;
        if (!TagMap.TryGetValue(tag, out l))
        {
            l = new List<int>();
            TagMap.Add(tag, l);
        }
        l.Add(id);
    };
    while (ipos != -1)
    {
        string tag = tagstr.Substring(0, ipos).TrimStart('<');
        addtag(tag);
        tagstr = tagstr.Substring(ipos + 1);
        ipos = tagstr.IndexOf("<");
        if (ipos == -1)
            break;
        tagstr = tagstr.Substring(ipos + 1);
        ipos = tagstr.IndexOf('>');
    }
    if (!string.IsNullOrEmpty(tagstr))
    {
        addtag(tagstr.Trim("<>".ToCharArray()));
    }
}

where the tags and post-tag association is recorded in TagMap and PostTag respectively.

Finally, we must post process the AcceptedAnswerId property of the posts to map them to the corresponding new post ids:

private void PostPostProc(CallContext cntx, IPostService2 svc, 
                                              Dictionary<int, int> map)
{
    QueryExpresion qexpr = new QueryExpresion();
    qexpr.OrderTks = new List<QToken>(new QToken[] { 
                             new QToken { TkName = "Id" },
                             new QToken { TkName = "asc" }
                         });
    qexpr.FilterTks = new List<QToken>(new QToken[] { 
                             new QToken { TkName = "AcceptedAnswerId" },
                             new QToken { TkName = "is not null" }
                         });
    PostSet set = new PostSet();
    set.PageSize_ = 50;
    set.PageBlockSize = 20;
    PostPage prev_page = null;
    PostPageBlock curr_pages = svc.NextPageBlock(cntx, set, qexpr, null);
    List<Post> l = new List<Post>();
    while (true)
    {
        for (int i = 0; i < curr_pages.Pages.Length; i++)
        {
            PostPage page = curr_pages.Pages[i];
            if (i == 0 && page.LastItem == null)
                continue;
            page.Items = svc.GetPageItems(cntx, set, qexpr, 
                                          prev_page == null ? null : 
                                          prev_page.Items[prev_page.Items.Count - 1]
                                         ).ToList();
            foreach (var e in page.Items)
            {
                int newid;
                if (map.TryGetValue(e.AcceptedAnswerId.Value, out newid))
                {
                    e.AcceptedAnswerId = newid;
                    e.IsAcceptedAnswerIdModified = true;
                    e.IsEntityChanged = true;
                    l.Add(e);
                    if (l.Count == 50)
                    {
                        svc.AddOrUpdateEntities(cntx, set, l.ToArray());
                        l.Clear();
                    }
                }
            }
            prev_page = page;
        }
        if (curr_pages.IsLastBlock)
            break;
        curr_pages = svc.NextPageBlock(cntx, set, qexpr, 
                                        prev_page.Items[prev_page.Items.Count - 1]);
        if (curr_pages == null || curr_pages.Pages.Length == 0)
            break;
    }
    if (l.Count > 0)
        svc.AddOrUpdateEntities(cntx, set, l.ToArray());
}

What it does is to enumerate all posts whose AcceptedAnswerId is not null, update its value obtained from the map parameter and send the modified posts back to the database to be updated. Due to the service nature of the database, entities are retrieved from it in data blocks (pages) and updates are also in blocks.

The user interface

The user interface adopts the ModernUI style of WPF using standard MVVM design pattern. It is multi-lingual. It assumes that there is one data source and multiple target databases. If, however, there are more than one databases, only one database should be enabled to do importing.

Although it has nice progress reporting features, it is not the subject to be discussed in more details in the present article due to length considerations. A reader can read the code included to find the parts of it that are to his/her interests.

The part that is directly relevant to data import is inside the DataImportPage class in the method OnImportData which is bound to the start button:

private void OnImportData(object sender, RoutedEventArgs e)
{
    if (IsProcessing)
        return;
    bool loadstates = false;
    string statepath = AppContext.AppDataPath + AppContext.DefaultStateFile;
    if (System.IO.File.Exists(statepath))
    {
        var dr = MessageBox.Show(
                       Properties.Resources.PreviousStoppedImportWarning, 
                       Properties.Resources.WarningWord, MessageBoxButton.YesNoCancel);
        switch (dr)
        {
            case MessageBoxResult.Cancel:
                return;
            case MessageBoxResult.Yes:
                loadstates = true;
                break;
        }
    }
    IsProcessing = true;
    this.Cursor = Cursors.Wait;
    ImportContext.Current.SetupSyncProc();
    var t = Task.Factory.StartNew(() =>
    {
        ImportContext.Current.ProcessingStopped = false;
        if (loadstates)
            ImportContext.Current.Importer.LoadState(statepath);
        ImportContext.Current.Importer.StartProcess();
        if (!ImportContext.Current.ProcessingStopped)
        {
            ImportContext.Current.SetupSyncProc2();
            ImportContext.Current.Importer.StartProcess();
        }
    }).ContinueWith(task =>
    {
        Dispatcher.Invoke((Action)(() =>
        {
            ImportContext.Current.ShutdownSyncProc();
            this.Cursor = Cursors.Arrow;
            IsProcessing = false;
        }), null);
    });
}

Since the tags and post-tag association is not known before all posts are processed, the import proceeds in two stages:

Stage one: It imports from data that does not include Tag and PostTag entity sets. Stage one is set up by calling the SetupSyncProc method of ImportContext class.

    
public void SetupSyncProc()
{
    if ((from d in CommitedSinks where d.IsEnabled select d).Count() > 1)
    {
        MessageBox.Show(Properties.Resources.SelectOneSinkInfo, 
                                          Properties.Resources.WarningWord);
        return;
    }
    ProcessingStopped = false;
    var s = (from d in CommitedSinks where d.IsEnabled select d).SingleOrDefault();
    if (s != null)
    {
        char schar = System.IO.Path.DirectorySeparatorChar;
        DataProvider.DataSourceParams = Source.ImportFileDir.TrimEnd(schar);
        DataProvider.UpdateEntityCount = EntityCountUpdate;
        Importer.PostPostProcessor = DataProvider.PostPostProcessor;
        Importer.ProcessingStoppedHandler = StoppedHandler;
        Importer.PostTypeEnum = DataProvider.GetPostTypes();
        Importer.VoteTypeEnum = DataProvider.GetVoteTypes();
        Importer.PostHistoryTypeEnum = DataProvider.GetPostHistoryTypes();
        Importer.UserEnum = DataProvider.GetUsers();
        Importer.PostEnum = DataProvider.GetPosts();
        Importer.BadgeEnum = DataProvider.GetBadges();
        Importer.TagEnum = null;
        Importer.CommentEnum = DataProvider.GetComments();
        Importer.PostHistoryEnum = DataProvider.GetPostHistorys();
        Importer.VoteEnum = DataProvider.GetVotes();
        Importer.PostTagEnum = null;
        Importer.ErrorHandler = handleError;
        s.DelProcStateChanged = s.OnProcStateChanged;
        Importer.EnumStateChanged += s.DelProcStateChanged;
        s.DelProcProgressing = s.OnProcProgressing;
        Importer.EnumProgress += s.DelProcProgressing;
        Importer.EnumProgress += new Action<EnumProgressEvent>(SourceWalker_EnumProgress);
        s.DelImporterUpdateIncomplete = s.OnImporterUpdateIncomplete;
        Importer.UpdateIncomplete += s.DelImporterUpdateIncomplete;
        Importer.Init(s.ServiceAddress, null);
    }
}

where the TagEnum and PostTagEnum are set to null to avoid update the corresponding data set in this stage.

Stage two: It imports Tag and PostTag entities retrieved from the Post entities built from the corresponding xml node. Stage two is set up by calling the SetupSyncProc2 method of ImportContext class.

    
public void SetupSyncProc2()
{
    if (ProcessingStopped)
        return;
    var s = (from d in CommitedSinks where d.IsEnabled select d).SingleOrDefault();
    if (s != null)
    {
        Importer.PostTypeEnum = null;
        Importer.VoteTypeEnum = null;
        Importer.PostHistoryTypeEnum = null;
        Importer.UserEnum = null;
        Importer.PostEnum = null;
        Importer.BadgeEnum = null;
        Importer.TagEnum = DataProvider.GetTags();
        Importer.CommentEnum = null;
        Importer.PostHistoryEnum = null;
        Importer.VoteEnum = null;
        Importer.PostTagEnum = DataProvider.GetPostTags();
    }
}

where all other "Enum" except TagEnum and PostTagEnum are set to null.

A plugin architecture

It is desirable to have a use a plugin framework like the MEF to supply the data provider as one needs after compilation. However, although the architecture is there, the current version of the importer sets the provider in code for simplicity, partly because there is only one kind of SE data dump at present. It will be upgraded to a truely pluggable in later versions.

Setting up test environment

The demo data service

Extract the files from the Member Data Service package to a folder, configure a website for it (it is a ASP.NET MVC 5 web application). Enable the HTTP activation of WCF inside your system. That's basically it.

The included sample data is randomly chosen because it is sufficiently small to be downloaded here at CodeProject. The content in it could be quite useful if a reader is interested in developing mobile Apps for SE because it is from stackapps.com. It can service as either as data source for the test of importation discussed here or as an existing data to be loaded directly into the database to study the data itself.

The service is backed by a custom build in-memory database that can load SE xml data files directly without using any import means. The data is inserted into internal data structure without doing any data integrity check. Interested user can use it to load other SE data sets to query. However, it will not perform well when the size of the data is large (say > 200 k), especially when data joins are required, since it does not use any indexing technologies to speed up the search at present.

The service supports unified full-text index search. The following is a pre-build index package for the stackapps.com data that are directly loaded (not the imported one since importing will change the primary keys):

Download sample full-text index for Stack-Apps - 728.0 Kb

Please extract the files from the root of the compressed package to the root of the service site. Note it is for stackapps.com data that is directly loaded only .

To load data from other sits of SE, reset the service if it is already loaded and dump the data into the directory

App_Data\StackExchange\Data

and load it from inside of the "Data Source" page.

To reset the service, one can simply make a change to the root Web.config file, change it back next and then save the file. The loaded data in memory will be cleared this way.

Using the importer

Albeit our custom in-memory database for SE can load xml data directly (and at very fast speed), it is not easily done for other type of database engines. The later case is where the generic importer is supposed to be used.

Before starting import process, one needs to specify a source data directory for the selected SE data dump files and a list of target base URLs for the sink data services (for SE). When the number of target data services is greater than one, enable only one of them before proceeding.

Note: if you are using the custom in-memory database and it is loaded, reset the data service using the above means before importing.

Summary

Writing a data provider for data import is not hard for the custom built SE data service since most of the complex logic had already been handled by the data service itself.

History

Version 1: Initial release.
Version 1.0.1: Incorrect behavior when adding or updating items due to a bug in service javascripts is corrected. More views are added. Only the package for the service is changed.
Version 1.0.2: Overall system updates, bug fixes and feature enhancements.
Version 1.2.0: The importer and data service now run under .Net 4.5.1 and the later also under Asp.Net Mvc 5.