Click here to Skip to main content
15,885,546 members

Comments by Mr. xieguigang 谢桂纲 (Top 15 by date)

Mr. xieguigang 谢桂纲 6-May-15 11:32am View    
Control.BeginInvoke(MethodInvoker) is still not working... the code is still stuck running at here
Mr. xieguigang 谢桂纲 15-Nov-14 2:33am View    
Hey, guy, i have work out how to dealing with this ultra large size text file parsing job, it contains 3 steps:

1. Loading all of the data into memory and split into chunk in size 786MB, it seems the UTF8.GetString function can not handle the size large than 1GB and then cache the chunk into a list
2. using the regular expression to parsing the section, due to the regex matching function just using one single thread for its parsing job, so that using parallel linq can speed up this job
3. do the section text parsing job as i does before.

here is my code:


'''
''' It seems 786MB possibly is the up bound of the Utf8.GetString function.
'''

''' <remarks>
Const CHUNK_SIZE As Long = 1024 * 1024 * 786
Const BLAST_QUERY_HIT_SECTION As String = "Query=.+?Effective search space used: \d+"

'''
''' Dealing with the file size large than 2GB
'''

''' <param name="LogFile"></param>
''' <returns>
''' <remarks>
Public Shared Function TryParseUltraLarge(LogFile As String, Optional Encoding As System.Text.Encoding = Nothing) As v228
Call Console.WriteLine("Regular Expression parsing blast output...")

'The default text encoding of the blast log is utf8
If Encoding Is Nothing Then Encoding = System.Text.Encoding.UTF8

Using p As Microsoft.VisualBasic.ConsoleProcessBar = New ConsoleProcessBar
Call p.Start()

Dim TextReader As IO.FileStream = New IO.FileStream(LogFile, IO.FileMode.Open)
Dim ChunkBuffer As Byte() = New Byte(CHUNK_SIZE - 1) {}
Dim LastIndex As String = ""
'Dim Sections As List(Of String) = New List(Of String)
Dim SectionChunkBuffer As List(Of String) = New List(Of String)

Do While TextReader.Position < TextReader.Length
Dim Delta As Integer = TextReader.Length - TextReader.Position

If Delta < CHUNK_SIZE Then ChunkBuffer = New Byte(Delta - 1) {}

Call TextReader.Read(ChunkBuffer, 0, ChunkBuffer.Count - 1)

Dim SourceText As String = Encoding.GetString(ChunkBuffer)

If Not String.IsNullOrEmpty(LastIndex) Then
SourceText = LastIndex & SourceText
End If

Dim i_LastIndex As Integer = InStrRev(SourceText, "Effective search space used:")
If i_LastIndex = -1 Then '当前区间之中没有一个完整的Section
LastIndex &= SourceText
Continue Do
Else
i_LastIndex += 42

If Not i_LastIndex >= Len(SourceText) Then
LastIndex = Mid(SourceText, i_LastIndex) 'There are some text in the last of this chunk is the part of the section in the next chunk.
Else
LastIndex = ""
End If
Call SectionChunkBuffer.Add(SourceText)
End If

'This part of the code is non-parallel

'Dim SectionsTempChunk = (From matche As Match
' In Regex.Matches(SourceText, BLAST_QUERY_HIT_SECTION, RegexOptions.Singleline + RegexOptions.IgnoreCase)
' Select matche.Value).ToArray

'If SectionsTempChunk.IsNullOrEmpty Then
' LastIndex &= SourceText
' Continue Do
'Else
' Call Sections.AddRange(SectionsTempChunk)
'End If

'LastIndex = SectionsTempChunk.Last()
'Dim Last_idx As Integer = InStr(SourceText, LastIndex) + Len(La
Mr. xieguigang 谢桂纲 14-Nov-14 4:06am View    
I trying your advice of split the big file into a chunk and then search in each chunk, maybe I can find a solution tonight. the difficulty of this job is how to make the loading process parallel or it will maybe takes whole day on this loading job....
Mr. xieguigang 谢桂纲 14-Nov-14 3:57am View    
here is the example of the section: each section start from "Query=" and end with Effective search space used:

blablablabla.........

Query= XC_0118 transcriptional regulator

Length=1113
Score E
Sequences producing significant alignments: (Bits) Value

lcl5167|ana:all4503 two-component response regulator; K07657 tw... 57.4 2e-009
lcl2658|cyp:PCC8801_3460 winged helix family two component tran... 56.6 4e-009
lcl8962|tnp:Tnap_0756 two component transcriptional regulator, ... 55.1 7e-009
lcl9057|kol:Kole_0706 two component transcriptional regulator, ... 55.1 1e-008
lcl9114|ter:Tery_2902 two component transcriptional regulator; ... 55.1 1e-008
lcl9051|trq:TRQ2_0821 two component transcriptional regulator (... 54.7 1e-008
lcl9023|tpt:Tpet_0798 two component transcriptional regulator (... 54.7 1e-008
lcl8929|tma:TM0126 response regulator; K02483 two-component sys... 54.3 1e-008
lcl8992|tna:CTN_0563 Response regulator (A)|[Regulog=ZnuR - The... 52.4 6e-008


blablablabla.........


> lcl5167|ana:all4503 two-component response regulator; K07657
two-component system, OmpR family, phosphate regulon response
regulator PhoB (A)|[Regulog=SphR - Cyanobacteria] [tfbs=all3651:-119;alr5291:-229;all4021:-136;alr5259:-329;all1758:-49;all0129:-101;all3822:-149;alr4975:-10;alr2234:-275;all0207:-98;all0911:-105;all4575:-324]
Length=253

Score = 57.4 bits (137), Expect = 2e-009, Method: Compositional matrix adjust.
Identities = 36/109 (33%), Positives = 52/109 (48%), Gaps = 1/109 (1%)

Query 3 LRSERVTQLGSVPRFRLGPLLVEPERLMLIGDGERITLEPRMMEVLVALAERAGEVISAE 62
LR +R+ L +P + + + P+ ++ G+ + L P+ +L A V S E
Sbjct 144 LRRQRLITLPQLPVLKFKDVTLNPQECRVLVRGQEVNLSPKEFRLLELFMSYARRVWSRE 203

Query 63 QLLIDVWHGSFYGDNP-VHKTIAQLRRKLGDDSRQPRFIETIRKRGYRL 110
QLL VW F GD+ V I LR KL D P +I T+R GYR
Sbjct 204 QLLDQVWGPDFVGDSKTVDVHIRWLREKLEQDPSHPEYIVTVRGFGYRF 252


blablablabla.........


Lambda K H a alpha
0.321 0.133 0.395 0.792 4.96

Gapped
Lambda K H a alpha sigma
0.267 0.0410 0.140 1.90 42.6 43.6

Effective search space used: 1655396995

blablablabla.........


and yes, this may be a solution, but it can not be parallel, and if we using a for each loop, then the program only utilize 1 CPU core, dealing with the 100GB text file, is impossible.....
Mr. xieguigang 谢桂纲 14-Nov-14 3:50am View    
yes, the IO.File.ReadAllText dealing with the file with size below 2GB is perfect and clean, but when dealing the size very large, it crash. i think the MS should improved the .NET class object for the ultra large size text file processing.