|
I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance
Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?
For example:
user A's hashtag = {cat, bull, cow, chicken, duck}
user B's hashtag ={cat, chicken, cloth}
user C's hashtag = {lenovo, Hp, Sony}
clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?
|
|
|
|
|
Is your company going to give your salary to anyone here for solving this? It's your job after all, not ours.
|
|
|
|
|
No, I am a University student, and I did not get any salary. I am just want to discuss with some coding Pro and those smart guy. I will be very appreciated if someone could give me some ideas. I think the forum is to discuss programming question, we could help each other and enhance our programming skills. I hope those capable coding Pro give me some hints. Thanks.
|
|
|
|
|
You should eliminate trivial words like 'a', 'and', etc.
And then research matching algorithms, I would start with the following google string.
algorithms for set matching -string
|
|
|
|
|
yes, definitely have to use String and array to process the data. However, I don't know how exactly to do it. The idea is not clear yet. Thanks very much for your reply.
|
|
|
|
|
Well - you could try find the similarities or "document distance" of and between the Twitter users by matching their tweets against each other - kind of like the way one search for plagiarism, perhaps that might work. You could start by out by searching the tweets of a particular Twitter user - using some sort of application. If I am not mistaken - I believe Twitter does have something like this available - furthermore, comparisons between and of the groups against each other can be carried out, therefore that way we can get a comparison of the similarity or "document distance" of Twitter users.
April
Comm100 - Leading Live Chat Software Provider
modified 27-May-14 8:34am.
|
|
|
|
|
Thanks very much for your suggestion. I will try to do some research about document distance. To process so huge amount of data like this, normal way is definitely infeasible, have to find a good idea on how to implement it. The project's focus is the idea, the coding should be very simple, but if the idea is very lousy, the whole project will become useless. I am very appreciated for your suggestion.
|
|
|
|
|
You're very welcome! It was what initially popped into my head - though I believe there is probably a stronger and ideal way to carry such a project out with regards to the large amounts of data you will be dealing with.
I find your project quite interesting!
Best of Luck!
With Kind Regards,
April
Comm100 - Leading Live Chat Software Provider
modified 27-May-14 8:33am.
|
|
|
|
|
Take a look at the Levenshtein distance
|
|
|
|
|
What is the best approach for client-server application?
I need to start develop website+database+crm+backoffice.
I am developing in c# working with SQL server 2008.
I would like someone to direct me on how to build my server side smart&simple or hard&complex ?
What is best to build an entity for each store procedure
I have or deploying to my client side all tables needed
and let the client handle the data ?
Please help me
thanks
Best regards
Adam
|
|
|
|
|
bugal wrote: What is the best approach for client-server application?
FIRST, collect requirements.
SECOND, create the architecture and/or design. Which of these is needed depends on the requirements.
THIRD, based on the architecture/design decide what technologies to use.
bugal wrote: or hard&complex ?
Very hard/complex when one skips the first two steps above.
|
|
|
|
|
An object called MovableObject can move through different methods.
MoveByLegs, wheels & Wings. When we choose one method, i.e MoveByLegs and pass Legs arguments, the other two will be unused.
Should I design it like this:
enum MoveType
{
MoveByLegs,
MoveByWheels,
MoveByWings
}
class MovableObject
{
List<leg> lstLegs;
List<Wheels> lstWheels;
List<Wing> lstWings;
MoveType m_Movetype;
EnableMovement(MoveType movetype_in, object obj_in)
{
m_Movetype = movetype_in;
switch (movetype_in)
{
case MoveByLegs:
lstLegs = List<Leg>(obj_in);
break;
case MoveByWheels:
lstWheels = List<Wheel>(obj_in);
break;
case MoveByWings:
lstWings = List<Wing>(obj_in);
break;
}
}
Move()
{
if(m_Movetype == MoveType.MoveByLegs)
{
}
similar case for MoveByWheels & Legs
}
}
Starting to think people post kid pics in their profiles because that was the last time they were cute - Jeremy.
|
|
|
|
|
Hmmm, you're mixing and matching things a lot here. As you're creating movable "things", you should consider the fact that each one of these is a separate movable type. This indicates that you should consider the fact that you're using an enum and change it to something like this:
public abstract class MovableObject
{
}
public class Legs : MovableObject
{
}
public class Wheels : MovableObject
{
} And there you have it - it's a lot cleaner and simpler to work with OO features.
|
|
|
|
|
Hmm let me explain it visually.
The movable object = An Egg.
Now you attach a set of legs, it walks through Legs.
If you attach a set of wheels, it rolls by wheels.
And when you attach a pair of wings, it flies by wings.
So We cannot inherit leg, wings, wheels from the Egg.
The Egg _has_ all these.. one at a time. Not all at the same time.
Starting to think people post kid pics in their profiles because that was the last time they were cute - Jeremy.
|
|
|
|
|
My design still works - change MovableObject to MovableEgg in my example and you see that it still stands (so to speak). Consider this example:
public abstract MovableEgg
{
public void Move();
}
public abstract EggWithLegs : MovableEgg
{
public virtual void Move()
{
Console.WriteLine("I'm walking");
}
}
public abstract EggWithWings : MovableEgg
{
public virtual void Move()
{
Console.WriteLine("I'm flying");
}
}
|
|
|
|
|
I guess the egg is an object with many things in it other than how it moves. Does the way it movers affects other things, like graphics etc? If yes look for "component design pattern".
If it is just a different way of movement then you can use this:
Interface IMovementMethod
{
public void Move(Egg AnEgg,...);
}
Class Egg
{
public IMovementMethod MovementMethod {get; set;}
public void Move(...)
{
MovementMethod.Move(this,...)
}
}
This way you can change the MovementMethod property if you want to change the way it moves.
|
|
|
|
|
Better example Here[^]
Starting to think people post kid pics in their profiles because that was the last time they were cute - Jeremy.
|
|
|
|
|
VuNic wrote: Should I design it like this:
Movement is an attribute not an entity.
So
Entity (not MovableObject) 'has' movement.
Thus objects would be
Entity
MovementWings
MovementLegs
MovementWheels.
And you set it by calling.
entity.setMovement(Movement)
The Movement itself is either enable/disabled (where 'enable' means to actually allow movement) so each Movement object would have a property to enable it. Then Entity could call that if it has Movement (if internal varible is not null.)
|
|
|
|
|
I need to explain more about the requirement. I'll come back and roll them down soon. thanks for your replies.
Starting to think people post kid pics in their profiles because that was the last time they were cute - Jeremy.
|
|
|
|
|
Better example Here[^]
Starting to think people post kid pics in their profiles because that was the last time they were cute - Jeremy.
|
|
|
|
|
Except of course that has nothing to do with movement.
|
|
|
|
|
Hello: I'm working on a large database project that I currently have in VBA/Access 2007 and intend to put into VB.Net/Access 2007 as I think it might be a better idea but I'm not sure. This program is for a large law case where I will be required to deploy it by sending it out on disks to various other firms across the country. I'm a bit rusty with this so I want to know the best way to handle the solution. If all these other firms don't have Access, what other type of database should I use? What is the best practice for deployment?
|
|
|
|
|
What does "large" mean? Presumably you mean a lot of data versus a lot of users.
Using .Net should eliminate the problem of them not having MS Access. MS Access is the GUI part with some other additionally functionality, but is not required to actually access the data. However testing is always a good idea.
Member 8385949 wrote: If all these other firms don't have Access, what other type of database should I
use?
Alternatives really depend on exactly what the application is doing and what "large" means.
|
|
|
|
|
Do not use Access, for the precise reason you have indicated, the client will potentially need the exact version of office you are using.
I would look at SQL Server CE, there is a 4gb limit but I suspect this would not be a problem.. There are also a number of other small, embedable databases available.
When completed and tested deploy using an installation project or package.
Let me emphasise Access is WRONG for this job!
Never underestimate the power of human stupidity
RAH
|
|
|
|
|
Mycroft Holmes wrote: the client will potentially need the exact version of office you are using.
That would only be the case if one used some fairly esoteric features and I doubt those are available via the normal databass access of C#/.Net.
|
|
|
|