|
I always wish that questions like this was expanded with something like "and which specific qualities of the language makes it particularly well suited for my problem area?"
Or, turned around: "Which specific language qualities are essential to solve problems in [data analysis], and which languages offer these qualities?"
Even if the question is not phrased that way, I always wish that those who provide answers would pretend that it was, and answer accordingly.
|
|
|
|
|
|
(Sounds like a leading question to me.)
Don't get me started...
I thought R was the language of choice for that. But really, I don't think any particular language should be. I use C# to do that.
Any off-the-shelf analysis platform can do only so much and get you so far. Then you will always need to go deeper depending on what you find on the surface. For that, you'll need a proper general-purpose programming language; not a scripting language (Python, ptui) and not some analysis-specific platform.
|
|
|
|
|
https://www.datacamp.com/blog/top-programming-languages-for-data-scientists-in-2022[^]
Python is first, R second, according to this site. Python, due to its increased versatility over other languages.
I don't see C# anywhere.
I love C# and primarily use it for most of my work, but just because I love a specific tool, does not mean it is the tool for everything (i.e. hammer, saw, screwdriver, etc.).
I'm not a data scientist or analyst, so I Google these things, because I have no clue.
|
|
|
|
|
Slacker007 wrote: I don't see C# anywhere. I'm rather disappointed FORTRAN isn't on that list.
Jeremy Falcon
|
|
|
|
|
|
Shouldn't that be "Go Forth and multiply"?
|
|
|
|
|
more likely, embed thyself
|
|
|
|
|
PIEBALDconsult wrote: I use C# to do that. As those who know, know...
Q: What's the best language to do XYZ in?
A: The language you're most of an expert in.
So, totally agree. That's the only reason I still haven't learned Rust yet. Even though some people go on about regarding Rust being safer, etc. are problems I've already solved in C even over the decades. Still tempting to learn Rust, and if I didn't know another lower level language I most likely would. I just don't have the need to. Rust is like C/C++ and JavaScript had a baby... which should be cool. Just don't have a need to learn it.
However, to the original point, Python is so dang popular with big data, he'll be sure to find plenty of libraries to help along the way. So, can also see the appeal if you don't have years of code laying around from your hardcore days for that one shiny moment in time to be used again.
Jeremy Falcon
|
|
|
|
|
Jeremy Falcon wrote: if you don't have years of code
Or coding experience.
All of these off-the-shelf systems (data analysis or ETL in particular) are there to help beginners make a start, but they can be a detriment if the user never learns to do it from scratch.
A custom system may take longer to get going, but it can (ideally) do exactly what is needed.
|
|
|
|
|
PIEBALDconsult wrote: A custom system may take longer to get going, but it can (ideally) do exactly what is needed. Kinda like a custom-built PC.
Jeremy Falcon
|
|
|
|
|
I'm not a fan of Python, but when it comes to big data it's extremely popular. So, you'll find a lot of tools, online docs, etc. to work with.
Jeremy Falcon
|
|
|
|
|
Jeremy Falcon wrote: extremely popular
Popularity does not imply suitability.
Python itself can't do very much and any heavy lifting has to be done in a more powerful language.
|
|
|
|
|
PIEBALDconsult wrote: Popularity does not imply suitability. Yes and no. You gotta look at from the n00b's standpoint. Popularity does imply there are more libraries available for it that would be useful or suitable. And it implies it would be easier to learn, with more resources available. Even if say the language took like 2 more lines per concept to code or whatever. There's usually more than one factor to consider.
Jeremy Falcon
|
|
|
|
|
Answers to questions like this usually have two major elements (or maybe only one of them):
First: "It is the language everyone is using!" Ten to twenty years ago, the obvious answer would be C/C++, regardless of problem area. Thirty to forty years ago, if you asked about "data analysis", maybe Cobol would be what everyone was using. (For numerical problems, Fortran was The Answer.) Today, it is next to completely impossible to make Python programmers identify any application area where Python is not the best.
Second: "The function and class libraries for the language are excellent!" This may be a more valid argument than "Everyone uses it". To some degree, it can put your fortune into the hands of library writers of various qualities. Note that some languages require libraries written specifically for that language (and conversely: the library cannot be used with any other language), while other libraries are written to language independent interface conventions and may be available from a multitude of programming languages. (The latter was the norm 20-30 years ago; it has been on the decline since.)
Neither argument group says anything about the language as such. Both refer to the 'ecosystem', rather than language. Often, the ecosystem is the more essential. You take it, regardless of the quality of the language that goes with it.
|
|
|
|
|
Depends on the "data" and the objective.
I would argue that Excel and MS Access are adequate for a lot of situations. "Analysis" could mean simply coming up with some totals (i.e SQL).
"Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I
|
|
|
|
|
That is definitely a good point.
Because I've been doing a lot of ETL work over the past ten years, I usually think in terms of analyzing a new data source to determine what datatypes and data quality checking will be required, not analyzing data to find trends and such.
I did have to do some of the latter to determine trends in log files -- record count growth and such. But never on the actual incoming data itself.
|
|
|
|
|
I'll take just about any bet that VBA is still being used for a whole lot of "very critical" things.
|
|
|
|
|
So, what is D.Analysis?
How do we answer this question with zero parameters? For starters, what is the data being stored in? Personally, I think the DB is the first thing to consider. I've done some pretty heavy lifting using PostgreSQL and a few lines of C#. I'm out of touch with SQL server but I'd guess it has the same or more functionality.
I'd need to know the size of the data in both width and record count, as well as the ultimate goal of the analysis. I can't pre-spec a language with the question asked. If I had to, I'd go with the language you are most proficient with as learning a new language is probably not practical in the real world.
|
|
|
|
|
"Best" is quite subjective. Even applying the scientific method to get an answer is likely to yield multiple top results with differences to small to matter.
When asked within the whole of the software development life cycle (SDLC), there are other considerations in determining what is best for you, your team, and your project.
What language that you know, or can easily learn, has an efficient and relatively simple, repeatable, and programmatically configurable library for data access?
Of the answers to that question, which libraries offer (in the context of the SDLC and your project) the blend of simplicity, scalability, performance, and supportability?
For me, since I use .NET and C#, I use the SqlClient library for whatever DB I am working with, wrapped in a simple, straightforward data access layer that with transactional support and parameters to avoid SQL injection. I do not use Entity Framework for production apps, since it tends to have higher support costs as a project evolves after the first production release, can generate some awful SQL, and does not scale well, besides being slower for "real world" CRUD uses.
Other developers will have other preferences, based on what they know, what kind of projects they use, and level of experience in a broad range of projects with the tools they select. The right answer is what works best for the developers on the project to deliver a production app that works, that scales, that is reliable, and that has the lowest SDLC cost for updates and extensions as the app matures and changes over time.
I know that is not a simple answer, but our discipline is not a simple one, nor one with which we can be successful by using "cookie cutter" designs or following "recipes" as if we were simply assembling widgets.
To get to your question, having obtained the data, you can apply the same principles to what you use to analyze the data.
modified 21-Aug-23 10:34am.
|
|
|
|
|
Just split a big job into three main threads in Windows. The first of them took longer to run than the other two (on a 4 core processor). Unfortunately, I can't exactly qualify it without re-running the almost 8 hour jobs on my 3.2 GHz machine. But two of the threads completed about an hour or more before the first thread! The last two completed at roughly the same time, far before the first of them.
|
|
|
|
|
Could be a ton of reasons why, but assuming it's the same code in all three threads, keep in mind there is such a thing as processor affinity and thread affinity. The whole concept of parallelism is a mirage. Computers are just so fast at context switching it seems that way, but this concept is why they have thread scheduling, etc.
In theory it's a lot like preemptive multitasking in Windows, but it's on a hardware level and thus much, much quicker with less crap in the way. But, even with a multi-core CPU, something's gotta manage what gets ran.
At least that's how it was back in my multithreaded days before multi-core became the norm but hyperthreading was a thing. These days in JavaScript land, threads scare people.
Jeremy Falcon
|
|
|
|
|
Let us also not forget there's thousands of other threads running in Windows that your process has to share those four cores with.
|
|
|
|
|
Quote: into three main threads in Windows.
You like to say with that you created _three_ threads in your _one_ process?
If yes, don't be surpriced about your measurements.
My expercience with windows is: In case you like to gain full cpu, span the work over multiply processes, instead of multiply threads in one process
|
|
|
|
|
Every I/O is an interrupt.
"Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I
|
|
|
|
|