Monday, November 10, 2008

Old Copyright Laws Hurt Research

Note: Thanks to my brother, Josh, for his comments.  Josh is an IP lawyer in Chicago, IL.

Recently, a question was phrased on a research mailing list, that more or less went as follows: the researcher was conducting a listening experiment and there was a potential that the subjects could potentially find and keep the 15-second excerpts for personal use.  The author was worried that this constituted a copyright violation.  I pointed out that more than likely this falls under fair use
.  However, reading this gives one clear impression: the law itself is rather meaningless.  First, the law only stipulates what needs to be considered in evaluating fair use, without giving guidelines or specifics.  The webpage states "There is no specific number of words, lines, or notes that may safely be taken without permission," and that it is best to obtain permission from the copyright holder.  Further, the precedent given only gives a partial list of examples that was relevant in 1961.

These points are key to researchers in information retrieval (and in particular, music information retrieval) because these laws were based on the 1960's technology.  Simply put, exchanging songs, text, images, etc., was a rather involved task.  Today, the exchange and storage can be conducted on a massive scale, unforeseen by the lawmakers fifty years ago.  With this increased capacity for storage, researchers can now test large-scale IR algorithms and the need for a (relatively) free, large scale database is needed.  However, in the case of music, such large scale databases are impossible to find or have severe restrictions on them.  Every year, I see experiment after experiment of promising algorithms, but results must be taken only so far because of the size and scope of the testing database.  Even though some schools have access to a large library archive of recordings, researchers at other institutions are unable to duplicate their results because the data is not freely available.

Some researchers have found "loopholes" that allow them to share features extracted from audio, which cannot be used to recreate the audio (e.g., Mel-cepstral coefficients
).  This is still not a viable solution because no-one can a priori determine the best features for all IR experiments and experimentation with new features is impossible.  Also, potentially, a set of features, which in combination may be reversible could lead to the best results, but this is impossible to test if only a limited set of features is ever distributed.

A very interesting solution comes in the form of MIREX
, where a TREC-like evaluation is conducted by having researchers send in algorithms to various competitions.  However, there are a few drawbacks.  First, it is an enormous burden on the sponsoring institution, IMERSIL at The University of Illinois.  The livelihood is also completely depended on the program's funding, which is fine for the next few years, but the long-term stability is not guaranteed.  Second, the evaluation is carried out once a year, but there was talk of extending this to a rolling model.  A third problem is that tasks are largely fixed and a new task is only considered if it has broad approval.  New and interesting tasks are still subject to small, private databases before their inclusion in the task.

I applaud those at IMERSIL for coming up with the proposed solution and also those that supply databases in some form or another, but these are patches to the main problem, which, as I have stated
, is that copyright regulations are severely out-of-date.  Simply put, when today's regulations were implemented, no one imagined the scalablity of today's information age.  Regulations are not only needed for the public sector to address today's file-sharing "problem," but also, better regulations are needed for today's researchers.

The problem ultimately stems from the current practice of common law.  Simply put, our current laws are written as loose guidelines and the specifics are left open to the courts.  Despite what you learned in history class, our laws are not actually written by legislatures, but rather by those on the bench.  Look at The Sherman Act: a single sentence determines when the law is applicable; however, courts have expanded and contracted this law as they see fit.  Instead of a coherent, well-structured law that anyone can follow, one needs a swarm of lawyers to get through any issue.  Worse, many people are completely unaware that they may be breaking copyright law.  Many researchers wrongly assume that if they use less than 30 seconds, then they are legally safe, but this is untrue.  It is purely dependent on whether the recording industry chooses to go after you and how good your defense team is.

So what would a good solution look like?  I have thought of one that is actually rather easy and is found in other research fields.  Handling of nuclear, biological, and chemical materials contains a strict set of guidelines for researchers to follow in obtaining, handling, and destroying potentially dangerous chemicals.  I'm actually a little surprised that a similar structure has not been suggested for the use of copyright materials.  Such guidelines could allow researchers access to large amounts of complete, unaltered data (i.e., full songs, raw audio), while still ensuring the rights of the copyright holders.

I can already address the objection that will be raised by the copyright holders: "But very few researchers will want to take home nuclear, biological, and chemical materials."  This is just untrue.  Many research labs conduct studies on illegal drugs, such as marijuana.  Are you telling me that no researcher would want to take home a little stash?  Again, strict guidelines are in place to ensure that researchers use these illegal substances in an ethical and legal manner while also ensuring that necessary research can be conducted (  This is definitely possible in terms of music, text, and other multi-media.

Thursday, November 6, 2008

Science Reporting, Data-Mining, and Terrorism

Disclaimer: This blog is non-political, but can discuss how science, journalism, and politics interact.  I will try my best to simply state the facts and point to where I see a misinterpretation or omission of scienctific principles.  As such, I intentionally did not post this until after the election.

Recently, I wrote about the new responsibilities engineers have when describing technical findings with science journalists.  Shortly after my post, I began to see many articles stating that a committee put together by The Department of Homeland Security found that data-mining technology should not be used to track or identify terrorists because the technology would not work and privacy-rights would be violated due to false-positives.  At first, I did not pay attention to this story, but I started to see more and more stories saying that ultimately, this task is futile.

Futile?  Really?  This implies that we know the limits of data-mining as a science.  I guess we can cancel all those conferences next year.  Unlike many of the journalists, I chose to actually read the report beyond the Executive Summary and found the comittee's objectives and conclusions were mischaracterized.  First, the committee said that such technologies should not be used right now "given the present state of the science" (italics added) and should never be trusted in a fully automatic sense.  The report also says (in a few places), that research should continue.

Second, this report is mostly a legal report and only uses the technological aspects as background.  One common theme in the report is that false positives will occur, which results in privacy violations.  However, the report fails to give the conditions under which a particular invasion may be justified.  Clearly, the answer is not all or none, since privacy violations occur legally in non-terrorism contexts.  For example, many common law-enforcement techniques such as DNA testing, witness accounts, and even confessions have a false-positive rate.  Where are the calls to dismiss these technologies or to stop investigating crimes in general?

So what were the real conclusions in the report in regards to using data mining techniques for counter-terrorism efforts?

1. No fully automatic data-mining technique should be used.

Specifically, the document says that since there is always the possibility of false-positives, data-mining techniques can only be used to identify subjects for further investigation. This is not really new.

2. Technology can be used to reduce the risk of privacy intrusion.

Specifically, the technology can be used as a filter. The report gives an example, where only images with guns detected automatically are seen by humans for further investigation.

3. "Because data mining has proven to be valueable in private-sector applications... there is reason to explore its potential uses in countering terrorism.

Once again, proving my point that engineers and scientists need to be careful about how they describe their research and findings to journalists.

4. Programs need to be developed with specific goals and legal limitations in mind. In addition, programs must be subject to continual review.

The truth is that many of laws or legal understandings are based on judicial precidents and are rarely cleaned up by Congress. This becomes an issue when technologies change and new laws are not written. Any legal decision is largely based on the facts in the particular case and will not encompass the facts to apply a law in a broader context. A similar problem is seen in our obsolete copyright laws.

For what it's worth, I do not blame the reporters entirely.  Reading a 372-page document is a lot to accomplish with a ever-shrinking new cycle.  But this does demonstrate that engineers and scientists need to be careful about how they state their finding, since public perception and even legal policies can be altered by their mischaracterization in the media

Pandora Video Series

Even after laying off a significant portion of its workforce, Pandora is continuing to look for ways to expand their business.  One such avenue is the start of video series format of their musicology podcast series (another favorite).  Personally, I love this and wonder if Pandora might one day expand or split off a business into the area of popular music education.  One can only hope this (and other music-based educational technologies) might be a way to offer music education to public and private schools under the threat of budget cuts.

Monday, November 3, 2008


I talked to a couple people at ISMIR about a new machine learning toolbox, called FASTLIB (although, it appears to be called both FASTLIB or MLPACK). This toolbox was developed by Alexander Gray's lab in the College of Computing at Georgia Tech and I used this extensively in Alexander Gray's class. I highly recommend that anyone try this toolbox for their machine learning needs. Programming within the guidelines greatly reduces the programming time (almost to the simplicity of MATLAB), while retaining computational speed and memory capacity. If you are like me, how have had to make the judgement call between programming something in MATLAB and having it run a long time, or spending a long time writing and debugging C++ code so that the algorithm runs quicker.

The official place to download the package seems to be here; however, I found some issues (expected with a version 1.0). The stripped down package on an old class website seemed easier to install. The individual built-in algorithms can be added manually later. I hope to have a small series of posts demonstrating the ease of programming style.